Jingwei Ni
2026
Can Reasoning Help Large Language Models Capture Human Annotator Disagreement?
Jingwei Ni | Yu Fan | Vilém Zouhar | Donya Rooein | Alexander Miserlis Hoyle | Mrinmaya Sachan | Markus Leippold | Dirk Hovy | Elliott Ash
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Jingwei Ni | Yu Fan | Vilém Zouhar | Donya Rooein | Alexander Miserlis Hoyle | Mrinmaya Sachan | Markus Leippold | Dirk Hovy | Elliott Ash
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Variation in human annotation (i.e., disagreements) is common in NLP, often reflecting important information like task subjectivity and sample ambiguity. Modeling this variation is important for applications that are sensitive to such information. Although RLVR-style reasoning (Reinforcement Learning with Verifiable Rewards) has improved Large Language Model (LLM) performance on many tasks, it remains unclear whether such reasoning enables LLMs to capture informative variation in human annotation. In this work, we evaluate the influence of different reasoning settings on LLM disagreement modeling. We systematically evaluate each reasoning setting across model sizes, distribution expression methods, and steering methods, resulting in 60 experimental setups across 3 tasks. Surprisingly, our results show that RLVR-style reasoning degrades performance in disagreement modeling, while naive Chain-of-Thought (CoT) reasoning improves the performance of RLHF LLMs (RL from human feedback). These findings underscore the potential risk of replacing human annotators with reasoning LLMs, especially when disagreements are important.
Efficient Test-Time Scaling of Multi-Step Reasoning by Probing Internal States of Large Language Models
Jingwei Ni | Ekaterina Fadeeva | Tianyi Wu | Mubashara Akhtar | Jiaheng Zhang | Elliott Ash | Markus Leippold | Timothy Baldwin | See-Kiong Ng | Artem Shelmanov | Mrinmaya Sachan
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jingwei Ni | Ekaterina Fadeeva | Tianyi Wu | Mubashara Akhtar | Jiaheng Zhang | Elliott Ash | Markus Leippold | Timothy Baldwin | See-Kiong Ng | Artem Shelmanov | Mrinmaya Sachan
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
LLMs can solve complex tasks by generating long, multi-step reasoning chains. Test-time scaling (TTS) can further improve LLM performance by sampling multiple variants of intermediate reasoning steps, verifying their correctness, and strategically choosing the best steps for continuation. However, existing verification approaches, such as Process Reward Models (PRMs), are computationally expensive, limited to specific domains, and require large-scale human or model-generated annotations. We propose a lightweight alternative for step-level reasoning verification based on probing the internal states of LLMs. We train a transformer-based probe that uses the internal states of the frozen LLM to estimate the credibility of its reasoning steps during generation. Annotation can be generated either by another larger LLM (e.g., DeepSeek-R1) or in a self-supervised manner by the original model itself. The probes are both effective and lightweight, containing fewer than 10M parameters. Across multiple domains, including mathematics, planning, and general knowledge question answering, our probes match or even exceed the performance of PRMs that are up to 810× larger. Our findings suggest that the internal states of LLMs encode their confidence in reasoning processes and can serve as reliable signals for reasoning step verification, offering a promising direction towards scalable and generalizable TTS and introspective LLMs.
Tackling the Root of Misinformation by Teaching Laypeople about Logical Fallacies via Socratic Questioning and Critical Argumentation
Minjing Shi | Junling Wang | Jingwei Ni | Sankalan Pal Chowdhury | Mrinmaya Sachan
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Minjing Shi | Junling Wang | Jingwei Ni | Sankalan Pal Chowdhury | Mrinmaya Sachan
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Identifying logical fallacies (LFs) in everyday discourse is challenging for many people. This challenge is amplified in the era of Large Language Models (LLMs), where malicious agents can deploy fallacious arguments to disseminate misinformation at scale. In this work, we explore the potential of LLMs as part of the solution. We introduce LFTutor, an intelligent tutoring system which uses LLMs to tutor humans and help them learn about logical fallacies. LFTutor integrates intent-driven Socratic questioning and critical argumentation principles to actively engage learners to reflect on their reasoning. Through both automatic and human evaluations, we demonstrate that LFTutor significantly outperforms baseline LLMs lacking such pedagogical strategies. This work highlights the promise of combining LLMs with pedagogical scaffolding to foster critical thinking and argument literacy in the age of AI.
Apertus: Democratizing Open and Compliant LLMs for Global Language Environments
Alejandro Hernández-Cano | Alexander Hägele | Allen Hao Huang | Angelika Romanou | Antoni-Joan Solergibert | Barna Pásztor | Bettina Messmer | Dhia Garbaya | Eduard Frank Ďurech | Ido Hakimi | Juan Garcia Giraldo | Mete Ismayilzada | Negar Foroutan | Skander Moalla | Tiancheng Chen | Vinko Sabolčec | Yixuan Xu | Michael Aerni | Badr AlKhamissi | Inés Altemir Marinas | Mohammad Hossein Amani | Matin Ansaripour | Ilia Badanin | Harold Benoit | Emanuela Boros | Nicholas John Browning | Fabian Bösch | Maximilian Böther | Niklas Canova | Camille Challier | Clément Charmillot | Jonathan Coles | Jan Milan Deriu | Arnout Devos | Lukas Drescher | Daniil Dzenhaliou | Maud Ehrmann | Dongyang Fan | Simin Fan | Silin Gao | Miguel Gila | María Grandury | Diba Hashemi | Alexander Miserlis Hoyle | Jiaming Jiang | Mark Klein | Andrei Kucharavy | Anastasiia Kucherenko | Frederike Lübeck | Roman Machacek | Theofilos Ioannis Manitaras | Andreas Marfurt | Kyle Matoba | Simon Matrenok | Henrique Mendonça | Fawzi Roberto Mohamed | Syrielle Montariol | Luca Mouchel | Sven Najem-Meyer | Jingwei Ni | Gennaro Oliva | Matteo Pagliardini | Elia Palme | Andrei Panferov | Léo Paoletti | Marco Passerini | Ivan Pavlov | Auguste Poiroux | Kaustubh Ponkshe | Nathan Ranchin | Javier Rando | Mathieu Sauser | Jakhongir Saydaliev | Mukhammadali Sayfiddinov | Marian Schneider | Stefano Schuppli | Marco Scialanga | Andrei Semenov | Kumar Shridhar | Raghav Singhal | Anna Sotnikova | Alexander Sternfeld | Ayush Kumar Tarun | Paul Teiletche | Jannis Vamvas | Xiaozhe Yao | Hao Zhao | Alexander Ilic | Ana Klimovic | Andreas Krause | Caglar Gulcehre | David Rosenthal | Elliott Ash | Florian Tramèr | Joost VandeVondele | Livio Veraldi | Martin Rajman | Thomas C. Schulthess | Torsten Hoefler | Antoine Bosselut | Martin Jaggi | Imanol Schlag
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Alejandro Hernández-Cano | Alexander Hägele | Allen Hao Huang | Angelika Romanou | Antoni-Joan Solergibert | Barna Pásztor | Bettina Messmer | Dhia Garbaya | Eduard Frank Ďurech | Ido Hakimi | Juan Garcia Giraldo | Mete Ismayilzada | Negar Foroutan | Skander Moalla | Tiancheng Chen | Vinko Sabolčec | Yixuan Xu | Michael Aerni | Badr AlKhamissi | Inés Altemir Marinas | Mohammad Hossein Amani | Matin Ansaripour | Ilia Badanin | Harold Benoit | Emanuela Boros | Nicholas John Browning | Fabian Bösch | Maximilian Böther | Niklas Canova | Camille Challier | Clément Charmillot | Jonathan Coles | Jan Milan Deriu | Arnout Devos | Lukas Drescher | Daniil Dzenhaliou | Maud Ehrmann | Dongyang Fan | Simin Fan | Silin Gao | Miguel Gila | María Grandury | Diba Hashemi | Alexander Miserlis Hoyle | Jiaming Jiang | Mark Klein | Andrei Kucharavy | Anastasiia Kucherenko | Frederike Lübeck | Roman Machacek | Theofilos Ioannis Manitaras | Andreas Marfurt | Kyle Matoba | Simon Matrenok | Henrique Mendonça | Fawzi Roberto Mohamed | Syrielle Montariol | Luca Mouchel | Sven Najem-Meyer | Jingwei Ni | Gennaro Oliva | Matteo Pagliardini | Elia Palme | Andrei Panferov | Léo Paoletti | Marco Passerini | Ivan Pavlov | Auguste Poiroux | Kaustubh Ponkshe | Nathan Ranchin | Javier Rando | Mathieu Sauser | Jakhongir Saydaliev | Mukhammadali Sayfiddinov | Marian Schneider | Stefano Schuppli | Marco Scialanga | Andrei Semenov | Kumar Shridhar | Raghav Singhal | Anna Sotnikova | Alexander Sternfeld | Ayush Kumar Tarun | Paul Teiletche | Jannis Vamvas | Xiaozhe Yao | Hao Zhao | Alexander Ilic | Ana Klimovic | Andreas Krause | Caglar Gulcehre | David Rosenthal | Elliott Ash | Florian Tramèr | Joost VandeVondele | Livio Veraldi | Martin Rajman | Thomas C. Schulthess | Torsten Hoefler | Antoine Bosselut | Martin Jaggi | Imanol Schlag
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Open LLMs enable AI practitioners to control development costs by building on an existing foundation for downstream applications. While offering substantial promise, current models often fail to meet the needs of users needing open solutions aligned with responsible AI principles, including data compliance, transparency, and inclusivity. In this work, we present Apertus, a fully open suite of large language models (LLMs) designed to address responsibility shortcomings in today’s open model ecosystem, namely data responsibility and global representation. Unlike many prior models that release weights without reproducible data pipelines or regard for content-owner rights, Apertus models are pretrained exclusively on openly available data, retroactively respecting robots.txt exclusions and filtering for non-permissive, toxic, and personally identifiable content. To mitigate risks of data memorization, we also adopt the Goldfish objective during pretraining, strongly suppressing verbatim recall of data while retaining downstream task performance. Apertus also drastically expands multilingual coverage, training on 15T tokens from over approximately 1800 languages, with about 40% of pretraining data allocated to non-English content. Released at 8B and 70B scales, Apertus approaches state-of-the-art results among fully open models on multilingual benchmarks, rivaling or surpassing open-weight counterparts.
ThinkBooster: A Unified Framework for Seamless Test-Time Scaling of LLM Reasoning
Vladislav Smirnov | Quang-Chieu Nguyen | Sergey Senichev | Minh Ngoc Ta | Ekaterina Fadeeva | Artem Vazhentsev | Daria Galimzianova | Nikolai Rozanov | Viktor Mazanov | Jingwei Ni | Tianyi Wu | Igor Kiselev | Mrinmaya Sachan | Iryna Gurevych | Preslav Nakov | Timothy Baldwin | Artem Shelmanov
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
Vladislav Smirnov | Quang-Chieu Nguyen | Sergey Senichev | Minh Ngoc Ta | Ekaterina Fadeeva | Artem Vazhentsev | Daria Galimzianova | Nikolai Rozanov | Viktor Mazanov | Jingwei Ni | Tianyi Wu | Igor Kiselev | Mrinmaya Sachan | Iryna Gurevych | Preslav Nakov | Timothy Baldwin | Artem Shelmanov
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
Test-time compute (TTC) scaling has emerged as a powerful paradigm for improving large language model (LLM) reasoning by allocating additional compute during inference, e.g., via multi-sample generation and verifier-based reranking. Existing TTC scaling strategies and reasoning scorers remain fragmented, evaluated under inconsistent protocols, and are rarely analyzed through the lens of quality-cost trade-offs. We introduce ThinkBooster, a unified framework for seamless test-time compute scaling of LLM reasoning, which consists of (i) a modular Python library implementing state-of-the-art TTC scaling strategy and scorer families, (ii) a benchmark that jointly evaluates performance and computational efficiency, and (iii) a deployable OpenAI-compatible proxy service that enables drop-in integration of adaptive reasoning into real-world applications. We further provide a demo visual debugger for inspecting the reasoning trajectories, intermediate selection decisions, and alternative reasoning paths. Empirical results on mathematical and coding tasks reveal the performance-compute trade-offs of TTC scaling strategies and scoring methods and demonstrate that ThinkBooster provides practical gains in real-world tasks. The code is available online under an MIT license.
2025
DIRAS: Efficient LLM Annotation of Document Relevance for Retrieval Augmented Generation
Jingwei Ni | Tobias Schimanski | Meihong Lin | Mrinmaya Sachan | Elliott Ash | Markus Leippold
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Jingwei Ni | Tobias Schimanski | Meihong Lin | Mrinmaya Sachan | Elliott Ash | Markus Leippold
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Retrieval Augmented Generation (RAG) is widely employed to ground responses to queries on domain-specific documents. But do RAG implementations leave out important information when answering queries that need an integrated analysis of information (e.g., Tell me good news in the stock market today.)? To address these concerns, RAG developers need to annotate information retrieval (IR) data for their domain of interest, which is challenging because (1) domain-specific queries usually need nuanced definitions of relevance beyond shallow semantic relevance; and (2) human or GPT-4 annotation is costly and cannot cover all (query, document) pairs (i.e., annotation selection bias), thus harming the effectiveness in evaluating IR recall. To address these challenges, we propose DIRAS (**D**omain-specific **I**nformation **R**etrieval **A**nnotation with **S**calability), a manual-annotation-free schema that fine-tunes open-sourced LLMs to consider nuanced relevance definition and annotate (partial) relevance labels with calibrated relevance scores. Extensive evaluation shows that DIRAS enables smaller (8B) LLMs to achieve GPT-4-level performance on annotating and ranking unseen (query, document) pairs, and is helpful for real-world RAG development.
Co-DETECT: Collaborative Discovery of Edge Cases in Text Classification
Chenfei Xiong | Jingwei Ni | Yu Fan | Vilém Zouhar | Donya Rooein | Lorena Calvo-Bartolomé | Alexander Hoyle | Zhijing Jin | Mrinmaya Sachan | Markus Leippold | Dirk Hovy | Mennatallah El-Assady | Elliott Ash
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Chenfei Xiong | Jingwei Ni | Yu Fan | Vilém Zouhar | Donya Rooein | Lorena Calvo-Bartolomé | Alexander Hoyle | Zhijing Jin | Mrinmaya Sachan | Markus Leippold | Dirk Hovy | Mennatallah El-Assady | Elliott Ash
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
We introduce Co-DETECT (Collaborative Discovery of Edge cases in TExt ClassificaTion), a novel mixed-initiative annotation framework that integrates human expertise with automatic annotation guided by large language models (LLMs). Co-DETECT starts with an initial, sketch-level codebook and dataset provided by a domain expert, then leverages the LLM to annotate the data and identify edge cases that are not well described by the initial codebook. Specifically, Co-DETECT flags challenging examples, induces high-level, generalizable descriptions of edge cases, and assists user in incorporating edge case handling rules to improve the codebook. This iterative process enables more effective handling of nuanced phenomena through compact, generalizable annotation rules. Extensive user study, qualitative and quantitative analyses prove the effectiveness of Co-DETECT.
Proceedings of the 2nd Workshop on Natural Language Processing Meets Climate Change (ClimateNLP 2025)
Kalyan Dutia | Peter Henderson | Markus Leippold | Christoper Manning | Gaku Morio | Veruska Muccione | Jingwei Ni | Tobias Schimanski | Dominik Stammbach | Alok Singh | Alba (Ruiran) Su | Saeid A. Vaghefi
Proceedings of the 2nd Workshop on Natural Language Processing Meets Climate Change (ClimateNLP 2025)
Kalyan Dutia | Peter Henderson | Markus Leippold | Christoper Manning | Gaku Morio | Veruska Muccione | Jingwei Ni | Tobias Schimanski | Dominik Stammbach | Alok Singh | Alba (Ruiran) Su | Saeid A. Vaghefi
Proceedings of the 2nd Workshop on Natural Language Processing Meets Climate Change (ClimateNLP 2025)
2024
ClimRetrieve: A Benchmarking Dataset for Information Retrieval from Corporate Climate Disclosures
Tobias Schimanski | Jingwei Ni | Roberto Spacey Martín | Nicola Ranger | Markus Leippold
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Tobias Schimanski | Jingwei Ni | Roberto Spacey Martín | Nicola Ranger | Markus Leippold
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
To handle the vast amounts of qualitative data produced in corporate climate communication, stakeholders increasingly rely on Retrieval Augmented Generation (RAG) systems. However, a significant gap remains in evaluating domain-specific information retrieval – the basis for answer generation. To address this challenge, this work simulates the typical tasks of a sustainability analyst by examining 30 sustainability reports with 16 detailed climate-related questions. As a result, we obtain a dataset with over 8.5K unique question-source-answer pairs labeled by different levels of relevance. Furthermore, we develop a use case with the dataset to investigate the integration of expert knowledge into information retrieval with embeddings. Although we show that incorporating expert knowledge works, we also outline the critical limitations of embeddings in knowledge-intensive downstream domains like climate change communication.
Proceedings of the 1st Workshop on Natural Language Processing Meets Climate Change (ClimateNLP 2024)
Dominik Stammbach | Jingwei Ni | Tobias Schimanski | Kalyan Dutia | Alok Singh | Julia Bingler | Christophe Christiaen | Neetu Kushwaha | Veruska Muccione | Saeid A. Vaghefi | Markus Leippold
Proceedings of the 1st Workshop on Natural Language Processing Meets Climate Change (ClimateNLP 2024)
Dominik Stammbach | Jingwei Ni | Tobias Schimanski | Kalyan Dutia | Alok Singh | Julia Bingler | Christophe Christiaen | Neetu Kushwaha | Veruska Muccione | Saeid A. Vaghefi | Markus Leippold
Proceedings of the 1st Workshop on Natural Language Processing Meets Climate Change (ClimateNLP 2024)
Towards Faithful and Robust LLM Specialists for Evidence-Based Question-Answering
Tobias Schimanski | Jingwei Ni | Mathias Kraus | Elliott Ash | Markus Leippold
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Tobias Schimanski | Jingwei Ni | Mathias Kraus | Elliott Ash | Markus Leippold
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Advances towards more faithful and traceable answers of Large Language Models (LLMs) are crucial for various research and practical endeavors. One avenue in reaching this goal is basing the answers on reliable sources. However, this Evidence-Based QA has proven to work insufficiently with LLMs in terms of citing the correct sources (source quality) and truthfully representing the information within sources (answer attributability). In this work, we systematically investigate how to robustly fine-tune LLMs for better source quality and answer attributability. Specifically, we introduce a data generation pipeline with automated data quality filters, which can synthesize diversified high-quality training and testing data at scale. We further introduce four test sets to benchmark the robustness of fine-tuned specialist models. Extensive evaluation shows that fine-tuning on synthetic data improves performance on both in- and out-of-distribution. Furthermore, we show that data quality, which can be drastically improved by proposed quality filters, matters more than quantity in improving Evidence-Based QA.
AFaCTA: Assisting the Annotation of Factual Claim Detection with Reliable LLM Annotators
Jingwei Ni | Minjing Shi | Dominik Stammbach | Mrinmaya Sachan | Elliott Ash | Markus Leippold
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jingwei Ni | Minjing Shi | Dominik Stammbach | Mrinmaya Sachan | Elliott Ash | Markus Leippold
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
With the rise of generative AI, automated fact-checking methods to combat misinformation are becoming more and more important. However, factual claim detection, the first step in a fact-checking pipeline, suffers from two key issues that limit its scalability and generalizability: (1) inconsistency in definitions of the task and what a claim is, and (2) the high cost of manual annotation. To address (1), we review the definitions in related work and propose a unifying definition of factual claims that focuses on verifiability. To address (2), we introduce AFaCTA (Automatic Factual Claim deTection Annotator), a novel framework that assists in the annotation of factual claims with the help of large language models (LLMs). AFaCTA calibrates its annotation confidence with consistency along three predefined reasoning paths. Extensive evaluation and experiments in the domain of political speech reveal that AFaCTA can efficiently assist experts in annotating factual claims and training high-quality classifiers, and can work with or without expert supervision. Our analyses also result in PoliClaim, a comprehensive claim detection dataset spanning diverse political topics.
2023
CHATREPORT: Democratizing Sustainability Disclosure Analysis through LLM-based Tools
Jingwei Ni | Julia Bingler | Chiara Colesanti-Senni | Mathias Kraus | Glen Gostlow | Tobias Schimanski | Dominik Stammbach | Saeid Ashraf Vaghefi | Qian Wang | Nicolas Webersinke | Tobias Wekhof | Tingyu Yu | Markus Leippold
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Jingwei Ni | Julia Bingler | Chiara Colesanti-Senni | Mathias Kraus | Glen Gostlow | Tobias Schimanski | Dominik Stammbach | Saeid Ashraf Vaghefi | Qian Wang | Nicolas Webersinke | Tobias Wekhof | Tingyu Yu | Markus Leippold
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
In the face of climate change, are companies really taking substantial steps toward more sustainable operations? A comprehensive answer lies in the dense, information-rich landscape of corporate sustainability reports. However, the sheer volume and complexity of these reports make human analysis very costly. Therefore, only a few entities worldwide have the resources to analyze these reports at scale, which leads to a lack of transparency in sustainability reporting. Empowering stakeholders with LLM-based automatic analysis tools can be a promising way to democratize sustainability report analysis. However, developing such tools is challenging due to (1) the hallucination of LLMs and (2) the inefficiency of bringing domain experts into the AI development loop. In this paper, we introduce ChatReport, a novel LLM-based system to automate the analysis of corporate sustainability reports, addressing existing challenges by (1) making the answers traceable to reduce the harm of hallucination and (2) actively involving domain experts in the development loop. We make our methodology, annotated datasets, and generated analyses of 1015 reports publicly available. Video Introduction: https://www.youtube.com/watch?v=Q5AzaKzPE4M Github: https://github.com/EdisonNi-hku/chatreport Live web app: reports.chatclimate.ai
When Does Aggregating Multiple Skills with Multi-Task Learning Work? A Case Study in Financial NLP
Jingwei Ni | Zhijing Jin | Qian Wang | Mrinmaya Sachan | Markus Leippold
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jingwei Ni | Zhijing Jin | Qian Wang | Mrinmaya Sachan | Markus Leippold
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Multi-task learning (MTL) aims at achieving a better model by leveraging data and knowledge from multiple tasks. However, MTL does not always work – sometimes negative transfer occurs between tasks, especially when aggregating loosely related skills, leaving it an open question when MTL works. Previous studies show that MTL performance can be improved by algorithmic tricks. However, what tasks and skills should be included is less well explored. In this work, we conduct a case study in Financial NLP where multiple datasets exist for skills relevant to the domain, such as numeric reasoning and sentiment analysis. Due to the task difficulty and data scarcity in the Financial NLP domain, we explore when aggregating such diverse skills from multiple datasets with MTL can work. Our findings suggest that the key to MTL success lies in skill diversity, relatedness between tasks, and choice of aggregation size and shared capacity. Specifically, MTL works well when tasks are diverse but related, and when the size of the task aggregation and the shared capacity of the model are balanced to avoid overwhelming certain tasks.
2022
Original or Translated? A Causal Analysis of the Impact of Translationese on Machine Translation Performance
Jingwei Ni | Zhijing Jin | Markus Freitag | Mrinmaya Sachan | Bernhard Schölkopf
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Jingwei Ni | Zhijing Jin | Markus Freitag | Mrinmaya Sachan | Bernhard Schölkopf
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Human-translated text displays distinct features from naturally written text in the same language. This phenomena, known as translationese, has been argued to confound the machine translation (MT) evaluation. Yet, we find that existing work on translationese neglects some important factors and the conclusions are mostly correlational but not causal. In this work, we collect CausalMT, a dataset where the MT training data are also labeled with the human translation directions. We inspect two critical factors, the train-test direction match (whether the human translation directions in the training and test sets are aligned), and data-model direction match (whether the model learns in the same direction as the human translation direction in the dataset). We show that these two factors have a large causal effect on the MT performance, in addition to the test-model direction mismatch highlighted by existing work on the impact of translationese. In light of our findings, we provide a set of suggestions for MT training and evaluation. Our code and data are at https://github.com/EdisonNi-hku/CausalMT
2021
Causal Direction of Data Collection Matters: Implications of Causal and Anticausal Learning for NLP
Zhijing Jin | Julius von Kügelgen | Jingwei Ni | Tejas Vaidhya | Ayush Kaushal | Mrinmaya Sachan | Bernhard Schölkopf
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Zhijing Jin | Julius von Kügelgen | Jingwei Ni | Tejas Vaidhya | Ayush Kaushal | Mrinmaya Sachan | Bernhard Schölkopf
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
The principle of independent causal mechanisms (ICM) states that generative processes of real world data consist of independent modules which do not influence or inform each other. While this idea has led to fruitful developments in the field of causal inference, it is not widely-known in the NLP community. In this work, we argue that the causal direction of the data collection process bears nontrivial implications that can explain a number of published NLP findings, such as differences in semi-supervised learning (SSL) and domain adaptation (DA) performance across different settings. We categorize common NLP tasks according to their causal direction and empirically assay the validity of the ICM principle for text data using minimum description length. We conduct an extensive meta-analysis of over 100 published SSL and 30 DA studies, and find that the results are consistent with our expectations based on causal insights. This work presents the first attempt to analyze the ICM principle in NLP, and provides constructive suggestions for future modeling choices.
Search
Fix author
Co-authors
- Markus Leippold 11
- Mrinmaya Sachan 10
- Elliott Ash 7
- Tobias Schimanski 6
- Zhijing Jin 4
- Dominik Stammbach 4
- Alexander Miserlis Hoyle 3
- Saeid A. Vaghefi 2
- Timothy Baldwin 2
- Julia Bingler 2
- Kalyan Dutia 2
- Ekaterina Fadeeva 2
- Yu Fan 2
- Dirk Hovy 2
- Mathias Kraus 2
- Veruska Muccione 2
- Donya Rooein 2
- Bernhard Schölkopf 2
- Artem Shelmanov 2
- Minjing Shi 2
- Alok Singh 2
- Qian Wang 2
- Tianyi Wu 2
- Vilém Zouhar 2
- Michael Aerni 1
- Mubashara Akhtar 1
- Badr AlKhamissi 1
- Mohammad Hossein Amani 1
- Matin Ansaripour 1
- Saeid Ashraf Vaghefi 1
- Ilia Badanin 1
- Harold Benoit 1
- Emanuela Boroş 1
- Antoine Bosselut 1
- Nicholas John Browning 1
- Fabian Bösch 1
- Maximilian Böther 1
- Lorena Calvo-Bartolomé 1
- Niklas Canova 1
- Camille Challier 1
- Clément Charmillot 1
- Tiancheng Chen 1
- Christophe Christiaen 1
- Jonathan Coles 1
- Chiara Colesanti-Senni 1
- Jan Milan Deriu 1
- Arnout Devos 1
- Lukas Drescher 1
- Daniil Dzenhaliou 1
- Maud Ehrmann 1
- Mennatallah El-Assady 1
- Dongyang Fan 1
- Simin Fan 1
- Negar Foroutan 1
- Markus Freitag 1
- Daria Galimzianova 1
- Silin Gao 1
- Dhia Garbaya 1
- Miguel Gila 1
- Juan Garcia Giraldo 1
- Glen Gostlow 1
- María Grandury 1
- Iryna Gurevych 1
- Çağlar Gu̇lçehre 1
- Ido Hakimi 1
- Diba Hashemi 1
- Peter Henderson 1
- Alejandro Hernández-Cano 1
- Torsten Hoefler 1
- Allen Hao Huang 1
- Alexander Hägele 1
- Alexander Ilic 1
- Mete Ismayilzada 1
- Martin Jaggi 1
- Jiaming Jiang 1
- Ayush Kaushal 1
- Igor Kiselev 1
- Mark Klein 1
- Ana Klimovic 1
- Andreas Krause 1
- Andrei Kucharavy 1
- Anastasiia Kucherenko 1
- Neetu Kushwaha 1
- Meihong Lin 1
- Frederike Lübeck 1
- Roman Machacek 1
- Theofilos Ioannis Manitaras 1
- Christoper Manning 1
- Andreas Marfurt 1
- Inés Altemir Marinas 1
- Roberto Spacey Martín 1
- Kyle Matoba 1
- Simon Matrenok 1
- Viktor Mazanov 1
- Henrique Mendonça 1
- Bettina Messmer 1
- Skander Moalla 1
- Fawzi Roberto Mohamed 1
- Syrielle Montariol 1
- Gaku Morio 1
- Luca Mouchel 1
- Sven Najem-Meyer 1
- Preslav Nakov 1
- See Kiong Ng 1
- Quang-Chieu Nguyen 1
- Gennaro Oliva 1
- Matteo Pagliardini 1
- Sankalan Pal Chowdhury 1
- Elia Palme 1
- Andrei Panferov 1
- Léo Paoletti 1
- Marco Passerini 1
- Ivan Pavlov 1
- Auguste Poiroux 1
- Kaustubh Ponkshe 1
- Barna Pásztor 1
- Martin Rajman 1
- Nathan Ranchin 1
- Javier Rando 1
- Nicola Ranger 1
- Angelika Romanou 1
- David Rosenthal 1
- Nikolai Rozanov 1
- Vinko Sabolčec 1
- Mathieu Sauser 1
- Jakhongir Saydaliev 1
- Mukhammadali Sayfiddinov 1
- Imanol Schlag 1
- Marian Schneider 1
- Thomas C. Schulthess 1
- Stefano Schuppli 1
- Marco Scialanga 1
- Andrei Semenov 1
- Sergey Senichev 1
- Kumar Shridhar 1
- Raghav Singhal 1
- Vladislav Smirnov 1
- Antoni-Joan Solergibert 1
- Anna Sotnikova 1
- Alexander Sternfeld 1
- Alba (Ruiran) Su 1
- Minh Ngoc Ta 1
- Ayush Kumar Tarun 1
- Paul Teiletche 1
- Florian Tramèr 1
- Tejas Vaidhya 1
- Jannis Vamvas 1
- Joost VandeVondele 1
- Artem Vazhentsev 1
- Livio Veraldi 1
- Junling Wang 1
- Nicolas Webersinke 1
- Tobias Wekhof 1
- Chenfei Xiong 1
- Yixuan Xu 1
- Xiaozhe Yao 1
- Tingyu Yu 1
- Jiaheng Zhang 1
- Hao Zhao 1
- Julius von Kügelgen 1
- Eduard Frank Ďurech 1