Benjamin Börschinger
Also published as: Benjamin Boerschinger
2022
Tomayto, Tomahto. Beyond Token-level Answer Equivalence for Question Answering Evaluation
Jannis Bulian | Christian Buck | Wojciech Gajewski | Benjamin Börschinger | Tal Schuster
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Jannis Bulian | Christian Buck | Wojciech Gajewski | Benjamin Börschinger | Tal Schuster
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
The predictions of question answering (QA) systems are typically evaluated against manually annotated finite sets of one or more answers. This leads to a coverage limitation that results in underestimating the true performance of systems, and is typically addressed by extending over exact match (EM) with predefined rules or with the token-level F1 measure.In this paper, we present the first systematic conceptual and data-driven analysis to examine the shortcomings of token-level equivalence measures.To this end, we define the asymmetric notion of answer equivalence (AE), accepting answers that are equivalent to or improve over the reference, and publish over 23k human judgements for candidates produced by multiple QA systems on SQuAD.Through a careful analysis of this data, we reveal and quantify several concrete limitations of the F1 measure, such as a false impression of graduality, or missing dependence on the question.Since collecting AE annotations for each evaluated model is expensive, we learn a BERT matching (BEM) measure to approximate this task. Being a simpler task than QA, we find BEM to provide significantly better AE approximations than F1, and to more accurately reflect the performance of systems.Finally, we demonstrate the practical utility of AE and BEM on the concrete application of minimal accurate prediction sets, reducing the number of required answers by up to X2.6.
2021
Fool Me Twice: Entailment from Wikipedia Gamification
Julian Eisenschlos | Bhuwan Dhingra | Jannis Bulian | Benjamin Börschinger | Jordan Boyd-Graber
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Julian Eisenschlos | Bhuwan Dhingra | Jannis Bulian | Benjamin Börschinger | Jordan Boyd-Graber
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
We release FoolMeTwice (FM2 for short), a large dataset of challenging entailment pairs collected through a fun multi-player game. Gamification encourages adversarial examples, drastically lowering the number of examples that can be solved using “shortcuts” compared to other popular entailment datasets. Players are presented with two tasks. The first task asks the player to write a plausible claim based on the evidence from a Wikipedia page. The second one shows two plausible claims written by other players, one of which is false, and the goal is to identify it before the time runs out. Players “pay” to see clues retrieved from the evidence pool: the more evidence the player needs, the harder the claim. Game-play between motivated players leads to diverse strategies for crafting claims, such as temporal inference and diverting to unrelated evidence, and results in higher quality data for the entailment and evidence retrieval tasks. We open source the dataset and the game code.
2020
What Question Answering can Learn from Trivia Nerds
Jordan Boyd-Graber | Benjamin Börschinger
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Jordan Boyd-Graber | Benjamin Börschinger
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
In addition to the traditional task of machines answering questions, question answering (QA) research creates interesting, challenging questions that help systems how to answer questions and reveal the best systems. We argue that creating a QA dataset—and the ubiquitous leaderboard that goes with it—closely resembles running a trivia tournament: you write questions, have agents (either humans or machines) answer the questions, and declare a winner. However, the research community has ignored the hard-learned lessons from decades of the trivia community creating vibrant, fair, and effective question answering competitions. After detailing problems with existing QA datasets, we outline the key lessons—removing ambiguity, discriminating skill, and adjudicating disputes—that can transfer to QA research and how they might be implemented.
Search
Fix author
Co-authors
- Mark Johnson 9
- Jordan Lee Boyd-Graber 2
- Jannis Bulian 2
- Katherine Demuth 2
- Emmanuel Dupoux 2
- Christian Buck 1
- Massimiliano Ciaramita 1
- Robert Dale 1
- Isabelle Dautriche 1
- Bhuwan Dhingra 1
- Mark Dras 1
- Lan Du 1
- Julian Eisenschlos 1
- Anas Elghafari 1
- Abdellah Fourtassi 1
- Wojciech Gajewski 1
- Bevan Jones 1
- François Lareau 1
- Vivi Nastase 1
- John K. Pate 1
- Tal Schuster 1
- Mark Steedman 1
- Michael Strube 1
- Gabriel Synnaeve 1
- Zhendong Zhao 1
- Cäcilia Zirn 1