Michael Xie
2026
BenchMarker: An Education-Inspired Toolkit for Highlighting Flaws in Multiple-Choice Benchmarks
Nishant Balepur | Bhavya Rajasekaran | Hyunjin Jane Oh | Michael Xie | Atrey Desai | Vipul Gupta | Steven James Moore | Eunsol Choi | Rachel Rudinger | Jordan Lee Boyd-Graber
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Nishant Balepur | Bhavya Rajasekaran | Hyunjin Jane Oh | Michael Xie | Atrey Desai | Vipul Gupta | Steven James Moore | Eunsol Choi | Rachel Rudinger | Jordan Lee Boyd-Graber
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Multiple-choice question answering (MCQA) is standard in NLP, but benchmarks lack rigorous quality control. We present BenchMarker, an education-inspired toolkit using LLM judges to flag three common MCQ flaws: 1) contamination—items appearing exactly online; 2) shortcuts—cues in the choices that enable guessing; and 3) writing errors—structural/grammatical issues based on a 19-rule education rubric. We validate BenchMarker with human annotations, then run the tool to audit 12 benchmarks, revealing: 2) contaminated MCQs tend to inflate accuracy, while writing errors tend to lower it and change rankings beyond random; and 3) prior benchmark repairs address their targeted issues (i.e., lowering accuracy with LLM-written distractors), but inadvertently add new flaws (i.e. implausible distractors, many correct answers). Overall, flaws in MCQs degrade NLP evaluation, but education research offers a path forward. We release BenchMarker to bridge the fields and improve MCQA benchmark design.
2023
Automatic Model Selection with Large Language Models for Reasoning
James Zhao | Yuxi Xie | Kenji Kawaguchi | Junxian He | Michael Xie
Findings of the Association for Computational Linguistics: EMNLP 2023
James Zhao | Yuxi Xie | Kenji Kawaguchi | Junxian He | Michael Xie
Findings of the Association for Computational Linguistics: EMNLP 2023
Chain-of-Thought (CoT) and Program-Aided Language Models (PAL) represent two distinct reasoning methods, each with its own strengths. CoT employs natural language, offering flexibility and interpretability, while PAL utilizes programming language, yielding more structured and rigorous logic. We introduce a model selection method to combine the best of both worlds by employing a large language model (LLM) to dynamically select between them. Our theoretical analysis underscores the feasibility of this method, which is further corroborated by empirical results. Our proposed method demonstrates significant performance improvements across eight reasoning datasets with Codex, ChatGPT, and GPT-4. Additionally, our method is complementary to self-consistency; when integrated, it can further enhance performance while significantly reducing computation costs. Moreover, we achieve new state-of-the-art results on GSM8K and SVAMP, with respective accuracies of 96.8% and 93.7%.