2025
pdf
bib
abs
No Questions are Stupid, but some are Poorly Posed: Understanding Poorly-Posed Information-Seeking Questions
Neha Srikanth
|
Rachel Rudinger
|
Jordan Lee Boyd-Graber
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Questions help unlock information to satisfy users’ information needs. However, when the question is poorly posed, answerers (whether human or computer) may struggle to answer the question in a way that satisfies the asker, despite possibly knowing everything necessary to address the asker’s latent information need. Using Reddit question-answer interactions from r/NoStupidQuestions, we develop a computational framework grounded in linguistic theory to study poorly-posedness of questions by generating spaces of potential interpretations of questions and computing distributions over these spaces based on interpretations chosen by both human answerers in the Reddit question thread, as well as by a suite of large language models. Both humans and models struggle to converge on dominant interpretations when faced with poorly-posed questions, but employ different strategies: humans focus on specific interpretations through question negotiation, while models attempt comprehensive coverage by addressing many interpretations simultaneously.
pdf
bib
abs
Whose Boat Does it Float? Improving Personalization in Preference Tuning via Inferred User Personas
Nishant Balepur
|
Vishakh Padmakumar
|
Fumeng Yang
|
Shi Feng
|
Rachel Rudinger
|
Jordan Lee Boyd-Graber
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
LLMs are aligned to follow input instructions by learning which of two responses users prefer for a prompt. However, such preference data do not convey *why* users prefer responses that are chosen or rejected, so LLMs trained on these datasets cannot tailor responses to varied user needs. To surface these parameters of personalization, we apply *abductive reasoning* to preference data, inferring needs and interests of users, i.e., personas, that may prefer either response. We test this idea in two steps: **Persona Inference (PI)**—abductively inferring personas of users who prefer chosen or rejected outputs—and **Persona Tailoring (PT)**—training models to tailor outputs to personas from PI. We show: 1) LLMs infer personas accurately explaining why different users may prefer *both* chosen or rejected outputs; 2) Training on preference data augmented with PI personas via PT boosts personalization and generalizes to supporting user-written personas; and 3) Rejected response personas form harder personalization evaluations, showing PT better aids users with uncommon preferences versus typical alignment methods. We argue for an abductive view of preferences for personalization, asking not only which response is better but when, why, and for whom.
pdf
bib
abs
Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above
Nishant Balepur
|
Rachel Rudinger
|
Jordan Lee Boyd-Graber
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Multiple choice question answering (MCQA) is popular for LLM evaluation due to its simplicity and human-like testing, but we argue for its reform. We first reveal flaws in MCQA’s format, as it struggles to: 1) test generation/subjectivity; 2) match LLM use cases; and 3) fully test knowledge. We instead advocate for generative formats based on human testing—where LLMs construct and explain answers—better capturing user needs and knowledge while remaining easy to score. We then show even when MCQA is a useful format, its datasets suffer from: leakage; unanswerability; shortcuts; and saturation. In each issue, we give fixes from education, like rubrics to guide MCQ writing; scoring methods to bridle guessing; and Item Response Theory to build harder MCQs. Lastly, we discuss LLM errors in MCQA—robustness, biases, and unfaithful explanations—showing how our prior solutions better measure or address these issues. While we do not need to desert MCQA, we encourage more efforts in refining the task based on educational testing, advancing evaluations.
pdf
bib
abs
Large Language Models Struggle to Describe the Haystack without Human Help: A Social Science-Inspired Evaluation of Topic Models
Zongxia Li
|
Lorena Calvo-Bartolomé
|
Alexander Miserlis Hoyle
|
Paiheng Xu
|
Daniel Kofi Stephens
|
Juan Francisco Fung
|
Alden Dima
|
Jordan Lee Boyd-Graber
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
A common use of NLP is to facilitate the understanding of large document collections, with models based on Large Language Models (LLMs) replacing probabilistic topic models. Yet the effectiveness of LLM-based approaches in real-world applications remains under explored. This study measures the knowledge users acquire with topic models—including traditional, unsupervised and supervised LLM- based approaches—on two datasets. While LLM-based methods generate more human- readable topics and show higher average win probabilities than traditional models for data exploration, they produce overly generic topics for domain-specific datasets that do not easily allow users to learn much about the documents. Adding human supervision to LLM-based topic models improves data exploration by addressing hallucination and genericity but requires more human efforts. In contrast, traditional models like Latent Dirichlet Allocation (LDA) remain effective for exploration but are less user-friendly. This paper provides best practices—there is no one right model, the choice of models is situation-specific—and suggests potential improvements for scalable LLM- based topic models.
pdf
bib
abs
ProxAnn: Use-Oriented Evaluations of Topic Models and Document Clustering
Alexander Miserlis Hoyle
|
Lorena Calvo-Bartolomé
|
Jordan Lee Boyd-Graber
|
Philip Resnik
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Topic models and document-clustering evaluations either use automated metrics that align poorly with human preferences, or require expert labels that are intractable to scale. We design a scalable human evaluation protocol and a corresponding automated approximation that reflect practitioners’ real-world usage of models. Annotators—or an LLM-based proxy—review text items assigned to a topic or cluster, infer a category for the group, then apply that category to other documents. Using this protocol, we collect extensive crowdworker annotations of outputs from a diverse set of topic models on two datasets. We then use these annotations to validate automated proxies, finding that the best LLM proxy is statistically indistinguishable from a human annotator and can therefore serve as a reasonable substitute in automated evaluations.
pdf
bib
abs
GRACE: A Granular Benchmark for Evaluating Model Calibration against Human Calibration
Yoo Yeon Sung
|
Eve Fleisig
|
Yu Hou
|
Ishan Upadhyay
|
Jordan Lee Boyd-Graber
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Language models are often miscalibrated, leading to confidently incorrect answers. We introduce GRACE, a benchmark for language model calibration that incorporates comparison with human calibration. GRACE consists of question-answer pairs, in which each question contains a series of clues that gradually become easier, all leading to the same answer; models must answer correctly as early as possible as the clues are revealed. This setting permits granular measurement of model calibration based on how early, accurately, and confidently a model answers. After collecting these questions, we host live human vs. model competitions to gather 1,749 data points on human and model teams’ timing, accuracy, and confidence. We propose a metric, CalScore, that uses GRACE to analyze model calibration errors and identify types of model miscalibration that differ from human behavior. We find that although humans are less accurate than models, humans are generally better calibrated. Since state-of-the-art models struggle on GRACE, it effectively evaluates progress on improving model calibration.
pdf
bib
abs
Should I Trust You? Detecting Deception in Negotiations using Counterfactual RL
Wichayaporn Wongkamjan
|
Yanze Wang
|
Feng Gu
|
Denis Peskoff
|
Jonathan K. Kummerfeld
|
Jonathan May
|
Jordan Lee Boyd-Graber
Findings of the Association for Computational Linguistics: ACL 2025
An increasingly common socio-technical problem is people being taken in by offers that sound “too good to be true”, where persuasion and trust shape decision-making. This paper investigates how AI can help detect these deceptive scenarios. We analyze how humans strategically deceive each other in Diplomacy, a board game that requires both natural language communication and strategic reasoning. This requires extracting logical forms representing proposals—agreements that players suggest during communication—and computing their relative rewards using agents’ value functions. Combined with text-based features, this can improve our deception detection. Our method detects human deception with a high precision when compared to a Large Language Model approach that flags many true messages as deceptive. Future human-AI interaction tools can build on our methods for deception detection by triggering friction to give users a chance of interrogating suspicious proposals.
pdf
bib
abs
MoDS: Moderating a Mixture of Document Speakers to Summarize Debatable Queries in Document Collections
Nishant Balepur
|
Alexa Siu
|
Nedim Lipka
|
Franck Dernoncourt
|
Tong Sun
|
Jordan Lee Boyd-Graber
|
Puneet Mathur
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Query-focused summarization (QFS) gives a summary of documents to answer a query.Past QFS work assumes queries have one answer, ignoring debatable ones (*Is law school worth it?*).We introduce **Debatable QFS (DQFS)**, a task to create summaries that answer debatable queries via documents with opposing perspectives; summaries must *comprehensively cover* all sources and *balance perspectives*, favoring no side.These goals elude LLM QFS systems, which: 1) lack structured content plans, failing to guide LLMs to write balanced summaries, and 2) employ the same query to retrieve contexts across documents, failing to cover all perspectives specific to each document’s content.To overcome this, we design MoDS, a multi-LLM framework mirroring human panel discussions.MoDS treats documents as individual Speaker LLMs and has a Moderator LLM that picks speakers to respond to tailored queries for planned topics.Speakers use tailored queries to retrieve relevant contexts from their documents and supply perspectives, which are tracked in a rich outline, yielding a content plan to guide the final summary.Experiments on ConflictingQA with controversial web queries and DebateQFS, our new dataset of debate queries from Debatepedia, show MoDS beats SOTA by 38-59% in topic paragraph coverage and balance, based on new citation metrics. Users also find MoDS’s summaries to be readable and more balanced.
pdf
bib
abs
Is your benchmark truly adversarial? AdvScore: Evaluating Human-Grounded Adversarialness
Yoo Yeon Sung
|
Maharshi Gor
|
Eve Fleisig
|
Ishani Mondal
|
Jordan Lee Boyd-Graber
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Adversarial datasets should validate AI robustness by providing samples on which humans perform well, but models do not. However, as models evolve, datasets can become obsolete. Measuring whether a dataset remains adversarial is hindered by the lack of a standardized metric for measuring adversarialness. We propose ADVSCORE, a human-grounded evaluation metric that assesses a dataset’s adversarialness by capturing models’ and humans’ varying abilities, while also identifying poor examples. We then use ADVSCORE to motivate a new dataset creation pipeline for realistic and high-quality adversarial samples, enabling us to collect an adversarial question answering (QA) dataset, ADVQA. We apply ADVSCORE using 9,347 human responses and ten language models’ predictions to track model improvement over five years (2020–2024). ADVSCORE thus provides guidance for achieving robustness comparable with human capabilities. Furthermore, it helps determine to what extent adversarial datasets continue to pose challenges, ensuring that, rather than reflecting outdated or overly artificial difficulties, they effectively test model capabilities.
pdf
bib
abs
Reverse Question Answering: Can an LLM Write a Question so Hard (or Bad) that it Can’t Answer?
Nishant Balepur
|
Feng Gu
|
Abhilasha Ravichander
|
Shi Feng
|
Jordan Lee Boyd-Graber
|
Rachel Rudinger
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)
Question answering (QA)—giving correct answers to questions—is a popular task, but we test **reverse question answering (RQA)**: for an input answer, give a question with that answer. Past work tests QA and RQA separately, but we test them jointly, comparing their difficulty, aiding benchmark design, and checking reasoning consistency. We run 16 LLMs on QA and RQA with trivia questions/answers, revealing: 1) Versus RQA, LLMs are much less accurate in RQA for numerical answers, but slightly more accurate in RQA for textual answers; 2) LLMs often answer their own invalid questions from RQA accurately in QA, so RQA errors are not just from knowledge gaps; 3) RQA errors correlate with question difficulty and inversely correlate with answer frequencies in the Dolma corpus; and 4) LLMs struggle to give valid multi-hop questions. By finding question and answer types that lead to RQA errors, we suggest improvements for LLM reasoning.
pdf
bib
abs
Personalized Help for Optimizing Low-Skilled Users’ Strategy
Feng Gu
|
Wichayaporn Wongkamjan
|
Jordan Lee Boyd-Graber
|
Jonathan K. Kummerfeld
|
Denis Peskoff
|
Jonathan May
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)
AIs can beat humans in game environments; however, how helpful those agents are to human remains understudied. We augment Cicero, a natural language agent that demonstrates superhuman performance in Diplomacy, to generate both move and message advice based on player intentions. A dozen Diplomacy games with novice and experienced players, with varying advice settings, show that some of the generated advice is beneficial. It helps novices compete with experienced players and in some instances even surpass them. The mere presence of advice can be advantageous, even if players do not follow it.
2024
pdf
bib
abs
More Victories, Less Cooperation: Assessing Cicero’s Diplomacy Play
Wichayaporn Wongkamjan
|
Feng Gu
|
Yanze Wang
|
Ulf Hermjakob
|
Jonathan May
|
Brandon M. Stewart
|
Jonathan K. Kummerfeld
|
Denis Peskoff
|
Jordan Lee Boyd-Graber
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The boardgame Diplomacy is a challenging setting for communicative and cooperative artificial intelligence. The most prominent communicative Diplomacy AI, Cicero, has excellent strategic abilities, exceeding human players. However, the best Diplomacy players master communication, not just tactics, which is why the game has received attention as an AI challenge. This work seeks to understand the degree to which Cicero succeeds at communication. First, we annotate in-game communication with abstract meaning representation to separate in-game tactics from general language. Second, we run two dozen games with humans and Cicero, totaling over 200 human-player hours of competition. While AI can consistently outplay human players, AI-Human communication is still limited because of AI’s difficulty with deception and persuasion. This shows that Cicero relies on strategy and has not yet reached the full promise of communicative and cooperative AI.
pdf
bib
abs
KARL: Knowledge-Aware Retrieval and Representations aid Retention and Learning in Students
Matthew Shu
|
Nishant Balepur
|
Shi Feng
|
Jordan Lee Boyd-Graber
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Flashcard schedulers rely on 1) *student models* to predict the flashcards a student knows; and 2) *teaching policies* to pick which cards to show next via these predictions.Prior student models, however, just use study data like the student’s past responses, ignoring the text on cards. We propose **content-aware scheduling**, the first schedulers exploiting flashcard content.To give the first evidence that such schedulers enhance student learning, we build KARL, a simple but effective content-aware student model employing deep knowledge tracing (DKT), retrieval, and BERT to predict student recall.We train KARL by collecting a new dataset of 123,143 study logs on diverse trivia questions.KARL bests existing student models in AUC and calibration error.To ensure our improved predictions lead to better student learning, we create a novel delta-based teaching policy to deploy KARL online.Based on 32 study paths from 27 users, KARL improves learning efficiency over SOTA, showing KARL’s strength and encouraging researchers to look beyond historical study data to fully capture student abilities.
pdf
bib
abs
A SMART Mnemonic Sounds like “Glue Tonic”: Mixing LLMs with Student Feedback to Make Mnemonic Learning Stick
Nishant Balepur
|
Matthew Shu
|
Alexander Hoyle
|
Alison Robey
|
Shi Feng
|
Seraphina Goldfarb-Tarrant
|
Jordan Lee Boyd-Graber
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Keyword mnemonics are memorable explanations that link new terms to simpler keywords.Prior work generates mnemonics for students, but they do not train models using mnemonics students prefer and aid learning.We build SMART, a mnemonic generator trained on feedback from real students learning new terms.To train SMART, we first fine-tune LLaMA-2 on a curated set of user-written mnemonics.We then use LLM alignment to enhance SMART: we deploy mnemonics generated by SMART in a flashcard app to find preferences on mnemonics students favor.We gather 2684 preferences from 45 students across two types: **expressed** (inferred from ratings) and **observed** (inferred from student learning), yielding three key findings.First, expressed and observed preferences disagree; what students *think* is helpful does not always capture what is *truly* helpful.Second, Bayesian models can synthesize complementary data from multiple preference types into a single effectiveness signal.SMART is tuned via Direct Preference Optimization on this signal, which resolves ties and missing labels in the typical method of pairwise comparisons, augmenting data for LLM output quality gains. Third, mnemonic experts assess SMART as matching GPT-4 at much lower deployment costs, showing the utility of capturing diverse student feedback to align LLMs in education.
pdf
bib
abs
You Make me Feel like a Natural Question: Training QA Systems on Transformed Trivia Questions
Tasnim Kabir
|
Yoo Yeon Sung
|
Saptarashmi Bandyopadhyay
|
Hao Zou
|
Abhranil Chandra
|
Jordan Lee Boyd-Graber
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Training question-answering QA and information retrieval systems for web queries require large, expensive datasets that are difficult to annotate and time-consuming to gather. Moreover, while natural datasets of information-seeking questions are often prone to ambiguity or ill-formed, there are troves of freely available, carefully crafted question datasets for many languages. Thus, we automatically generate shorter, information-seeking questions, resembling web queries in the style of the Natural Questions (NQ) dataset from longer trivia data. Training a QA system on these transformed questions is a viable strategy for alternating to more expensive training setups showing the F1 score difference of less than six points and contrasting the final systems.
pdf
bib
abs
Do great minds think alike? Investigating Human-AI Complementarity in Question Answering with CAIMIRA
Maharshi Gor
|
Hal Daumé Iii
|
Tianyi Zhou
|
Jordan Lee Boyd-Graber
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Recent advancements of large language models (LLMs)have led to claims of AI surpassing humansin natural language processing NLP tasks such as textual understanding and reasoning.%This work investigates these assertions by introducingCAIMIRA, a novel framework rooted in item response theory IRTthat enables quantitative assessment and comparison of problem-solving abilities inquestion-answering QA agents.%Through analysis of over 300,000 responses from ~ 70 AI systemsand 155 humans across thousands of quiz questions, CAIMIRA uncovers distinctproficiency patterns in knowledge domains and reasoning skills. %Humans outperform AI systems in knowledge-grounded abductive and conceptual reasoning,while state-of-the-art LLMs like GPT-4 Turbo and Llama-3-70B demonstrate superior performance ontargeted information retrieval and fact-based reasoning, particularly when information gapsare well-defined and addressable through pattern matching or data retrieval.%These findings identify key areas for future QA tasks and model development,highlighting the critical need for questions that not only challengehigher-order reasoning and scientific thinking, but also demand nuanced linguisticand cross-contextual application.
pdf
bib
abs
AutoHallusion: Automatic Generation of Hallucination Benchmarks for Vision-Language Models
Xiyang Wu
|
Tianrui Guan
|
Dianqi Li
|
Shuaiyi Huang
|
Xiaoyu Liu
|
Xijun Wang
|
Ruiqi Xian
|
Abhinav Shrivastava
|
Furong Huang
|
Jordan Lee Boyd-Graber
|
Tianyi Zhou
|
Dinesh Manocha
Findings of the Association for Computational Linguistics: EMNLP 2024
Large vision-language models (LVLMs) are prone to hallucinations, where certain contextual cues in an image can trigger the language module to produce overconfident and incorrect reasoning about abnormal or hypothetical objects. While some benchmarks have been developed to investigate LVLM hallucinations, they often rely on hand-crafted corner cases whose failure patterns may not generalize well. Additionally, fine-tuning on these examples could undermine their validity. To address this, we aim to scale up the number of cases through an automated approach, reducing human bias in crafting such corner cases. This motivates the development of AutoHallusion, the first automated benchmark generation approach that employs several key strategies to create a diverse range of hallucination examples. Our generated visual-question pairs pose significant challenges to LVLMs, requiring them to overcome contextual biases and distractions to arrive at correct answers. AutoHallusion enables us to create new benchmarks at the minimum cost and thus overcomes the fragility of hand-crafted benchmarks. It also reveals common failure patterns and reasons, providing key insights to detect, avoid, or control hallucinations. Comprehensive evaluations of top-tier LVLMs, e.g., GPT-4V(ision), Gemini Pro Vision, Claude 3, and LLaVA-1.5, show a 97.7% and 98.7% success rate of hallucination induction on synthetic and real-world datasets of AutoHallusion, paving the way for a long battle against hallucinations. The codebase and data can be accessed at https://github.com/wuxiyang1996/AutoHallusion
pdf
bib
abs
PEDANTS: Cheap but Effective and Interpretable Answer Equivalence
Zongxia Li
|
Ishani Mondal
|
Huy Nghiem
|
Yijun Liang
|
Jordan Lee Boyd-Graber
Findings of the Association for Computational Linguistics: EMNLP 2024
Question answering (QA) can only make progress if we know if an answer is correct, but current answer correctness (AC) metrics struggle with verbose, free-form answers from large language models (LLMs). There are two challenges with current short-form QA evaluations: a lack of diverse styles of evaluation data and an over-reliance on expensive and slow LLMs. LLM-based scorers correlate better with humans, but this expensive task has only been tested on limited QA datasets. We rectify these issues by providing rubrics and datasets for evaluating machine QA adopted from the Trivia community. We also propose an efficient, and interpretable QA evaluation that is more stable than an exact match and neural methods (BERTScore).
pdf
bib
abs
SciDoc2Diagrammer-MAF: Towards Generation of Scientific Diagrams from Documents guided by Multi-Aspect Feedback Refinement
Ishani Mondal
|
Zongxia Li
|
Yufang Hou
|
Anandhavelu Natarajan
|
Aparna Garimella
|
Jordan Lee Boyd-Graber
Findings of the Association for Computational Linguistics: EMNLP 2024
Automating the creation of scientific diagrams from academic papers can significantly streamline the development of tutorials, presentations, and posters, thereby saving time and accelerating the process. Current text-to-image models (Rombach et al., 2022a; Belouadi et al., 2023) struggle with generating accurate and visually appealing diagrams from long-context inputs. We propose SciDoc2Diagram, a task that extracts relevant information from scientific papers and generates diagrams, along with a benchmarking dataset, SciDoc2DiagramBench. We develop a multi-step pipeline SciDoc2Diagrammer that generates diagrams based on user intentions using intermediate code generation. We observed that initial diagram drafts were often incomplete or unfaithful to the source, leading us to develop SciDoc2Diagrammer-Multi-Aspect-Feedback (MAF), a refinement strategy that significantly enhances factual correctness and visual appeal and outperforms existing models on both automatic and human judgement.