Maharshi Gor

2026

AI systems are fallible, and humans can make mistakes in deciding whether to trustAI over their own judgment. Thus, improving human-AI collaboration requires that we understand when,why, and how humans decide to rely on AI. We study two reliance decisions: delegating a task toAI without seeing its output (whether AI is used) and evaluating AI suggestions to decidewhether to adopt them how AI output shapes final decisions).Both matter for effective collaboration, yet prior work lacks naturalistic experiments capturing both patternsfor the same users. We address this gap by studying collaborative human–AI teams competing in aquestion-answering game in which humans can choose when and how to work with AI agents to win.Our 24 matches pair 23 expert humans with 16 AI agents, capturing 387 delegation and 1440 adoption decisions.While human–AI collaboration performs better than either AI or humansalone, humans make suboptimal collaboration decisions, bothunder-relying on correct AI suggestions (3.7% of opportunities missed) and over-relying when AI misleads them (1.5%).Both parties contribute wrong answers: reported model confidence is near chance when humans and AI disagree, while confirmation bias drives higher under-reliance (60.7%) when an AI suggestion agrees with humans’ initial incorrect answer.

2025

pdf bib abs

Is your benchmark truly adversarial? AdvScore: Evaluating Human-Grounded Adversarialness
Yoo Yeon Sung | Maharshi Gor | Eve Fleisig | Ishani Mondal | Jordan Boyd-Graber
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Adversarial datasets should validate AI robustness by providing samples on which humans perform well, but models do not. However, as models evolve, datasets can become obsolete. Measuring whether a dataset remains adversarial is hindered by the lack of a standardized metric for measuring adversarialness. We propose ADVSCORE, a human-grounded evaluation metric that assesses a dataset’s adversarialness by capturing models’ and humans’ varying abilities, while also identifying poor examples. We then use ADVSCORE to motivate a new dataset creation pipeline for realistic and high-quality adversarial samples, enabling us to collect an adversarial question answering (QA) dataset, ADVQA. We apply ADVSCORE using 9,347 human responses and ten language models’ predictions to track model improvement over five years (2020–2024). ADVSCORE thus provides guidance for achieving robustness comparable with human capabilities. Furthermore, it helps determine to what extent adversarial datasets continue to pose challenges, ensuring that, rather than reflecting outdated or overly artificial difficulties, they effectively test model capabilities.

2024

pdf bib abs

Do great minds think alike? Investigating Human-AI Complementarity in Question Answering with CAIMIRA
Maharshi Gor | Hal Daumé III | Tianyi Zhou | Jordan Boyd-Graber
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Recent advancements of large language models (LLMs)have led to claims of AI surpassing humansin natural language processing NLP tasks such as textual understanding and reasoning.%This work investigates these assertions by introducingCAIMIRA, a novel framework rooted in item response theory IRTthat enables quantitative assessment and comparison of problem-solving abilities inquestion-answering QA agents.%Through analysis of over 300,000 responses from ~ 70 AI systemsand 155 humans across thousands of quiz questions, CAIMIRA uncovers distinctproficiency patterns in knowledge domains and reasoning skills. %Humans outperform AI systems in knowledge-grounded abductive and conceptual reasoning,while state-of-the-art LLMs like GPT-4 Turbo and Llama-3-70B demonstrate superior performance ontargeted information retrieval and fact-based reasoning, particularly when information gapsare well-defined and addressable through pattern matching or data retrieval.%These findings identify key areas for future QA tasks and model development,highlighting the critical need for questions that not only challengehigher-order reasoning and scientific thinking, but also demand nuanced linguisticand cross-contextual application.

2021

pdf bib abs

MATE: Multi-view Attention for Table Transformer Efficiency
Julian Eisenschlos | Maharshi Gor | Thomas Müller | William Cohen
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

This work presents a sparse-attention Transformer architecture for modeling documents that contain large tables. Tables are ubiquitous on the web, and are rich in information. However, more than 20% of relational tables on the web have 20 or more rows (Cafarella et al., 2008), and these large tables present a challenge for current Transformer models, which are typically limited to 512 tokens. Here we propose MATE, a novel Transformer architecture designed to model the structure of web tables. MATE uses sparse attention in a way that allows heads to efficiently attend to either rows or columns in a table. This architecture scales linearly with respect to speed and memory, and can handle documents containing more than 8000 tokens with current accelerators. MATE also has a more appropriate inductive bias for tabular data, and sets a new state-of-the-art for three table reasoning datasets. For HybridQA (Chen et al., 2020), a dataset that involves large documents containing tables, we improve the best prior result by 19 points.

pdf bib abs

Toward Deconfounding the Effect of Entity Demographics for Question Answering Accuracy
Maharshi Gor | Kellie Webster | Jordan Boyd-Graber
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

The goal of question answering (QA) is to answer _any_ question. However, major QA datasets have skewed distributions over gender, profession, and nationality. Despite that skew, an analysis of model accuracy reveals little evidence that accuracy is lower for people based on gender or nationality; instead, there is more variation on professions (question topic) and question ambiguity. But QA’s lack of representation could itself hide evidence of bias, necessitating QA datasets that better represent global diversity.

Co-authors

Yu Hou 1

Venues

Fix author