Tasnim Kabir
2026
AUDITA: A New Dataset to Audit Humans vs. AI Skill at Audio QA
Tasnim Kabir | Dmytro Kurdydyk | Aadi Palnitkar | Liam Dorn | Ahmed Haj Ahmed | Jordan Lee Boyd-Graber
Findings of the Association for Computational Linguistics: ACL 2026
Tasnim Kabir | Dmytro Kurdydyk | Aadi Palnitkar | Liam Dorn | Ahmed Haj Ahmed | Jordan Lee Boyd-Graber
Findings of the Association for Computational Linguistics: ACL 2026
Existing audio question answering benchmarks largely emphasize sound event classification or caption-grounded queries, often enabling models to succeed through shortcut strategies, short-duration cues, lexical priors, dataset-specific biases, or even bypassing audio via metadata and captions rather than genuine reasoning Thus, we present AUDITA (Audio Understanding from Diverse Internet Trivia Authors), a large-scale, real-world benchmark to rigorously evaluate audio reasoning beyond surface-level acoustic recognition. AUDITA comprises carefully curated, human-authored trivia questions grounded in real-world audio, designed to stress robust auditory reasoning through challenging distractors and long-range temporal dependencies, using probing queries that cannot be answered from isolated text or sound cues alone. Human average accuracy of 32.13% shows both the challenge of the task while demonstrating meaningful comprehension of the audio. In stark contrast, state-of-the- art audio question answering models perform poorly, with average accuracy below 8.86%. Beyond raw accuracy, we apply Item Response Theory (IRT) to estimate latent proficiency, question difficulty, and expose systematic deficiencies of the models and data.
2024
You Make me Feel like a Natural Question: Training QA Systems on Transformed Trivia Questions
Tasnim Kabir | Yoo Yeon Sung | Saptarashmi Bandyopadhyay | Hao Zou | Abhranil Chandra | Jordan Lee Boyd-Graber
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Tasnim Kabir | Yoo Yeon Sung | Saptarashmi Bandyopadhyay | Hao Zou | Abhranil Chandra | Jordan Lee Boyd-Graber
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Training question-answering QA and information retrieval systems for web queries require large, expensive datasets that are difficult to annotate and time-consuming to gather. Moreover, while natural datasets of information-seeking questions are often prone to ambiguity or ill-formed, there are troves of freely available, carefully crafted question datasets for many languages. Thus, we automatically generate shorter, information-seeking questions, resembling web queries in the style of the Natural Questions (NQ) dataset from longer trivia data. Training a QA system on these transformed questions is a viable strategy for alternating to more expensive training setups showing the F1 score difference of less than six points and contrasting the final systems.
2021
The University of Maryland, College Park Submission to Large-Scale Multilingual Shared Task at WMT 2021
Saptarashmi Bandyopadhyay | Tasnim Kabir | Zizhen Lian | Marine Carpuat
Proceedings of the Sixth Conference on Machine Translation
Saptarashmi Bandyopadhyay | Tasnim Kabir | Zizhen Lian | Marine Carpuat
Proceedings of the Sixth Conference on Machine Translation
This paper describes the system submitted to Large-Scale Multilingual Shared Task (Small Task #2) at WMT 2021. It is based on the massively multilingual open-source model FLORES101_MM100 model, with selective fine-tuning. Our best-performing system reported a 15.72 average BLEU score for the task.
The UMD Submission to the Explainable MT Quality Estimation Shared Task: Combining Explanation Models with Sequence Labeling
Tasnim Kabir | Marine Carpuat
Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems
Tasnim Kabir | Marine Carpuat
Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems
This paper describes the UMD submission to the Explainable Quality Estimation Shared Task at the EMNLP 2021 Workshop on “Evaluation & Comparison of NLP Systems”. We participated in the word-level and sentence-level MT Quality Estimation (QE) constrained tasks for all language pairs: Estonian-English, Romanian-English, German-Chinese, and Russian-German. Our approach combines the predictions of a word-level explainer model on top of a sentence-level QE model and a sequence labeler trained on synthetic data. These models are based on pre-trained multilingual language models and do not require any word-level annotations for training, making them well suited to zero-shot settings. Our best-performing system improves over the best baseline across all metrics and language pairs, with an average gain of 0.1 in AUC, Average Precision, and Recall at Top-K score.