Ayush Garg
2025
Grounded, or a Good Guesser? A Per-Question Balanced Dataset to Separate Blind from Grounded Models for Embodied Question Answering
Miles Shelton
|
Nate Wingerd
|
Kritim K Rijal
|
Ayush Garg
|
Adelina Gutic
|
Brett Barnes
|
Catherine Finegan-Dollak
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Embodied question answering (EQA) means using *perception of* and *action in* an environment to answer natural language questions about that environment. However, previous work has demonstrated that blind language models (which do not incorporate perception, but predict an answer based solely on the question text) are a strong baseline for existing benchmarks, even compared against state-of-the-art vision and language models. To determine whether a model is grounding its answers in its specific environment, rather than relying on a language model’s expectations about the world generally, we propose PQB-EQA, a *per-question balanced* EQA dataset. In this new benchmark, every question appears twice, paired with two different environments that yield two different answers. That is, the answer distribution is balanced for each question, not just across the whole dataset. We show both theoretically and empirically that grounding in the environment is necessary to perform better than chance on PQB-EQA.
2021
MIPE: A Metric Independent Pipeline for Effective Code-Mixed NLG Evaluation
Ayush Garg
|
Sammed Kagi
|
Vivek Srivastava
|
Mayank Singh
Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems
Code-mixing is a phenomenon of mixing words and phrases from two or more languages in a single utterance of speech and text. Due to the high linguistic diversity, code-mixing presents several challenges in evaluating standard natural language generation (NLG) tasks. Various widely popular metrics perform poorly with the code-mixed NLG tasks. To address this challenge, we present a metric in- dependent evaluation pipeline MIPE that significantly improves the correlation between evaluation metrics and human judgments on the generated code-mixed text. As a use case, we demonstrate the performance of MIPE on the machine-generated Hinglish (code-mixing of Hindi and English languages) sentences from the HinGE corpus. We can extend the proposed evaluation strategy to other code-mixed language pairs, NLG tasks, and evaluation metrics with minimal to no effort.
Search
Fix author
Co-authors
- Brett Barnes 1
- Catherine Finegan-Dollak 1
- Adelina Gutic 1
- Sammed Kagi 1
- Kritim K Rijal 1
- show all...