Nandita Shankar Naik

2025

As language models become integral to critical workflows, assessing their behavior remains a fundamental challenge – human evaluation is costly and noisy, while automated metrics provide only coarse, difficult-to-interpret signals. We introduce natural language unit tests, a paradigm that decomposes response quality into explicit, testable criteria, along with a unified scoring model, LMUnit, which combines multi-objective training across preferences, direct ratings, and natural language rationales. Through controlled human studies, we show this paradigm significantly improves inter-annotator agreement and enables more effective LLM development workflows. LMUnit achieves state-of-the-art performance on evaluation benchmarks including FLASK, BigGenBench, and RewardBench 2, while maintaining competitive results on the original RewardBench. These results validate both our proposed paradigm and scoring model, suggesting a promising path forward for language model evaluation and development. Our code has been released at github.com/ContextualAI/LMUnit with an MIT license.

2024

pdf bib abs
CommVQA: Situating Visual Question Answering in Communicative Contexts
Nandita Shankar Naik | Christopher Potts | Elisa Kreiss
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Current visual question answering (VQA) models tend to be trained and evaluated on image-question pairs in isolation. However, the questions people ask are dependent on their informational needs and prior knowledge about the image content. To evaluate how situating images within naturalistic contexts shapes visual questions, we introduce CommVQA, a VQA dataset consisting of images, image descriptions, real-world communicative scenarios where the image might appear (e.g., a travel website), and follow-up questions and answers conditioned on the scenario and description. CommVQA, which contains 1000 images and 8,949 question-answer pairs, poses a challenge for current models. Error analyses and a human-subjects study suggest that generated answers still contain high rates of hallucinations, fail to fittingly address unanswerable questions, and don’t suitably reflect contextual information.

Co-authors

Venues

emnlp1
findings1

Fix author