Mansi Uniyal
2026
Agentic Context Strategies for Multi-Format Document Understanding: When Should Language Models Use Tools?
Mansi Uniyal | Mukul Singh | Ryan Nadel
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Mansi Uniyal | Mukul Singh | Ryan Nadel
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Large language models face fundamental trade-offs when processing long documents: full context is expensive and may exceed limits, while RAG risks missing relevant information. We evaluate four context strategies across six frontier models on three document formats (Word, Excel, and PowerPoint). Our key finding: agentic tool-augmented approaches dramatically outperform passive strategies, with RAG+Tools achieving 46% accuracy vs 6% for RAG-only. Tool benefits are consistent across formats (+28-40 points) and models. We further show that (1) intelligent routing matters more than iteration count, (2) tools provide unique capability beyond reasoning loops, and (3) forcing active exploration matches providing context proactively. These results suggest tool augmentation is crucial for complex document QA.
2024
One-to-many testing for code generation from (just) natural language
Mansi Uniyal | Mukul Singh | Gust Verbruggen | Sumit Gulwani | Vu Le
Findings of the Association for Computational Linguistics: EMNLP 2024
Mansi Uniyal | Mukul Singh | Gust Verbruggen | Sumit Gulwani | Vu Le
Findings of the Association for Computational Linguistics: EMNLP 2024
MBPP is a popular dataset for evaluating the task of code generation from natural language. Despite its popularity, there are three problems: (1) it relies on providing test cases to generate the right signature, (2) there is poor alignment between instruction and evaluation test cases, and (3) contamination of the exact phrasing being present in training datasets. We adapt MBPP to emphasize on generating code from just natural language by (1) removing ambiguity about the semantics of the task from the descriptions, and (2) evaluating generated code on multiple sets of assertions to account for ambiguity in the syntax. We compare popular open and closed weight models on the original (MBPP) and adapted (MBUPP) datasets.