Akriti Jain
2026
Knowing What’s Missing: Assessing Information Sufficiency in Question Answering
Akriti Jain | Aparna Garimella
Findings of the Association for Computational Linguistics: EACL 2026
Akriti Jain | Aparna Garimella
Findings of the Association for Computational Linguistics: EACL 2026
Determining whether a provided context contains sufficient information to answer a question is a critical challenge for building reliable question-answering systems. While simple prompting strategies have shown success on factual questions, they frequently fail on inferential ones that require reasoning beyond direct text extraction. We hypothesize that asking a model to first reason about what specific information is missing provides a more reliable, implicit signal for assessing overall sufficiency. To this end, we propose a structured Identify-then-Verify framework for robust sufficiency modeling. Our method first generates multiple hypotheses about missing information and establishes a semantic consensus. It then performs a critical verification step, forcing the model to re-examine the source text to confirm whether this information is truly absent. We evaluate our method against established baselines across diverse multi-hop and factual QA datasets. The results demonstrate that by guiding the model to justify its claims about missing information, our framework produces more accurate sufficiency judgments while clearly articulating any information gaps.
MARCH: Evaluating the Intersection of Ambiguity Interpretation and Multi-hop Inference
Jeonghyun Park | Ingeol Baek | Seunghyun Yoon | Haeun Jang | Aparna Garimella | Akriti Jain | Nedim Lipka | Hwanhee Lee
Findings of the Association for Computational Linguistics: ACL 2026
Jeonghyun Park | Ingeol Baek | Seunghyun Yoon | Haeun Jang | Aparna Garimella | Akriti Jain | Nedim Lipka | Hwanhee Lee
Findings of the Association for Computational Linguistics: ACL 2026
Real-world multi-hop QA is naturally linked with ambiguity, where a single query can trigger multiple reasoning paths that require independent resolution. Since ambiguity can occur at any stage, models must navigate layered uncertainty throughout the entire reasoning chain. Despite its prevalence in real-world user queries, previous benchmarks have primarily focused on single-hop ambiguity, leaving the complex interaction between multi-step inference and layered ambiguity underexplored. In this paper, we introduce MARCH, a benchmark for their intersection, with 2,209 multi-hop ambiguous questions curated via multi-LLM verification and validated by human annotation with strong agreement. Our experiments reveal that even state-of-the-art models struggle with MARCH, confirming that combining ambiguity resolution with multi-step reasoning is a significant challenge. To address this, we propose CLARION, a two-stage agentic framework that explicitly decouples ambiguity planning from evidence-driven reasoning, significantly outperforms existing approaches, and paves the way for robust reasoning systems.
AnalystBench: Benchmarking professional long-form report generation with web-mined multimodal tasks
Chau Minh Pham | Zichao Wang | Puneet Mathur | Alexa Siu | Akriti Jain | Aparna Garimella | Ananya B. Sai | Nedim Lipka | Mohit Iyyer | Varun Manjunatha
Findings of the Association for Computational Linguistics: ACL 2026
Chau Minh Pham | Zichao Wang | Puneet Mathur | Alexa Siu | Akriti Jain | Aparna Garimella | Ananya B. Sai | Nedim Lipka | Mohit Iyyer | Varun Manjunatha
Findings of the Association for Computational Linguistics: ACL 2026
Large language models are increasingly used to draft long-form multimodal documents, but their end-to-end performance on professional report generation remains systematically understudied. We introduce AnalystBench, a continually extensible benchmark of 20 real-world report generation tasks grounded in multimodal document collections, where models must process millions of input tokens to produce long-form professional reports. Using expert-validated quality checklists and groundedness evaluation, we evaluate LLMs and coding agents and find that the best model, GPT-5.1, scores highly on executive summarization tasks (exceeding 90% on quality checklists) but degrades substantially on tasks requiring long-horizon synthesis over large inputs (dropping to 25-40%). Agent-based generation substantially benefits strong closed-source models such as GPT-5.1, with checklist scores improving by 20.24 percentage points and visual coverage by 37.41 points over vanilla generation, but offers little or negative gains for open-source models like DeepSeek-R1 (-3.02 points). Expert reviewers note that while generated reports are grounded and clearly separate factual description from interpretation, they often fall short in actionability, clarity, and quantitative precision, which highlights the gap between system performance and real-world professional needs.
Decisive: Guiding User Decisions with Optimal Preference Elicitation from Unstructured Documents
Akriti Jain | Anish Mulay | Divyansh Verma | Aishani Pandey | Pritika Ramu | Aparna Garimella
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Akriti Jain | Anish Mulay | Divyansh Verma | Aishani Pandey | Pritika Ramu | Aparna Garimella
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Decision-making is a cognitively intensive task that requires synthesizing relevant information from multiple unstructured sources, weighing competing factors, and incorporating subjective user preferences. Existing methods, including large language models and traditional decision-support systems, fall short: they often overwhelm users with information or fail to capture nuanced preferences accurately. We present Decisive, an interactive decision-making framework that combines document-grounded reasoning with Bayesian preference inference. Our approach grounds decisions in an objective option-scoring matrix extracted from source documents, while actively learning a user’s latent preference vector through targeted elicitation. Users answer pairwise tradeoff questions adaptively selected to maximize information gain over the final decision. This process converges efficiently, minimizing user effort while ensuring recommendations remain transparent and personalized. Through extensive experiments, we demonstrate that our approach significantly outperforms both general-purpose LLMs and existing decision-making frameworks achieving up to 20% improvement in decision accuracy over strong baselines across domains.
2025
Modeling Contextual Passage Utility for Multihop Question Answering
Akriti Jain | Aparna Garimella
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
Akriti Jain | Aparna Garimella
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
Multihop Question Answering (QA) requires systems to identify and synthesize information from multiple text passages. While most prior retrieval methods assist in identifying relevant passages for QA, further assessing the utility of the passages can help in removing redundant ones, which may otherwise add to noise and inaccuracies in the generated answers. Existing utility prediction approaches model passage utility independently, overlooking a critical aspect of multi-hop reasoning, that the utility of a passage can be context-dependent, influenced by its relation to other passages—whether it provides complementary information, or forms a crucial link in conjunction with others. In this paper, we propose a light-weight approach to model contextual passage utility, accounting for inter-passage dependencies. We fine-tune a small transformer-based model to predict passage utility scores for multihop QA. We leverage the reasoning traces from an advanced reasoning model to capture the order in which passages are used to answer a question, to obtain synthetic training data. Through comprehensive experiments, we demonstrate that our utility-based scoring of retrieved passages leads to better reranking and downstream task performance compared to relevance-based reranking methods.
FiRST: Finetuning Router-Selective Transformers for Input-Adaptive Latency Reduction
Akriti Jain | Saransh Sharma | Koyel Mukherjee | Soumyabrata Pal
Findings of the Association for Computational Linguistics: EMNLP 2025
Akriti Jain | Saransh Sharma | Koyel Mukherjee | Soumyabrata Pal
Findings of the Association for Computational Linguistics: EMNLP 2025
Auto-regressive Large Language Models (LLMs) demonstrate remarkable performance across different domains such as vision and language tasks. However, due to sequential processing through multiple transformer layers, autoregressive decoding faces significant computational challenges, particularly in resource-constrained environments like mobile and edge devices. Existing approaches in literature that aim to improve latency via skipping layers have two distinct flavors: (1) early exit, and (2) input-agnostic heuristics where tokens exit at pre-determined layers irrespective of input sequence. Both the above strategies have limitations, the former cannot be applied in the presence of KV caching, which is essential for speed-ups in modern inference frameworks, and the latter fails to capture variation in layer importance across tasks or, more generally, across input sequences. To address these limitations, we propose FiRST, a model-agnostic framework that reduces inference latency by using layer-specific routers to adaptively skip transformer layers during decoding, based on routing decisions made from the input prompt in the prefill stage. FiRST remains fully compatible with KV caching, enabling faster decoding while maintaining quality. Our method reveals that input adaptivity is essential: Different tasks rely on different subsets of layers to evolve meaningful representations. Extensive experiments show that FiRST significantly reduces latency while outperforming existing layer selection strategies in quality. It retains performance comparable to the base model without skipping. FiRST is thus a promising and efficient solution for LLM deployment in low-resource environments.
Doc2Chart: Intent-Driven Zero-Shot Chart Generation from Documents
Akriti Jain | Pritika Ramu | Aparna Garimella | Apoorv Saxena
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Akriti Jain | Pritika Ramu | Aparna Garimella | Apoorv Saxena
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Large Language Models (LLMs) have demonstrated strong capabilities in transforming text descriptions or tables to data visualizations via instruction-tuning methods. However, it is not straightforward to apply these methods directly for a more real-world use case of visualizing data from long documents based on user-given intents, as opposed to the user pre-selecting the relevant content manually. We introduce the task of _intent-based chart generation_ from documents: given a user-specified intent and document(s), the goal is to generate a chart adhering to the intent and grounded on the document(s) in a zero-shot setting. We propose an unsupervised, two-staged framework in which an LLM first extracts relevant information from the document(s) by decomposing the intent and iteratively validates and refines this data. Next, a heuristic-guided module selects an appropriate chart type before final code generation. To assess the data accuracy of the generated charts, we propose an attribution-based metric that uses a structured textual representation of charts, instead of relying on visual decoding metrics that often fail to capture the chart data effectively. To validate our approach, we curate a dataset comprising of 1,242 <intent, document, charts> tuples from two domains, finance and scientific, in contrast to the existing datasets that are largely limited to parallel text descriptions/ tables and their corresponding charts. We compare our approach with baselines using single-shot chart generation using LLMs and query-based retrieval methods; our method outperforms by upto 9 points and 17 points in terms of chart data accuracy and chart type respectively over the best baselines.
2020
IlliniMet: Illinois System for Metaphor Detection with Contextual and Linguistic Information
Hongyu Gong | Kshitij Gupta | Akriti Jain | Suma Bhat
Proceedings of the Second Workshop on Figurative Language Processing
Hongyu Gong | Kshitij Gupta | Akriti Jain | Suma Bhat
Proceedings of the Second Workshop on Figurative Language Processing
Metaphors are rhetorical use of words based on the conceptual mapping as opposed to their literal use. Metaphor detection, an important task in language understanding, aims to identify metaphors in word level from given sentences. We present IlliniMet, a system to automatically detect metaphorical words. Our model combines the strengths of the contextualized representation by the widely used RoBERTa model and the rich linguistic information from external resources such as WordNet. The proposed approach is shown to outperform strong baselines on a benchmark dataset. Our best model achieves F1 scores of 73.0% on VUA ALLPOS, 77.1% on VUA VERB, 70.3% on TOEFL ALLPOS and 71.9% on TOEFL VERB.