Aaron Colak


2025

pdf bib
Memory-QA: Answering Recall Questions Based on Multimodal Memories
Hongda Jiang | Xinyuan Zhang | Siddhant Garg | Rishab Arora | Shiun-Zu Kuo | Jiayang Xu | Aaron Colak | Xin Luna Dong
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

We introduce Memory-QA, a novel real-world task that involves answering recall questions about visual content from previously stored multimodal memories. This task poses unique challenges, including the creation of task-oriented memories, the effective utilization of temporal and location information within memories, and the ability to draw upon multiple memories to answer a recall question. To address these challenges, we propose a comprehensive pipeline, Pensieve, integrating memory-specific augmentation, time- and location-aware multi-signal retrieval, and multi-memory QA fine-tuning. We created a multimodal benchmark to illustrate various real challenges in this task, and show the superior performance of Pensieve over state-of-the-art solutions (up to +14% on QA accuracy).

pdf bib
PrismRAG: Boosting RAG Factuality with Distractor Resilience and Strategized Reasoning
Mohammad Kachuee | Teja Gollapudi | Minseok Kim | Yin Huang | Kai Sun | Xiao Yang | Jiaqi Wang | Nirav Shah | Yue Liu | Aaron Colak | Anuj Kumar | Wen-tau Yih | Xin Luna Dong
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track

Retrieval-augmented generation (RAG) often falls short when retrieved context includes confusing semi-relevant passages, or when answering questions require deep contextual understanding and reasoning. We propose an efficient fine-tuning framework, called PrismRAG, that (i) trains the model with distractor-aware QA pairs mixing gold evidence with subtle distractor passages, and (ii) instills reasoning-centric habits that make the LLM plan, rationalize, and synthesize without relying on extensive human engineered instructions. Evaluated across 12 open-book RAG QA benchmarks spanning diverse application domains and scenarios, PrismRAG improves average factuality by 5.4%, outperforming state-of-the-art solutions. Our method is being deployed in production.

2023

pdf bib
Effective Proxy for Human Labeling: Ensemble Disagreement Scores in Large Language Models for Industrial NLP
Wei Du | Laksh Advani | Yashmeet Gambhir | Daniel Perry | Prashant Shiralkar | Zhengzheng Xing | Aaron Colak
Proceedings of the Third Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)

Large language models (LLMs) have demonstrated significant capability to generalize across a large number of NLP tasks. For industry applications, it is imperative to assess the performance of the LLM on unlabeled production data from time to time to validate for a real-world setting. Human labeling to assess model error requires considerable expense and time delay. Here we demonstrate that ensemble disagreement scores work well as a proxy for human labeling for language models in zero-shot, few-shot, and fine-tuned settings, per our evaluation on keyphrase extraction (KPE) task. We measure fidelity of the results by comparing to true error measured from human labeled ground truth. We contrast with the alternative of using another LLM as a source of machine labels, or ‘silver labels’. Results across various languages and domains show disagreement scores provide a better estimation of model performance with mean average error (MAE) as low as 0.4% and on average 13.8% better than using silver labels.