Shuai Shao

2026

Distribution Shift Alignment Helps LLMs Simulate Survey Response Distributions
Ji Huang | Mengfei LI | Shuai Shao
Findings of the Association for Computational Linguistics: ACL 2026

Large language models (LLMs) offer a promising way to simulate human survey responses, potentially reducing the cost of large-scale data collection. However, existing zero-shot methods suffer from prompt sensitivity and low accuracy, while conventional fine-tuning approaches mostly fit the training set distributions and struggle to produce results more accurate than the training set itself, which deviates from the original goal of using LLMs to simulate survey responses. Building on this observation, we introduce Distribution Shift Alignment (DSA), a two-stage fine-tuning method that aligns both the output distributions and the distribution shifts across different backgrounds. By learning how these distributions change rather than fitting training data, DSA can provide results substantially closer to the true distribution than the training data. Empirically, DSA consistently outperforms other methods on five public survey datasets. We further conduct a comprehensive comparison covering accuracy, robustness, and data savings. DSA reduces the required real data by 53.48-69.12%, demonstrating its effectiveness and efficiency in survey simulation.

2025

pdf bib abs

Luna: A Lightweight Evaluation Model to Catch Language Model Hallucinations with High Accuracy and Low Cost
Masha Belyi | Robert Friel | Shuai Shao | Atindriyo Sanyal
Proceedings of the 31st International Conference on Computational Linguistics: Industry Track

Retriever-Augmented Generation (RAG) systems have become pivotal in enhancing the capabilities of language models by incorporating external knowledge retrieval mechanisms. However, a significant challenge in deploying these systems in industry applications is the detection and mitigation of hallucinations - instances where the model generates information that is not grounded in the retrieved context. Addressing this issue is crucial for ensuring the reliability and accuracy of responses generated by large language models (LLMs) in industry settings. Current hallucination detection techniques fail to deliver accuracy, low latency, and low cost simultaneously. We introduce Luna: a DeBERTA-large encoder, fine-tuned for hallucination detection in RAG settings. We demonstrate that Luna outperforms GPT-3.5 and commercial evaluation frameworks on the hallucination detection task, with 97% and 91% reduction in cost and latency, respectively. Luna’s generalization capacity across multiple industry verticals and out-of-domain data makes it a strong candidate for guardrailing industry LLM applications.

Co-authors

Venues

COLING1
Findings1

Fix author