2025
pdf
bib
abs
Is Semantic Chunking Worth the Computational Cost?
Renyi Qu
|
Ruixuan Tu
|
Forrest Sheng Bao
Findings of the Association for Computational Linguistics: NAACL 2025
Recent advances in Retrieval-Augmented Generation (RAG) systems have popularized semantic chunking, which aims to improve retrieval performance by dividing documents into semantically coherent segments. Despite its growing adoption, the actual benefits over simpler fixed-size chunking, where documents are split into consecutive, fixed-size segments, remain unclear. This study systematically evaluates the effectiveness of semantic chunking using three common retrieval-related tasks: document retrieval, evidence retrieval, and retrieval-based answer generation. The results show that the computational costs associated with semantic chunking are not justified by consistent performance gains. These findings challenge the previous assumptions about semantic chunking and highlight the need for more efficient chunking strategies in RAG systems.
pdf
bib
abs
FaithBench: A Diverse Hallucination Benchmark for Summarization by Modern LLMs
Forrest Sheng Bao
|
Miaoran Li
|
Renyi Qu
|
Ge Luo
|
Erana Wan
|
Yujia Tang
|
Weisi Fan
|
Manveer Singh Tamber
|
Suleman Kazi
|
Vivek Sourabh
|
Mike Qi
|
Ruixuan Tu
|
Chenyu Xu
|
Matthew Gonzales
|
Ofer Mendelevitch
|
Amin Ahmad
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)
Summarization is one of the most common tasks performed by large language models (LLMs), especially in applications like Retrieval-Augmented Generation (RAG). However, existing evaluations of hallucinations in LLM-generated summaries, and evaluations of hallucination detection models both suffer from a lack of diversity and recency in the LLM and LLM families considered. This paper introduces FaithBench, a summarization hallucination benchmark comprising challenging hallucinations made by 10 modern LLMs from 8 different families, with ground truth annotations by human experts. “Challenging” here means summaries on which popular, state-of-the-art hallucination detection models, including GPT-4o-as-a-judge, disagreed on. Our results show GPT-4o and GPT-3.5-Turbo produce the least hallucinations. However, most state-of-the-art hallucination detection models have near 50% accuracies on FaithBench, indicating lots of room for future improvement.
2023
pdf
bib
abs
Conditioning on Dialog Acts improves Empathy Style Transfer
Renyi Qu
|
Lyle Ungar
|
João Sedoc
Findings of the Association for Computational Linguistics: EMNLP 2023
We explore the role of dialog acts in style transfer, specifically empathy style transfer – rewriting a sentence to make it more empathetic without changing its meaning. Specifically, we use two novel few-shot prompting strategies: target prompting, which only uses examples of the target style (unlike traditional prompting with source/target pairs), and dialog-act-conditioned prompting, which first estimates the dialog act of the source sentence and then makes it more empathetic using few-shot examples of the same dialog act. Our study yields two key findings: (1) Target prompting typically improves empathy more effectively while maintaining the same level of semantic similarity; (2) Dialog acts matter. Dialog-act-conditioned prompting enhances empathy while preserving both semantics and the dialog-act type. Different dialog acts benefit differently from different prompting methods, highlighting the need for further investigation of the role of dialog acts in style transfer.