Zeinab Sadat Taghavi
2025
ImpliRet: Benchmarking the Implicit Fact Retrieval Challenge
Zeinab Sadat Taghavi
|
Ali Modarressi
|
Yunpu Ma
|
Hinrich Schuetze
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Retrieval systems are central to many NLP pipelines, but often rely on surface-level cues such as keyword overlap and lexical semantic similarity. To evaluate retrieval beyond these shallow signals, recent benchmarks introduce reasoning-heavy queries; however, they primarily shift the burden to query-side processing techniques – like prompting or multi-hop retrieval – that can help resolve complexity. In contrast, we present Impliret, a benchmark that shifts the reasoning challenge to document-side processing: The queries are simple, but relevance depends on facts stated implicitly in documents through temporal (e.g., resolving “two days ago”), arithmetic, and world knowledge relationships. We evaluate a range of sparse and dense retrievers, all of which struggle in this setting: the best nDCG@10 is only 14.91%. We also test whether long-context models can overcome this limitation. But even with a short context of only thirty documents, including the positive document, GPT-o4-mini scores only 55.54%, showing that document-side reasoning remains a challenge. Our codes are available at github.com/ZeinabTaghavi/IMPLIRET.
2024
SLPL SHROOM at SemEval2024 Task 06 : A comprehensive study on models ability to detect hallucination
Pouya Fallah
|
Soroush Gooran
|
Mohammad Jafarinasab
|
Pouya Sadeghi
|
Reza Farnia
|
Amirreza Tarabkhah
|
Zeinab Sadat Taghavi
|
Hossein Sameti
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)
Language models, particularly generative models, are susceptible to hallucinations, generating outputs that contradict factual knowledgeor the source text. This study explores methodsfor detecting hallucinations in three SemEval2024 Task 6 tasks: Machine Translation, Definition Modeling, and Paraphrase Generation.We evaluate two methods: semantic similaritybetween the generated text and factual references, and an ensemble of language modelsthat judge each other’s outputs. Our resultsshow that semantic similarity achieves moderate accuracy and correlation scores in trial data,while the ensemble method offers insights intothe complexities of hallucination detection butfalls short of expectations. This work highlights the challenges of hallucination detectionand underscores the need for further researchin this critical area.
Search
Fix author
Co-authors
- Pouya Fallah 1
- Reza Farnia 1
- Soroush Gooran 1
- Mohammad Jafarinasab 1
- Yunpu Ma 1
- show all...