Maximilian Idahl
2026
LongTailQA: Benchmarking LLMs and RAG Models on Disambiguated Long-Tail Entities
William Xion | Uwe Hadler | Tim Cofala | Maximilian Idahl | Soumyadeep Roy | Wolfgang Nejdl
Proceedings of the Fifteenth Language Resources and Evaluation Conference
William Xion | Uwe Hadler | Tim Cofala | Maximilian Idahl | Soumyadeep Roy | Wolfgang Nejdl
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Large Language Models (LLMs) struggle with memorizing long-tail facts. Retrieval-Augmented Generation (RAG) models show better performance on long-tail Question Answering (QA) by offloading memory to external knowledge sources. We demonstrate that popular QA benchmarks such as PopQA, WITQA, and EntityQA contain significant entity ambiguity, with 8-30% of long-tail questions referencing entities with non-unique names. This ambiguity confounds evaluation, obscuring true model capabilities. To perform robust benchmarking, we disambiguate these questions with the Wikipedia knowledge graph to develop LongTailQA, an improved QA benchmark that mitigates entity ambiguity in long-tail entity questions. We evaluate various recent LLMs and RAG models, such as Self-RAG and InstructRAG, investigating retriever quality and retrieval depth impacts on QA performance. We observe that: (i) disambiguation improves model accuracy up to 24.7%, (ii) RAG models benefit significantly more than vanilla LLMs, (iii) simply increasing retrieval depth does not improve RAG performance, and (iv) RAG models achieve high accuracy with perfect information, highlighting the need to filter noisy documents during retrieval. The LongTailQA benchmark facilitates robust evaluation of long-tail knowledge recall and RAG system effectiveness. We make the codebase and datasets publicly available at https://github.com/williamx854/LongTailQA-Benchmark
2025
OpenReviewer: A Specialized Large Language Model for Generating Critical Scientific Paper Reviews
Maximilian Idahl | Zahra Ahmadi
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (System Demonstrations)
Maximilian Idahl | Zahra Ahmadi
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (System Demonstrations)
We present OpenReviewer, an open-source system for generating high-quality peer reviews of machine learning and AI conference papers. At its core is Llama-OpenReviewer-8B, an 8B parameter language model specifically fine-tuned on 79,000 expert reviews from top conferences. Given a PDF paper submission and review template as input, OpenReviewer extracts the full text, including technical content like equations and tables, and generates a structured review following conference-specific guidelines. Our evaluation on 400 test papers shows that OpenReviewer produces considerably more critical and realistic reviews compared to general-purpose LLMs like GPT-4 and Claude-3.5. While other LLMs tend toward overly positive assessments, OpenReviewer’s recommendations closely match the distribution of human reviewer ratings. The system provides authors with rapid, constructive feedback to improve their manuscripts before submission, though it is not intended to replace human peer review. OpenReviewer is available as an online demo and open-source tool.
2021
Towards Benchmarking the Utility of Explanations for Model Debugging
Maximilian Idahl | Lijun Lyu | Ujwal Gadiraju | Avishek Anand
Proceedings of the First Workshop on Trustworthy Natural Language Processing
Maximilian Idahl | Lijun Lyu | Ujwal Gadiraju | Avishek Anand
Proceedings of the First Workshop on Trustworthy Natural Language Processing
Post-hoc explanation methods are an important class of approaches that help understand the rationale underlying a trained model’s decision. But how useful are they for an end-user towards accomplishing a given task? In this vision paper, we argue the need for a benchmark to facilitate evaluations of the utility of post-hoc explanation methods. As a first step to this end, we enumerate desirable properties that such a benchmark should possess for the task of debugging text classifiers. Additionally, we highlight that such a benchmark facilitates not only assessing the effectiveness of explanations but also their efficiency.