This is an internal, incomplete preview of a proposed change to the ACL Anthology.
For efficiency reasons, we don't generate MODS or Endnote formats, and the preview may be incomplete in other ways, or contain mistakes.
Do not treat this content as an official publication.
JitkapatSawatphol
Fixing paper assignments
Please select all papers that belong to the same person.
Indicate below which author they should be assigned to.
We present VeReaFine, a novel “Verifier-RAG” pipeline designed to eliminate hallucinations in open-ended clinical question answering. VeReaFine interleaves three tightly coupled stages—retrieval, verification, and generation—across up to three iterations. First, a two-stage dense retriever (BM-Retriever-410M → BM-Reranker-2B) fetches and ranks top-k biomedical passages; an 8B-parameter MedReason verifier then filters these for direct relevance and identifies missing evidence. When the verifier deems the context insufficient, it formulates a focused “feedback query” to retrieve additional passages (bounded to prevent infinite loops). Once a minimal ground-truth context is assembled, a 7B-parameter generator (Qwen2.5-7B-Instruct) drafts an answer purely from that vetted context, and the verifier performs a final check—prompting the generator to refine any remaining unsupported claims. By iteratively fetching only missing facts and ensuring every assertion is evidence-backed, VeReaFine achieves monotonic factuality improvements with minimal overhead. On the BioNLP 2025 ClinIQLink “LLM Lie-Detector” shared task, our 7B generator augmented with VeReaFine matches or surpasses a 32B medical model on open-ended reasoning metrics, reducing multi-hop inverse step-identification errors by 26%. These findings demonstrate that moderate-size LLMs, when guided by targeted verification loops, can deliver expert-level reliability in clinical QA.
The evaluation of generative models in Machine Reading Comprehension (MRC) presents distinct difficulties, as traditional metrics like BLEU, ROUGE, METEOR, Exact Match, and F1 score often struggle to capture the nuanced and diverse responses. While embedding-based metrics such as BERTScore and BARTScore focus on semantic similarity, they still fail to fully address aspects such as recognizing additional helpful information and rewarding contextual faithfulness. Recent advances in large language model (LLM) based metrics offer more fine-grained evaluations, but challenges such as score clustering remain. This paper introduces a multi-aspect evaluation framework, CHIE,incorporating aspects of Correctness, Helpfulness, Irrelevance, and Extraneousness. Our approach, which uses binary categorical values rather than continuous rating scales, aligns well with human judgments, indicating its potential as a comprehensive and effective evaluation method.
Authorship verification (AV) aims to identify whether a pair of texts has the same author. We address the challenge of evaluating AV models’ robustness against topic shifts. The conventional evaluation assumes minimal topic overlap between training and test data. However, we argue that there can still be topic leakage in test data, causing misleading model performance and unstable rankings. To address this, we propose an evaluation method called Heterogeneity-Informed Topic Sampling (HITS), which creates a smaller dataset with a heterogeneously distributed topic set. Our experimental results demonstrate that HITS-sampled datasets yield a more stable ranking of models across random seeds and evaluation splits. Our contributions include: 1. An analysis of causes and effects of topic leakage; 2. A demonstration of the HITS in reducing the effects of topic leakage; and 3. The Robust Authorship Verification bENchmark (RAVEN) that allows topic shortcut test to uncover AV models’ reliance on topic-specific features.
Authorship attribution is a task that aims to identify the author of a given piece of writing. We aim to develop a generalized solution that can handle a large number of texts from authors and topics unavailable in training data. Previous studies have proposed strategies to address only either unseen authors or unseen topics. Authorship representation learning has been shown to work in open-set environments with a large number of unseen authors but has not been explicitly designed for cross-topic environments at the same time. To handle a large number of unseen authors and topics, we propose Authorship Representation Regularization (ARR), a distillation framework that creates authorship representation with reduced reliance on topic-specific information. To assess the performance of our framework, we also propose a cross-topic-open-set evaluation method. Our proposed method has improved performances in the cross-topic-open set setup over baselines in 4 out of 6 cases.