This is an internal, incomplete preview of a proposed change to the ACL Anthology.
For efficiency reasons, we don't generate MODS or Endnote formats, and the preview may be incomplete in other ways, or contain mistakes.
Do not treat this content as an official publication.
HuiZhao
Also published as:
晖 赵
Fixing paper assignments
Please select all papers that belong to the same person.
Indicate below which author they should be assigned to.
Large Language Models (LLMs) have demonstrated remarkable capabilities across various domains. Math Word Problems (MWPs) serve as a crucial benchmark for evaluating LLMs’ reasoning abilities. While most research primarily focuses on improving accuracy, it often neglects understanding and addressing the underlying patterns of errors. Current error classification methods rely on static and predefined categories, which limit their ability to capture the full spectrum of error patterns in mathematical reasoning. To enable systematic error analysis, we collect error samples from 15 different LLMs of varying sizes across four distinct MWP datasets using multiple sampling strategies. Based on this extensive collection, we introduce MWPES-300K, a comprehensive dataset containing 304,865 error samples that cover diverse error patterns and reasoning paths. To reduce human bias and enable fine-grained analysis of error patterns, we propose a novel framework for automated dynamic error classification in mathematical reasoning. Experimental results demonstrate that dataset characteristics significantly shape error patterns, which evolve from basic to complex manifestations as model capabilities increase. With deeper insights into error patterns, we propose Error-Aware Prompting (EAP) that incorporates common error patterns as explicit guidance, leading to significant improvements in mathematical reasoning performance.
Multimodal machine learning is often hindered by two critical challenges: modality missingness and modality imbalance. These challenges significantly degrade the performance of multimodal models. The majority of existing methods either require the availability of full-modality data during the training phase or necessitate explicit annotations to detect missing modalities. These dependencies severely limit the models’ applicability in the real world. To tackle these problems, we propose a Dynamic modality Recognition and Enhancement for Adaptive Multimodal fusion framework *DREAM*. Within DREAM, we innovatively employ a sample-level dynamic modality assessment mechanism to direct selective reconstruction of missing or underperforming modalities. Additionally, we introduce a soft masking fusion strategy that adaptively integrates different modalities according to their estimated contributions, enabling more accurate and robust predictions. Experimental results on three benchmark datasets consistently demonstrate that DREAM outperforms several representative baseline and state-of-the-art models, marking its robustness against modality missingness and imbalanced modality.
“The rapid development of artificial intelligence has led to an explosion of literature in the biomed-ical field, and Biomedical Named Entity Recognition (BioNER) can quickly and accurately iden-tify key information from unstructured text. This task has become an important topic to promotethe rapid development of intelligence in the biomedical field. However, in the Named EntityRecognition (NER) of the biomedical field, there are always some problems of unclear boundaryrecognition, the underutilization of hierarchical information in sentences and the scarcity of train-ing data resources. Based on this, this paper proposes a multi-task BioNER model based on dataaugmentation, using four data augmentation methods: Mention Replacement (MR), Label-wisetoken Replacement (LwTR), Shuffle Within Segments (SiS) and Synonym Replacement (SR)to increase the training data. The syntactic information is extracted by incorporating the inputsentence into the Graph Convolutional Network (GCN), and then the tag information encodedby BERT is interacted through a co-attention mechanism to obtain an interaction matrix. Subse-quently, NER is performed through boundary detection tasks and span classification tasks. Com-parative experiments with other methods are conducted on the BC5CDR and JNLPBA datasets,as well as the CCKS2017 dataset. The experimental results demonstrate the effectiveness of themodel proposed in this paper.”
Clinical text summarization has proven successful in generating concise and coherent summaries. However, these summaries may include unintended text with hallucinations, which can mislead clinicians and patients. Existing methods for mitigating hallucinations can be categorized into task-specific and task-agnostic approaches. Task-specific methods lack versatility for real-world applicability. Meanwhile, task-agnostic methods are not model-agnostic, so they require retraining for different models, resulting in considerable computational costs. To address these challenges, we propose MEDAL, a model-agnostic framework designed to post-process medical hallucinations. MEDAL can seamlessly integrate with any medical summarization model, requiring no additional computational overhead. MEDAL comprises a medical infilling model and a hallucination correction model. The infilling model generates non-factual summaries with common errors to train the correction model. The correction model is incorporated with a self-examination mechanism to activate its cognitive capability. We conduct comprehensive experiments using 11 widely accepted metrics on 7 baseline models across 3 medical text summarization tasks. MEDAL demonstrates superior performance in correcting hallucinations when applied to summaries generated by pre-trained language models and large language models.
Large language models (LLMs) are highly effective in various natural language processing (NLP) tasks. However, they are susceptible to producing unreliable conjectures in ambiguous contexts called hallucination. This paper presents a new method for evaluating LLM hallucination in Question Answering (QA) based on the unanswerable math word problem (MWP). To support this approach, we innovatively develop a dataset called Unanswerable Math Word Problem (UMWP) which comprises 5200 questions across five categories. We developed an evaluation methodology combining text similarity and mathematical expression detection to determine whether LLM considers the question unanswerable. The results of extensive experiments conducted on 31 LLMs, including GPT-3, InstructGPT, LLaMA, and Claude, demonstrate that in-context learning and reinforcement learning with human feedback (RLHF) training significantly enhance the model’s ability to avoid hallucination. We show that utilizing MWP is a reliable and effective approach to assess hallucination. Our code and data are available at https://github.com/Yuki-Asuuna/UMWP.
Gender bias in vision-language models (VLMs) can reinforce harmful stereotypes and discrimination. In this paper, we focus on mitigating gender bias towards vision-language tasks. We identify object hallucination as the essence of gender bias in VLMs. Existing VLMs tend to focus on salient or familiar attributes in images but ignore contextualized nuances. Moreover, most VLMs rely on the co-occurrence between specific objects and gender attributes to infer the ignored features, ultimately resulting in gender bias. We propose GAMA, a task-agnostic generation framework to mitigate gender bias. GAMA consists of two stages: narrative generation and answer inference. During narrative generation, GAMA yields all-sided but gender-obfuscated narratives, which prevents premature concentration on localized image features, especially gender attributes. During answer inference, GAMA integrates the image, generated narrative, and a task-specific question prompt to infer answers for different vision-language tasks. This approach allows the model to rethink gender attributes and answers. We conduct extensive experiments on GAMA, demonstrating its debiasing and generalization ability.