2025
pdf
bib
abs
Aligning Large Language Models to Follow Instructions and Hallucinate Less via Effective Data Filtering
Shuzheng Si
|
Haozhe Zhao
|
Gang Chen
|
Cheng Gao
|
Yuzhuo Bai
|
Zhitong Wang
|
Kaikai An
|
Kangyang Luo
|
Chen Qian
|
Fanchao Qi
|
Baobao Chang
|
Maosong Sun
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Training LLMs on data containing unfamiliar knowledge during the instruction tuning stage can encourage hallucinations. To address this challenge, we introduce NOVA, a novel framework designed to identify high-quality data that aligns well with the LLM’s learned knowledge to reduce hallucinations. NOVA includes Internal Consistency Probing (ICP) and Semantic Equivalence Identification (SEI) to measure how familiar the LLM is with instruction data. Specifically, ICP evaluates the LLM’s understanding of the given instruction by calculating the tailored consistency among multiple self-generated responses. SEI further assesses the familiarity of the LLM with the target response by comparing it to the generated responses, using the proposed semantic clustering and well-designed voting strategy. Finally, to ensure the quality of selected samples, we introduce an expert-aligned reward model, considering characteristics beyond just familiarity. By considering data quality and avoiding unfamiliar data, we can utilize the selected data to effectively align LLMs to follow instructions and hallucinate less. Experiments show that NOVA significantly reduces hallucinations while maintaining a competitive ability to follow instructions.
pdf
bib
abs
Value Compass Benchmarks: A Comprehensive, Generative and Self-Evolving Platform for LLMs’ Value Evaluation
Jing Yao
|
Xiaoyuan Yi
|
Shitong Duan
|
Jindong Wang
|
Yuzhuo Bai
|
Muhua Huang
|
Yang Ou
|
Scarlett Li
|
Peng Zhang
|
Tun Lu
|
Zhicheng Dou
|
Maosong Sun
|
James Evans
|
Xing Xie
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
As large language models (LLMs) are gradually integrated into human daily life, assessing their underlying values becomes essential for understanding their risks and alignment with specific preferences. Despite growing efforts, current value evaluation methods face two key challenges. C1. Evaluation Validity: Static benchmarks fail to reflect intended values or yield informative results due to data contamination or a ceiling effect. C2. Result Interpretation: They typically reduce the pluralistic and often incommensurable values to one-dimensional scores, which hinders users from gaining meaningful insights and guidance. To address these challenges, we present Value Compass Benchmarks, the first dynamic, online and interactive platform specially devised for comprehensive value diagnosis of LLMs. It (1) grounds evaluations in multiple basic value systems from social science; (2) develops a generative evolving evaluation paradigm that automatically creates real-world test items co-evolving with ever-advancing LLMs; (3) offers multi-faceted result interpretation, including (i) fine-grained scores and case studies across 27 value dimensions for 33 leading LLMs, (ii) customized comparisons, and (iii) visualized analysis of LLMs’ alignment with cultural values. We hope Value Compass Benchmarks serves as a navigator for further enhancing LLMs’ safety and alignment, benefiting their responsible and adaptive development.
pdf
bib
abs
Document Segmentation Matters for Retrieval-Augmented Generation
Zhitong Wang
|
Cheng Gao
|
Chaojun Xiao
|
Yufei Huang
|
Shuzheng Si
|
Kangyang Luo
|
Yuzhuo Bai
|
Wenhao Li
|
Tangjian Duan
|
Chuancheng Lv
|
Guoshan Lu
|
Gang Chen
|
Fanchao Qi
|
Maosong Sun
Findings of the Association for Computational Linguistics: ACL 2025
Retrieval-augmented generation (RAG) enhances large language models (LLMs) by integrating external knowledge. A critical yet underexplored challenge in RAG is document segmentation, also known as document chunking. Existing widely-used rule-based chunking methods usually lead to suboptimal splits, where overly large chunks introduce irrelevant information and small chunks lack semantic coherence. Existing semantic-based approaches either require costly LLM calls or fail to adaptively group contextually related sentences. To address these limitations, we propose PIC, Pseudo-Instruction for document Chunking), a simple yet effective method that leverages document summaries as pseudo-instructions to guide chunking. By computing semantic similarity between sentences and the summary, PIC dynamically groups sentences into chunks that align with the document’s key themes, ensuring semantic completeness and relevance to potential user instructions. Experiments on multiple open-domain question-answering benchmarks demonstrate that PIC can significantly improve retrieval accuracy (Hits@k) and end-to-end QA performance (Exact Match) without any additional training.
pdf
bib
abs
GLTW: Joint Improved Graph Transformer and LLM via Three-Word Language for Knowledge Graph Completion
Kangyang Luo
|
Yuzhuo Bai
|
Cheng Gao
|
Shuzheng Si
|
Zhu Liu
|
Yingli Shen
|
Zhitong Wang
|
Cunliang Kong
|
Wenhao Li
|
Yufei Huang
|
Ye Tian
|
Xuantang Xiong
|
Lei Han
|
Maosong Sun
Findings of the Association for Computational Linguistics: ACL 2025
Knowledge Graph Completion (KGC), which aims to infer missing or incomplete facts, is a crucial task for KGs. However, integrating the vital structural information of KGs into Large Language Models (LLMs) and outputting predictions deterministically remains challenging. To address this, we propose a new method called GLTW, which encodes the structural information of KGs and merges it with LLMs to enhance KGC performance. Specifically, we introduce an improved Graph Transformer (iGT) that effectively encodes subgraphs with both local and global structural information and inherits the characteristics of language model, bypassing training from scratch. Also, we develop a subgraph-based multi-classification training objective, using all entities within KG as classification objects, to boost learning efficiency. Importantly, we combine iGT with an LLM that takes KG language prompts as input. Our extensive experiments on various KG datasets show that GLTW achieves significant performance gains compared to SOTA baselines.
2024
pdf
bib
abs
OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems
Chaoqun He
|
Renjie Luo
|
Yuzhuo Bai
|
Shengding Hu
|
Zhen Thai
|
Junhao Shen
|
Jinyi Hu
|
Xu Han
|
Yujie Huang
|
Yuxiang Zhang
|
Jie Liu
|
Lei Qi
|
Zhiyuan Liu
|
Maosong Sun
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Recent advancements have seen Large Language Models (LLMs) and Large Multimodal Models (LMMs) surpassing general human capabilities in various tasks, approaching the proficiency level of human experts across multiple domains. With traditional benchmarks becoming less challenging for these models, new rigorous challenges are essential to gauge their advanced abilities. In this work, we present OlympiadBench, an Olympiad-level bilingual multimodal scientific benchmark, featuring 8,476 problems from Olympiad-level mathematics and physics competitions, including the Chinese college entrance exam. Each problem is detailed with expert-level annotations for step-by-step reasoning. Evaluating top-tier models on OlympiadBench, we implement a comprehensive assessment methodology to accurately evaluate model responses. Notably, the best-performing model, GPT-4V, attains an average score of 17.97% on OlympiadBench, with a mere 10.74% in physics, highlighting the benchmark rigor and the intricacy of physical reasoning. Our analysis orienting GPT-4V points out prevalent issues with hallucinations, knowledge omissions, and logical fallacies. We hope that our challenging benchmark can serve as a valuable resource for helping future AGI research endeavors. The data and evaluation code are available at
https://github.com/OpenBMB/OlympiadBench2021
pdf
bib
Manual Evaluation Matters: Reviewing Test Protocols of Distantly Supervised Relation Extraction
Tianyu Gao
|
Xu Han
|
Yuzhuo Bai
|
Keyue Qiu
|
Zhiyu Xie
|
Yankai Lin
|
Zhiyuan Liu
|
Peng Li
|
Maosong Sun
|
Jie Zhou
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021
2020
pdf
bib
abs
IsOBS: An Information System for Oracle Bone Script
Xu Han
|
Yuzhuo Bai
|
Keyue Qiu
|
Zhiyuan Liu
|
Maosong Sun
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Oracle bone script (OBS) is the earliest known ancient Chinese writing system and the ancestor of modern Chinese. As the Chinese writing system is the oldest continuously-used system in the world, the study of OBS plays an important role in both linguistic and historical research. In order to utilize advanced machine learning methods to automatically process OBS, we construct an information system for OBS (IsOBS) to symbolize, serialize, and store OBS data at the character-level, based on efficient databases and retrieval modules. Moreover, we also apply few-shot learning methods to build an effective OBS character recognition module, which can recognize a large number of OBS characters (especially those characters with a handful of examples) and make the system easy to use. The demo system of IsOBS can be found from
http://isobs.thunlp.org/. In the future, we will add more OBS data to the system, and hopefully our IsOBS can support further efforts in automatically processing OBS and advance the scientific progress in this field.