Hao Li - ACL Anthology

This page is part of a temporary preview of a proposed change that may be incomplete or contain mistakes. It is not official and will be removed when the change is merged or abandoned.

Hao Li

Also published as: 浩李

Papers on this page may belong to the following people: Hao Li, Hao Li, Hao Li, Hao Li, Hao Li

2026

AutoTaskEval: Towards Domain-Specific and Fine-Grained Evaluation for LLMs
Qingqing Lyu | Linjuan Wu | Yongliang Shen | Hengwei Liu | Hao Li | Shengpei Jiang | Yin Zhang | Weiming Lu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Despite the rapid progress of LLMs, their evaluation remains hindered by static, manually curated benchmarks with limited task coverage and poor adaptability to emerging domains. Existing automated approaches typically operate within fixed task schemas and often fail to autonomously discover new evaluation dimensions, limiting both scalability and effectiveness. To address these gaps, we propose AutoTaskEval, an automated framework that constructs domain-specific benchmarks directly from unstructured corpora. Using a refined Bloom’s Taxonomy, the framework systematically discovers tasks, enriches contextual grounding via iterative Socratic prompting, and generates diverse, progressively challenging evaluation instances. Applied to the complex and knowledge-intensive legal domain, AutoTaskEval uncovers a broader and more fine-grained task space than expert-curated benchmarks while producing high-quality instances that preserve established model-level evaluation trends. We further validate its robustness in a low-structure e-commerce review domain. Together, these results show that AutoTaskEval enables scalable, adaptive, and high-fidelity LLM assessment across domains and model families, advancing autonomous and capability-sensitive evaluation.

LogitSpec: Accelerating Retrieval-based Speculative Decoding via Next Next Token Speculation
Tianyu Liu | Qitan Lv | Hao Li | Xing Gao | Xiao Sun | Xiaoyan Sun
Findings of the Association for Computational Linguistics: ACL 2026

Speculative decoding (SD), where a small draft model is employed to propose *draft* tokens in advance and then the target model validates them in parallel, has emerged as a promising technique for LLM inference acceleration. Many endeavors to improve SD are to eliminate the need for a draft model and generate draft tokens in a retrieval-based manner in order to further alleviate the drafting overhead and significantly reduce the difficulty in deployment and applications. However, retrieval-based SD relies on a matching paradigm to retrieve the most relevant reference as the draft tokens, where these methods often fail to find matched and accurate draft tokens. To address this challenge, we propose *LogitSpec* to effectively expand the retrieval range and find the most relevant reference as drafts. *LogitSpec* is motivated by the observation that the logit of the last token can not only predict **the next token**, but also speculate **the next next token**. Specifically, *LogitSpec* generates draft tokens in two steps: (1) utilizing the last logit to speculate the next next token; (2) retrieving relevant reference for both the next token and the next next token. *LogitSpec* is training-free and plug-and-play, which can be easily integrated into existing LLM inference frameworks. Extensive experiments on a wide range of text generation benchmarks demonstrate that *LogitSpec* can achieve up to 2.61× speedup and 3.28 mean accepted tokens per decoding step.

Be Your Own Red Teamer: Safety Alignment via Self-Play and Reflective Experience Replay
Hao Wang | Yanting Wang | Hao Li | Rui Li | Lei Sha
Findings of the Association for Computational Linguistics: ACL 2026

Large Language Models (LLMs) have achieved remarkable capabilities but remain vulnerable to adversarial “jailbreak” attacks designed to bypass safety guardrails. Current safety alignment methods depend heavily on static external red teaming, utilizing fixed defense prompts or pre-collected adversarial datasets. This leads to a rigid defense that overfits known patterns and fails to generalize to novel, sophisticated threats. To address this critical limitation, we propose empowering the model to be its own red teamer, capable of achieving autonomous and evolving adversarial attacks. Specifically, we introduce Safety Self- Play (SSP), a system that utilizes a single LLM to act concurrently as both the Attacker (generating jailbreaks) and the Defender (refusing harmful requests) within a unified Reinforcement Learning (RL) loop, dynamically evolving attack strategies to uncover vulnerabilities while simultaneously strengthening defense mechanisms. To ensure the Defender effectively addresses critical safety issues during the self-play, we introduce an advanced Reflective Experience Replay Mechanism, which uses an experience pool accumulated throughout the process. The mechanism employs a Upper Confidence Bound (UCB) sampling strategy to focus on failure cases with low rewards, helping the model learn from past hard mistakes while balancing exploration and exploitation. Extensive experiments demonstrate that our SSP approach autonomously evolves robust defense capabilities, significantly outperforming baselines trained on static adversarial datasets and establishing a new benchmark for proactive safety alignment.

Demystifying Data Organization for Enhanced LLM Training
Yalun Dai | Yangyu Huang | Tongshen Yang | Yonghan Wang | Xin Zhang | Wenshan Wu | Qihao Zhao | Hao Li | Yuanyuan Gao | Kim-Hui Yap | Scarlett Li
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Large Language Models (LLMs) have revolutionized various fields, yet their training efficiency is heavily reliant on effective data curation. While data selection has been widely studied, the strategic data organization for enhanced training remains an underexplored area, particularly since current LLMs are often trained for only one or a few epochs. This paper systematically explores the influence of data organization on LLM training by reusing pre-computed sample-level scores originally generated for data efficiency, thereby incurring minimal additional computational overhead. We identify and formalize four key guidances for optimizing data organization: Boundary Sharpening, Cyclic Scheduling, Curriculum Continuity, and Local Diversity. Guided by them, we introduce two novel data ordering methods termed STR and SAW. Extensive experiments across different model scales and data sizes, encompassing both pre-training and SFT stages, validate the effectiveness of our summarized guidances. They also demonstrate the robustness of our proposed data ordering methods in enhancing the stability and performance of LLM training.

LLMRouterBench: A Massive Benchmark and Unified Framework for LLM Routing
Hao Li | Yiqun Zhang | Zhaoyan Guo | Chenxu Wang | Shengji Tang | Qiaosheng Zhang | Yang Chen | Biqing Qi | Peng Ye | Lei Bai | Zhen Wang | Shuyue Hu
Findings of the Association for Computational Linguistics: ACL 2026

Large language model (LLM) routing assigns each query to the most suitable model from an ensemble. We introduce LLMRouterBench, a large-scale benchmark and unified framework for LLM routing. It comprises over 400K instances from 21 datasets and 33 models. Moreover, it provides comprehensive metrics for both performance-oriented and performance-cost trade-off routing, and integrates 10 representative routing baselines. Using LLMRouterBench, we systematically re-evaluate the field. While confirming strong model complementarity—the central premise of LLM routing—we find that many routing methods exhibit similar performance under unified evaluation, and several recent approaches, including commercial routers, fail to reliably outperform a simple baseline. Meanwhile, a substantial gap remains to the Oracle, driven primarily by persistent model-recall failures. We further show that backbone embedding models have limited impact, that larger ensembles exhibit diminishing returns compared to careful model curation, and that the benchmark also enables latency-aware analysis. All code and data are available at https://github.com/ynulihao/LLMRouterBench.

MTRouter: Cost-Aware Multi-Turn LLM Routing with History–Model Joint Embeddings
Yiqun Zhang | Hao Li | Zihan Wang | Shi Feng | Xiaocui Yang | Daling Wang | Bo Zhang | Lei Bai | Shuyue Hu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Multi-turn, long-horizon tasks are increasingly common for large language models (LLMs), but solving them typically requires many sequential model invocations, accumulating substantial inference costs. Here, we study cost-aware multi-turn LLM routing: selecting which model to invoke at each turn from a model pool, given a fixed cost budget. We propose MTRouter, which encodes the interaction history and candidate models into joint history–model embeddings, and learns an outcome estimator from logged trajectories to predict turn-level model utility. Experiments show that MTRouter improves the performance–cost trade-off: on ScienceWorld, it surpasses GPT-5 while reducing total cost by 58.7%; on Humanity’s Last Exam (HLE), it achieves competitive accuracy while reducing total cost by 43.4% relative to GPT-5, and these gains even carry over to held-out tasks. Further analyses reveal several mechanisms underlying its effectiveness: relative to prior multi-turn routers, MTRouter makes fewer model switches, is more tolerant to transient errors, and exhibits emergent specialization across models.Code: https://github.com/ZhangYiqun018/MTRouter

2025

Breaking the Reasoning Barrier A Survey on LLM Complex Reasoning through the Lens of Self-Evolution
Tao He | Hao Li | Jingchang Chen | Runxuan Liu | Yixin Cao | Lizi Liao | Zihao Zheng | Zheng Chu | Jiafeng Liang | Ming Liu | Bing Qin
Findings of the Association for Computational Linguistics: ACL 2025

The release of OpenAI’s O1 and subsequent projects like DeepSeek R1 has significantly advanced research on complex reasoning in LLMs. This paper systematically analyzes existing reasoning studies from the perspective of self-evolution, structured into three components: data evolution, model evolution, and self-evolution. Data evolution explores methods to generate higher-quality reasoning training data. Model evolution focuses on training strategies to boost reasoning capabilities. Self-evolution research autonomous system evolution via iterating cycles of data and model evolution. We further discuss the scaling law of self-evolution and analyze representative O1-like works through this lens. By summarizing advanced methods and outlining future directions, this paper aims to drive advancements in LLMs’ reasoning abilities.

When to Continue Thinking: Adaptive Thinking Mode Switching for Efficient Reasoning
Xiaoyun Zhang | Jingqing Ruan | Xing Ma | Yawen Zhu | Haodong Zhao | Hao Li | Jiansong Chen | Ke Zeng | Xunliang Cai
Findings of the Association for Computational Linguistics: EMNLP 2025

Large reasoning models (LRMs) achieve remarkable performance via long reasoning chains, but often incur excessive computational overhead due to redundant reasoning, especially on simple tasks. In this work, we systematically quantify the upper bounds of LRMs under both Long-Thinking and No-Thinking modes, and uncover the phenomenon of “Internal Self-Recovery Mechanism” where models implicitly supplement reasoning during answer generation. Building on this insight, we propose Adaptive Self-Recovery Reasoning (ASRR), a framework that suppresses unnecessary reasoning and enables implicit recovery. By introducing accuracy-aware length reward regulation, ASRR adaptively allocates reasoning effort according to problem difficulty, achieving high efficiency with negligible performance sacrifice. Experiments across multiple benchmarks and models show that, compared with GRPO, ASRR reduces reasoning budget by up to 32.5% (1.5B) and 25.7% (7B) with minimal accuracy loss (1.2% and 0.6% pass@1), and significantly boosts harmless rates on safety benchmarks (up to +21.7%). Our results highlight the potential of ASRR for enabling efficient, adaptive, and safer reasoning in LRMs.

LLMs are not Zero-Shot Reasoners for Biomedical Information Extraction
Aishik Nagar | Viktor Schlegel | Thanh-Tung Nguyen | Hao Li | Yuping Wu | Kuluhan Binici | Stefan Winkler
The Sixth Workshop on Insights from Negative Results in NLP

Large Language Models (LLMs) are increasingly adopted for applications in healthcare, reaching the performance of domain experts on tasks such as question answering and document summarisation. Despite their success on these tasks, it is unclear how well LLMs perform on tasks that are traditionally pursued in the biomedical domain, such as structured information extration. To bridge this gap, in this paper, we systematically benchmark LLM performance in Medical Classification and Named Entity Recognition (NER) tasks. We aim to disentangle the contribution of different factors to the performance, particularly the impact of LLMs’ task knowledge and reasoning capabilities, their (parametric) domain knowledge, and addition of external knowledge. To this end, we evaluate various open LLMs—including BioMistral and Llama-2 models—on a diverse set of biomedical datasets, using standard prompting, Chain-of-Thought (CoT) and Self-Consistency based reasoning as well as Retrieval-Augmented Generation (RAG) with PubMed and Wikipedia corpora. Counter-intuitively, our results reveal that standard prompting consistently outperforms more complex techniques across both tasks, laying bare the limitations in the current application of CoT, self-consistency and RAG in the biomedical domain. Our findings suggest that advanced prompting methods developed for knowledge- or reasoning-intensive tasks, such as CoT or RAG, are not easily portable to biomedical tasks where precise structured outputs are required. This highlights the need for more effective integration of external knowledge and reasoning mechanisms in LLMs to enhance their performance in real-world biomedical applications.

VLSBench: Unveiling Visual Leakage in Multimodal Safety
Xuhao Hu | Dongrui Liu | Hao Li | Xuanjing Huang | Jing Shao
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Safety concerns of Multimodal large language models (MLLMs) have gradually become an important problem in various applications. Surprisingly, previous works indicate a counterintuitive phenomenon that using textual unlearning to align MLLMs achieves comparable safety performances with MLLMs aligned with image-text pairs. To explain such a phenomenon, we discover a Visual Safety Information Leakage (VSIL) problem in existing multimodal safety benchmarks, i.e., the potentially risky content in the image has been revealed in the textual query. Thus, MLLMs can easily refuse these sensitive image-text pairs according to textual queries only, leading to unreliable cross-modality safety evaluation of MLLMs. We also conduct a further comparison experiment between textual alignment and multimodal alignment to highlight this drawback. To this end, we construct Visual Leakless Safety Bench (VLSBench) with 2.2k image-text pairs through an automated data pipeline. Experimental results indicate that VLSBench poses a significant challenge to both open-source and close-source MLLMs, i.e., LLaVA, Qwen2-VL and GPT-4o. Besides, we empirically compare textual and multimodal alignment methods on VLSBench and find that textual alignment is effective enough for multimodal safety scenarios with VSIL, while multimodal alignment is preferable for safety scenarios without VSIL.

DiffusionAttacker: Diffusion-Driven Prompt Manipulation for LLM Jailbreak
Hao Wang | Hao Li | Junda Zhu | Xinyuan Wang | Chengwei Pan | Minlie Huang | Lei Sha
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Large Language Models (LLMs) are susceptible to generating harmful content when prompted with carefully crafted inputs, a vulnerability known as LLM jailbreaking. As LLMs become more powerful, studying jailbreak methods is critical to enhancing security and aligning models with human values. Traditionally, jailbreak techniques have relied on suffix addition or prompt templates, but these methods suffer from limited attack diversity. This paper introduces DiffusionAttacker, an end-to-end generative approach for jailbreak rewriting inspired by diffusion models. Our method employs a sequence-to-sequence (seq2seq) text diffusion model as a generator, conditioning on the original prompt and guiding the denoising process with a novel attack loss. Unlike previous approaches that use autoregressive LLMs to generate jailbreak prompts, which limit the modification of already generated tokens and restrict the rewriting space, DiffusionAttacker utilizes a seq2seq diffusion model, allowing more flexible token modifications. This approach preserves the semantic content of the original prompt while producing harmful content. Additionally, we leverage the Gumbel-Softmax technique to make the sampling process from the diffusion model’s output distribution differentiable, eliminating the need for iterative token search. Extensive experiments on Advbench and Harmbench demonstrate that DiffusionAttacker outperforms previous methods across various evaluation metrics, including attack success rate (ASR), fluency, and diversity.

AMoPO: Adaptive Multi-objective Preference Optimization without Reward Models and Reference Models
Qi Liu | Jingqing Ruan | Hao Li | Haodong Zhao | Desheng Wang | Jiansong Chen | Wan Guanglu | Xunliang Cai | Zhi Zheng | Tong Xu
Findings of the Association for Computational Linguistics: ACL 2025

Existing multi-objective preference alignment methods for large language models (LLMs) face limitations: (1) the inability to effectively balance various preference dimensions, and (2) reliance on auxiliary reward/reference models introduces computational complexity. To address these challenges, we propose Adaptive Multi-objective Preference Optimization (AMoPO), a novel framework that achieves dynamic balance across preference dimensions. By introducing the multi-objective optimization paradigm to use the dimension-aware generation metrics as implicit rewards, AMoPO aligns LLMs with diverse preferences without additional reward models or reference models. We introduce an adaptive weight assignment mechanism that models the generation space as a Gaussian distribution, allowing dynamic prioritization of preference dimensions. Empirical results demonstrate that AMoPO outperforms state-of-the-art baselines by 28.5%, and the experiments on 7B, 14B, and 32B models reveal the scaling ability of AMoPO. Moreover, additional analysis of multiple dimensions verifies its adaptability and effectiveness. These findings validate AMoPO’s capability to achieve dimension-aware preference alignment, highlighting its superiority. Our codes and datasets are available at https://github.com/Javkonline/AMoPO.

YNU-HPCC at SemEval-2025 Task 2: Local Cache and Online Retrieval-Based method for Entity-Aware Machine Translation
Hao Li | Jin Wang | Xuejie Zhang
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

This paper presents methods for {textbf{SemEval-2025 Task 11}} on text-based emotion detection across three tracks: Multi-label Emotion Detection, Emotion Intensity Prediction, and Cross-lingual Emotion Detection. We apply approaches such as supervised fine-tuning, preference-based reinforcement learning, and few-shot learning to enhance performance. Our combined strategies result in improved accuracy, particularly in multi-label and cross-lingual emotion detection, demonstrating the effectiveness of these methods in diverse linguistic settings.

TEaR: Improving LLM-based Machine Translation with Systematic Self-Refinement
Zhaopeng Feng | Yan Zhang | Hao Li | Bei Wu | Jiayu Liao | Wenqiang Liu | Jun Lang | Yang Feng | Jian Wu | Zuozhu Liu
Findings of the Association for Computational Linguistics: NAACL 2025

Large Language Models (LLMs) have achieved impressive results in Machine Translation (MT). However, human evaluations reveal that LLM-generated translations still contain various errors. Notably, feeding the error information back into the LLMs can facilitate self-refinement, leading to enhanced translation quality. Motivated by these findings, we introduce TEaR (Translate, Estimate, and Refine), a systematic LLM-based self-refinement framework aimed at bootstrapping translation performance. Our key results show that: 1) TEaR framework enables LLMs to improve their translation quality relying solely on self-feedback, measured by both automatic metrics and Multidimensional Quality Metrics (MQM) scores; 2) TEaR autonomously selects improvements, ensuring a robust translation quality baseline while outperforming both internal refinement and external feedback methods. Error analysis and iterative refinement experiments show its ability to continuously reduce translation errors and enhance overall translation quality. Our code and data are publicly available at https://github.com/fzp0424/self_correct_mt.

基于数据合成的多模态讽刺隐喻理解大模型的构建
Lingrui Dai | Hao Li | Yunfang Wu
Proceedings of the 24th China National Conference on Computational Linguistics (CCL 2025)

"讽刺和隐喻是文学与语言表达中常见的修辞手法,以往相关研究多聚焦于分类任务上,且更多的基于英文数据进行探索。随着大模型与多模态大模型的不断涌现,模型对各种自然语言处理任务与多模态任务的处理能力得到了显著的提高。本文利用GPT-4o进行自动数据合成,来训练多模态大模型,实现了图文多模态讽刺隐喻综合理解任务。本文训练出能理解图片或图文讽刺隐喻内容,并进行详细解释或配文的参数量较小的多模态大模型,并且保证了模型具备良好的鲁棒性和通用性能。本文精心设计了数据构造方法,包括数据源的选择,指令数据的合成,回复数据的合成,来获得了一批高质量的多模态讽刺隐喻指令微调数据。我们选用了当前表现较好的多模态大模型作为骨干模型,使用合成数据并结合公开多模态图文数据集进行训练。在模型评测方面,本文分别从讽刺隐喻理解能力和通用能力进行评测,验证了模型的可用性。本文的数据以及模型权重将在后续放置在https://github.com/652897698/Multimodal-LLMs-for-Sarcasm-and-Metaphor-Undrerstanding"

Self-Critique Guided Iterative Reasoning for Multi-hop Question Answering
Zheng Chu | Huiming Fan | Jingchang Chen | Qianyu Wang | Mingda Yang | Jiafeng Liang | Zhongjie Wang | Hao Li | Guo Tang | Ming Liu | Bing Qin
Findings of the Association for Computational Linguistics: ACL 2025

Although large language models (LLMs) have demonstrated remarkable reasoning capabilities, they still face challenges in knowledge-intensive multi-hop reasoning. Recent work explores iterative retrieval to address complex problems. However, the absence of intermediate guidance often leads to inaccurate retrieval and intermediate reasoning errors, leading to incorrect reasoning. To address these, we propose Self-Critique Guided Iterative Reasoning (SiGIR), which uses self-critique feedback to guide the iterative reasoning process. Specifically, through end-to-end training, we enable the model to iteratively address complex problems via question decomposition, while also being able to self-evaluate its intermediate reasoning steps. During iterative reasoning, the model engages in branching exploration and employs self-evaluation to guide the selection of promising reasoning trajectories. Extensive experiments on three multi-hop reasoning datasets demonstrate the effectiveness of our proposed method, surpassing the previous SOTA by 8.6%. Furthermore, our thorough analysis offers insights for future research. Our code, data, and models are available at https://github.com/zchuz/SiGIR-MHQA.

LLMs know their vulnerabilities: Uncover Safety Gaps through Natural Distribution Shifts
Qibing Ren | Hao Li | Dongrui Liu | Zhanxu Xie | Xiaoya Lu | Yu Qiao | Lei Sha | Junchi Yan | Lizhuang Ma | Jing Shao
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Safety concerns in large language models (LLMs) have gained significant attention due to their exposure to potentially harmful data during pre-training. In this paper, we identify a new safety vulnerability in LLMs: their susceptibility to natural distribution shifts between attack prompts and original toxic prompts, where seemingly benign prompts, semantically related to harmful content, can bypass safety mechanisms. To explore this issue, we introduce a novel attack method, ActorBreaker, which identifies actors related to toxic prompts within pre-training distribution to craft multi-turn prompts that gradually lead LLMs to reveal unsafe content. ActorBreaker is grounded in Latour’s actor-network theory, encompassing both human and non-human actors to capture a broader range of vulnerabilities. Our experimental results demonstrate that ActorBreaker outperforms existing attack methods in terms of diversity, effectiveness, and efficiency across aligned LLMs. To address this vulnerability, we propose expanding safety training to cover a broader semantic space of toxic content. We thus construct a multi-turn safety dataset using ActorBreaker. Fine-tuning models on our dataset shows significant improvements in robustness, though with some trade-offs in utility. Code is available at https://github.com/AI45Lab/ActorAttack.

Test-Time Code-Switching for Cross-lingual Aspect Sentiment Triplet Extraction
Dongming Sheng | Kexin Han | Hao Li | Yan Zhang | Yucheng Huang | Jun Lang | Wenqiang Liu
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Aspect Sentiment Triplet Extraction (ASTE) is a thriving research area with impressive outcomes being achieved on high-resource languages. However, the application of cross-lingual transfer to the ASTE task has been relatively unexplored, and current code-switching methods still suffer from term boundary detection issues and out-of-dictionary problems. In this study, we introduce a novel Test-Time Code-SWitching (TT-CSW) framework, which bridges the gap between the bilingual training phase and the monolingual test-time prediction. During training, a generative model is developed based on bilingual code-switched training data and can produce bilingual ASTE triplets for bilingual inputs. In the testing stage, we employ an alignment-based code-switching technique for test-time augmentation. Extensive experiments on cross-lingual ASTE datasets validate the effectiveness of our proposed method. We achieve an average improvement of 3.7% in terms of weighted-averaged F1 in four datasets with different languages. Additionally, we set a benchmark using ChatGPT and GPT-4, and demonstrate that even smaller generative models fine-tuned with our proposed TT-CSW framework surpass ChatGPT and GPT-4 by 14.2% and 5.0% respectively.

Layer-Aware Representation Filtering: Purifying Finetuning Data to Preserve LLM Safety Alignment
Hao Li | Lijun Li | Zhenghao Lu | Xianyi Wei | Rui Li | Jing Shao | Lei Sha
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

With rapid advancement and increasing accessibility of LLMs, fine-tuning aligned models has become a critical step for adapting them to real-world applications, which makes the safety of this fine-tuning process more important than ever. However, recent studies have highlighted a critical challenge: even when fine-tuning with seemingly benign downstream datasets, the safety of aligned LLMs can be compromised, making them more susceptible to malicious instructions. In this paper, we show that fine-tuning datasets often contain samples with safety-degrading features that are not easily identifiable on the surface. These samples can significantly degrade the safety alignment of LLMs during fine-tuning. To address this issue, we propose LARF, a Layer-Aware Representation Filtering method. This method identifies safety-sensitive layers within the LLM and leverages their representations to detect which data samples in the post-training dataset contain safety-degrading features. Experimental results demonstrate that LARF can effectively identify benign data with safety-degrading features. After removing such data, the safety alignment degradation caused by fine-tuning is mitigated.

2024

360∘REA: Towards A Reusable Experience Accumulation with 360∘ Assessment for Multi-Agent System
Shen Gao | Hao Li | Zhengliang Shi | Chengrui Huang | Quan Tu | Shuo Shang | Zhiliang Tian | Minlie Huang
Findings of the Association for Computational Linguistics: ACL 2024

ASETF: A Novel Method for Jailbreak Attack on LLMs through Translate Suffix Embeddings
Hao Wang | Hao Li | Minlie Huang | Lei Sha
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

The safety defense methods of Large language models (LLMs) stays limited because the dangerous prompts are manually curated to just few known attack types, which fails to keep pace with emerging varieties. Recent studies found that attaching suffixes to harmful instructions can hack the defense of LLMs and lead to dangerous outputs. However, similar to traditional text adversarial attacks, this approach, while effective, is limited by the challenge of the discrete tokens. This gradient based discrete optimization attack requires over 100,000 LLM calls, and due to the unreadable of adversarial suffixes, it can be relatively easily penetrated by common defense methods such as perplexity filters.To cope with this challenge, in this paper, we propose an Adversarial Suffix Embedding Translation Framework (ASETF), aimed at transforming continuous adversarial suffix embeddings into coherent and understandable text. This method greatly reduces the computational overhead during the attack process and helps to automatically generate multiple adversarial samples, which can be used as data to strengthen LLM’s security defense. Experimental evaluations were conducted on Llama2, Vicuna, and other prominent LLMs, employing harmful directives sourced from the Advbench dataset.The results indicate that our method significantly reduces the computation time of adversarial suffixes and achieves a much better attack success rate than existing techniques, while significantly enhancing the textual fluency of the prompts. In addition, our approach can be generalized into a broader method for generating transferable adversarial suffixes that can successfully attack multiple LLMs, even black-box LLMs, such as ChatGPT and Gemini.

STTATTS: Unified Speech-To-Text And Text-To-Speech Model
Hawau Olamide Toyin | Hao Li | Hanan Aldarmaki
Findings of the Association for Computational Linguistics: EMNLP 2024

Speech recognition and speech synthesis models are typically trained separately, each with its own set of learning objectives, training data, and model parameters, resulting in two distinct large networks. We propose a parameter-efficient approach to learning ASR and TTS jointly via a multi-task learning objective and shared parameters. Our evaluation demonstrates thatthe performance of our multi-task model is comparable to that of individually trained models while significantly savingcomputational and memory costs (~50% reduction in the total number of parameters required for the two tasks combined). We experiment with English as a resource-rich language, and Arabic as a relatively low-resource language due to shortage of TTS data. Our models are trained with publicly available data, and both the training code and model checkpoints are openly available for further research.

DEIE: Benchmarking Document-level Event Information Extraction with a Large-scale Chinese News Dataset
Yubing Ren | Yanan Cao | Hao Li | Yingjie Li | Zixuan ZM Ma | Fang Fang | Ping Guo | Wei Ma
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

A text corpus centered on events is foundational to research concerning the detection, representation, reasoning, and harnessing of online events. The majority of current event-based datasets mainly target sentence-level tasks, thus to advance event-related research spanning from sentence to document level, this paper introduces DEIE, a unified large-scale document-level event information extraction dataset with over 56,000+ events and 242,000+ arguments. Three key features stand out: large-scale manual annotation (20,000 documents), comprehensive unified annotation (encompassing event trigger/argument, summary, and relation at once), and emergency events annotation (covering 19 emergency types). Notably, our experiments reveal that current event-related models struggle with DEIE, signaling a pressing need for more advanced event-related research in the future.

面向社交媒体多特征增强的药物不良反应检测(Adverse drug reaction detection with multi-feature enhancement for social media)
Hao Li (李浩) | Yunzhi Qiu (邱云志) | Hongfei Lin (林鸿飞)
Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 1: Main Conference)

“社交媒体是药物不良反应(ADR)检测的重要途径之一。本文提出一个基于社交媒体的药物不良反应检测模型DMFE,以全面捕捉患者对药物使用的反馈信息。与传统的文本检测相比,社交媒体数据中通常会有语法不规范与单词拼写错误的问题。本文提取出社交媒体数据的抽象语义表示(AMR)使用图注意力网络(GAT)学习抽象语义特征提高模型对语义信息的理解,使用字符级卷积神经网络(charCNN)捕获字符特征以减少单词拼写错误带来的影响。此外,本文使用提示学习的方法融入荍荥荤荄荒荁药物不良反应领域关键词进一步增强模型对领域知识的理解能力。经实验评估,本文模型DMFE在CADEC、TwiMed两个数据集上F1值与基线模型相比取得最优效果。”

2023

Intra-Event and Inter-Event Dependency-Aware Graph Network for Event Argument Extraction
Hao Li | Yanan Cao | Yubing Ren | Fang Fang | Lanxue Zhang | Yingjie Li | Shi Wang
Findings of the Association for Computational Linguistics: EMNLP 2023

Event argument extraction is critical to various natural language processing tasks for providing structured information. Existing works usually extract the event arguments one by one, and mostly neglect to build dependency information among event argument roles, especially from the perspective of event structure. Such an approach hinders the model from learning the interactions between different roles. In this paper, we raise our research question: How to adequately model dependencies between different roles for better performance? To this end, we propose an intra-event and inter-event dependency-aware graph network, which uses the event structure as the fundamental unit to construct dependencies between roles. Specifically, we first utilize the dense intra-event graph to construct role dependencies within events, and then construct dependencies between events by retrieving similar events of the current event through the retrieval module. To further optimize dependency information and event representation, we propose a dependency interaction module and two auxiliary tasks to improve the extraction ability of the model in different scenarios. Experimental results on the ACE05, RAMS, and WikiEvents datasets show the great advantages of our proposed approach.

Not all quantifiers are equal: Probing Transformer-based language models’ understanding of generalised quantifiers
Tharindu Madusanka | Iqra Zahid | Hao Li | Ian Pratt-Hartmann | Riza Batista-Navarro
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

How do different generalised quantifiers affect the behaviour of transformer-based language models (TLMs)? The recent popularity of TLMs and the central role generalised quantifiers have traditionally played in linguistics and logic bring this question into particular focus. The current research investigating this subject has not utilised a task defined purely in a logical sense, and thus, has not captured the underlying logical significance of generalised quantifiers. Consequently, they have not answered the aforementioned question faithfully or adequately. Therefore, we investigate how different generalised quantifiers affect TLMs by employing a textual entailment problem defined in a purely logical sense, namely, model-checking with natural language. Our approach permits the automatic construction of datasets with respect to which we can assess the ability of TLMs to learn the meanings of generalised quantifiers. Our investigation reveals that TLMs generally can comprehend the logical semantics of the most common generalised quantifiers, but that distinct quantifiers influence TLMs in varying ways.

Team:PULSAR at ProbSum 2023:PULSAR: Pre-training with Extracted Healthcare Terms for Summarising Patients’ Problems and Data Augmentation with Black-box Large Language Models
Hao Li | Yuping Wu | Viktor Schlegel | Riza Batista-Navarro | Thanh-Tung Nguyen | Abhinav Ramesh Kashyap | Xiao-Jun Zeng | Daniel Beck | Stefan Winkler | Goran Nenadic
Proceedings of the 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks

Medical progress notes play a crucial role in documenting a patient’s hospital journey, including his or her condition, treatment plan, and any updates for healthcare providers. Automatic summarisation of a patient’s problems in the form of a “problem list” can aid stakeholders in understanding a patient’s condition, reducing workload and cognitive bias. BioNLP 2023 Shared Task 1A focusses on generating a list of diagnoses and problems from the provider’s progress notes during hospitalisation. In this paper, we introduce our proposed approach to this task, which integrates two complementary components. One component employs large language models (LLMs) for data augmentation; the other is an abstractive summarisation LLM with a novel pre-training objective for generating the patients’ problems summarised as a list. Our approach was ranked second among all submissions to the shared task. The performance of our model on the development and test datasets shows that our approach is more robust on unknown data, with an improvement of up to 3.1 points over the same size of the larger model.

2020

Position-Aware Tagging for Aspect Sentiment Triplet Extraction
Lu Xu | Hao Li | Wei Lu | Lidong Bing
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Aspect Sentiment Triplet Extraction (ASTE) is the task of extracting the triplets of target entities, their associated sentiment, and opinion spans explaining the reason for the sentiment. Existing research efforts mostly solve this problem using pipeline approaches, which break the triplet extraction process into several stages. Our observation is that the three elements within a triplet are highly related to each other, and this motivates us to build a joint model to extract such triplets using a sequence tagging approach. However, how to effectively design a tagging approach to extract the triplets that can capture the rich interactions among the elements is a challenging research question. In this work, we propose the first end-to-end model with a novel position-aware tagging scheme that is capable of jointly extracting the triplets. Our experimental results on several existing datasets show that jointly capturing elements in the triplet using our approach leads to improved performance over the existing approaches. We also conducted extensive experiments to investigate the model effectiveness and robustness.

2019

Learning Explicit and Implicit Structures for Targeted Sentiment Analysis
Hao Li | Wei Lu
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Targeted sentiment analysis is the task of jointly predicting target entities and their associated sentiment information. Existing research efforts mostly regard this joint task as a sequence labeling problem, building models that can capture explicit structures in the output space. However, the importance of capturing implicit global structural information that resides in the input space is largely unexplored. In this work, we argue that both types of information (implicit and explicit structural information) are crucial for building a successful targeted sentiment analysis model. Our experimental results show that properly capturing both information is able to lead to better performance than competitive existing approaches. We also conduct extensive experiments to investigate our model’s effectiveness and robustness.

Reinforced Dynamic Reasoning for Conversational Question Generation
Boyuan Pan | Hao Li | Ziyu Yao | Deng Cai | Huan Sun
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

This paper investigates a new task named Conversational Question Generation (CQG) which is to generate a question based on a passage and a conversation history (i.e., previous turns of question-answer pairs). CQG is a crucial task for developing intelligent agents that can drive question-answering style conversations or test user understanding of a given passage. Towards that end, we propose a new approach named Reinforced Dynamic Reasoning network, which is based on the general encoder-decoder framework but incorporates a reasoning procedure in a dynamic manner to better understand what has been asked and what to ask next about the passage into the general encoder-decoder framework. To encourage producing meaningful questions, we leverage a popular question answering (QA) model to provide feedback and fine-tune the question generator using a reinforcement learning mechanism. Empirical results on the recently released CoQA dataset demonstrate the effectiveness of our method in comparison with various baselines and model variants. Moreover, to show the applicability of our method, we also apply it to create multi-turn question-answering conversations for passages in SQuAD.

Neural Chinese Address Parsing
Hao Li | Wei Lu | Pengjun Xie | Linlin Li
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

This paper introduces a new task – Chinese address parsing – the task of mapping Chinese addresses into semantically meaningful chunks. While it is possible to model this problem using a conventional sequence labelling approach, our observation is that there exist complex dependencies between labels that cannot be readily captured by a simple linear-chain structure. We investigate neural structured prediction models with latent variables to capture such rich structural information within Chinese addresses. We create and publicly release a new dataset consisting of 15K Chinese addresses, and conduct extensive experiments on the dataset to investigate the model effectiveness and robustness. We release our code and data at http://statnlp.org/research/sp.

2018

Learning with Structured Representations for Negation Scope Extraction
Hao Li | Wei Lu
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

We report an empirical study on the task of negation scope extraction given the negation cue. Our key observation is that certain useful information such as features related to negation cue, long-distance dependencies as well as some latent structural information can be exploited for such a task. We design approaches based on conditional random fields (CRF), semi-Markov CRF, as well as latent-variable CRF models to capture such information. Extensive experiments on several standard datasets demonstrate that our approaches are able to achieve better results than existing approaches reported in the literature.

2016

Cross-genre Event Extraction with Knowledge Enrichment
Hao Li | Heng Ji
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2014

Two-Stage Hashing for Fast Document Retrieval
Hao Li | Wei Liu | Heng Ji
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2013

Linking Tweets to News: A Framework to Enrich Short Text Data in Social Media
Weiwei Guo | Hao Li | Heng Ji | Mona Diab
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2012

Combining Social Cognitive Theories with Linguistic Features for Multi-genre Sentiment Analysis
Hao Li | Yu Chen | Heng Ji | Smaranda Muresan | Dequan Zheng
Proceedings of the 26th Pacific Asia Conference on Language, Information, and Computation

2011

Cross-lingual Slot Filling from Comparable Corpora
Matthew Snover | Xiang Li | Wen-Pin Lin | Zheng Chen | Suzanne Tamang | Mingmin Ge | Adam Lee | Qi Li | Hao Li | Sam Anzaroot | Heng Ji
Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web

2010

Domain-Independent Novel Event Discovery and Semi-Automatic Event Annotation
Hao Li | Xiang Li | Heng Ji | Yuval Marton
Proceedings of the 24th Pacific Asia Conference on Language, Information and Computation

Co-authors

Riza Theresa Batista-Navarro 2

Jingchang Chen 2

Jiansong Chen 2

Jiafeng Liang 2

Thanh-Tung Nguyen 2

Bing Qin (秦兵) 2

Jingqing Ruan 2

Viktor Schlegel 2

Stefan Winkler 2

Hanan Aldarmaki 1

Kuluhan Binici 1

Zhaopeng Feng 1

Chengrui Huang 1

Xuan-Jing Huang (黄萱菁) 1

Yucheng Huang 1

Shengpei Jiang 1

Hongfei Lin (林鸿飞) 1

Tharindu Madusanka 1

Smaranda Muresan 1

Goran Nenadic 1

Ian Pratt-Hartmann 1

Abhinav Ramesh Kashyap 1

Yongliang Shen 1

Dongming Sheng 1

Zhengliang Shi 1

Matthew Snover 1

Suzanne Tamang 1

Zhiliang Tian 1

Hawau Olamide Toyin 1

Zhongjie Wang (王钟杰) 1

Tongshen Yang 1

Xiao-Jun Zeng 1

Xiaoyun Zhang 1

Qiaosheng Zhang 1

Venues