Hao Zhang


2025

pdf bib
TrimLLM: Progressive Layer Dropping for Domain-Specific LLMs
Lanxiang Hu | Tajana Rosing | Hao Zhang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Specializing large language models (LLMs) for local deployment in domain-specific use cases is necessary for strong performance while meeting latency and privacy constraints. However, conventional task-specific adaptation approaches do not show simultaneous memory saving and inference speedup at deployment time. Practical compression techniques like quantization and pruning require dedicated hardware or kernel support to achieve measured inference speedup. We develop TrimLLM based on the layer-wise specialization phenomenon we empirically observed and verified on contemporary LLMs. TrimLLM reduces the depth of LLMs via progressive layer dropping. We show it retains LLMs’ capacity in specific domains and achieves inference speedup irrespective of hardware and deep learning frameworks. We evaluated TrimLLM on LLMs of various sizes for inference; models adapted on medical, legal, and financial datasets all demonstrate 2.1 - 5.7× inference speedup on consumer GPUs and up to 3.1× speedup on A100 when compared to state-of-the-art model compression algorithms, with no loss in accuracy at 50∼ 60% model compression ratio.

pdf bib
Reversal of Thought: Enhancing Large Language Models with Preference-Guided Reverse Reasoning Warm-up
Jiahao Yuan | Dehui Du | Hao Zhang | Zixiang Di | Usman Naseem
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Large language models (LLMs) have shown remarkable performance in reasoning tasks but face limitations in mathematical and complex logical reasoning. Existing methods to improve LLMs’ logical capabilities either involve traceable or verifiable logical sequences that generate more reliable responses by constructing logical structures yet increase computational costs, or introduces rigid logic template rules, reducing flexibility. In this paper, we propose Reversal of Thought (RoT), a plug-and-play and cost-effective reasoning framework designed to enhance the logical reasoning abilities of LLMs during the warm-up phase prior to batch inference. RoT utilizes a Preference-Guided Reverse Reasoning warm-up strategy, which integrates logical symbols for pseudocode planning through meta-cognitive mechanisms and pairwise preference self-evaluation to generate task-specific prompts solely through demonstrations, aligning with LLMs’ cognitive preferences shaped by RLHF. Through reverse reasoning, we utilize a Cognitive Preference Manager to assess knowledge boundaries and further expand LLMs’ reasoning capabilities by aggregating solution logic for known tasks and stylistic templates for unknown tasks. Experiments across various tasks demonstrate that RoT surpasses existing baselines in both reasoning accuracy and efficiency.

pdf bib
Planning with Multi-Constraints via Collaborative Language Agents
Cong Zhang | Xin Deik Goh | Dexun Li | Hao Zhang | Yong Liu
Proceedings of the 31st International Conference on Computational Linguistics

The rapid advancement of neural language models has sparked a new surge of intelligent agent research. Unlike traditional agents, large language model-based agents (LLM agents) have emerged as a promising paradigm for achieving artificial general intelligence (AGI) due to their superior reasoning and generalization capabilities. Effective planning is crucial for the success of LLM agents in real-world tasks, making it a highly pursued topic in the community. Current planning methods typically translate tasks into executable action sequences. However, determining a feasible or optimal sequence for complex tasks with multiple constraints at fine granularity, which often requires compositing long chains of heterogeneous actions, remains challenging. This paper introduces Planning with Multi-Constraints (PMC), a zero-shot methodology for collaborative LLM-based multi-agent systems that simplifies complex task planning with constraints by decomposing it into a hierarchy of subordinate tasks. Each subtask is then mapped into executable actions. PMC was assessed on two constraint-intensive benchmarks, TravelPlanner and API-Bank. Notably, PMC achieved an average 42.68% success rate on TravelPlanner, significantly higher than GPT-4 (2.92%), and outperforming GPT-4 with ReAct on API-Bank by 13.64%, showing the immense potential of integrating LLM with multi-agent systems. We also show that PMC works with small LLM as the planning core, e.g., LLaMA-3.1-8B.

pdf bib
Transferable Direct Prompt Injection via Activation-Guided MCMC Sampling
Minghui Li | Hao Zhang | Yechao Zhang | Wei Wan | Shengshan Hu | Pei Xiaobing | Jing Wang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Direct Prompt Injection (DPI) attacks pose a critical security threat to Large Language Models (LLMs) due to their low barrier of execution and high potential damage. To address the impracticality of existing white-box/gray-box methods and the poor transferability of black-box methods, we propose an activations-guided prompt injection attack framework. We first construct an Energy-based Model (EBM) using activations from a surrogate model to evaluate the quality of adversarial prompts. Guided by the trained EBM, we employ the token-level Markov Chain Monte Carlo (MCMC) sampling to adaptively optimize adversarial prompts, thereby enabling gradient-free black-box attacks. Experimental results demonstrate our superior cross-model transferability, achieving 49.6% attack success rate (ASR) across five mainstream LLMs and 34.6% improvement over human-crafted prompts, and maintaining 36.6% ASR on unseen task scenarios. Interpretability analysis reveals a correlation between activations and attack effectiveness, highlighting the critical role of semantic patterns in transferable vulnerability exploitation.

pdf bib
Long-form Hallucination Detection with Self-elicitation
Zihang Liu | Jiawei Guo | Hao Zhang | Hongyang Chen | Jiajun Bu | Haishuai Wang
Findings of the Association for Computational Linguistics: ACL 2025

While Large Language Models (LLMs) have exhibited impressive performance in generating long-form content, they frequently present a hazard of producing factual inaccuracies or hallucinations. An effective strategy to mitigate this hazard is to leverage off-the-shelf LLMs to detect hallucinations after the generation. The primary challenge resides in the comprehensive elicitation of the intrinsic knowledge acquired during their pre-training phase. However, existing methods that employ multi-step reasoning chains predominantly fall short of addressing this issue. Moreover, since existing methods for hallucination detection tend to decompose text into isolated statements, they are unable to understand the contextual semantic relations in long-form content. In this paper, we study a novel concept, self-elicitation, to leverage self-generated thoughts derived from prior statements as catalysts to elicit the expression of intrinsic knowledge and understand contextual semantics. We present a framework, SelfElicit, to integrate self-elicitation with graph structures to effectively organize the elicited knowledge and facilitate factual evaluations. Extensive experiments on five datasets in various domains demonstrate the effectiveness of self-elicitation and the superiority of our proposed method.

pdf bib
Sugar-Coated Poison: Benign Generation Unlocks Jailbreaking
Yuhang Wu | Yu-Jie Xiong | Hao Zhang | Jia-Chen Zhang | Zheng Zhou
Findings of the Association for Computational Linguistics: EMNLP 2025

With the increasingly deep integration of large language models (LLMs) across diverse domains, the effectiveness of their safety mechanisms is encountering severe challenges. Currently, jailbreak attacks based on prompt engineering, which induce models to generate potentially harmful content, have become a major security threat. However, existing methods primarily rely on black-box manipulation of prompt templates, resulting in high costs and poor generalizability. To break through the bottleneck, this study reveals the potential impact of the generation of LLMs on safety for the first time that Defense Threshold Decay (DTD) phenomena: as benign content generation increases, the model’s attention to input instructions progressively diminishes. Building on this insight, we propose the Sugar-Coated Poison (SCP) attack paradigm, using a “semantic reversal” strategy, where benign inputs that are opposite in meaning to malicious intent are crafted to induce the model into a safety response mode. When the defense threshold decays, an adversarial reasoning mechanism easily bypasses safety mechanisms. Experiments show SCP outperforms existing baselines. For defense, we propose Part-of-Speech Defense (POSD), leveraging verb-noun dependencies for syntactic analysis to enhance robustness and security of LLMs. Our code is available at https://anonymous.4open.science/r/SCP-9092.

pdf bib
Sensitivity-LoRA : Low-Load Sensitivity-Based Fine-Tuning for Large Language Models
Hao Zhang | Bo Huang | Zhenjia Li | Xi Xiao | Hui Yi Leong | Zumeng Zhang | Xinwei Long | Tianyang Wang | Hao Xu
Findings of the Association for Computational Linguistics: EMNLP 2025

Large Language Models (LLMs) have transformed both everyday life and scientific research. However, adapting LLMs from general-purpose models to specialized tasks remains challenging, particularly in resource-constrained environments. Low-Rank Adaptation (LoRA), a prominent method within Parameter-Efficient Fine-Tuning (PEFT), has emerged as a promising approach to LLMs by approximating model weight updates using low-rank decomposition. However, LoRA is limited by its uniform rank ( r ) allocation to each incremental matrix, and existing rank allocation techniques aimed at addressing this issue remain computationally inefficient, complex, and unstable, hindering practical applications. To address these limitations, we propose Sensitivity-LoRA, an efficient fine-tuning method that dynamically allocates ranks to weight matrices based on both their global and local sensitivities. It leverages the second-order derivatives (Hessian Matrix) of the loss function to effectively capture weight sensitivity, enabling optimal rank allocation with minimal computational overhead. Our experimental results have demonstrated robust effectiveness, efficiency and stability of Sensitivity-LoRA across diverse tasks and benchmarks.

pdf bib
SafetyQuizzer: Timely and Dynamic Evaluation on the Safety of LLMs
Zhichao Shi | Shaoling Jing | Yi Cheng | Hao Zhang | Yuanzhuo Wang | Jie Zhang | Huawei Shen | Xueqi Cheng
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

With the expansion of the application of Large Language Models (LLMs), concerns about their safety have grown among researchers. Numerous studies have demonstrated the potential risks of LLMs generating harmful content and have proposed various safety assessment benchmarks to evaluate these risks. However, the evaluation questions in current benchmarks, especially for Chinese, are too straightforward, making them easily rejected by target LLMs, and difficult to update with practical relevance due to their lack of correlation with real-world events. This hinders the effective application of these benchmarks in continuous evaluation tasks. To address these limitations, we propose SafetyQuizzer, a question-generation framework designed to evaluate the safety of LLMs more sustainably in the Chinese context. SafetyQuizzer leverages a finetuned LLM and jailbreaking attack templates to generate subtly offensive questions, which reduces the decline rate. Additionally, by utilizing retrieval-augmented generation, SafetyQuizzer incorporates the latest real-world events into evaluation questions, improving the adaptability of the benchmarks. Our experiments demonstrate that evaluation questions generated by SafetyQuizzer significantly reduce the decline rate compared to other benchmarks while maintaining a comparable attack success rate. Our code is available at https://github.com/zhichao-stone/SafetyQuizzer. Warning: this paper contains examples that may be offensive or upsetting.

2024

pdf bib
Efficient Detection of LLM-generated Texts with a Bayesian Surrogate Model
Yibo Miao | Hongcheng Gao | Hao Zhang | Zhijie Deng
Findings of the Association for Computational Linguistics: ACL 2024

The detection of machine-generated text, especially from large language models (LLMs), is crucial in preventing serious social problems resulting from their misuse. Some methods train dedicated detectors on specific datasets but fall short in generalizing to unseen test data, while other zero-shot ones often yield suboptimal performance. Although the recent DetectGPT has shown promising detection performance, it suffers from significant inefficiency issues, as detecting a single candidate requires querying the source LLM with hundreds of its perturbations. This paper aims to bridge this gap. Concretely, we propose to incorporate a Bayesian surrogate model, which allows us to select typical samples based on Bayesian uncertainty and interpolate scores from typical samples to other samples, to improve query efficiency. Empirical results demonstrate that our method significantly outperforms existing approaches under a low query budget. Notably, when detecting the text generated by LLaMA family models, our method with just 2 or 3 queries can outperform DetectGPT with 200 queries.

pdf bib
LMDX: Language Model-based Document Information Extraction and Localization
Vincent Perot | Kai Kang | Florian Luisier | Guolong Su | Xiaoyu Sun | Ramya Sree Boppana | Zilong Wang | Zifeng Wang | Jiaqi Mu | Hao Zhang | Chen-Yu Lee | Nan Hua
Findings of the Association for Computational Linguistics: ACL 2024

Large Language Models (LLM) have revolutionized Natural Language Processing (NLP), improving state-of-the-art and exhibiting emergent capabilities across various tasks. However, their application in extracting information from visually rich documents, which is at the core of many document processing workflows and involving the extraction of key entities from semi-structured documents, has not yet been successful. The main obstacles to adopting LLMs for this task include the absence of layout encoding within LLMs, which is critical for high quality extraction, and the lack of a grounding mechanism to localize the predicted entities within the document. In this paper, we introduce Language Model-based Document Information EXtraction and Localization (LMDX), a methodology to reframe the document information extraction task for a LLM. LMDX enables extraction of singular, repeated, and hierarchical entities, both with and without training data, while providing grounding guarantees and localizing the entities within the document. Finally, we apply LMDX to the PaLM 2-S and Gemini Pro LLMs and evaluate it on VRDU and CORD benchmarks, setting a new state-of-the-art and showing how LMDX enables the creation of high quality, data-efficient parsers.

pdf bib
AdaMoE: Token-Adaptive Routing with Null Experts for Mixture-of-Experts Language Models
Zihao Zeng | Yibo Miao | Hongcheng Gao | Hao Zhang | Zhijie Deng
Findings of the Association for Computational Linguistics: EMNLP 2024

Mixture of experts (MoE) has become the standard for constructing production-level large language models (LLMs) due to its promise to boost model capacity without causing significant overheads. Nevertheless, existing MoE methods usually enforce a constant top-k routing for all tokens, which is arguably restrictive because various tokens (e.g., "<EOS>” vs. “apple”) may require various numbers of experts for feature abstraction. Lifting such a constraint can help make the most of limited resources and unleash the potential of the model for downstream tasks. In this sense, we introduce **AdaMoE** to realize token-adaptive routing for MoE, where different tokens are permitted to select a various number of experts. AdaMoE makes minimal modifications to the vanilla MoE with top-k routing—it simply introduces a fixed number of *null experts*, which do not consume any FLOPs, to the expert set and increases the value of k. AdaMoE does not force each token to occupy a fixed number of null experts but ensures the average usage of the null experts with a load-balancing loss, leading to an adaptive number of null/true experts used by each token. AdaMoE exhibits a strong resemblance to MoEs with expert choice routing while allowing for trivial auto-regressive modeling. AdaMoE is easy to implement and can be effectively applied to pre-trained (MoE-)LLMs. Extensive studies show that AdaMoE can reduce average expert load (FLOPs) while achieving superior performance. For example, on the ARC-C dataset, applying our method to fine-tuning Mixtral-8x7B can reduce FLOPs by 14.5% while increasing accuracy by 1.69%.Code is available at [this link](https://github.com/CengZihao/AdaMoE).

pdf bib
EpiGEN: An Efficient Multi-Api Code GENeration Framework under Enterprise Scenario
Sijie Li | Sha Li | Hao Zhang | Shuyang Li | Kai Chen | Jianyong Yuan | Yi Cao | Lvqing Yang
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

In recent years, Large Language Models (LLMs) have demonstrated exceptional performance in code-generation tasks. However, under enterprise scenarios where private APIs are pre-built, general LLMs often fail to meet expectations. Existing approaches are confronted with drawbacks of high resource consumption and inadequate handling of multi-API tasks. To address these challenges, we propose EpiGEN, an Efficient multi-Api code GENeration framework under enterprise scenario. It consists of three core modules: Task Decomposition Module (TDM), API Retrieval Module (ARM), and Code Generation Module (CGM), in which Langchain played an important role. Through a series of experiments, EpiGEN shows good acceptability and readability, compared to fully fine-tuned LLM with a larger number of parameters. Particularly, in medium and hard level tasks, the performance of EpiGEN on a single-GPU machine even surpasses that of a fully fine-tuned LLM that requires multi-GPU configuration. Generally, EpiGEN is model-size agnostic, facilitating a balance between the performance of code generation and computational requirements.

pdf bib
Meta-Adapter for Self-Supervised Speech Models: A Solution to Low-Resource Speech Recognition Challenges
Yaqi Chen | Hao Zhang | Xukui Yang | Wenlin Zhang | Dan Qu
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Self-supervised models have demonstrated remarkable performance in speech processing by learning latent representations from large amounts of unlabeled data. Although these models yield promising results on low-resource languages, the computational expense of fine-tuning all model parameters is prohibitively high. Adapters offer a solution by incorporating lightweight bottleneck structures into pre-trained models, enabling efficient parameter adaptation for downstream tasks. However, randomly initialized adapters often underperform in low-resource scenarios, limiting their applicability in low-resource languages. To address this issue, we develop the Meta-Adapter for self-supervised models to obtain meta-initialized parameters that facilitate quick adaptation to low-resource languages. Extensive experiments on the Common Voice and FLEURS datasets demonstrate the superior performance of Meta-Adapters on 12 low-resource languages spanning four different language families. Moreover, Meta-adapters show better generalization and extensibility than traditional pretraining methods.

2023

pdf bib
FormNetV2: Multimodal Graph Contrastive Learning for Form Document Information Extraction
Chen-Yu Lee | Chun-Liang Li | Hao Zhang | Timothy Dozat | Vincent Perot | Guolong Su | Xiang Zhang | Kihyuk Sohn | Nikolay Glushnev | Renshen Wang | Joshua Ainslie | Shangbang Long | Siyang Qin | Yasuhisa Fujii | Nan Hua | Tomas Pfister
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The recent advent of self-supervised pre-training techniques has led to a surge in the use of multimodal learning in form document understanding. However, existing approaches that extend the mask language modeling to other modalities require careful multi-task tuning, complex reconstruction target designs, or additional pre-training data. In FormNetV2, we introduce a centralized multimodal graph contrastive learning strategy to unify self-supervised pre-training for all modalities in one loss. The graph contrastive objective maximizes the agreement of multimodal representations, providing a natural interplay for all modalities without special customization. In addition, we extract image features within the bounding box that joins a pair of tokens connected by a graph edge, capturing more targeted visual cues without loading a sophisticated and separately pre-trained image embedder. FormNetV2 establishes new state-of-the-art performance on FUNSD, CORD, SROIE and Payment benchmarks with a more compact model size.

pdf bib
UKP-SQuARE v3: A Platform for Multi-Agent QA Research
Haritz Puerto | Tim Baumgärtner | Rachneet Sachdeva | Haishuo Fang | Hao Zhang | Sewin Tariverdian | Kexin Wang | Iryna Gurevych
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)

The continuous development of Question Answering (QA) datasets has drawn the research community’s attention toward multi-domain models. A popular approach is to use multi-dataset models, which are models trained on multiple datasets to learn their regularities and prevent overfitting to a single dataset. However, with the proliferation of QA models in online repositories such as GitHub or Hugging Face, an alternative is becoming viable. Recent works have demonstrated that combining expert agents can yield large performance gains over multi-dataset models. To ease research in multi-agent models, we extend UKP-SQuARE, an online platform for QA research, to support three families of multi-agent systems: i) agent selection, ii) early-fusion of agents, and iii) late-fusion of agents. We conduct experiments to evaluate their inference speed and discuss the performance vs. speed trade-off compared to multi-dataset models. UKP-SQuARE is open-source and publicly available.

pdf bib
TLM: Token-Level Masking for Transformers
Yangjun Wu | Kebin Fang | Dongxiang Zhang | Han Wang | Hao Zhang | Gang Chen
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Structured dropout approaches, such as attention dropout and DropHead, have been investigated to regularize the multi-head attention mechanism in Transformers. In this paper, we propose a new regularization scheme based on token-level rather than structure-level to reduce overfitting. Specifically, we devise a novel Token-Level Masking (TLM) training strategy for Transformers to regularize the connections of self-attention, which consists of two masking techniques that are effective and easy to implement. The underlying idea is to manipulate the connections between tokens in the multi-head attention via masking, where the networks are forced to exploit partial neighbors’ information to produce a meaningful representation. The generality and effectiveness of TLM are thoroughly evaluated via extensive experiments on 4 diversified NLP tasks across 18 datasets, including natural language understanding benchmark GLUE, ChineseGLUE, Chinese Grammatical Error Correction, and data-to-text generation. The results indicate that TLM can consistently outperform attention dropout and DropHead, e.g., it increases by 0.5 points relative to DropHead with BERT-large on GLUE. Moreover, TLM can establish a new record on the data-to-text benchmark Rotowire (18.93 BLEU). Our code will be publicly available at https://github.com/Young1993/tlm.

pdf bib
QueryForm: A Simple Zero-shot Form Entity Query Framework
Zifeng Wang | Zizhao Zhang | Jacob Devlin | Chen-Yu Lee | Guolong Su | Hao Zhang | Jennifer Dy | Vincent Perot | Tomas Pfister
Findings of the Association for Computational Linguistics: ACL 2023

Zero-shot transfer learning for document understanding is a crucial yet under-investigated scenario to help reduce the high cost involved in annotating document entities. We present a novel query-based framework, QueryForm, that extracts entity values from form-like documents in a zero-shot fashion. QueryForm contains a dual prompting mechanism that composes both the document schema and a specific entity type into a query, which is used to prompt a Transformer model to perform a single entity extraction task. Furthermore, we propose to leverage large-scale query-entity pairs generated from form-like webpages with weak HTML annotations to pre-train QueryForm. By unifying pre-training and fine-tuning into the same query-based framework, QueryForm enables models to learn from structured documents containing various entities and layouts, leading to better generalization to target document types without the need for target-specific training data. QueryForm sets new state-of-the-art average F1 score on both the XFUND (+4.6% 10.1%) and the Payment (+3.2% 9.5%) zero-shot benchmark, with a smaller model size and no additional image input.

pdf bib
ZBL2W at SemEval-2023 Task 9: A Multilingual Fine-tuning Model with Data Augmentation for Tweet Intimacy Analysis
Hao Zhang | Youlin Wu | Junyu Lu | Zewen Bai | Jiangming Wu | Hongfei Lin | Shaowu Zhang
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

This paper describes our system used in the SemEval-2023 Task 9 Multilingual Tweet Intimacy Analysis. There are two key challenges in this task: the complexity of multilingual and zero-shot cross-lingual learning, and the difficulty of semantic mining of tweet intimacy. To solve the above problems, our system extracts contextual representations from the pretrained language models, XLM-T, and employs various optimization methods, including adversarial training, data augmentation, ordinal regression loss and special training strategy. Our system ranked 14th out of 54 participating teams on the leaderboard and ranked 10th on predicting languages not in the training data. Our code is available on Github.

pdf bib
DUTIR at SemEval-2023 Task 10: Semi-supervised Learning for Sexism Detection in English
Bingjie Yu | Zewen Bai | Haoran Ji | Shiyi Li | Hao Zhang | Hongfei Lin
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

Sexism is an injustice afflicting women and has become a common form of oppression in social media. In recent years, the automatic detection of sexist instances has been utilized to combat this oppression. The Subtask A of SemEval-2023 Task 10, Explainable Detection of Online Sexism, aims to detect whether an English-language post is sexist. In this paper, we describe our system for the competition. The structure of the classification model is based on RoBERTa, and we further pre-train it on the domain corpus. For fine-tuning, we adopt Unsupervised Data Augmentation (UDA), a semi-supervised learning approach, to improve the robustness of the system. Specifically, we employ Easy Data Augmentation (EDA) method as the noising operation for consistency training. We train multiple models based on different hyperparameter settings and adopt the majority voting method to predict the labels of test entries. Our proposed system achieves a Macro-F1 score of 0.8352 and a ranking of 41/84 on the leaderboard of Subtask A.

2022

pdf bib
UKP-SQuARE v2: Explainability and Adversarial Attacks for Trustworthy QA
Rachneet Sachdeva | Haritz Puerto | Tim Baumgärtner | Sewin Tariverdian | Hao Zhang | Kexin Wang | Hossain Shaikh Saadi | Leonardo F. R. Ribeiro | Iryna Gurevych
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing: System Demonstrations

Question Answering (QA) systems are increasingly deployed in applications where they support real-world decisions. However, state-of-the-art models rely on deep neural networks, which are difficult to interpret by humans. Inherently interpretable models or post hoc explainability methods can help users to comprehend how a model arrives at its prediction and, if successful, increase their trust in the system. Furthermore, researchers can leverage these insights to develop new methods that are more accurate and less biased. In this paper, we introduce SQuARE v2, the new version of SQuARE, to provide an explainability infrastructure for comparing models based on methods such as saliency maps and graph-based explanations. While saliency maps are useful to inspect the importance of each input token for the model’s prediction, graph-based explanations from external Knowledge Graphs enable the users to verify the reasoning behind the model prediction. In addition, we provide multiple adversarial attacks to compare the robustness of QA models. With these explainability methods and adversarial attacks, we aim to ease the research on trustworthy QA models. SQuARE is available on https://square.ukp-lab.de.

pdf bib
Incorporating Instructional Prompts into a Unified Generative Framework for Joint Multiple Intent Detection and Slot Filling
Yangjun Wu | Han Wang | Dongxiang Zhang | Gang Chen | Hao Zhang
Proceedings of the 29th International Conference on Computational Linguistics

The joint multiple Intent Detection (ID) and Slot Filling (SF) is a significant challenge in spoken language understanding. Because the slots in an utterance may relate to multi-intents, most existing approaches focus on utilizing task-specific components to capture the relations between intents and slots. The customized networks restrict models from modeling commonalities between tasks and generalization for broader applications. To address the above issue, we propose a Unified Generative framework (UGEN) based on a prompt-based paradigm, and formulate the task as a question-answering problem. Specifically, we design 5-type templates as instructional prompts, and each template includes a question that acts as the driver to teach UGEN to grasp the paradigm, options that list the candidate intents or slots to reduce the answer search space, and the context denotes original utterance. Through the instructional prompts, UGEN is guided to understand intents, slots, and their implicit correlations. On two popular multi-intent benchmark datasets, experimental results demonstrate that UGEN achieves new SOTA performances on full-data and surpasses the baselines by a large margin on 5-shot (28.1%) and 10-shot (23%) scenarios, which verify that UGEN is robust and effective.

pdf bib
Language Model Decomposition: Quantifying the Dependency and Correlation of Language Models
Hao Zhang
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Pre-trained language models (LMs), such as BERT (Devlin et al., 2018) and its variants, have led to significant improvements on various NLP tasks in past years. However, a theoretical framework for studying their relationships is still missing. In this paper, we fill this gap by investigating the linear dependency between pre-trained LMs. The linear dependency of LMs is defined analogously to the linear dependency of vectors. We propose Language Model Decomposition (LMD) to represent a LM using a linear combination of other LMs as basis, and derive the closed-form solution. A goodness-of-fit metric for LMD similar to the coefficient of determination is defined and used to measure the linear dependency of a set of LMs. In experiments, we find that BERT and eleven (11) BERT-like LMs are 91% linearly dependent. This observation suggests that current state-of-the-art (SOTA) LMs are highly “correlated”. To further advance SOTA we need more diverse and novel LMs that are less dependent on existing LMs.

pdf bib
STGN: an Implicit Regularization Method for Learning with Noisy Labels in Natural Language Processing
Tingting Wu | Xiao Ding | Minji Tang | Hao Zhang | Bing Qin | Ting Liu
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Noisy labels are ubiquitous in natural language processing (NLP) tasks. Existing work, namely learning with noisy labels in NLP, is often limited to dedicated tasks or specific training procedures, making it hard to be widely used. To address this issue, SGD noise has been explored to provide a more general way to alleviate the effect of noisy labels by involving benign noise in the process of stochastic gradient descent. However, previous studies exert identical perturbation for all samples, which may cause overfitting on incorrect ones or optimizing correct ones inadequately. To facilitate this, we propose a novel stochastic tailor-made gradient noise (STGN), mitigating the effect of inherent label noise by introducing tailor-made benign noise for each sample. Specifically, we investigate multiple principles to precisely and stably discriminate correct samples from incorrect ones and thus apply different intensities of perturbation to them. A detailed theoretical analysis shows that STGN has good properties, beneficial for model generalization. Experiments on three different NLP tasks demonstrate the effectiveness and versatility of STGN. Also, STGN can boost existing robust training methods.

pdf bib
Interventional Training for Out-Of-Distribution Natural Language Understanding
Sicheng Yu | Jing Jiang | Hao Zhang | Yulei Niu | Qianru Sun | Lidong Bing
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Out-of-distribution (OOD) settings are used to measure a model’s performance when the distribution of the test data is different from that of the training data. NLU models are known to suffer in OOD. We study this issue from the perspective of causality, which sees confounding bias as the reason for models to learn spurious correlations. While a common solution is to perform intervention, existing methods handle only known and single confounder, but in many NLU tasks the confounders can be both unknown and multifactorial. In this paper, we propose a novel interventional training method called Bottom-up Automatic Intervention (BAI) that performs multi-granular intervention with identified multifactorial confounders. Our experiments on three NLU tasks, namely, natural language inference, fact verification and paraphrase identification, show the effectiveness of BAI for tackling OOD settings.

pdf bib
FCGCL: Fine- and Coarse-Granularity Contrastive Learning for Speech Translation
Hao Zhang | Nianwen Si | Yaqi Chen | Zhen Li | Tong Niu | Xukui Yang | Dan Qu
Findings of the Association for Computational Linguistics: EMNLP 2022

It is notoriously difficult to implement end-to-end speech translation (E2E-ST) model because of the task complexity and data scarcity. Existing techniques often attempt to carry out implicit knowledge transfer from machine translation (MT) to ST model by imposing various constraints. However, in this transfer scenario, a significant problem is that the performance of the MT will drop significantly and the final transfer effect is also restricted. In this article, we recommend Fine and Coarse Granularity Contrastive Learning (FCGCL), which conduct explicit knowledge transfer from MT to ST model. Specially, we ensure through multi granularity contrastive learning that inputs with similar semantic between different modalities are encoded closely in the shared semantic space while inputs with different semantics are kept apart. Experiments on the MuST-C datasets on all 8 languages and further analysis show that our method can effectively improve the E2E-ST performance and achieves an average BLEU of 29.0.

pdf bib
GUTS at SemEval-2022 Task 4: Adversarial Training and Balancing Methods for Patronizing and Condescending Language Detection
Junyu Lu | Hao Zhang | Tongyue Zhang | Hongbo Wang | Haohao Zhu | Bo Xu | Hongfei Lin
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)

Patronizing and Condescending Language (PCL) towards vulnerable communities in general media has been shown to have potentially harmful effects. Due to its subtlety and the good intentions behind its use, the audience is not aware of the language’s toxicity. In this paper, we present our method for the SemEval-2022 Task4 titled “Patronizing and Condescending Language Detection”. In Subtask A, a binary classification task, we introduce adversarial training based on Fast Gradient Method (FGM) and employ pre-trained model in a unified architecture. For Subtask B, framed as a multi-label classification problem, we utilize various improved multi-label cross-entropy loss functions and analyze the performance of our method. In the final evaluation, our system achieved official rankings of 17/79 and 16/49 on Subtask A and Subtask B, respectively. In addition, we explore the relationship between PCL and emotional polarity and intensity it contains.

2021

pdf bib
COSY: COunterfactual SYntax for Cross-Lingual Understanding
Sicheng Yu | Hao Zhang | Yulei Niu | Qianru Sun | Jing Jiang
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Pre-trained multilingual language models, e.g., multilingual-BERT, are widely used in cross-lingual tasks, yielding the state-of-the-art performance. However, such models suffer from a large performance gap between source and target languages, especially in the zero-shot setting, where the models are fine-tuned only on English but tested on other languages for the same task. We tackle this issue by incorporating language-agnostic information, specifically, universal syntax such as dependency relations and POS tags, into language models, based on the observation that universal syntax is transferable across different languages. Our approach, called COunterfactual SYntax (COSY), includes the design of SYntax-aware networks as well as a COunterfactual training method to implicitly force the networks to learn not only the semantics but also the syntax. To evaluate COSY, we conduct cross-lingual experiments on natural language inference and question answering using mBERT and XLM-R as network backbones. Our results show that COSY achieves the state-of-the-art performance for both tasks, without using auxiliary training data.

pdf bib
EnsLM: Ensemble Language Model for Data Diversity by Semantic Clustering
Zhibin Duan | Hao Zhang | Chaojie Wang | Zhengjue Wang | Bo Chen | Mingyuan Zhou
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Natural language processing (NLP) often faces the problem of data diversity such as different domains, themes, styles, and so on. Therefore, a single language model (LM) is insufficient to learn all knowledge from diverse samples. To solve this problem, we firstly propose an autoencoding topic model with a mixture prior (mATM) to perform clustering for the data, where the clusters defined in semantic space describes the data diversity. Having obtained the clustering assignment for each sample, we develop the ensemble LM (EnsLM) with the technique of weight modulation. Specifically, EnsLM contains a backbone that is adjusted by a few modulated weights to fit for different sample clusters. As a result, the backbone learns the shared knowledge among all clusters while modulated weights extract the cluster-specific features. EnsLM can be trained jointly with mATM with a flexible LM backbone. We evaluate the effectiveness of both mATM and EnsLM on various tasks.

pdf bib
PhotoChat: A Human-Human Dialogue Dataset With Photo Sharing Behavior For Joint Image-Text Modeling
Xiaoxue Zang | Lijuan Liu | Maria Wang | Yang Song | Hao Zhang | Jindong Chen
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

We present a new human-human dialogue dataset - PhotoChat, the first dataset that casts light on the photo sharing behavior in online messaging. PhotoChat contains 12k dialogues, each of which is paired with a user photo that is shared during the conversation. Based on this dataset, we propose two tasks to facilitate research on image-text modeling: a photo-sharing intent prediction task that predicts whether one intends to share a photo in the next conversation turn, and a photo retrieval task that retrieves the most relevant photo according to the dialogue context. In addition, for both tasks, we provide baseline models using the state-of-the-art models and report their benchmark performances. The best image retrieval model achieves 10.4% recall@1 (out of 1000 candidates) and the best photo intent prediction model achieves 58.1% F1 score, indicating that the dataset presents interesting yet challenging real-world problems. We are releasing PhotoChat to facilitate future research work among the community.

pdf bib
Parallel Attention Network with Sequence Matching for Video Grounding
Hao Zhang | Aixin Sun | Wei Jing | Liangli Zhen | Joey Tianyi Zhou | Siow Mong Rick Goh
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

2020

pdf bib
Span-based Localizing Network for Natural Language Video Localization
Hao Zhang | Aixin Sun | Wei Jing | Joey Tianyi Zhou
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Given an untrimmed video and a text query, natural language video localization (NLVL) is to locate a matching span from the video that semantically corresponds to the query. Existing solutions formulate NLVL either as a ranking task and apply multimodal matching architecture, or as a regression task to directly regress the target video span. In this work, we address NLVL task with a span-based QA approach by treating the input video as text passage. We propose a video span localizing network (VSLNet), on top of the standard span-based QA framework, to address NLVL. The proposed VSLNet tackles the differences between NLVL and span-based QA through a simple and yet effective query-guided highlighting (QGH) strategy. The QGH guides VSLNet to search for matching video span within a highlighted region. Through extensive experiments on three benchmark datasets, we show that the proposed VSLNet outperforms the state-of-the-art methods; and adopting span-based QA framework is a promising direction to solve NLVL.

pdf bib
Multi-source Meta Transfer for Low Resource Multiple-Choice Question Answering
Ming Yan | Hao Zhang | Di Jin | Joey Tianyi Zhou
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Multiple-choice question answering (MCQA) is one of the most challenging tasks in machine reading comprehension since it requires more advanced reading comprehension skills such as logical reasoning, summarization, and arithmetic operations. Unfortunately, most existing MCQA datasets are small in size, which increases the difficulty of model learning and generalization. To address this challenge, we propose a multi-source meta transfer (MMT) for low-resource MCQA. In this framework, we first extend meta learning by incorporating multiple training sources to learn a generalized feature representation across domains. To bridge the distribution gap between training sources and the target, we further introduce the meta transfer that can be integrated into the multi-source meta training. More importantly, the proposed MMT is independent of backbone language models. Extensive experiments demonstrate the superiority of MMT over state-of-the-arts, and continuous improvements can be achieved on different backbone networks on both supervised and unsupervised domain adaptation settings.

pdf bib
Friendly Topic Assistant for Transformer Based Abstractive Summarization
Zhengjue Wang | Zhibin Duan | Hao Zhang | Chaojie Wang | Long Tian | Bo Chen | Mingyuan Zhou
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Abstractive document summarization is a comprehensive task including document understanding and summary generation, in which area Transformer-based models have achieved the state-of-the-art performance. Compared with Transformers, topic models are better at learning explicit document semantics, and hence could be integrated into Transformers to further boost their performance. To this end, we rearrange and explore the semantics learned by a topic model, and then propose a topic assistant (TA) including three modules. TA is compatible with various Transformer-based models and user-friendly since i) TA is a plug-and-play model that does not break any structure of the original Transformer network, making users easily fine-tune Transformer+TA based on a well pre-trained model; ii) TA only introduces a small number of extra parameters. Experimental results on three datasets demonstrate that TA is able to improve the performance of several Transformer-based models.

2019

pdf bib
Dual Adversarial Neural Transfer for Low-Resource Named Entity Recognition
Joey Tianyi Zhou | Hao Zhang | Di Jin | Hongyuan Zhu | Meng Fang | Rick Siow Mong Goh | Kenneth Kwok
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

We propose a new neural transfer method termed Dual Adversarial Transfer Network (DATNet) for addressing low-resource Named Entity Recognition (NER). Specifically, two variants of DATNet, i.e., DATNet-F and DATNet-P, are investigated to explore effective feature fusion between high and low resource. To address the noisy and imbalanced training data, we propose a novel Generalized Resource-Adversarial Discriminator (GRAD). Additionally, adversarial training is adopted to boost model generalization. In experiments, we examine the effects of different components in DATNet across domains and languages and show that significant improvement can be obtained especially for low-resource data, without augmenting any additional hand-crafted features and pre-trained language model.

2018

pdf bib
WordNet Troponymy and Extraction of “Manner-Result” Relations
Aliaksandr Huminski | Hao Zhang
Proceedings of the 9th Global Wordnet Conference

Commonsense knowledge bases need to have relations that allow to predict the consequences of specific actions (say, if John stabbed Peter, Peter might be killed) and to unfold the possible actions for the specific results (Peter was killed. It could happen because of poisoning, stabbing, shooting, etc.) This kind of causal relations are established between manner verbs and result verbs: manner-result relations. We offer a procedure on how to extract manner-result relations from WordNet through the analysis of the troponym glosses. The procedure of extraction includes three steps and the results are based on the analysis of the whole set of verbs in WordNet.

pdf bib
UKP-Athene: Multi-Sentence Textual Entailment for Claim Verification
Andreas Hanselowski | Hao Zhang | Zile Li | Daniil Sorokin | Benjamin Schiller | Claudia Schulz | Iryna Gurevych
Proceedings of the First Workshop on Fact Extraction and VERification (FEVER)

The Fact Extraction and VERification (FEVER) shared task was launched to support the development of systems able to verify claims by extracting supporting or refuting facts from raw text. The shared task organizers provide a large-scale dataset for the consecutive steps involved in claim verification, in particular, document retrieval, fact extraction, and claim classification. In this paper, we present our claim verification pipeline approach, which, according to the preliminary results, scored third in the shared task, out of 23 competing systems. For the document retrieval, we implemented a new entity linking approach. In order to be able to rank candidate facts and classify a claim on the basis of several selected facts, we introduce two extensions to the Enhanced LSTM (ESIM).

2016

pdf bib
Learning Concept Taxonomies from Multi-modal Data
Hao Zhang | Zhiting Hu | Yuntian Deng | Mrinmaya Sachan | Zhicheng Yan | Eric Xing
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2015

pdf bib
KWB: An Automated Quick News System for Chinese Readers
Yiqi Bai | Wenjing Yang | Hao Zhang | Jingwen Wang | Ming Jia | Roland Tong | Jie Wang
Proceedings of the Eighth SIGHAN Workshop on Chinese Language Processing

2012

pdf bib
NiuTrans: An Open Source Toolkit for Phrase-based and Syntax-based Machine Translation
Tong Xiao | Jingbo Zhu | Hao Zhang | Qiang Li
Proceedings of the ACL 2012 System Demonstrations

2011

pdf bib
Document-level Consistency Verification in Machine Translation
Tong Xiao | Jingbo Zhu | Shujie Yao | Hao Zhang
Proceedings of Machine Translation Summit XIII: Papers

2010

pdf bib
An Empirical Study of Translation Rule Extraction with Multiple Parsers
Tong Xiao | Jingbo Zhu | Hao Zhang | Muhua Zhu
Coling 2010: Posters

pdf bib
NEUNLPLab Chinese Word Sense Induction System for SIGHAN Bakeoff 2010
Hao Zhang | Tong Xiao | Jingbo Zhu
CIPS-SIGHAN Joint Conference on Chinese Language Processing

Search
Co-authors
Fix author