Weidong Zhang
2026
PretrainRL: Alleviating Factuality Hallucination of Large Language Models at the Beginning
Langming Liu | Kangtao Lv | Haibin Chen | Weidong Zhang | Yejing Wang | Shilei Liu | Xin Tong | Yujin Yuan | Yongwei Wang | Wenbo Su | Bo Zheng
Findings of the Association for Computational Linguistics: ACL 2026
Langming Liu | Kangtao Lv | Haibin Chen | Weidong Zhang | Yejing Wang | Shilei Liu | Xin Tong | Yujin Yuan | Yongwei Wang | Wenbo Su | Bo Zheng
Findings of the Association for Computational Linguistics: ACL 2026
Large language models (LLMs), despite their powerful capabilities, suffer from factual hallucinations where they generate verifiable falsehoods. We identify a root of this issue: the imbalanced data distribution in the pretraining corpus, which leads to a state of "low-probability truth" and "high-probability falsehood". Recent approaches, such as teaching models to say "I don’t know" or post-hoc knowledge editing, either evade the problem or face catastrophic forgetting. To address this issue from its root, we propose PretrainRL, a novel framework that integrates reinforcement learning into the pretraining phase to consolidate factual knowledge. The core principle of PretrainRL is "debiasing then learning." It actively reshapes the model’s probability distribution by down-weighting high-probability falsehoods, thereby making "room" for low-probability truths to be learned effectively. To enable this, we design an efficient negative sampling strategy to discover these high-probability falsehoods and introduce novel metrics to evaluate the model’s probabilistic state concerning factual knowledge. Extensive experiments on three public benchmarks demonstrate that PretrainRL significantly alleviates factual hallucinations and outperforms state-of-the-art methods.
SELECting over Tokens: Curating Pre-training Data at Scale via Token Classification
Xin Tong | Weidong Zhang | Jiaang Li | Haibin Chen | Shilei Liu | Langming Liu | Kangtao Lv | Yujin Yuan | Wenbo Su | Bo Zheng
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Xin Tong | Weidong Zhang | Jiaang Li | Haibin Chen | Shilei Liu | Langming Liu | Kangtao Lv | Yujin Yuan | Wenbo Su | Bo Zheng
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The quality of pre-training data critically impacts the capabilities of large language models. Existing pipelines rely on expert-crafted heuristic rules, which primarily operate at the sample level and are based on coarse statistical indicators, thus lacking content-aware, fine-grained noise detection. While recent generative approaches, e.g., ProX-C, enable token-level refinement, their reliance on synthesizing Python code incurs prohibitive computational cost at scale and can introduce hallucinations into the refined data. To overcome these limitations, we propose Selecting over Tokens (SelecT), a novel framework that reframes data refinement as a highly efficient token classification task. SelecT classifies each token as either informative or noisy and subsequently removes the latter. This design achieves fine-grained data optimization while avoiding the inefficiency of generation, ensuring scalability. When evaluated on diverse downstream benchmarks, the model trained on SelecT-refined corpora, on average, outperforms the one trained on raw data by over 2% and exceeds the best heuristic baselines by more than 1% while preserving 17% more tokens than the latter. Furthermore, SelecT achieves higher average performance than the generative ProX-C across all experimental settings, and is 2.5x faster at inference, even with twice the parameters. Our results establish SelecT as an effective, efficient, and scalable solution for pre-training data optimization.
CoMeT: Collaborative Memory Transformer for Efficient Long Context Modeling
Runsong Zhao | Shilei Liu | Jiwei Tang | Langming Liu | Haibin Chen | Weidong Zhang | Yujin Yuan | Tong Xiao | JingBo Zhu | Wenbo Su | Bo Zheng
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Runsong Zhao | Shilei Liu | Jiwei Tang | Langming Liu | Haibin Chen | Weidong Zhang | Yujin Yuan | Tong Xiao | JingBo Zhu | Wenbo Su | Bo Zheng
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The quadratic complexity and indefinitely growing key-value (KV) cache of standard Transformers pose a major barrier to long-context processing. To overcome this, we introduce the **Co**llaborative **Me**mory **T**ransformer (CoMeT), a novel architecture that enables LLMs to handle arbitrarily long sequences with constant memory usage and linear time complexity. Designed as an efficient, plug-in module, CoMeT can be integrated into pre-trained models with only minimal fine-tuning. It operates on sequential data chunks, using a dual-memory system to manage context: a temporary memory on a FIFO queue for recent events, and a global memory with a gated update rule for long-range dependencies. These memories then act as a dynamic soft prompt for the next chunk. The effectiveness of our approach is remarkable: a model equipped with CoMeT and fine-tuned on 32k contexts can accurately retrieve a passkey from any position within a 1M token sequence. On the SCROLLS benchmark, CoMeT surpasses other efficient methods and achieves performance comparable to a full-attention baseline on summarization tasks. Its practical effectiveness is further validated on real-world agent and user behavior QA tasks, supported by a novel layer-level pipeline parallel training strategy that enables fine-tuning on extremely long contexts. The code is available at: https://github.com/LivingFutureLab/Comet
2024
Pause-Aware Automatic Dubbing using LLM and Voice Cloning
Yuang Li | Jiaxin Guo | Min Zhang | Ma Miaomiao | Zhiqiang Rao | Weidong Zhang | Xianghui He | Daimeng Wei | Hao Yang
Proceedings of the 21st International Conference on Spoken Language Translation (IWSLT 2024)
Yuang Li | Jiaxin Guo | Min Zhang | Ma Miaomiao | Zhiqiang Rao | Weidong Zhang | Xianghui He | Daimeng Wei | Hao Yang
Proceedings of the 21st International Conference on Spoken Language Translation (IWSLT 2024)
Automatic dubbing aims to translate the speech of a video into another language, ensuring the new speech naturally fits the original video. This paper details Huawei Translation Services Center’s (HW-TSC) submission for IWSLT 2024’s automatic dubbing task, under an unconstrained setting. Our system’s machine translation (MT) component utilizes a Transformer-based MT model and an LLM-based post-editor to produce translations of varying lengths. The text-to-speech (TTS) component employs a VITS-based TTS model and a voice cloning module to emulate the original speaker’s vocal timbre. For enhanced dubbing synchrony, we introduce a parsing-informed pause selector. Finally, we rerank multiple results based on lip-sync error distance (LSE-D) and character error rate (CER). Our system achieves LSE-D of 10.75 and 12.19 on subset1 and subset2 of DE-EN test sets respectively, superior to last year’s best system.
LatEval: An Interactive LLMs Evaluation Benchmark with Incomplete Information from Lateral Thinking Puzzles
Shulin Huang | Shirong Ma | Yinghui Li | Mengzuo Huang | Wuhe Zou | Weidong Zhang | Haitao Zheng
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Shulin Huang | Shirong Ma | Yinghui Li | Mengzuo Huang | Wuhe Zou | Weidong Zhang | Haitao Zheng
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
With the evolution of LLMs, they are endowed with impressive logical reasoning, or vertical thinking capabilities. But can they think out of the box? Do they possess proficient lateral thinking abilities? Following the setup of Lateral Thinking Puzzles, we propose a novel evaluation benchmark, LatEval, which assesses the model’s lateral thinking within an interactive framework. In our benchmark, we challenge LLMs with 2 aspects: (1) posing high-quality questions that break out of conventional norms but are beneficial for puzzle-solving. (2) integrating existing information to gradually deduce the truth through reasoning. We observe that it is hard for most LLMs to accomplish lateral thinking during interactions. Even the most powerful LLM, GPT-4, faces challenges in achieving satisfactory performance, and for most open-source models, simply completing this task is quite difficult. This evaluation benchmark provides LLMs with a highly challenging and differentiating task that is crucial to an effective AI assistant. Our dataset and source codes are available at https://github.com/THUKElab/LatEval.
2023
KG-IQES: An Interpretable Quality Estimation System for Machine Translation Based on Knowledge Graph
Junhao Zhu | Min Zhang | Hao Yang | Song Peng | Zhanglin Wu | Yanfei Jiang | Xijun Qiu | Weiqiang Pan | Ming Zhu | Ma Miaomiao | Weidong Zhang
Proceedings of Machine Translation Summit XIX, Vol. 2: Users Track
Junhao Zhu | Min Zhang | Hao Yang | Song Peng | Zhanglin Wu | Yanfei Jiang | Xijun Qiu | Weiqiang Pan | Ming Zhu | Ma Miaomiao | Weidong Zhang
Proceedings of Machine Translation Summit XIX, Vol. 2: Users Track
The widespread use of machine translation (MT) has driven the need for effective automatic quality estimation (AQE) methods. How to enhance the interpretability of MT output quality estimation is well worth exploring in the industry. From the perspective of the alignment of named entities (NEs) in the source and translated sentences, we construct a multilingual knowledge graph (KG) consisting of domain-specific NEs, and design a KG-based interpretable quality estimation (QE) system for machine translations (KG-IQES). KG-IQES effectively estimates the translation quality without relying on reference translations. Its effectiveness has been verified in our business scenarios.
2022
A Token-pair Framework for Information Extraction from Dialog Transcripts in SereTOD Challenge
Chenyue Wang | Xiangxing Kong | Mengzuo Huang | Feng Li | Jian Xing | Weidong Zhang | Wuhe Zou
Proceedings of the Towards Semi-Supervised and Reinforced Task-Oriented Dialog Systems (SereTOD)
Chenyue Wang | Xiangxing Kong | Mengzuo Huang | Feng Li | Jian Xing | Weidong Zhang | Wuhe Zou
Proceedings of the Towards Semi-Supervised and Reinforced Task-Oriented Dialog Systems (SereTOD)
This paper describes our solution for Sere- TOD Challenge Track 1: Information extraction from dialog transcripts. We propose a token-pair framework to simultaneously identify entity and value mentions and link them into corresponding triples. As entity mentions are usually coreferent, we adopt a baseline model for coreference resolution. We exploit both annotated transcripts and unsupervised dialogs for training. With model ensemble and post-processing strategies, our system significantly outperforms the baseline solution and ranks first in triple f1 and third in entity f1.
Search
Fix author
Co-authors
- Haibin Chen 3
- Langming Liu 3
- Shilei Liu 3
- Wenbo Su 3
- Yujin Yuan 3
- Bo Zheng 3
- Mengzuo Huang 2
- Kangtao Lv 2
- Ma Miaomiao 2
- Xin Tong 2
- Hao Yang 2
- Min Zhang 2
- Wuhe Zou 2
- Jiaxin Guo 1
- Xianghui He 1
- Shulin Huang 1
- Yanfei Jiang 1
- Xiangxing Kong 1
- Jiaang Li 1
- Feng Li 1
- Yuang Li 1
- Yinghui Li 1
- Shirong Ma 1
- Weiqiang Pan 1
- Song Peng 1
- Xijun Qiu 1
- Zhiqiang Rao 1
- Jiwei Tang 1
- Yejing Wang 1
- Yongwei Wang 1
- Chenyue Wang 1
- Daimeng Wei 1
- Zhanglin Wu 1
- Tong Xiao (肖桐) 1
- Jian Xing 1
- Runsong Zhao 1
- Hai-Tao Zheng 1
- JingBo Zhu (朱靖波) 1
- Junhao Zhu 1
- Ming Zhu 1