Yudong Li
2026
KoCo: Conditioning Language Model Pre-training on Knowledge Coordinates
Yudong Li | Jiawei Cai | Linlin Shen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yudong Li | Jiawei Cai | Linlin Shen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Standard Large Language Model (LLM) pre-training typically treats corpora as flattened token sequences, often overlooking the real-world context that humans naturally rely on to contextualize information. To bridge this gap, we introduce Knowledge Coordinate Conditioning (KoCo), a simple method that maps every document into a three-dimensional semantic coordinate. By prepending these coordinates as textual prefixes for pre-training, we aim to equip the model with explicit contextual awareness to learn the documents within the real-world knowledge structure. Experiment results demonstrate that KoCo significantly enhances performance across 10 downstream tasks and accelerates pre-training convergence by approximately 30%. Furthermore, our analysis indicates that explicitly modeling knowledge coordinates helps the model distinguish stable facts from noise, effectively mitigating hallucination in generated outputs.
MTAVG-Bench: A Diagnostic Benchmark for Multi-Talker Dialogue-Centric Audio-Video Generation
Yanghao Zhou | Haitian Li | Rexar Lin | Heyan Huang | Jinxing Zhou | Changsen Yuan | Tian Lan | Ziqin Zhou | Yudong Li | Jiajun Xu | Jingyun Liao | YiMing Cheng | Xuefeng Chen | Xian-Ling Mao | Yousheng Feng
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yanghao Zhou | Haitian Li | Rexar Lin | Heyan Huang | Jinxing Zhou | Changsen Yuan | Tian Lan | Ziqin Zhou | Yudong Li | Jiajun Xu | Jingyun Liao | YiMing Cheng | Xuefeng Chen | Xian-Ling Mao | Yousheng Feng
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Recent advances in text-to-audio-video (T2AV) generation have enabled models to synthesize audio-visual videos with multi-participant dialogues. However, existing evaluation benchmarks remain largely designed for human-recorded videos or single-speaker settings. As a result, structural failures in generated multi-talker dialogue videos, such as identity drift, unnatural turn transitions, and audio-visual misalignment, cannot be effectively diagnosed. To address this issue, we introduce MTAVG-Bench, a failure-driven diagnostic benchmark for multi-talker dialogue-centric audio-video generation. MTAVG-Bench is built via a semi-automatic pipeline, where 1.8k videos are generated using mainstream T2AV models with carefully designed prompts, yielding 2.4k manually annotated QA pairs for fine-grained failure diagnosis. The benchmark evaluates multi-speaker dialogue generation at four levels: audio-visual signal fidelity, temporal attribute consistency, social interaction, and cinematic expression. Built on a hierarchical failure taxonomy and a targeted QA protocol, MTAVG-Bench is primarily designed to evaluate whether proprietary and open-source omni-models can reliably identify failure modes in multi-speaker T2AV outputs. We benchmark 12 proprietary and open-source omni-models on MTAVG-Bench, with Gemini 3 Pro achieving the strongest overall performance, while leading open-source models remain competitive in signal fidelity and consistency. Overall, MTAVG-Bench enables fine-grained failure analysis for rigorous model comparison and targeted video generation refinement.
2023
TencentPretrain: A Scalable and Flexible Toolkit for Pre-training Models of Different Modalities
Zhe Zhao | Yudong Li | Cheng Hou | Jing Zhao | Rong Tian | Weijie Liu | Yiren Chen | Ningyuan Sun | Haoyan Liu | Weiquan Mao | Han Guo | Weigang Gou | Taiqiang Wu | Tao Zhu | Wenhang Shi | Chen Chen | Shan Huang | Sihong Chen | Liqun Liu | Feifei Li | Xiaoshuai Chen | Xingwu Sun | Zhanhui Kang | Xiaoyong Du | Linlin Shen | Kimmo Yan
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
Zhe Zhao | Yudong Li | Cheng Hou | Jing Zhao | Rong Tian | Weijie Liu | Yiren Chen | Ningyuan Sun | Haoyan Liu | Weiquan Mao | Han Guo | Weigang Gou | Taiqiang Wu | Tao Zhu | Wenhang Shi | Chen Chen | Shan Huang | Sihong Chen | Liqun Liu | Feifei Li | Xiaoshuai Chen | Xingwu Sun | Zhanhui Kang | Xiaoyong Du | Linlin Shen | Kimmo Yan
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
Recently, the success of pre-training in text domain has been fully extended to vision, audio, and cross-modal scenarios. The proposed pre-training models of different modalities are showing a rising trend of homogeneity in their model structures, which brings the opportunity to implement different pre-training models within a uniform framework. In this paper, we present TencentPretrain, a toolkit supporting pre-training models of different modalities. The core feature of TencentPretrain is the modular design. The toolkit uniformly divides pre-training models into 5 components: embedding, encoder, target embedding, decoder, and target. As almost all of common modules are provided in each component, users can choose the desired modules from different components to build a complete pre-training model. The modular design enables users to efficiently reproduce existing pre-training models or build brand-new one. We test the toolkit on text, vision, and audio benchmarks and show that it can match the performance of the original implementations.
2022
CSL: A Large-scale Chinese Scientific Literature Dataset
Yudong Li | Yuqing Zhang | Zhe Zhao | Linlin Shen | Weijie Liu | Weiquan Mao | Hui Zhang
Proceedings of the 29th International Conference on Computational Linguistics
Yudong Li | Yuqing Zhang | Zhe Zhao | Linlin Shen | Weijie Liu | Weiquan Mao | Hui Zhang
Proceedings of the 29th International Conference on Computational Linguistics
Scientific literature serves as a high-quality corpus, supporting a lot of Natural Language Processing (NLP) research. However, existing datasets are centered around the English language, which restricts the development of Chinese scientific NLP. In this work, we present CSL, a large-scale Chinese Scientific Literature dataset, which contains the titles, abstracts, keywords and academic fields of 396k papers. To our knowledge, CSL is the first scientific document dataset in Chinese. The CSL can serve as a Chinese corpus. Also, this semi-structured data is a natural annotation that can constitute many supervised NLP tasks. Based on CSL, we present a benchmark to evaluate the performance of models across scientific domain tasks, i.e., summarization, keyword generation and text classification. We analyze the behavior of existing text-to-text models on the evaluation tasks and reveal the challenges for Chinese scientific NLP tasks, which provides a valuable reference for future research. Data and code will be publicly available.
Multi-stage Distillation Framework for Cross-Lingual Semantic Similarity Matching
Kunbo Ding | Weijie Liu | Yuejian Fang | Zhe Zhao | Qi Ju | Xuefeng Yang | Rong Tian | Zhu Tao | Haoyan Liu | Han Guo | Xingyu Bai | Weiquan Mao | Yudong Li | Weigang Guo | Taiqiang Wu | Ningyuan Sun
Findings of the Association for Computational Linguistics: NAACL 2022
Kunbo Ding | Weijie Liu | Yuejian Fang | Zhe Zhao | Qi Ju | Xuefeng Yang | Rong Tian | Zhu Tao | Haoyan Liu | Han Guo | Xingyu Bai | Weiquan Mao | Yudong Li | Weigang Guo | Taiqiang Wu | Ningyuan Sun
Findings of the Association for Computational Linguistics: NAACL 2022
Previous studies have proved that cross-lingual knowledge distillation can significantly improve the performance of pre-trained models for cross-lingual similarity matching tasks. However, the student model needs to be large in this operation. Otherwise, its performance will drop sharply, thus making it impractical to be deployed to memory-limited devices. To address this issue, we delve into cross-lingual knowledge distillation and propose a multi-stage distillation framework for constructing a small-size but high-performance cross-lingual model. In our framework, contrastive learning, bottleneck, and parameter recurrent strategies are delicately combined to prevent performance from being compromised during the compression process. The experimental results demonstrate that our method can compress the size of XLM-R and MiniLM by more than 50%, while the performance is only reduced by about 1%.
Parameter-efficient Continual Learning Framework in Industrial Real-time Text Classification System
Tao Zhu | Zhe Zhao | Weijie Liu | Jiachi Liu | Yiren Chen | Weiquan Mao | Haoyan Liu | Kunbo Ding | Yudong Li | Xuefeng Yang
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Track
Tao Zhu | Zhe Zhao | Weijie Liu | Jiachi Liu | Yiren Chen | Weiquan Mao | Haoyan Liu | Kunbo Ding | Yudong Li | Xuefeng Yang
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Track
Catastrophic forgetting is a challenge for model deployment in industrial real-time systems, which requires the model to quickly master a new task without forgetting the old one. Continual learning aims to solve this problem; however, it usually updates all the model parameters, resulting in extensive training times and the inability to deploy quickly. To address this challenge, we propose a parameter-efficient continual learning framework, in which efficient parameters are selected through an offline parameter selection strategy and then trained using an online regularization method. In our framework, only a few parameters need to be updated, which not only alleviates catastrophic forgetting, but also allows the model to be saved with the changed parameters instead of all parameters. Extensive experiments are conducted to examine the effectiveness of our proposal. We believe this paper will provide useful insights and experiences on developing deep learning-based online real-time systems.
2020
CLUE: A Chinese Language Understanding Evaluation Benchmark
Liang Xu | Hai Hu | Xuanwei Zhang | Lu Li | Chenjie Cao | Yudong Li | Yechen Xu | Kai Sun | Dian Yu | Cong Yu | Yin Tian | Qianqian Dong | Weitang Liu | Bo Shi | Yiming Cui | Junyi Li | Jun Zeng | Rongzhao Wang | Weijian Xie | Yanting Li | Yina Patterson | Zuoyu Tian | Yiwen Zhang | He Zhou | Shaoweihua Liu | Zhe Zhao | Qipeng Zhao | Cong Yue | Xinrui Zhang | Zhengliang Yang | Kyle Richardson | Zhenzhong Lan
Proceedings of the 28th International Conference on Computational Linguistics
Liang Xu | Hai Hu | Xuanwei Zhang | Lu Li | Chenjie Cao | Yudong Li | Yechen Xu | Kai Sun | Dian Yu | Cong Yu | Yin Tian | Qianqian Dong | Weitang Liu | Bo Shi | Yiming Cui | Junyi Li | Jun Zeng | Rongzhao Wang | Weijian Xie | Yanting Li | Yina Patterson | Zuoyu Tian | Yiwen Zhang | He Zhou | Shaoweihua Liu | Zhe Zhao | Qipeng Zhao | Cong Yue | Xinrui Zhang | Zhengliang Yang | Kyle Richardson | Zhenzhong Lan
Proceedings of the 28th International Conference on Computational Linguistics
The advent of natural language understanding (NLU) benchmarks for English, such as GLUE and SuperGLUE allows new NLU models to be evaluated across a diverse set of tasks. These comprehensive benchmarks have facilitated a broad range of research and applications in natural language processing (NLP). The problem, however, is that most such benchmarks are limited to English, which has made it difficult to replicate many of the successes in English NLU for other languages. To help remedy this issue, we introduce the first large-scale Chinese Language Understanding Evaluation (CLUE) benchmark. CLUE is an open-ended, community-driven project that brings together 9 tasks spanning several well-established single-sentence/sentence-pair classification tasks, as well as machine reading comprehension, all on original Chinese text. To establish results on these tasks, we report scores using an exhaustive set of current state-of-the-art pre-trained Chinese models (9 in total). We also introduce a number of supplementary datasets and additional tools to help facilitate further progress on Chinese NLU. Our benchmark is released at https://www.cluebenchmarks.com
Search
Fix author
Co-authors
- Zhe Zhao 5
- Weijie Liu 4
- Weiquan Mao 4
- Haoyan Liu 3
- Linlin Shen 3
- Yiren Chen 2
- Kunbo Ding 2
- Han Guo 2
- Ningyuan Sun 2
- Rong Tian 2
- Taiqiang Wu 2
- Xuefeng Yang 2
- Tao Zhu 2
- Xingyu Bai 1
- Jiawei Cai 1
- Chenjie Cao 1
- Chen Chen 1
- Sihong Chen 1
- Xiaoshuai Chen 1
- Xuefeng Chen 1
- Yiming Cheng 1
- Yiming Cui 1
- Qianqian Dong 1
- Xiaoyong Du 1
- Yuejian Fang 1
- Yousheng Feng 1
- Weigang Gou 1
- Weigang Guo 1
- Cheng Hou 1
- Hai Hu 1
- Shan Huang 1
- He-Yan Huang (黄河燕) 1
- Qi Ju 1
- Zhanhui Kang 1
- Zhenzhong Lan 1
- Tian Lan 1
- Feifei Li 1
- Lu Li 1
- Junyi Li 1
- Yanting Li 1
- Haitian Li 1
- Jingyun Liao 1
- Rexar Lin 1
- Liqun Liu 1
- Weitang Liu 1
- Shaoweihua Liu 1
- Jiachi Liu 1
- Xian-Ling Mao 1
- Yina Patterson 1
- Kyle Richardson 1
- Wenhang Shi 1
- Bo Shi 1
- Xingwu Sun 1
- Kai Sun 1
- Zhu Tao 1
- Yin Tian 1
- Zuoyu Tian 1
- Rongzhao Wang 1
- Weijian Xie 1
- Liang Xu 1
- Yechen Xu 1
- Jiajun Xu 1
- Kimmo Yan 1
- Zhengliang Yang 1
- Dian Yu 1
- Cong Yu 1
- Changsen Yuan 1
- Cong Yue 1
- Jun Zeng 1
- Yuqing Zhang 1
- Hui Zhang (张晖) 1
- Xuanwei Zhang 1
- Yiwen Zhang 1
- Xinrui Zhang 1
- Jing Zhao 1
- Qipeng Zhao 1
- He Zhou 1
- Yanghao Zhou (周杨浩) 1
- Jinxing Zhou 1
- Ziqin Zhou 1