Chao Xu
2026
Guided by the Plan: Enhancing Faithful Autoregressive Text-to-Audio Generation with Guided Decoding
Juncheng Wang | Zhe Hu | Chao Xu | Siyue Ren | Yuxiang Feng | Yang Liu | Baigui Sun | Shujun Wang
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Juncheng Wang | Zhe Hu | Chao Xu | Siyue Ren | Yuxiang Feng | Yang Liu | Baigui Sun | Shujun Wang
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Autoregressive (AR) models excel at generating temporally coherent audio by producing tokens sequentially, yet they often falter in faithfully following complex textual prompts—especially those describing complex sound events. We uncover a surprising capability in AR audio generators: their early prefix tokens implicitly encode global semantic attributes of the final output, such as event count and sound-object category, revealing a form of implicit planning. Building on this insight, we propose Plan-Critic, a lightweight auxiliary model trained with a Generalized Advantage Estimation (GAE)-inspired objective to predict final instruction-following quality from partial generations. At inference time, Plan-Critic enables guided exploration: it evaluates candidate prefixes early, prunes low-fidelity trajectories, and reallocates computation to high-potential planning seeds. Our Plan-Critic-guided sampling achieves up to a 10 points improvement in CLAP score over the AR baseline—establishing a new state of the art in AR text-to-audio generation—while maintaining computational parity with standard best-of-N decoding. This work bridges the gap between causal generation and global semantic alignment, demonstrating that even strictly autoregressive models can plan ahead.
MoEC: A Memory-Routed Mixture-of-Experts Controller for Adaptive Minecraft Control
Hui Wu | Chao Xu | Jianghui Wang | Ziqiong Liu | Dong Li | Yiwei Dai | Emad Barsoum
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Hui Wu | Chao Xu | Jianghui Wang | Ziqiong Liu | Dong Li | Yiwei Dai | Emad Barsoum
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Embodied agents in open-ended environments such as Minecraft increasingly adopt planner–controller architectures, with large language models acting as high-level planners. While planning has advanced rapidly, control remains underexplored. Existing systems commonly rely on a monolithic policy to execute subgoals across varying contexts, forcing incompatible behaviors into a shared parameter space and causing interference that scaling only partially mitigates. To address this, we propose MoEC, a Memory-Routed Mixture-of-Experts Controller for Adaptive Minecraft Control. MoEC routes via a subgoal-indexed, non-parametric expert memory and regulates capacity through failure-triggered expert growth and redundancy-aware consolidation. This design enables continual adaptation without full retraining, while maintaining parameter efficiency and with bounded inference cost. We evaluate MoEC on diverse and compositional Minecraft tasks, demonstrating significant gains in adaptability, robustness, and execution consistency over strong baselines, yielding a scalable and efficient alternative for open-ended control.
MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing
Junbo Niu | Zheng Liu | Zhuangcheng Gu | Bin Wang | Linke Ouyang | Zhiyuan Zhao | Tao Chu | Tianyao He | Fan Wu | Qintong Zhang | Zhenjiang Jin | Guang Liang | Rui Zhang | Wenzheng Zhang | Yuan Qu | Zhifei Ren | Yuefeng Sun | Zirui Tang | Boyu Niu | Yuanhong Zheng | Dongsheng Ma | Ziyang Miao | Hejun Dong | Siyi Qian | Junyuan Zhang | Fangdong Wang | Jingzhou Chen | Xiaomeng Zhao | Liqun Wei | Wei Li | Shasha Wang | RuiLiang Xu | Yuanyuan Cao | Lu Chen | Qianqian Wu | Huaiyu Gu | Lindong Lu | Dechen Lin | Shenguanlin | Xuanhe Zhou | Linfeng Zhang | Yuhang Zang | Xiaoyi Dong | Jiaqi Wang | Bo Zhang | Lei Bai | Pei Chu | Weijia Li | Jiang Wu | Lijun Wu | Zhenxiang Li | Guangyu Wang | Zhongying Tu | Chao Xu | Kai Chen | Bowen Zhou | Dahua Lin | Wentao Zhang | Conghui He
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Junbo Niu | Zheng Liu | Zhuangcheng Gu | Bin Wang | Linke Ouyang | Zhiyuan Zhao | Tao Chu | Tianyao He | Fan Wu | Qintong Zhang | Zhenjiang Jin | Guang Liang | Rui Zhang | Wenzheng Zhang | Yuan Qu | Zhifei Ren | Yuefeng Sun | Zirui Tang | Boyu Niu | Yuanhong Zheng | Dongsheng Ma | Ziyang Miao | Hejun Dong | Siyi Qian | Junyuan Zhang | Fangdong Wang | Jingzhou Chen | Xiaomeng Zhao | Liqun Wei | Wei Li | Shasha Wang | RuiLiang Xu | Yuanyuan Cao | Lu Chen | Qianqian Wu | Huaiyu Gu | Lindong Lu | Dechen Lin | Shenguanlin | Xuanhe Zhou | Linfeng Zhang | Yuhang Zang | Xiaoyi Dong | Jiaqi Wang | Bo Zhang | Lei Bai | Pei Chu | Weijia Li | Jiang Wu | Lijun Wu | Zhenxiang Li | Guangyu Wang | Zhongying Tu | Chao Xu | Kai Chen | Bowen Zhou | Dahua Lin | Wentao Zhang | Conghui He
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
We introduce MinerU2.5, a 1.2B-parameter document parsing vision-language model that achieves state-of-the-art recognition accuracy while maintaining exceptional computational efficiency. Our approach employs a coarse-to-fine, two-stage parsing strategy that decouples global layout analysis from local content recognition. In the first stage, the model performs efficient layout analysis on downsampled images to identify structural elements, circumventing the computational overhead of processing high-resolution inputs. In the second stage, guided by the global layout, it performs targeted content recognition on native-resolution crops extracted from the original image, preserving fine-grained details in dense text, complex formulas, and tables. To support this strategy, we developed a comprehensive data engine that generates diverse, large-scale training corpora for both pretraining and fine-tuning. Ultimately, MinerU2.5 demonstrates strong document parsing ability, achieving state-of-the-art performance on multiple benchmarks, surpassing both general-purpose and domain-specific models across various recognition tasks, while maintaining significantly lower computational overhead.
2025
Language Model Based Text-to-Audio Generation: Anti-Causally Aligned Collaborative Residual Transformers
Juncheng Wang | Chao Xu | Cheng Yu | Zhe Hu | Haoyu Xie | Guoqi Yu | Lei Shang | Shujun Wang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Juncheng Wang | Chao Xu | Cheng Yu | Zhe Hu | Haoyu Xie | Guoqi Yu | Lei Shang | Shujun Wang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
While language models (LMs) paired with residual vector quantization (RVQ) tokenizers have shown promise in text-to-audio (T2A) generation, they still lag behind diffusion-based models by a non-trivial margin. We identify a critical dilemma underpinning this gap: incorporating more RVQ layers improves audio reconstruction fidelity but exceeds the generation capacity of conventional LMs. To address this, we first analyze RVQ dynamics and uncover two key limitations: 1) orthogonality of features across RVQ layers hinders effective LMs training, and 2) descending semantic richness in tokens from deeper RVQ layers exacerbates exposure bias during autoregressive decoding. Based on these insights, we propose Siren, a novel LM-based framework that employs multiple isolated transformers with causal conditioning and anti-causal alignment via reinforcement learning. Extensive experiments demonstrate that Siren outperforms both existing LM-based and diffusion-based T2A systems, achieving state-of-the-art results. By bridging the representational strengths of LMs with the fidelity demands of audio synthesis, our approach repositions LMs as competitive contenders against diffusion models in T2A tasks. Moreover, by aligning audio representations with linguistic structures, Siren opens a promising pathway toward unified multi-modal generation frameworks.
Lemmatization of Cuneiform Languages Using the ByT5 Model
Pengxiu Lu | Yonglong Huang | Jing Xu | Minxuan Feng | Chao Xu
Proceedings of the Second Workshop on Ancient Language Processing
Pengxiu Lu | Yonglong Huang | Jing Xu | Minxuan Feng | Chao Xu
Proceedings of the Second Workshop on Ancient Language Processing
Lemmatization of cuneiform languages presents a unique challenge due to their complex writing system, which combines syllabic and logographic elements. In this study, we investigate the effectiveness of the ByT5 model in addressing this challenge by developing and evaluating a ByT5-based lemmatization system. Experimental results demonstrate that ByT5 outperforms mT5 in this task, achieving an accuracy of 80.55% on raw lemmas and 82.59% on generalized lemmas, where sense numbers are removed. These findings highlight the potential of ByT5 for lemmatizing cuneiform languages and provide useful insights for future work on ancient text lemmatization.
2024
Overview of EvaHan2024: The First International Evaluation on Ancient Chinese Sentence Segmentation and Punctuation
Bin Li | Bolin Chang | Zhixing Xu | Minxuan Feng | Chao Xu | Weiguang Qu | Si Shen | Dongbo Wang
Proceedings of the Third Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA) @ LREC-COLING-2024
Bin Li | Bolin Chang | Zhixing Xu | Minxuan Feng | Chao Xu | Weiguang Qu | Si Shen | Dongbo Wang
Proceedings of the Third Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA) @ LREC-COLING-2024
Ancient Chinese texts have no sentence boundaries and punctuation. Adding modern Chinese punctuation to theses texts requires expertise, time and efforts. Automatic sentence segmentation and punctuation is considered as a basic task for Ancient Chinese processing, but there is no shared task to evaluate the performances of different systems. This paper presents the results of the first ancient Chinese sentence segmentation and punctuation bakeoff, which is held at the Third Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA) 2024. The contest uses metrics for detailed evaluations of 4 genres of unpublished texts with 11 punctuation types. Six teams submitted 32 running results. In the closed modality, the participants are only allowed to use the training data, the highest obtained F1 scores are respectively 88.47% and 75.29% in sentence segmentation and sentence punctuation. The perfermances on the unseen data is 10 percent lower than the published common data, which means there is still space for further improvement. The large language models outperform the traditional models, but LLM changes the original characters around 1-2%, due to over-generation. Thus, post-processing is needed to keep the text consistancy.
PSC: Extending Context Window of Large Language Models via Phase Shift Calibration
Wenqiao Zhu | Chao Xu | Lulu Wang | Jun Wu
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Wenqiao Zhu | Chao Xu | Lulu Wang | Jun Wu
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Rotary Position Embedding (RoPE) is an efficient position encoding approach and is widely utilized in numerous large language models (LLMs). Recently, a lot of methods have been put forward to further expand the context window based on RoPE. The core concept of those methods is to predefine or search for a set of factors to rescale the base frequencies of RoPE. Nevertheless, it is quite a challenge for existing methods to predefine an optimal factor due to the exponential search space. In view of this, we introduce PSC (Phase Shift Calibration), a small module for calibrating the frequencies predefined by existing methods. With the employment of PSC, we demonstrate that many existing methods can be further enhanced, like PI, YaRN, and LongRoPE. We conducted extensive experiments across multiple models and tasks. The results demonstrate that (1) when PSC is enabled, the comparative reductions in perplexity increase as the context window size is varied from 16k, to 32k, and up to 64k. (2) Our approach is broadly applicable and exhibits robustness across a variety of models and tasks.
AuditWen: An Open-Source Large Language Model for Audit
Jiajia Huang | Haoran Zhu | Chao Xu | Tianming Zhan | Qianqian Xie | Jimin Huang
Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 1: Main Conference)
Jiajia Huang | Haoran Zhu | Chao Xu | Tianming Zhan | Qianqian Xie | Jimin Huang
Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 1: Main Conference)
“Intelligent auditing represents a crucial advancement in modern audit practices, enhancing boththe quality and efficiency of audits within the realm of artificial intelligence. With the rise oflarge language model (LLM), there is enormous potential for intelligent models to contribute toaudit domain. However, general LLMs applied in audit domain face the challenges of lackingspecialized knowledge and the presence of data biases. To overcome these challenges, this studyintroduces AuditWen, an open-source audit LLM by fine-tuning Qwen with constructing instruc-tion data from audit domain. We first outline the application scenarios for LLMs in the audit andextract requirements that shape the development of LLMs tailored for audit purposes. We thenpropose an audit LLM, called AuditWen, by fine-tuning Qwen with constructing 30k instructiondataset from 15 audit tasks and 3 layers. In evaluation stage, we proposed a benchmark with 5kinstructions that covers a set of critical audit tasks derived from the application scenarios. Withthe benchmark, we compare AuditWen with other existing LLMs from information extraction,question answering and document generation. The experimental results demonstrate superiorperformance of AuditWen both in question understanding and answer generation, making it animmediately valuable tool for audit.Keyword AuditWen, LLM, instruction dataset, fine-tuning, benchmarkIntroduction”
2022
The First International Ancient Chinese Word Segmentation and POS Tagging Bakeoff: Overview of the EvaHan 2022 Evaluation Campaign
Bin Li | Yiguo Yuan | Jingya Lu | Minxuan Feng | Chao Xu | Weiguang Qu | Dongbo Wang
Proceedings of the Second Workshop on Language Technologies for Historical and Ancient Languages
Bin Li | Yiguo Yuan | Jingya Lu | Minxuan Feng | Chao Xu | Weiguang Qu | Dongbo Wang
Proceedings of the Second Workshop on Language Technologies for Historical and Ancient Languages
This paper presents the results of the First Ancient Chinese Word Segmentation and POS Tagging Bakeoff (EvaHan), which was held at the Second Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA) 2022, in the context of the 13th Edition of the Language Resources and Evaluation Conference (LREC 2022). We give the motivation for having an international shared contest, as well as the data and tracks. The contest is consisted of two modalities, closed and open. In the closed modality, the participants are only allowed to use the training data, obtained the highest F1 score of 96.03% and 92.05% in word segmentation and POS tagging. In the open modality, the participants can use whatever resource they have, with the highest F1 score of 96.34% and 92.56% in word segmentation and POS tagging. The scores on the blind test dataset decrease around 3 points, which shows that the out-of-vocabulary words still are the bottleneck for lexical analyzers.
Drum Up SUPPORT: Systematic Analysis of Image-Schematic Conceptual Metaphors
Lennart Wachowiak | Dagmar Gromann | Chao Xu
Proceedings of the 3rd Workshop on Figurative Language Processing (FLP)
Lennart Wachowiak | Dagmar Gromann | Chao Xu
Proceedings of the 3rd Workshop on Figurative Language Processing (FLP)
Conceptual metaphors represent a cognitive mechanism to transfer knowledge structures from one onto another domain. Image-schematic conceptual metaphors (ISCMs) specialize on transferring sensorimotor experiences to abstract domains. Natural language is believed to provide evidence of such metaphors. However, approaches to verify this hypothesis largely rely on top-down methods, gathering examples by way of introspection, or on manual corpus analyses. In order to contribute towards a method that is systematic and can be replicated, we propose to bring together existing processing steps in a pipeline to detect ISCMs, exemplified for the image schema SUPPORT in the COVID-19 domain. This pipeline consist of neural metaphor detection, dependency parsing to uncover construction patterns, clustering, and BERT-based frame annotation of dependent constructions to analyse ISCMs.
2020
A Cognitively Motivated Approach to Spatial Information Extraction
Chao Xu | Emmanuelle-Anna Dietz Saldanha | Dagmar Gromann | Beihai Zhou
Proceedings of the Third International Workshop on Spatial Language Understanding
Chao Xu | Emmanuelle-Anna Dietz Saldanha | Dagmar Gromann | Beihai Zhou
Proceedings of the Third International Workshop on Spatial Language Understanding
Automatic extraction of spatial information from natural language can boost human-centered applications that rely on spatial dynamics. The field of cognitive linguistics has provided theories and cognitive models to address this task. Yet, existing solutions tend to focus on specific word classes, subject areas, or machine learning techniques that cannot provide cognitively plausible explanations for their decisions. We propose an automated spatial semantic analysis (ASSA) framework building on grammar and cognitive linguistic theories to identify spatial entities and relations, bringing together methods of spatial information extraction and cognitive frameworks on spatial language. The proposed rule-based and explainable approach contributes constructions and preposition schemas and outperforms previous solutions on the CLEF-2017 standard dataset.
Search
Fix author
Co-authors
- Minxuan Feng 3
- Dagmar Gromann 2
- Zhe Hu 2
- Bin Li 2
- Weiguang Qu 2
- Dongbo Wang 2
- Juncheng Wang 2
- Shujun Wang 2
- Lei Bai 1
- Emad Barsoum 1
- Yuanyuan Cao 1
- Bolin Chang 1
- Jingzhou Chen 1
- Kai Chen 1
- Lu Chen 1
- Pei Chu 1
- Tao Chu 1
- Yiwei Dai 1
- Emmanuelle-Anna Dietz Saldanha 1
- Hejun Dong 1
- Xiaoyi Dong 1
- Yuxiang Feng 1
- Huaiyu Gu 1
- Zhuangcheng Gu 1
- Conghui He 1
- Tianyao He 1
- Jiajia Huang 1
- Jimin Huang 1
- Yonglong Huang 1
- Zhenjiang Jin 1
- Dong Li 1
- Wei Li 1
- Weijia Li 1
- Zhenxiang Li 1
- Guang Liang 1
- Dahua Lin 1
- Dechen Lin 1
- Yang Liu 1
- Zheng Liu 1
- Ziqiong Liu 1
- Jingya Lu 1
- Lindong Lu 1
- Pengxiu Lu 1
- Dongsheng Ma 1
- Ziyang Miao 1
- Boyu Niu 1
- Junbo Niu 1
- Linke Ouyang 1
- Siyi Qian 1
- Yuan Qu 1
- Siyue Ren 1
- Zhifei Ren 1
- Lei Shang 1
- Si Shen 1
- Shenguanlin 1
- Baigui Sun 1
- Yuefeng Sun 1
- Zirui Tang 1
- Zhongying Tu 1
- Lennart Wachowiak 1
- Bin Wang 1
- Fangdong Wang 1
- Guangyu Wang 1
- Jianghui Wang 1
- Jiaqi Wang 1
- Lulu Wang 1
- Shasha Wang 1
- Liqun Wei 1
- Fan Wu 1
- Hui Wu 1
- Jiang Wu 1
- Jun Wu 1
- Lijun Wu 1
- Qianqian Wu 1
- Haoyu Xie 1
- Qianqian Xie 1
- Jing Xu 1
- RuiLiang Xu 1
- Zhixing Xu (许智星) 1
- Cheng Yu 1
- Guoqi Yu 1
- Yiguo Yuan 1
- Yuhang Zang 1
- Tianming Zhan 1
- Bo Zhang 1
- Junyuan Zhang 1
- Linfeng Zhang 1
- Qintong Zhang 1
- Rui Zhang 1
- Wentao Zhang 1
- Wenzheng Zhang 1
- Xiaomeng Zhao 1
- Zhiyuan Zhao 1
- Yuanhong Zheng 1
- Beihai Zhou 1
- Bowen Zhou 1
- Xuanhe Zhou 1
- Haoran Zhu 1
- Wenqiao Zhu 1