Tao Sun
2026
MdEval: Massively Multilingual Code Debugging
Shukai Liu | Linzheng Chai | Jian Yang | Jiajun Shi | He Zhu | Liran Wang | Jin Ke | Wei Zhang | Hualei Zhu | Shuyue Guo | Tao Sun | Jiaheng Liu | Yunlong Duan | Yu Hao | Liqun Yang | Guanglin Niu | Ge Zhang | Zhoujun Li
Findings of the Association for Computational Linguistics: ACL 2026
Shukai Liu | Linzheng Chai | Jian Yang | Jiajun Shi | He Zhu | Liran Wang | Jin Ke | Wei Zhang | Hualei Zhu | Shuyue Guo | Tao Sun | Jiaheng Liu | Yunlong Duan | Yu Hao | Liqun Yang | Guanglin Niu | Ge Zhang | Zhoujun Li
Findings of the Association for Computational Linguistics: ACL 2026
Code large language models (LLMs) have made significant progress in code debugging by directly generating the correct code based on the buggy code snippet. Programming benchmarks, typically consisting of buggy code snippets and their associated test cases, are used to assess the debugging capabilities of LLMs. However, many existing benchmarks primarily focus on Python and are often limited in terms of language diversity (e.g., DebugBench and DebugEval). To advancethe field of multilingual debugging with LLMs, we propose the first massively multilingual debugging benchmark, which includes 3.9K test samples of 20 programming languages and covers the automated program repair (APR) task, the bug localization(BL) task, and the bug identification (BI) task. In addition, we introduce the debugging instruction corpora MdEval-Instruct by injecting bugs into the correct multilingual queries and solutions (xDebugGen). Further, a multilingual debugger xDebugCoder trained on MdEval-Instruct as a strong baseline specifically to handle bugs of a wide range of programming languages (e.g. “Missing Mut” in language Rust and “Misused Macro Definition” in language C). Our extensive experiments on MdEval reveal a notable performance gap between open-source and closed-source LLMs (e.g., GPT and Claudeseries), highlighting huge room for improvement in multilingual code debugging scenarios.
2025
XFormParser: A Simple and Effective Multimodal Multilingual Semi-structured Form Parser
Xianfu Cheng | Hang Zhang | Jian Yang | Xiang Li | Weixiao Zhou | Fei Liu | Kui Wu | Xiangyuan Guan | Tao Sun | Xianjie Wu | Tongliang Li | Zhoujun Li
Proceedings of the 31st International Conference on Computational Linguistics
Xianfu Cheng | Hang Zhang | Jian Yang | Xiang Li | Weixiao Zhou | Fei Liu | Kui Wu | Xiangyuan Guan | Tao Sun | Xianjie Wu | Tongliang Li | Zhoujun Li
Proceedings of the 31st International Conference on Computational Linguistics
In the domain of Document AI, parsing semi-structured image form is a crucial Key Information Extraction (KIE) task. The advent of pre-trained multimodal models significantly empowers Document AI frameworks to extract key information from form documents in different formats such as PDF, Word, and images. Nonetheless, form parsing is still encumbered by notable challenges like subpar capabilities in multilingual parsing and diminished recall in industrial contexts in rich text and rich visuals. In this work, we introduce a simple but effective Multimodal and Multilingual semi-structured FORM PARSER (XFormParser), which is anchored on a comprehensive Transformer-based pre-trained language model and innovatively amalgamates semantic entity recognition (SER) and relation extraction (RE) into a unified framework. Combined with Bi-LSTM, the performance of multilingual parsing is significantly improved. Furthermore, we develop InDFormSFT, a pioneering supervised fine-tuning (SFT) industrial dataset that specifically addresses the parsing needs of forms in a variety of industrial contexts. Through rigorous testing on established benchmarks, XFormParser has demonstrated its unparalleled effectiveness and robustness. Compared to existing state-of-the-art (SOTA) models, XFormParser notably achieves up to 1.79% F1 score improvement on RE tasks in language-specific settings. It also exhibits exceptional improvements in cross-task performance in both multilingual and zero-shot settings.
LLaVA-RE: Binary Image-Text Relevancy Evaluation with Multimodal Large Language Model
Tao Sun | Oliver Liu | JinJin Li | Lan Ma
Proceedings of the First Workshop of Evaluation of Multi-Modal Generation
Tao Sun | Oliver Liu | JinJin Li | Lan Ma
Proceedings of the First Workshop of Evaluation of Multi-Modal Generation
Multimodal generative AI usually involves generating image or text responses given inputs in another modality. The evaluation of image-text relevancy is essential for measuring the response quality or ranking candidate responses. In particular, binary relevancy evaluation, i.e., “Relevant” vs. “Not Relevant”, is a fundamental problem. However, this is a challenging task considering that texts have diverse formats and the definition of relevancy varies in different scenarios. We find that Multimodal Large Language Models (MLLMs) are an ideal choice to build such evaluators, as they can flexibly handle complex text formats and take in additional task information. In this paper, we present LLaVA-RE, a first attempt for binary image-text relevancy evaluation with MLLM. It follows the LLaVA architecture and adopts detailed task instructions and multimodal in-context samples. Further, we propose a novel binary relevancy dataset covering diverse tasks. Experimental results validate the effectiveness of our framework.
2024
UniCoder: Scaling Code Large Language Model via Universal Code
Tao Sun | Linzheng Chai | Jian Yang | Yuwei Yin | Hongcheng Guo | Jiaheng Liu | Bing Wang | Liqun Yang | Zhoujun Li
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Tao Sun | Linzheng Chai | Jian Yang | Yuwei Yin | Hongcheng Guo | Jiaheng Liu | Bing Wang | Liqun Yang | Zhoujun Li
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Intermediate reasoning or acting steps have successfully improved large language models (LLMs) for handling various downstream natural language processing (NLP) tasks.When applying LLMs for code generation, recent works mainly focus on directing the models to articulate intermediate natural-language reasoning steps, as in chain-of-thought (CoT) prompting, and then output code with the natural language or other structured intermediate steps. However, such output is not suitable for code translation or generation tasks since the standard CoT has different logical structures and forms of expression with the code. In this work, we introduce the universal code (UniCode) as the intermediate representation. It is a description of algorithm steps using a mix of conventions of programming languages, such as assignment operator, conditional operator, and loop. Hence, we collect an instruction dataset UniCoder-Instruct to train our model UniCoder on multi-task learning objectives. UniCoder-Instruct comprises natural-language questions, code solutions, and the corresponding universal code. The alignment between the intermediate universal code representation and the final code solution significantly improves the quality of the generated code. The experimental results demonstrate that UniCoder with the universal code significantly outperforms the previous prompting methods by a large margin, showcasing the effectiveness of the structural clues in pseudo-code.
Search
Fix author
Co-authors
- Zhoujun Li 3
- Linzheng Chai 2
- Jiaheng Liu 2
- Liqun Yang 2
- Jian Yang 2
- Xianfu Cheng 1
- Yunlong Duan 1
- Xiangyuan Guan 1
- Shuyue Guo 1
- Hongcheng Guo 1
- Yu Hao 1
- Jin Ke 1
- Xiang Li 1
- Tongliang Li 1
- Jinjin Li 1
- Shukai Liu 1
- Fei Liu 1
- Oliver Liu 1
- Lan Ma 1
- Guanglin Niu 1
- Jiajun Shi 1
- Liran Wang 1
- Bing Wang 1
- Kui Wu 1
- Xianjie Wu 1
- Jian Yang 1
- Yuwei Yin 1
- Wei Zhang 1
- Ge Zhang 1
- Hang Zhang 1
- Weixiao Zhou 1
- He Zhu 1
- Hualei Zhu 1