Zhixun Chen
2026
MuBench: Assessment of Multilingual Capabilities of Large Language Models Across 61 Languages
Wenhan Han | Yifan Zhang | Zhixun Chen | Binbinliu | Mykola Pechenizkiy | Meng Fang | Yin Zheng
Findings of the Association for Computational Linguistics: ACL 2026
Wenhan Han | Yifan Zhang | Zhixun Chen | Binbinliu | Mykola Pechenizkiy | Meng Fang | Yin Zheng
Findings of the Association for Computational Linguistics: ACL 2026
Multilingual large language models (LLMs) are advancing rapidly, with new models frequently claiming support for an increasing number of languages. However, existing evaluation datasets are limited and lack cross-lingual alignment, leaving assessments of multilingual capabilities fragmented in both language and skill coverage. To address this, we introduce MuBench, a benchmark covering 61 languages with 3.9M samples and evaluating a broad range of capabilities. We evaluate several state-of-the-art multilingual LLMs and find notable gaps between claimed and actual language coverage, particularly a persistent performance disparity between English and low-resource languages. Leveraging MuBench’s alignment, we propose Multilingual Consistency (MLC) as a complementary metric to accuracy for analyzing performance bottlenecks and guiding model improvement. MuBench provides flexible evaluation formats, including mixed-language testing. Experimental results show that increasing model size does not improve its ability to handle mixed-language contexts. We recruited human experts to evaluate translation quality and cultural sensitivity for 34k samples across 17 languages, and combined these assessments with an LLM-as-a-Judge approach to ensure overall data quality in low resource languages.
2025
ATLAS: Agent Tuning via Learning Critical Steps
Zhixun Chen | Ming Li | Yuxuan Huang | Yali Du | Meng Fang | Tianyi Zhou
Findings of the Association for Computational Linguistics: ACL 2025
Zhixun Chen | Ming Li | Yuxuan Huang | Yali Du | Meng Fang | Tianyi Zhou
Findings of the Association for Computational Linguistics: ACL 2025
Large Language Model (LLM) agents have demonstrated remarkable generalization capabilities across multi-domain tasks. Existing agent tuning approaches typically employ supervised finetuning on entire expert trajectories. However, behavior-cloning of full trajectories can introduce expert bias and weaken generalization to states not covered by the expert data. Additionally, critical steps—such as planning, complex reasoning for intermediate subtasks, and strategic decision-making—are essential to success in agent tasks, so learning these steps is the key to improving LLM agents. For more effective and efficient agent tuning, we propose ATLAS that identifies the critical steps in expert trajectories and finetunes LLMs solely on these steps with reduced costs. By steering the training’s focus to a few critical steps, our method mitigates the risk of overfitting entire trajectories and promotes generalization across different environments and tasks. In extensive experiments, an LLM finetuned on only 30% critical steps selected by ATLAS outperforms the LLM finetuned on all steps and recent open-source LLM agents. ATLAS maintains and improves base LLM skills as generalist agents interacting with diverse environments.