Juesi Xiao


2025

pdf bib
Praetor: A Fine-Grained Generative LLM Evaluator with Instance-Level Customizable Evaluation Criteria
Yongqi Leng | Renren Jin | Yue Chen | Zhuowen Han | Ling Shi | Jianxiang Peng | Lei Yang | Juesi Xiao | Deyi Xiong
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

With the increasing capability of large language models (LLMs), LLM-as-a-judge has emerged as a new evaluation paradigm. Compared with traditional automatic and manual evaluation, LLM evaluators exhibit better interpretability and efficiency. Despite this, existing LLM evaluators suffer from limited use scenarios and poor flexibility. To mitigate these issues, we propose Praetor, a fine-grained generative LLM evaluator with instance-level customazable evaluation criteria. To train Praetor, we curate a large-scale dataset guided with a hierarchical guideline covering a wide range of tasks and instance-level evaluation criteria. We train Praetor on this dataset in a multi-task learning fashion, which enables to evaluate LLMs in either pointwise grading or pairwise comparison way and support two languages simultaneously with a high flexibility of setting evaluation criteria. Extensive experiments demonstrate that Praetor outperforms previous LLM evaluators and instruction-tuned LLMs on multiple benchmarks, setting new SOTA results. It also exhibits the potential for generating critiques as scalable feedback to further improve LLMs. Our model and related resources are released at https://github.com/tjunlp-lab/Praetor.

2024

pdf bib
FuxiTranyu: A Multilingual Large Language Model Trained with Balanced Data
Haoran Sun | Renren Jin | Shaoyang Xu | Leiyu Pan | Supryadi | Menglong Cui | Jiangcun Du | Yikun Lei | Lei Yang | Ling Shi | Juesi Xiao | Shaolin Zhu | Deyi Xiong
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track

Large language models (LLMs) have demonstrated prowess in a wide range of tasks. However, many LLMs exhibit significant performance discrepancies between high- and low-resource languages. To mitigate this challenge, we present FuxiTranyu, an open-source multilingual LLM, which is designed to satisfy the need of the research community for balanced and high-performing multilingual capabilities. The base model, FuxiTranyu-8B, features 8 billion parameters and is trained from scratch on meticulously balanced multilingual data that contains 600 billion tokens covering 43 natural languages and 16 programming languages. We also develop two instruction-tuned models: FuxiTranyu-8B-SFT which is fine-tuned on a diverse multilingual instruction dataset, and FuxiTranyu-8B-DPO which is further refined with DPO on a preference dataset for enhanced alignment ability. Extensive experiments on a wide range of multilingual benchmarks demonstrate the competitive performance of FuxiTranyu against existing multilingual LLMs, e.g., BLOOM-7B, PolyLM-13B, and Mistral-7B-Instruct. Both neuron and representation interpretability analyses reveal that FuxiTranyu achieves consistent multilingual representations across languages. To promote further research into multilingual LLMs, we release both the base and instruction-tuned FuxiTranyu models together with 58 pre-training checkpoints at HuggingFace and Github.