2025
pdf
bib
abs
Multilingual Encoder Knows more than You Realize: Shared Weights Pretraining for Extremely Low-Resource Languages
Zeli Su
|
Ziyin Zhang
|
Guixian Xu
|
Jianing Liu
|
Xu Han
|
Ting Zhang
|
Yushuang Dong
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
While multilingual language models like XLM-R have advanced multilingualism in NLP, they still perform poorly in extremely low-resource languages. This situation is exacerbated by the fact that modern LLMs such as LLaMA and Qwen support far fewer languages than XLM-R, making text generation models non-existent for many languages in the world. To tackle this challenge, we propose a novel framework for adapting multilingual encoders to text generation in extremely low-resource languages. By reusing the weights between the encoder and the decoder, our framework allows the model to leverage the learned semantic space of the encoder, enabling efficient learning and effective generalization in low-resource languages. Applying this framework to four Chinese minority languages, we present XLM-SWCM, and demonstrate its superior performance on various downstream tasks even when compared with much larger models.
pdf
bib
abs
TACLR: A Scalable and Efficient Retrieval-based Method for Industrial Product Attribute Value Identification
Yindu Su
|
Huike Zou
|
Lin Sun
|
Ting Zhang
|
Haiyang Yang
|
Chen Li Yu
|
David Lo
|
Qingheng Zhang
|
Shuguang Han
|
Jufeng Chen
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Product Attribute Value Identification (PAVI) involves identifying attribute values from product profiles, a key task for improving product search, recommendation, and business analytics on e-commerce platforms.However, existing PAVI methods face critical challenges, such as inferring implicit values, handling out-of-distribution (OOD) values, and producing normalized outputs.To address these limitations, we introduce Taxonomy-Aware Contrastive Learning Retrieval (TACLR), the first retrieval-based method for PAVI.TACLR formulates PAVI as an information retrieval task by encoding product profiles and candidate values into embeddings and retrieving values based on their similarity. It leverages contrastive training with taxonomy-aware hard negative sampling and employs adaptive inference with dynamic thresholds.TACLR offers three key advantages: (1) it effectively handles implicit and OOD values while producing normalized outputs; (2) it scales to thousands of categories, tens of thousands of attributes, and millions of values; and (3) it supports efficient inference for high-load industrial deployment.Extensive experiments on proprietary and public datasets validate the effectiveness and efficiency of TACLR. Further, it has been successfully deployed on the real-world e-commerce platform Xianyu, processing millions of product listings daily with frequently updated, large-scale attribute taxonomies. We release the code to facilitate reproducibility and future research at https://github.com/SuYindu/TACLR.
pdf
bib
abs
CMHG: A Dataset and Benchmark for Headline Generation of Minority Languages in China
Guixian Xu
|
Zeli Su
|
Ziyin Zhang
|
Jianing Liu
|
Xu Han
|
Ting Zhang
|
Yushuang Dong
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Minority languages in China, such as Tibetan, Uyghur, and Traditional Mongolian, face significant challenges due to their unique writing systems, which differ from international standards. This discrepancy has led to a severe lack of relevant corpora, particularly for supervised tasks like headline generation. To address this gap, we introduce a novel dataset, Chinese Minority Headline Generation (CMHG), which includes 100,000 entries for Tibetan, and 50,000 entries each for Uyghur and Mongolian, specifically curated for headline generation tasks. Additionally, we propose a high-quality test set annotated by native speakers, designed to serve as a benchmark for future research in this domain. We hope this dataset will become a valuable resource for advancing headline generation in Chinese minority languages and contribute to the development of related benchmarks.
pdf
bib
abs
Process-Supervised Reinforcement Learning for Code Generation
Yufan Ye
|
Ting Zhang
|
Wenbin Jiang
|
Hua Huang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Existing reinforcement learning (RL) strategies based on outcome supervision have proven effective in enhancing the performance of large language models (LLMs) for code generation. While reinforcement learning based on process supervision shows great potential in multi-step reasoning tasks, its effectiveness in the field of code generation still lacks sufficient exploration and verification. The primary obstacle stems from the resource-intensive nature of constructing a high-quality process-supervised reward dataset, which requires substantial human expertise and computational resources. To overcome this challenge, this paper proposes a “mutation/refactoring-execution verification” strategy. Specifically, the teacher model is used to mutate and refactor the statement lines or blocks, and the execution results of the compiler are used to automatically label them, thus generating a process-supervised reward dataset. Based on this dataset, we have carried out a series of RL experiments. The experimental results show that, compared with the method relying only on outcome supervision, reinforcement learning based on process supervision performs better in handling complex code generation tasks. In addition, this paper for the first time confirms the advantages of the Direct Preference Optimization (DPO) method in the RL task of code generation based on process supervision, providing new ideas and directions for code generation research.
pdf
bib
abs
LogRules: Enhancing Log Analysis Capability of Large Language Models through Rules
Xin Huang
|
Ting Zhang
|
Wen Zhao
Findings of the Association for Computational Linguistics: NAACL 2025
Currently, large language models (LLMs) have achieved impressive performance in natural language processing tasks. However, LLMs still exhibit many hallucinations when analyzing system logs, which is due to the implicit knowledge and rules in logs that LLMs cannot capture. Based on this, we propose LogRules, a lightweight log analysis framework that generates and utilizes rules through LLMs. LogRules consists of three stages: an induction stage, an alignment stage, and a reasoning stage. Firstly, in the induction stage, an strong LLM (e.g., GPT-4o-mini) is tasked with generating a series of rules related to logs, which are then validated on the training set. When the rules are confirmed to produce correct reasoning results, they are added to a rule repository. Secondly, considering that the LLMs with small size (≈8B parameters) still face challenges in utilizing rules, we design an alignment method based on rule-case contrastive preference optimization (CPO) to effectively enhance the rule reasoning capabilities of these LLMs. Finally, in the reasoning stage, the LLM constructs prompt using the rule repository and performs log analysis on the test set. Experiments show that LogRules outperforms LLM-based methods in log parsing and anomaly detection tasks, and achieves better performance compared to case-based methods.