Chen Li Yu


2025

pdf bib
TACLR: A Scalable and Efficient Retrieval-based Method for Industrial Product Attribute Value Identification
Yindu Su | Huike Zou | Lin Sun | Ting Zhang | Haiyang Yang | Chen Li Yu | David Lo | Qingheng Zhang | Shuguang Han | Jufeng Chen
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Product Attribute Value Identification (PAVI) involves identifying attribute values from product profiles, a key task for improving product search, recommendation, and business analytics on e-commerce platforms.However, existing PAVI methods face critical challenges, such as inferring implicit values, handling out-of-distribution (OOD) values, and producing normalized outputs.To address these limitations, we introduce Taxonomy-Aware Contrastive Learning Retrieval (TACLR), the first retrieval-based method for PAVI.TACLR formulates PAVI as an information retrieval task by encoding product profiles and candidate values into embeddings and retrieving values based on their similarity. It leverages contrastive training with taxonomy-aware hard negative sampling and employs adaptive inference with dynamic thresholds.TACLR offers three key advantages: (1) it effectively handles implicit and OOD values while producing normalized outputs; (2) it scales to thousands of categories, tens of thousands of attributes, and millions of values; and (3) it supports efficient inference for high-load industrial deployment.Extensive experiments on proprietary and public datasets validate the effectiveness and efficiency of TACLR. Further, it has been successfully deployed on the real-world e-commerce platform Xianyu, processing millions of product listings daily with frequently updated, large-scale attribute taxonomies. We release the code to facilitate reproducibility and future research at https://github.com/SuYindu/TACLR.

pdf bib
GSID: Generative Semantic Indexing for E-Commerce Product Understanding
Haiyang Yang | Qinye Xie | Qingheng Zhang | Chen Li Yu | Huike Zou | Chengbao Lian | Shuguang Han | Fei Huang | Jufeng Chen | Bo Zheng
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track

Structured representation of product information is a major bottleneck for the efficiency of e-commerce platforms, especially in second-hand ecommerce platforms. Currently, most product information are organized based on manually curated product categories and attributes, which often fail to adequately cover long-tail products and do not align well with buyer preference. To address these problems, we propose Generative Semantic InDexings (GSID), a data-driven approach to generate product structured representations. GSID consists of two key components: (1) Pre-training on unstructured product metadata to learn in-domain semantic embeddings, and (2) Generating more effective semantic codes tailored for downstream product-centric applications. Extensive experiments are conducted to validate the effectiveness of GSID, and it has been successfully deployed on the real-world e-commerce platform, achieving promising results on product understanding and other downstream tasks.

pdf bib
Multi-Value-Product Retrieval-Augmented Generation for Industrial Product Attribute Value Identification
Huike Zou | Haiyang Yang | Yindu Su | Chen Li Yu | Qinye Xie | Chengbao Lian | Qingheng Zhang | Shuguang Han | Fei Huang | Jufeng Chen
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track

Identifying attribute values from product profiles is a key task for improving product search, recommendation, and business analytics on e-commerce platforms, which we called Product Attribute Value Identification (PAVI) . However, existing PAVI methods face critical challenges, such as cascading errors, inability to handle out-of-distribution (OOD) attribute values, and lack of generalization capability. To address these limitations, we introduce Multi-Value-Product Retrieval-Augmented Generation (MVP-RAG), combining the strengths of retrieval, generation, and classification paradigms. MVP-RAG defines PAVI as a retrieval-generation task, where the product title description serves as the query, and products and attribute values act as the corpus. It first retrieves similar products of the same category and candidate attribute values, and then generates the standardized attribute values. The key advantages of this work are: (1) the proposal of a multi-level retrieval scheme, with products and attribute values as distinct hierarchical levels in PAVI domain (2) attribute value generation of large language model to significantly alleviate the OOD problem and (3) its successful deployment in a real-world industrial environment. Extensive experimental results on the dataset demonstrate that the proposed method performs better than the state-of-the-art baselines.