Haochen Zhang


2025

pdf bib
CoVE: Compressed Vocabulary Expansion Makes Better LLM-based Recommender Systems
Haochen Zhang | Tianyi Zhang | Junze Yin | Oren Gal | Anshumali Shrivastava | Vladimir Braverman
Findings of the Association for Computational Linguistics: ACL 2025

Recommender systems play a pivotal role in providing relevant content to users. With the rapid development of large language models (LLMs), researchers have begun utilizing LLMs to build more powerful recommender systems. However, existing approaches that focus on aligning LLMs with recommendation tasks do not fully leverage their sequential information processing capabilities, leading to suboptimal performance. In this paper, we propose a novel system called compressed vocabulary expansion (CoVE). In CoVE, each item is assigned a unique ID within the expanded vocabulary. Our framework effectively capitalizes on sequence understanding abilities of LLMs, significantly enhancing their performance on recommendation tasks. Additionally, we compress the embedding layer, making CoVE practical for large-scale industrial applications. The effectiveness and performance of CoVE are demonstrated through comprehensive experiments on multiple recommendation datasets and comparisons with prior works. Our code can be found at https://github.com/HaochenZhang717/CoVE-official-Repo.

2024

pdf bib
Jellyfish: Instruction-Tuning Local Large Language Models for Data Preprocessing
Haochen Zhang | Yuyang Dong | Chuan Xiao | Masafumi Oyamada
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

This paper explores the utilization of LLMs for data preprocessing (DP), a crucial step in the data mining pipeline that transforms raw data into a clean format. We instruction-tune local LLMs as universal DP task solvers that operate on a local, single, and low-priced GPU, ensuring data security and enabling further customization. We select a collection of datasets across four representative DP tasks and construct instruction data using data configuration, knowledge injection, and reasoning data distillation techniques tailored to DP. By tuning Mistral-7B, Llama 3-8B, and OpenOrca-Platypus2-13B, our models, Jellyfish-7B/8B/13B, deliver competitiveness compared to GPT-3.5/4 models and strong generalizability to unseen tasks while barely compromising the base models’ abilities in NLP tasks. Meanwhile, Jellyfish offers enhanced reasoning capabilities compared to GPT-3.5. Our models are available at: https://huggingface.co/NECOUDBFM/JellyfishOur instruction dataset is available at: https://huggingface.co/datasets/NECOUDBFM/Jellyfish-Instruct