Weinan Gan
2025
RecBase: Generative Foundation Model Pretraining for Zero-Shot Recommendation
Sashuai Zhou
|
Weinan Gan
|
Qijiong Liu
|
Ke Lei
|
Jieming Zhu
|
Hai Huang
|
Yan Xia
|
Ruiming Tang
|
Zhenhua Dong
|
Zhou Zhao
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Recent advances in LLM-based recommendation have shown promise, yet their cross-domain generalization is hindered by a fundamental mismatch between language-centric pretraining and the recommendation task. Existing methods, relying on language-level knowledge, fail to capture dynamic, item-level user interests across domains. To bridge this gap, we propose RecBase, a domain-agnostic foundational model pretrained with a recommendation-oriented objective. RecBase leverages a large-scale, heterogeneous, cross-domain corpus with unified textual representations and feature mappings to enhance cross-domain generalization. To further align item semantics across domains, we introduce a unified item tokenizer that encodes items into hierarchical concept identifiers, enabling structured representation and efficient vocabulary sharing. The model is trained using an autoregressive objective to capture complex item-level sequential patterns. On eight real-world datasets, our 1.5B-parameter model matches or surpasses the performance of LLM baselines up to 7B parameters in zero-shot and cross-domain recommendation tasks.
Instruction-Tuning Data Synthesis from Scratch via Web Reconstruction
Yuxin Jiang
|
Yufei Wang
|
Chuhan Wu
|
Xinyi Dai
|
Yan Xu
|
Weinan Gan
|
Yasheng Wang
|
Xin Jiang
|
Lifeng Shang
|
Ruiming Tang
|
Wei Wang
Findings of the Association for Computational Linguistics: ACL 2025
The improvement of LLMs’ instruction-following capabilities depends critically on the availability of high-quality instruction-response pairs. While existing automatic data synthetic methods alleviate the burden of manual curation, they often rely heavily on either the quality of seed data or strong assumptions about the structure and content of web documents. To tackle these challenges, we propose Web Reconstruction (WebR), a fully automated framework for synthesizing high-quality instruction-tuning (IT) data directly from raw web documents with minimal assumptions. Leveraging the inherent diversity of raw web content, we conceptualize web reconstruction as an instruction-tuning data synthesis task via a novel dual-perspective paradigm—Web as Instruction and Web as Response—where each web document is designated as either the input or output role to trigger the reconstruction process. Comprehensive experiments show that datasets generated by WebR outperform state-of-the-art baselines by up to 16.65% across four instruction-following benchmarks. Notably, WebR demonstrates superior compatibility, data efficiency, and scalability, enabling enhanced domain adaptation with minimal effort.
Search
Fix author
Co-authors
- Ruiming Tang 2
- Xinyi Dai 1
- Zhenhua Dong 1
- Hai Huang 1
- Yuxin Jiang 1
- show all...