Xuekong Xu
2025
Token-level Preference Self-Alignment Optimization for Multi-style Outline Controllable Generation
Zihao Li
|
Xuekong Xu
|
Ziyao Chen
|
Lixin Zou
|
Ethanhjwu Ethanhjwu
|
Qiang Chen
|
Chenliang Li
Findings of the Association for Computational Linguistics: ACL 2025
Multi-style outline controllable generation is crucial for multiple applications, including document semantic structuring and retrieval-augmented generation.The great success of preference alignment approaches encourages their application in controllable generation tasks.However, these attempts encounter several limitations: (1) response pair requirements, (2) substantial computation costs, and (3) insufficient exploitation of fine-grained preference signals.To address these problems, we propose a token-level preference self-alignment optimization, named TKPO, for outline controllable generation. TKPO extends the Bradley-Terry model from pair-wise to list-wise comparison, which is further applied at the token level for fine-grained preference signal utilization. In comparison to the representative methods, e.g., DPO, TKPO does not require response pairs; instead, we propose a controllable attributes-driven method to construct reject samples for self-alignment. Additionally, TKPO optimizes only the base model, thereby avoiding additional memory usage and substantial computational costs.We curate two outline controllable generation datasets with regard to language style and level-of-detail.Extensive experiments demonstrate that TKPO outperforms DPO by up to 19.28% in performance while requiring only 56.25% in training time.We release the code and datasets resources at https://github.com/WHUIR/TKPO.
Search
Fix author
Co-authors
- Ziyao Chen 1
- Qiang Chen 1
- Ethanhjwu Ethanhjwu 1
- Zihao Li 1
- Chenliang Li 1
- show all...