Abstract
Pre-training on large-scale open-domain dialogue data can substantially improve the performance of dialogue models. However, the pre-trained dialogue model’s ability to utilize long-range context is limited due to the scarcity of long-turn dialogue sessions. Most dialogues in existing pre-training corpora contain fewer than three turns of dialogue. To alleviate this issue, we propose the Retrieve, Reorganize and Rescale framework (Re3Dial), which can automatically construct billion-scale long-turn dialogues by reorganizing existing short-turn ones. Given a short-turn session, Re3Dial first employs a session retriever to retrieve coherent consecutive sessions. To this end, we train the retriever to capture semantic and discourse relations within multi-turn dialogues through contrastive training. Next, Re3Dial samples a session from retrieved results following a diversity sampling strategy, which is designed to penalize repetitive or generic sessions. A longer session is then derived by concatenating the original session and the sampled session. By repeating the above process, Re3Dial can yield a coherent long-turn dialogue. Extensive experiments on multiple multi-turn dialogue benchmarks demonstrate that Re3Dial significantly improves the dialogue model’s ability to utilize long-range context and thus generate more sensible and informative responses. Finally, we build a toolkit for efficiently rescaling conversations with Re3Dial, which enables us to construct a corpus containing 1B Chinese dialogue sessions with 11.3 turns on average (5X longer than the original corpus). We will release our retriever model, toolkit, and data for public use.- Anthology ID:
- 2023.emnlp-main.612
- Volume:
- Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
- Month:
- December
- Year:
- 2023
- Address:
- Singapore
- Editors:
- Houda Bouamor, Juan Pino, Kalika Bali
- Venue:
- EMNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 9878–9894
- Language:
- URL:
- https://preview.aclanthology.org/add_missing_videos/2023.emnlp-main.612/
- DOI:
- 10.18653/v1/2023.emnlp-main.612
- Cite (ACL):
- Jiaxin Wen, Hao Zhou, Jian Guan, Jie Zhou, and Minlie Huang. 2023. Re3Dial: Retrieve, Reorganize and Rescale Conversations for Long-Turn Open-Domain Dialogue Pre-training. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9878–9894, Singapore. Association for Computational Linguistics.
- Cite (Informal):
- Re3Dial: Retrieve, Reorganize and Rescale Conversations for Long-Turn Open-Domain Dialogue Pre-training (Wen et al., EMNLP 2023)
- PDF:
- https://preview.aclanthology.org/add_missing_videos/2023.emnlp-main.612.pdf