Open-Source Large Language Models as Multilingual Crowdworkers: Synthesizing Open-Domain Dialogues in Several Languages With No Examples in Targets and No Machine Translation
Ahmed Njifenjou, Virgile Sucal, Bassam Jabaian, Fabrice Lefèvre
Abstract
The prevailing paradigm in the field of Open-Domain Dialogue (ODD) agents predominantly focuses on some high-resource languages such as English or Chinese. Furthermore, the financial and temporal investments required for crowd-sourcing such datasets, in multiple languages, are substantial. Fortunately, advancements in Large Language Models (LLMs), specifically instruction-tuning enabled them to execute tasks based on natural language instructions. Additionally, these models possess the capability to function in various languages within a single thread. Consequently, to generate new data samples in different languages, we propose leveraging these capabilities to replicate the data collection process. We introduce a pipeline for generating ODD data in multiple target languages using LLMs, with demonstrations provided in a unique source language. By eschewing explicit Machine Translation in this approach, we enhance language-specific nuances and cultural specificity. We apply this methodology to the PersonaChat dataset. To further improve the openness of generated dialogues and mimic real life scenarios, we added the notion of speech events corresponding to the type of conversation the speakers are involved in and that of common ground which represents the premises of a conversation.- Anthology ID:
- 2025.sigdial-1.55
- Volume:
- Proceedings of the 26th Annual Meeting of the Special Interest Group on Discourse and Dialogue
- Month:
- August
- Year:
- 2025
- Address:
- Avignon, France
- Editors:
- Frédéric Béchet, Fabrice Lefèvre, Nicholas Asher, Seokhwan Kim, Teva Merlin
- Venue:
- SIGDIAL
- SIG:
- SIGDIAL
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 697–749
- Language:
- URL:
- https://preview.aclanthology.org/corrections-2025-10/2025.sigdial-1.55/
- DOI:
- Cite (ACL):
- Ahmed Njifenjou, Virgile Sucal, Bassam Jabaian, and Fabrice Lefèvre. 2025. Open-Source Large Language Models as Multilingual Crowdworkers: Synthesizing Open-Domain Dialogues in Several Languages With No Examples in Targets and No Machine Translation. In Proceedings of the 26th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 697–749, Avignon, France. Association for Computational Linguistics.
- Cite (Informal):
- Open-Source Large Language Models as Multilingual Crowdworkers: Synthesizing Open-Domain Dialogues in Several Languages With No Examples in Targets and No Machine Translation (Njifenjou et al., SIGDIAL 2025)
- PDF:
- https://preview.aclanthology.org/corrections-2025-10/2025.sigdial-1.55.pdf