IndoPref: A Multi-Domain Pairwise Preference Dataset for Indonesian
Vanessa Rebecca Wiyono, David Anugraha, Ayu Purwarianti, Genta Indra Winata
Abstract
Over 200 million people speak Indonesian, yet the language remains significantly underrepresented in preference-based research for large language models (LLMs). Most existing multilingual datasets are derived from English translations, often resulting in content that lacks cultural and linguistic authenticity. To address this gap, we introduce IndoPref, the first fully human-authored and multi-domain Indonesian preference dataset designed to evaluate the naturalness and quality of LLM-generated text. The dataset contains 522 prompts and yields 4,099 human-annotated pairwise preferences from comparisons across five instruction-tuned LLMs. All annotations are natively written in Indonesian with strong inter-annotator agreement, measured by Krippendorff’s alpha. Our benchmark spans 10 diverse categories, enabling practitioners to identify LLMs’ fine-grained strengths and weaknesses.- Anthology ID:
- 2025.ijcnlp-short.12
- Volume:
- Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
- Month:
- December
- Year:
- 2025
- Address:
- Mumbai, India
- Editors:
- Kentaro Inui, Sakriani Sakti, Haofen Wang, Derek F. Wong, Pushpak Bhattacharyya, Biplab Banerjee, Asif Ekbal, Tanmoy Chakraborty, Dhirendra Pratap Singh
- Venues:
- IJCNLP | AACL
- SIG:
- Publisher:
- The Asian Federation of Natural Language Processing and The Association for Computational Linguistics
- Note:
- Pages:
- 128–138
- Language:
- URL:
- https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.ijcnlp-short.12/
- DOI:
- Cite (ACL):
- Vanessa Rebecca Wiyono, David Anugraha, Ayu Purwarianti, and Genta Indra Winata. 2025. IndoPref: A Multi-Domain Pairwise Preference Dataset for Indonesian. In Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, pages 128–138, Mumbai, India. The Asian Federation of Natural Language Processing and The Association for Computational Linguistics.
- Cite (Informal):
- IndoPref: A Multi-Domain Pairwise Preference Dataset for Indonesian (Wiyono et al., IJCNLP-AACL 2025)
- PDF:
- https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.ijcnlp-short.12.pdf