Abstract
Large language models (LLMs) have rapidly evolved as the foundation of various natural language processing (NLP) applications. Despite their wide use cases, their understanding of culturally-related concepts and reasoning remains limited. Meantime, there is a significant need to enhance these models’ cultural reasoning capabilities, especially concerning underrepresented regions. This paper introduces a novel pipeline for extracting high-quality, culturally-related instruction tuning datasets from vast unstructured corpora. We utilize a self-instruction generation pipeline to identify cultural concepts and trigger instruction. By integrating with a general-purpose instruction tuning dataset, our model demonstrates enhanced capabilities in recognizing and understanding regional cultural nuances, thereby enhancing its reasoning capabilities. We conduct experiments across three regions: Singapore, the Philippines, and the United States, achieving performance improvement of up to 6%. Our research opens new avenues for extracting cultural instruction tuning sets directly from unstructured data, setting a precedent for future innovations in the field.- Anthology ID:
- 2024.c3nlp-1.4
- Volume:
- Proceedings of the 2nd Workshop on Cross-Cultural Considerations in NLP
- Month:
- August
- Year:
- 2024
- Address:
- Bangkok, Thailand
- Editors:
- Vinodkumar Prabhakaran, Sunipa Dev, Luciana Benotti, Daniel Hershcovich, Laura Cabello, Yong Cao, Ife Adebara, Li Zhou
- Venues:
- C3NLP | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 42–47
- Language:
- URL:
- https://aclanthology.org/2024.c3nlp-1.4
- DOI:
- Cite (ACL):
- Bin Wang, Geyu Lin, Zhengyuan Liu, Chengwei Wei, and Nancy Chen. 2024. CRAFT: Extracting and Tuning Cultural Instructions from the Wild. In Proceedings of the 2nd Workshop on Cross-Cultural Considerations in NLP, pages 42–47, Bangkok, Thailand. Association for Computational Linguistics.
- Cite (Informal):
- CRAFT: Extracting and Tuning Cultural Instructions from the Wild (Wang et al., C3NLP-WS 2024)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-4/2024.c3nlp-1.4.pdf