What data should I include in my POS tagging training set?
Zoey Liu, Masoud Jasbi, Christan Grant, Kenji Sagae, Emily Prud’hommeaux
Abstract
Building an NLP training set for understudied languages, including Indigenous and endangered languages, often faces challenges due to varying degrees of resource limitations in the speaker communities. What are some reasonable approaches for training set construction in these cases? We address this question with POS tagging as the test case. Although many might consider POS tagging “a solved problem”, it remains a crucial task for descriptive linguistics and language documentation and requires laborious manual annotation. Drawing data from 12 language families, we compare in-context learning, active learning (AL), and random sampling. Our results suggest: (1) for communities whose language data can be ethically shared with an API, using only 1,000 randomly sampled tokens as prompt examples, the proprietary GPT-4.1-mini can deliver desirable performance (F1>0.83) on par with that from a training set of thousands of tokens in AL iterations; (2) in cases where communities prefer not to share data, 4,500-5,500 tokens selected from AL can yield reasonable results at a pace statistically significantly faster than random sampling, evidenced by growth curve modeling.- Anthology ID:
- 2025.findings-emnlp.448
- Volume:
- Findings of the Association for Computational Linguistics: EMNLP 2025
- Month:
- November
- Year:
- 2025
- Address:
- Suzhou, China
- Editors:
- Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 8439–8455
- Language:
- URL:
- https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.448/
- DOI:
- 10.18653/v1/2025.findings-emnlp.448
- Cite (ACL):
- Zoey Liu, Masoud Jasbi, Christan Grant, Kenji Sagae, and Emily Prud’hommeaux. 2025. What data should I include in my POS tagging training set?. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 8439–8455, Suzhou, China. Association for Computational Linguistics.
- Cite (Informal):
- What data should I include in my POS tagging training set? (Liu et al., Findings 2025)
- PDF:
- https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.448.pdf