What data should I include in my POS tagging training set?

Zoey Liu, Masoud Jasbi, Christan Grant, Kenji Sagae, Emily Prud’hommeaux


Abstract
Building an NLP training set for understudied languages, including Indigenous and endangered languages, often faces challenges due to varying degrees of resource limitations in the speaker communities. What are some reasonable approaches for training set construction in these cases? We address this question with POS tagging as the test case. Although many might consider POS tagging “a solved problem”, it remains a crucial task for descriptive linguistics and language documentation and requires laborious manual annotation. Drawing data from 12 language families, we compare in-context learning, active learning (AL), and random sampling. Our results suggest: (1) for communities whose language data can be ethically shared with an API, using only 1,000 randomly sampled tokens as prompt examples, the proprietary GPT-4.1-mini can deliver desirable performance (F1>0.83) on par with that from a training set of thousands of tokens in AL iterations; (2) in cases where communities prefer not to share data, 4,500-5,500 tokens selected from AL can yield reasonable results at a pace statistically significantly faster than random sampling, evidenced by growth curve modeling.
Anthology ID:
2025.findings-emnlp.448
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2025
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
8439–8455
Language:
URL:
https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.448/
DOI:
10.18653/v1/2025.findings-emnlp.448
Bibkey:
Cite (ACL):
Zoey Liu, Masoud Jasbi, Christan Grant, Kenji Sagae, and Emily Prud’hommeaux. 2025. What data should I include in my POS tagging training set?. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 8439–8455, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
What data should I include in my POS tagging training set? (Liu et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.448.pdf
Checklist:
 2025.findings-emnlp.448.checklist.pdf