Abstract
This study investigates the extent to which currently available morpheme parsers/taggers apply to lesser-studied languages and language-usage contexts, with a focus on second language (L2) Korean. We pursue this inquiry by (1) training a neural-network model (pre-trained on first language [L1] Korean data) on varying L2 datasets and (2) measuring its morpheme parsing/POS tagging performance on L2 test sets from both the same and different sources of the L2 train sets. Results show that the L2 trained models generally excel in domain-specific tokenization and POS tagging compared to the L1 pre-trained baseline model. Interestingly, increasing the size of the L2 training data does not lead to improving model performance consistently.- Anthology ID:
- 2023.findings-emnlp.767
- Volume:
- Findings of the Association for Computational Linguistics: EMNLP 2023
- Month:
- December
- Year:
- 2023
- Address:
- Singapore
- Editors:
- Houda Bouamor, Juan Pino, Kalika Bali
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 11461–11473
- Language:
- URL:
- https://aclanthology.org/2023.findings-emnlp.767
- DOI:
- 10.18653/v1/2023.findings-emnlp.767
- Cite (ACL):
- Hakyung Sung and Gyu-Ho Shin. 2023. Diversifying language models for lesser-studied languages and language-usage contexts: A case of second language Korean. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 11461–11473, Singapore. Association for Computational Linguistics.
- Cite (Informal):
- Diversifying language models for lesser-studied languages and language-usage contexts: A case of second language Korean (Sung & Shin, Findings 2023)
- PDF:
- https://preview.aclanthology.org/fix-volume-bibkeys/2023.findings-emnlp.767.pdf