Abstract
Text classification tends to be difficult when data are inadequate considering the amount of manually labeled text corpora. For low-resource agglutinative languages including Uyghur, Kazakh, and Kyrgyz (UKK languages), in which words are manufactured via stems concatenated with several suffixes and stems are used as the representation of text content, this feature allows infinite derivatives vocabulary that leads to high uncertainty of writing forms and huge redundant features. There are major challenges of low-resource agglutinative text classification the lack of labeled data in a target domain and morphologic diversity of derivations in language structures. It is an effective solution which fine-tuning a pre-trained language model to provide meaningful and favorable-to-use feature extractors for downstream text classification tasks. To this end, we propose a low-resource agglutinative language model fine-tuning AgglutiFiT, specifically, we build a low-noise fine-tuning dataset by morphological analysis and stem extraction, then fine-tune the cross-lingual pre-training model on this dataset. Moreover, we propose an attention-based fine-tuning strategy that better selects relevant semantic and syntactic information from the pre-trained language model and uses those features on downstream text classification tasks. We evaluate our methods on nine Uyghur, Kazakh, and Kyrgyz classification datasets, where they have significantly better performance compared with several strong baselines.- Anthology ID:
- 2020.ccl-1.92
- Volume:
- Proceedings of the 19th Chinese National Conference on Computational Linguistics
- Month:
- October
- Year:
- 2020
- Address:
- Haikou, China
- Editors:
- Maosong Sun (孙茂松), Sujian Li (李素建), Yue Zhang (张岳), Yang Liu (刘洋)
- Venue:
- CCL
- SIG:
- Publisher:
- Chinese Information Processing Society of China
- Note:
- Pages:
- 994–1005
- Language:
- English
- URL:
- https://aclanthology.org/2020.ccl-1.92
- DOI:
- Cite (ACL):
- Xiuhong Li, Zhe Li, Jiabao Sheng, and Wushour Slamu. 2020. Low-Resource Text Classification via Cross-lingual Language Model Fine-tuning. In Proceedings of the 19th Chinese National Conference on Computational Linguistics, pages 994–1005, Haikou, China. Chinese Information Processing Society of China.
- Cite (Informal):
- Low-Resource Text Classification via Cross-lingual Language Model Fine-tuning (Li et al., CCL 2020)
- PDF:
- https://preview.aclanthology.org/naacl-24-ws-corrections/2020.ccl-1.92.pdf