ThaiLMCut: Unsupervised Pretraining for Thai Word Segmentation
Suteera Seeha, Ivan Bilan, Liliana Mamani Sanchez, Johannes Huber, Michael Matuschek, Hinrich Schütze
Abstract
We propose ThaiLMCut, a semi-supervised approach for Thai word segmentation which utilizes a bi-directional character language model (LM) as a way to leverage useful linguistic knowledge from unlabeled data. After the language model is trained on substantial unlabeled corpora, the weights of its embedding and recurrent layers are transferred to a supervised word segmentation model which continues fine-tuning them on a word segmentation task. Our experimental results demonstrate that applying the LM always leads to a performance gain, especially when the amount of labeled data is small. In such cases, the F1 Score increased by up to 2.02%. Even on abig labeled dataset, a small improvement gain can still be obtained. The approach has also shown to be very beneficial for out-of-domain settings with a gain in F1 Score of up to 3.13%. Finally, we show that ThaiLMCut can outperform other open source state-of-the-art models achieving an F1 Score of 98.78% on the standard benchmark, InterBEST2009.- Anthology ID:
- 2020.lrec-1.858
- Volume:
- Proceedings of the Twelfth Language Resources and Evaluation Conference
- Month:
- May
- Year:
- 2020
- Address:
- Marseille, France
- Editors:
- Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
- Venue:
- LREC
- SIG:
- Publisher:
- European Language Resources Association
- Note:
- Pages:
- 6947–6957
- Language:
- English
- URL:
- https://aclanthology.org/2020.lrec-1.858
- DOI:
- Cite (ACL):
- Suteera Seeha, Ivan Bilan, Liliana Mamani Sanchez, Johannes Huber, Michael Matuschek, and Hinrich Schütze. 2020. ThaiLMCut: Unsupervised Pretraining for Thai Word Segmentation. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 6947–6957, Marseille, France. European Language Resources Association.
- Cite (Informal):
- ThaiLMCut: Unsupervised Pretraining for Thai Word Segmentation (Seeha et al., LREC 2020)
- PDF:
- https://preview.aclanthology.org/ingest-acl-2023-videos/2020.lrec-1.858.pdf
- Code
- meanna/ThaiLMCUT