Word Segmentation of Vietnamese Texts: a Comparison of Approaches
Quang Thắng Đinh, Hồng Phương Lê, Thị Minh Huyền Nguyễn, Cẩm Tú Nguyễn, Mathias Rossignol, Xuân Lương Vũ
Abstract
We present in this paper a comparison between three segmentation systems for the Vietnamese language. Indeed, the majority of Vietnamese words is built by semantic composition from about 7,000 syllables, which also have a meaning as isolated words. So the identification of word boundaries in a text is not a simple task, and ambiguities often appear. Beyond the presentation of the tested systems, we also propose a standard definition for word segmentation in Vietnamese, and introduce a reference corpus developed for the purpose of evaluating such a task. The results observed confirm that it can be relatively well treated by automatic means, although a solution needs to be found to take into account out-of-vocabulary words.- Anthology ID:
- L08-1355
- Volume:
- Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
- Month:
- May
- Year:
- 2008
- Address:
- Marrakech, Morocco
- Editors:
- Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, Daniel Tapias
- Venue:
- LREC
- SIG:
- Publisher:
- European Language Resources Association (ELRA)
- Note:
- Pages:
- Language:
- URL:
- http://www.lrec-conf.org/proceedings/lrec2008/pdf/493_paper.pdf
- DOI:
- Cite (ACL):
- Quang Thắng Đinh, Hồng Phương Lê, Thị Minh Huyền Nguyễn, Cẩm Tú Nguyễn, Mathias Rossignol, and Xuân Lương Vũ. 2008. Word Segmentation of Vietnamese Texts: a Comparison of Approaches. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08), Marrakech, Morocco. European Language Resources Association (ELRA).
- Cite (Informal):
- Word Segmentation of Vietnamese Texts: a Comparison of Approaches (Đinh et al., LREC 2008)
- PDF:
- http://www.lrec-conf.org/proceedings/lrec2008/pdf/493_paper.pdf