Thai Word Segmentation a Lexical Semantic Approach

Krisda Khankasikam, Nuttanart Muansuwqan


Abstract
In Thai language, the word boundary is not explicitly clear, therefore, word segmentation is needed to determine word boundary in Thai sentences. Many applications of Thai Language Processing require the word segmentation. Several approaches of Thai word segmentation such as maximal matching, longest matching and n-gram model do not take semantics into consideration. This paper presents a Thai word segmentation system using semantic corpus which is composed of four steps: generating all possible candidates, proper noun consideration, semantic tagging and semantic checking. The first three steps are conducted using a dictionary. Semantic checking is carried out on the basis of corpus-based approach. Finally, we assign the semantic scores to segmented words and select the ones that contain maximum semantic scores. In order to assign semantic scores, we use a Thai proper noun database and the semantic corpus derived from ORCHID corpus. This approach is more reliable than other approaches that do not take the meaning into consideration and performs the level of accuracy at 96-99% depending on the characteristic of input and the dictionary used in the segmentation.
Anthology ID:
2005.mtsummit-posters.2
Volume:
Proceedings of Machine Translation Summit X: Posters
Month:
September 13-15
Year:
2005
Address:
Phuket, Thailand
Venue:
MTSummit
SIG:
Publisher:
Note:
Pages:
331–338
Language:
URL:
https://aclanthology.org/2005.mtsummit-posters.2
DOI:
Bibkey:
Cite (ACL):
Krisda Khankasikam and Nuttanart Muansuwqan. 2005. Thai Word Segmentation a Lexical Semantic Approach. In Proceedings of Machine Translation Summit X: Posters, pages 331–338, Phuket, Thailand.
Cite (Informal):
Thai Word Segmentation a Lexical Semantic Approach (Khankasikam & Muansuwqan, MTSummit 2005)
Copy Citation:
PDF:
https://preview.aclanthology.org/emnlp-22-attachments/2005.mtsummit-posters.2.pdf