Modeling infant segmentation of two morphologically diverse languages

Georgia-Rengina Loukatou, Sabine Stoll, Damian Blasi, Alejandrina Cristia


Abstract
A rich literature explores unsupervised segmentation algorithms infants could use to parse their input, mainly focusing on English, an analytic language where word, morpheme, and syllable boundaries often coincide. Synthetic languages, where words are multi-morphemic, may present unique difficulties for segmentation. Our study tests corpora of two languages selected to differ in the extent of complexity of their morphological structure, Chintang and Japanese. We use three conceptually diverse word segmentation algorithms and we evaluate them on both word- and morpheme-level representations. As predicted, results for the simpler Japanese are better than those for the more complex Chintang. However, the difference is small compared to the effect of the algorithm (with the lexical algorithm outperforming sub-lexical ones) and the level (scores were lower when evaluating on words versus morphemes). There are also important interactions between language, model, and evaluation level, which ought to be considered in future work.
Anthology ID:
2018.jeptalnrecital-long.4
Volume:
Actes de la Conférence TALN. Volume 1 - Articles longs, articles courts de TALN
Month:
5
Year:
2018
Address:
Rennes, France
Venue:
JEP/TALN/RECITAL
SIG:
Publisher:
ATALA
Note:
Pages:
47–60
Language:
URL:
https://aclanthology.org/2018.jeptalnrecital-long.4
DOI:
Bibkey:
Cite (ACL):
Georgia-Rengina Loukatou, Sabine Stoll, Damian Blasi, and Alejandrina Cristia. 2018. Modeling infant segmentation of two morphologically diverse languages. In Actes de la Conférence TALN. Volume 1 - Articles longs, articles courts de TALN, pages 47–60, Rennes, France. ATALA.
Cite (Informal):
Modeling infant segmentation of two morphologically diverse languages (Loukatou et al., JEP/TALN/RECITAL 2018)
Copy Citation:
PDF:
https://preview.aclanthology.org/auto-file-uploads/2018.jeptalnrecital-long.4.pdf