A Little Linguistics Goes a Long Way: Unsupervised Segmentation with Limited Language Specific Guidance
Alexander Erdmann, Salam Khalifa, Mai Oudah, Nizar Habash, Houda Bouamor
Abstract
We present de-lexical segmentation, a linguistically motivated alternative to greedy or other unsupervised methods, requiring only minimal language specific input. Our technique involves creating a small grammar of closed-class affixes which can be written in a few hours. The grammar over generates analyses for word forms attested in a raw corpus which are disambiguated based on features of the linguistic base proposed for each form. Extending the grammar to cover orthographic, morpho-syntactic or lexical variation is simple, making it an ideal solution for challenging corpora with noisy, dialect-inconsistent, or otherwise non-standard content. In two evaluations, we consistently outperform competitive unsupervised baselines and approach the performance of state-of-the-art supervised models trained on large amounts of data, providing evidence for the value of linguistic input during preprocessing.- Anthology ID:
- W19-4214
- Volume:
- Proceedings of the 16th Workshop on Computational Research in Phonetics, Phonology, and Morphology
- Month:
- August
- Year:
- 2019
- Address:
- Florence, Italy
- Venue:
- ACL
- SIG:
- SIGMORPHON
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 113–124
- Language:
- URL:
- https://aclanthology.org/W19-4214
- DOI:
- 10.18653/v1/W19-4214
- Cite (ACL):
- Alexander Erdmann, Salam Khalifa, Mai Oudah, Nizar Habash, and Houda Bouamor. 2019. A Little Linguistics Goes a Long Way: Unsupervised Segmentation with Limited Language Specific Guidance. In Proceedings of the 16th Workshop on Computational Research in Phonetics, Phonology, and Morphology, pages 113–124, Florence, Italy. Association for Computational Linguistics.
- Cite (Informal):
- A Little Linguistics Goes a Long Way: Unsupervised Segmentation with Limited Language Specific Guidance (Erdmann et al., ACL 2019)
- PDF:
- https://preview.aclanthology.org/starsem-semeval-split/W19-4214.pdf