C.m. Downey
2023
Learning to translate by learning to communicate
C.m. Downey
|
Xuhui Zhou
|
Zeyu Liu
|
Shane Steinert-Threlkeld
Proceedings of the 3rd Workshop on Multi-lingual Representation Learning (MRL)
Embedding Structure Matters: Comparing Methods to Adapt Multilingual Vocabularies to New Languages
C.m. Downey
|
Terra Blevins
|
Nora Goldfine
|
Shane Steinert-Threlkeld
Proceedings of the 3rd Workshop on Multi-lingual Representation Learning (MRL)
2022
A Masked Segmental Language Model for Unsupervised Natural Language Segmentation
C.m. Downey
|
Fei Xia
|
Gina-Anne Levow
|
Shane Steinert-Threlkeld
Proceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology
We introduce a Masked Segmental Language Model (MSLM) for joint language modeling and unsupervised segmentation. While near-perfect supervised methods have been developed for segmenting human-like linguistic units in resource-rich languages such as Chinese, many of the world’s languages are both morphologically complex, and have no large dataset of “gold” segmentations for supervised training. Segmental Language Models offer a unique approach by conducting unsupervised segmentation as the byproduct of a neural language modeling objective. However, current SLMs are limited in their scalability due to their recurrent architecture. We propose a new type of SLM for use in both unsupervised and lightly supervised segmentation tasks. The MSLM is built on a span-masking transformer architecture, harnessing a masked bidirectional modeling context and attention, as well as adding the potential for model scalability. In a series of experiments, our model outperforms the segmentation quality of recurrent SLMs on Chinese, and performs similarly to the recurrent model on English.
Search
Co-authors
- Shane Steinert-Threlkeld 3
- Xuhui Zhou 1
- Zeyu Liu 1
- Terra Blevins 1
- Nora Goldfine 1
- show all...