Lexical Induction of Morphological and Orthographic Forms for Low-Resourced Languages

Taha Tobaili


Abstract
In this work we address the issue of high-degree lexical sparsity for non-standard languages under severe circumstance of small resources that are considered insufficient to train recent powerful language models. We proposed a new rule-based approach and utilised word embeddings to connect words with their inflectional and orthographic forms from a given corpus. Our case example is the low-resourced Lebanese dialect Arabizi. Arabizi is the name given to a new social transcription of the spoken Arabic in Latin script. The term comes from the portmanteau of Araby (Arabic) and Englizi (English). It is an informal written language where Arabs transcribe their dialectal mother tongue in text using Latin alphanumeral instead of Arabic script. For example حبيبي Ḥabībī my-love could be transcribed as 7abibi in Arabizi. We induced 175K forms from a list of 1.7K sentiment words. We evaluated this induction extrinsically on a sentiment-annotated dataset pushing its coverage by 13% over the previous version. We named the new lexicon SenZi-Large and released it publicly.
Anthology ID:
2020.msr-1.5
Volume:
Proceedings of the Third Workshop on Multilingual Surface Realisation
Month:
December
Year:
2020
Address:
Barcelona, Spain (Online)
Editors:
Anya Belz, Bernd Bohnet, Thiago Castro Ferreira, Yvette Graham, Simon Mille, Leo Wanner
Venue:
MSR
SIG:
SIGGEN
Publisher:
Association for Computational Linguistics
Note:
Pages:
42–49
Language:
URL:
https://aclanthology.org/2020.msr-1.5
DOI:
Bibkey:
Cite (ACL):
Taha Tobaili. 2020. Lexical Induction of Morphological and Orthographic Forms for Low-Resourced Languages. In Proceedings of the Third Workshop on Multilingual Surface Realisation, pages 42–49, Barcelona, Spain (Online). Association for Computational Linguistics.
Cite (Informal):
Lexical Induction of Morphological and Orthographic Forms for Low-Resourced Languages (Tobaili, MSR 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-4/2020.msr-1.5.pdf