Bridge-Language Capitalization Inference in Western Iranian: Sorani, Kurmanji, Zazaki, and Tajik

Patrick Littell, David R. Mortensen, Kartik Goyal, Chris Dyer, Lori Levin


Abstract
In Sorani Kurdish, one of the most useful orthographic features in named-entity recognition – capitalization – is absent, as the language’s Perso-Arabic script does not make a distinction between uppercase and lowercase letters. We describe a system for deriving an inferred capitalization value from closely related languages by phonological similarity, and illustrate the system using several related Western Iranian languages.
Anthology ID:
L16-1529
Volume:
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Month:
May
Year:
2016
Address:
Portorož, Slovenia
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
3318–3324
Language:
URL:
https://aclanthology.org/L16-1529
DOI:
Bibkey:
Cite (ACL):
Patrick Littell, David R. Mortensen, Kartik Goyal, Chris Dyer, and Lori Levin. 2016. Bridge-Language Capitalization Inference in Western Iranian: Sorani, Kurmanji, Zazaki, and Tajik. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 3318–3324, Portorož, Slovenia. European Language Resources Association (ELRA).
Cite (Informal):
Bridge-Language Capitalization Inference in Western Iranian: Sorani, Kurmanji, Zazaki, and Tajik (Littell et al., LREC 2016)
Copy Citation:
PDF:
https://preview.aclanthology.org/paclic-22-ingestion/L16-1529.pdf