@inproceedings{gaim-etal-2022-geezswitch,
    title = "{G}eez{S}witch: Language Identification in Typologically Related Low-resourced {E}ast {A}frican Languages",
    author = "Gaim, Fitsum  and
      Yang, Wonsuk  and
      Park, Jong C.",
    editor = "Calzolari, Nicoletta  and
      B{\'e}chet, Fr{\'e}d{\'e}ric  and
      Blache, Philippe  and
      Choukri, Khalid  and
      Cieri, Christopher  and
      Declerck, Thierry  and
      Goggi, Sara  and
      Isahara, Hitoshi  and
      Maegaard, Bente  and
      Mariani, Joseph  and
      Mazo, H{\'e}l{\`e}ne  and
      Odijk, Jan  and
      Piperidis, Stelios",
    booktitle = "Proceedings of the Thirteenth Language Resources and Evaluation Conference",
    month = jun,
    year = "2022",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "https://preview.aclanthology.org/ingest-emnlp/2022.lrec-1.707/",
    pages = "6578--6584",
    abstract = "Language identification is one of the fundamental tasks in natural language processing that is a prerequisite to data processing and numerous applications. Low-resourced languages with similar typologies are generally confused with each other in real-world applications such as machine translation, affecting the user{'}s experience. In this work, we present a language identification dataset for five typologically and phylogenetically related low-resourced East African languages that use the Ge{'}ez script as a writing system; namely Amharic, Blin, Ge{'}ez, Tigre, and Tigrinya. The dataset is built automatically from selected data sources, but we also performed a manual evaluation to assess its quality. Our approach to constructing the dataset is cost-effective and applicable to other low-resource languages. We integrated the dataset into an existing language-identification tool and also fine-tuned several Transformer based language models, achieving very strong results in all cases. While the task of language identification is easy for the informed person, such datasets can make a difference in real-world deployments and also serve as part of a benchmark for language understanding in the target languages. The data and models are made available at \url{https://github.com/fgaim/geezswitch}."
}Markdown (Informal)
[GeezSwitch: Language Identification in Typologically Related Low-resourced East African Languages](https://preview.aclanthology.org/ingest-emnlp/2022.lrec-1.707/) (Gaim et al., LREC 2022)
ACL