Soh-eun Shim


Fixing paper assignments

  1. Please select all papers that belong to the same person.
  2. Indicate below which author they should be assigned to.
Provide a valid ORCID iD here. This will be used to match future papers to this author.
Provide the name of the school or the university where the author has received or will receive their highest degree (e.g., Ph.D. institution for researchers, or current affiliation for students). This will be used to form the new author page ID, if needed.

TODO: "submit" and "cancel" buttons here


2023

pdf bib
VarDial in the Wild: Industrial Applications of LID Systems for Closely-Related Language Varieties
Fritz Hohl | Soh-eun Shim
Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023)

This report describes first an industrial use case for identifying closely related languages, e.g.dialects, namely the detection of languages of movie subtitle documents. We then presenta 2-stage architecture that is able to detect macrolanguages in the first stage and languagevariants in the second. Using our architecture, we participated in the DSL-TL Shared Task of the VarDial 2023 workshop. We describe the results of our experiments. In the first experiment we report an accuracy of 97.8% on a set of 460 subtitle files. In our second experimentwe used DSL-TL data and achieve a macroaverage F1 of 76% for the binary task, and 54% for the three-way task in the dev set. In the open track, we augment the data with named entities retrieved from Wikidata and achieve minor increases of about 1% for both tracks.