Abstract
Bilingual word lexicons map words in one language to their synonyms in another language. Numerous papers have explored bilingual lexicon induction (BLI) in high-resource scenarios, framing a typical pipeline that consists of two steps: (i) unsupervised bitext mining and (ii) unsupervised word alignment. At the core of those steps are pre-trained large language models (LLMs).In this paper we present the analysis of the BLI pipeline for German and two of its dialects, Bavarian and Alemannic. This setup poses a number of unique challenges, attributed to the scarceness of resources, relatedness of the languages and lack of standardization in the orthography of dialects. We analyze the BLI outputs with respect to word frequency and the pairwise edit distance. Finally, we release an evaluation dataset consisting of manual annotations for 1K bilingual word pairs labeled according to their semantic similarity.- Anthology ID:
- 2023.nodalida-1.39
- Volume:
- Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)
- Month:
- May
- Year:
- 2023
- Address:
- Tórshavn, Faroe Islands
- Editors:
- Tanel Alumäe, Mark Fishel
- Venue:
- NoDaLiDa
- SIG:
- Publisher:
- University of Tartu Library
- Note:
- Pages:
- 371–385
- Language:
- URL:
- https://aclanthology.org/2023.nodalida-1.39
- DOI:
- Cite (ACL):
- Ekaterina Artemova and Barbara Plank. 2023. Low-resource Bilingual Dialect Lexicon Induction with Large Language Models. In Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa), pages 371–385, Tórshavn, Faroe Islands. University of Tartu Library.
- Cite (Informal):
- Low-resource Bilingual Dialect Lexicon Induction with Large Language Models (Artemova & Plank, NoDaLiDa 2023)
- PDF:
- https://preview.aclanthology.org/emnlp22-frontmatter/2023.nodalida-1.39.pdf