Automatic language identity tagging on word and sentence-level in multilingual text sources: a case-study on Luxembourgish

Thomas Lavergne, Gilles Adda, Martine Adda-Decker, Lori Lamel


Abstract
Luxembourgish, embedded in a multilingual context on the divide between Romance and Germanic cultures, remains one of Europe’s under-described languages. This is due to the fact that the written production remains relatively low, and linguistic knowledge and resources, such as lexica and pronunciation dictionaries, are sparse. The speakers or writers will frequently switch between Luxembourgish, German, and French, on a per-sentence basis, as well as on a sub-sentence level. In order to build resources like lexicons, and especially pronunciation lexicons, or language models needed for natural language processing tasks such as automatic speech recognition, language used in text corpora should be identified. In this paper, we present the design of a manually annotated corpus of mixed language sentences as well as the tools used to select these sentences. This corpus of difficult sentences was used to test a word-based language identification system. This language identification system was used to select textual data extracted from the web, in order to build a lexicon and language models. This lexicon and language model were used in an Automatic Speech Recognition system for the Luxembourgish language which obtain a 25% WER on the Quaero development data.
Anthology ID:
L14-1573
Volume:
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Month:
May
Year:
2014
Address:
Reykjavik, Iceland
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
3300–3304
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/732_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Thomas Lavergne, Gilles Adda, Martine Adda-Decker, and Lori Lamel. 2014. Automatic language identity tagging on word and sentence-level in multilingual text sources: a case-study on Luxembourgish. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 3300–3304, Reykjavik, Iceland. European Language Resources Association (ELRA).
Cite (Informal):
Automatic language identity tagging on word and sentence-level in multilingual text sources: a case-study on Luxembourgish (Lavergne et al., LREC 2014)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/732_Paper.pdf