2024
pdf
abs
Automatic Speech Recognition for Gascon and Languedocian Variants of Occitan
Iñigo Morcillo
|
Igor Leturia
|
Ander Corral
|
Xabier Sarasola
|
Michaël Barret
|
Aure Séguier
|
Benaset Dazéas
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
This paper describes different approaches for developing, for the first time, an automatic speech recognition system for two of the main dialects of Occitan, namely Gascon and Languedocian, and the results obtained in them. The difficulty of the task lies in the fact that Occitan is a less-resourced language. Although a great effort has been made to collect or create corpora of each variant (transcribed speech recordings for the acoustic models and two text corpora for the language models), the sizes of the corpora obtained are far from those of successful systems reported in the literature, and thus we have tested different techniques to compensate for the lack of resources. We have developed classical systems using Kaldi, creating an acoustic model for each variant and also creating language models from the collected corpora and from machine translated texts. We have also tried fine-tuning a Whisper model with our speech corpora. We report word error rates of 20.86 for Gascon and 13.52 for Languedocian with the Kaldi systems and 16.37 for Gascon and 11.74 for Languedocian with Whisper.
pdf
abs
MULTILINGTOOL, Development of an Automatic Multilingual Subtitling and Dubbing System
Xabier Saralegi
|
Ander Corral
|
Igor Leturia
|
Xabier Sarasola
|
Josu Murua
|
Iker Manterola
|
Itziar Cortes
Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 2)
In this paper, we present the MULTILINGTOOL project, led by the Elhuyar Foundation and funded by the European Commission under the CREA-MEDIA2022-INNOVBUSMOD call. The aim of the project is to develop an advanced platform for automatic multilingual subtitling and dubbing. It will provide support for Spanish, English, and French, as well as the co-official languages of Spain, namely Basque, Catalan, and Galician.
2020
pdf
abs
Neural Text-to-Speech Synthesis for an Under-Resourced Language in a Diglossic Environment: the Case of Gascon Occitan
Ander Corral
|
Igor Leturia
|
Aure Séguier
|
Michäel Barret
|
Benaset Dazéas
|
Philippe Boula de Mareüil
|
Nicolas Quint
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)
Occitan is a minority language spoken in Southern France, some Alpine Valleys of Italy, and the Val d’Aran in Spain, which only very recently started developing language and speech technologies. This paper describes the first project for designing a Text-to-Speech synthesis system for one of its main regional varieties, namely Gascon. We used a state-of-the-art deep neural network approach, the Tacotron2-WaveGlow system. However, we faced two additional difficulties or challenges: on the one hand, we wanted to test if it was possible to obtain good quality results with fewer recording hours than is usually reported for such systems; on the other hand, we needed to achieve a standard, non-Occitan pronunciation of French proper names, therefore we needed to record French words and test phoneme-based approaches. The evaluation carried out over the various developed systems and approaches shows promising results with near production-ready quality. It has also allowed us to detect the phenomena for which some flaws or fall of quality occur, pointing at the direction of future work to improve the quality of the actual system and for new systems for other language varieties and voices.
2018
pdf
abs
Massively multilingual accessible audioguides via cell phones
Itziar Cortes
|
Igor Leturia
|
Ińaki Alegria
|
Aitzol Astigarraga
|
Kepa Sarasola
|
Manex Garaio
Proceedings of the 21st Annual Conference of the European Association for Machine Translation
Bidaide1 is a web service that allows the visitors of a museum, route or building to read or listen to explanations relative to the visited place on their own mobile and in their own language. The visitor can access the explanations in various ways: by scanning some QR codes located in the place, by GPS positioning (in outdoor routes), or by automatic Bluetooth proximity activation. This makes it accessible for people with reduced or null vision. On the other hand, this platform also offers to the manager of the visited site the most advanced language resources to create the texts and audios of the explanations in many languages.
2012
pdf
Evaluating Different Methods for Automatically Collecting Large General Corpora for Basque from the Web
Igor Leturia
Proceedings of COLING 2012
2008
pdf
abs
Analysis and Performance of Morphological Query Expansion and Language-Filtering Words on Basque Web Searching
Igor Leturia
|
Antton Gurrutxaga
|
Nerea Areta
|
Eli Pociello
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
Morphological query expansion and language-filtering words have proved to be valid methods when searching the web for content in Basque via APIs of commercial search engines, as the implementation of these methods in recent IR and web-as-corpus tools shows, but no real analysis has been carried out to ascertain the degree of improvement, apart from a comparison of recall and precision using a classical web search engine and measured in terms of hit counts. This paper deals with a more theoretical study that confirms the validity of the combination of both methods. We have measured the increase in recall obtained by morphological query expansion and the increase in precision and loss in recall produced by language-filtering-words, but not only by searching the web directly and looking at the hit counts which are not considered to be very reliable at best, but also using both a Basque web corpus and a classical lemmatised corpus, thus providing more exact quantitative results. Furthermore, we provide various corpora-extracted data to be used in the aforementioned methods, such as lists of the most frequent inflections and declinations (cases, persons, numbers, times, etc.) for each POS the most interesting word forms for a morphologically expanded query, or a list of the most used Basque words with their frequencies and document-frequencies the ones that should be used as language-filtering words.
2006
pdf
abs
Structure, Annotation and Tools in the Basque ZT Corpus
N. Areta
|
A. Gurrutxaga
|
I. Leturia
|
Z. Polin
|
R. Saiz
|
I. Alegria
|
X. Artola
|
A. Diaz de Ilarraza
|
N. Ezeiza
|
A. Sologaistoa
|
A. Soroa
|
A. Valverde
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
The ZT corpus (Basque Corpus of Science and Technology) is a tagged collection of specialized texts in Basque, which wants to be a main resource in research and development about written technical Basque: terminology, syntax and style. It will be the first written corpus in Basque which will be distributed by ELDA (at the end of 2006) and it wants to be a methodological and functional reference for new projects in the future (i.e. a national corpus for Basque). We also present the technology and the tools to build this Corpus. These tools, Corpusgile and Eulia, provide a flexible and extensible infrastructure for creating, visualizing and managing corpora and for consulting, visualizing and modifying annotations generated by linguistic tools.