2024
pdf
abs
MultiLeg: Dataset for Text Sanitisation in Less-resourced Languages
Rinalds Vīksna
|
Inguna Skadiņa
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Text sanitization is the task of detecting and removing personal information from the text. While it has been well-studied in monolingual settings, today, there is also a need for multilingual text sanitization. In this paper, we introduce MultiLeg: a parallel, multilingual named entity (NE) dataset consisting of documents from the Court of Justice of the European Union annotated with semantic categories suitable for text sanitization. The dataset is available in 8 languages, and it contains 3082 parallel text segments for each language. We also show that the pseudonymized dataset remains useful for downstream tasks.
2023
pdf
abs
Large Language Models for Multilingual Slavic Named Entity Linking
Rinalds Vīksna
|
Inguna Skadiņa
|
Daiga Deksne
|
Roberts Rozis
Proceedings of the 9th Workshop on Slavic Natural Language Processing 2023 (SlavicNLP 2023)
This paper describes our submission for the 4th Shared Task on SlavNER on three Slavic languages - Czech, Polish and Russian. We use pre-trained multilingual XLM-R Language Model (Conneau et al., 2020) and fine-tune it for three Slavic languages using datasets provided by organizers. Our multilingual NER model achieves 0.896 F-score on all corpora, with the best result for Czech (0.914) and the worst for Russian (0.880). Our cross-language entity linking module achieves F-score of 0.669 in the official SlavNER 2023 evaluation.
2022
pdf
abs
Assessing Multilinguality of Publicly Accessible Websites
Rinalds Vīksna
|
Inguna Skadiņa
|
Raivis Skadiņš
|
Andrejs Vasiļjevs
|
Roberts Rozis
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Although information on the Internet can be shared in many languages, the language presence on the World Wide Web is very disproportionate. The problem of multilingualism on the Web, in particular access, availability and quality of information in the world’s languages, has been the subject of UNESCO focus for several decades. Making European websites more multilingual is also one of the focal targets of the Connecting Europe Facility Automated Translation (CEF AT) digital service infrastructure. In order to monitor this goal, alongside other possible solutions, CEF AT needs a methodology and easy to use tool to assess the degree of multilingualism of a given website. In this paper we investigate methods and tools that automatically analyse the language diversity of the Web and propose indicators and methodology on how to measure the multilingualism of European websites. We also introduce a prototype tool based on open-source software that helps to assess multilingualism of the Web and can be independently run at set intervals. We also present initial results obtained with our tool that allows us to conclude that multilingualism on the Web is still a problem not only at the world level, but also at the European and regional level.
pdf
abs
Latvian National Corpora Collection – Korpuss.lv
Baiba Saulite
|
Roberts Darģis
|
Normunds Gruzitis
|
Ilze Auzina
|
Kristīne Levāne-Petrova
|
Lauma Pretkalniņa
|
Laura Rituma
|
Peteris Paikens
|
Arturs Znotins
|
Laine Strankale
|
Kristīne Pokratniece
|
Ilmārs Poikāns
|
Guntis Barzdins
|
Inguna Skadiņa
|
Anda Baklāne
|
Valdis Saulespurēns
|
Jānis Ziediņš
Proceedings of the Thirteenth Language Resources and Evaluation Conference
LNCC is a diverse collection of Latvian language corpora representing both written and spoken language and is useful for both linguistic research and language modelling. The collection is intended to cover diverse Latvian language use cases and all the important text types and genres (e.g. news, social media, blogs, books, scientific texts, debates, essays, etc.), taking into account both quality and size aspects. To reach this objective, LNCC is a continuous multi-institutional and multi-project effort, supported by the Digital Humanities and Language Technology communities in Latvia. LNCC includes a broad range of Latvian texts from the Latvian National Library, Culture Information Systems Centre, Latvian National News Agency, Latvian Parliament, Latvian web crawl, various Latvian publishers, and from the Latvian language corpora created by Institute of Mathematics and Computer Science and its partners, including spoken language corpora. All corpora of LNCC are re-annotated with a uniform morpho-syntactic annotation scheme which enables federated search and consistent linguistics analysis in all the LNCC corpora, as well as facilitates to select and mix various corpora for pre-training large Latvian language models like BERT and GPT.
2021
pdf
abs
Domain Expert Platform for Goal-Oriented Dialog Collection
Didzis Goško
|
Arturs Znotins
|
Inguna Skadina
|
Normunds Gruzitis
|
Gunta Nešpore-Bērzkalne
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations
Today, most dialogue systems are fully or partly built using neural network architectures. A crucial prerequisite for the creation of a goal-oriented neural network dialogue system is a dataset that represents typical dialogue scenarios and includes various semantic annotations, e.g. intents, slots and dialogue actions, that are necessary for training a particular neural network architecture. In this demonstration paper, we present an easy to use interface and its back-end which is oriented to domain experts for the collection of goal-oriented dialogue samples. The platform not only allows to collect or write sample dialogues in a structured way, but also provides a means for simple annotation and interpretation of the dialogues. The platform itself is language-independent; it depends only on the availability of particular language processing components for a specific language. It is currently being used to collect dialogue samples in Latvian (a highly inflected language) which represent typical communication between students and the student service.
pdf
abs
Multilingual Slavic Named Entity Recognition
Rinalds Vīksna
|
Inguna Skadina
Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing
Named entity recognition, in particular for morphological rich languages, is challenging task due to the richness of inflected forms and ambiguity. This challenge is being addressed by SlavNER Shared Task. In this paper we describe system submitted to this task. Our system uses pre-trained multilingual BERT Language Model and is fine-tuned for six Slavic languages of this task on texts distributed by organizers. In our experiments this multilingual NER model achieved 96 F1 score on in-domain data and an F1 score of 83 on out of domain data. Entity coreference module achieved F1 score of 47.6 as evaluated by bsnlp2021 organizers.
2020
pdf
abs
The European Language Technology Landscape in 2020: Language-Centric and Human-Centric AI for Cross-Cultural Communication in Multilingual Europe
Georg Rehm
|
Katrin Marheinecke
|
Stefanie Hegele
|
Stelios Piperidis
|
Kalina Bontcheva
|
Jan Hajič
|
Khalid Choukri
|
Andrejs Vasiļjevs
|
Gerhard Backfried
|
Christoph Prinz
|
José Manuel Gómez-Pérez
|
Luc Meertens
|
Paul Lukowicz
|
Josef van Genabith
|
Andrea Lösch
|
Philipp Slusallek
|
Morten Irgens
|
Patrick Gatellier
|
Joachim Köhler
|
Laure Le Bars
|
Dimitra Anastasiou
|
Albina Auksoriūtė
|
Núria Bel
|
António Branco
|
Gerhard Budin
|
Walter Daelemans
|
Koenraad De Smedt
|
Radovan Garabík
|
Maria Gavriilidou
|
Dagmar Gromann
|
Svetla Koeva
|
Simon Krek
|
Cvetana Krstev
|
Krister Lindén
|
Bernardo Magnini
|
Jan Odijk
|
Maciej Ogrodniczuk
|
Eiríkur Rögnvaldsson
|
Mike Rosner
|
Bolette Pedersen
|
Inguna Skadiņa
|
Marko Tadić
|
Dan Tufiș
|
Tamás Váradi
|
Kadri Vider
|
Andy Way
|
François Yvon
Proceedings of the Twelfth Language Resources and Evaluation Conference
Multilingualism is a cultural cornerstone of Europe and firmly anchored in the European treaties including full language equality. However, language barriers impacting business, cross-lingual and cross-cultural communication are still omnipresent. Language Technologies (LTs) are a powerful means to break down these barriers. While the last decade has seen various initiatives that created a multitude of approaches and technologies tailored to Europe’s specific needs, there is still an immense level of fragmentation. At the same time, AI has become an increasingly important concept in the European Information and Communication Technology area. For a few years now, AI – including many opportunities, synergies but also misconceptions – has been overshadowing every other topic. We present an overview of the European LT landscape, describing funding programmes, activities, actions and challenges in the different countries with regard to LT, including the current state of play in industry and the LT market. We present a brief overview of the main LT-related activities on the EU level in the last ten years and develop strategic guidance with regard to four key dimensions.
pdf
abs
The Competitiveness Analysis of the European Language Technology Market
Andrejs Vasiļjevs
|
Inguna Skadiņa
|
Indra Samite
|
Kaspars Kauliņš
|
Ēriks Ajausks
|
Jūlija Meļņika
|
Aivars Bērziņš
Proceedings of the Twelfth Language Resources and Evaluation Conference
This paper presents the key results of a study on the global competitiveness of the European Language Technology market for three areas – Machine Translation, speech technology, and cross-lingual search. EU competitiveness is analyzed in comparison to North America and Asia. The study focuses on seven dimensions (research, innovations, investments, market dominance, industry, infrastructure, and Open Data) that have been selected to characterize the language technology market. The study concludes that while Europe still has strong positions in Research and Innovation, it lags behind North America and Asia in scaling innovations and conquering market share.
2019
pdf
bib
Competitiveness Analysis of the European Machine Translation Market
Andrejs Vasiļjevs
|
Inguna Skadiņa
|
Indra Sāmīte
|
Kaspars Kauliņš
|
Ēriks Ajausks
|
Jūlija Meļņika
|
Aivars Bērziņš
Proceedings of Machine Translation Summit XVII: Translator, Project and User Tracks
2017
pdf
abs
NMT or SMT: Case Study of a Narrow-domain English-Latvian Post-editing Project
Inguna Skadiņa
|
Mārcis Pinnis
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
The recent technological shift in machine translation from statistical machine translation (SMT) to neural machine translation (NMT) raises the question of the strengths and weaknesses of NMT. In this paper, we present an analysis of NMT and SMT systems’ outputs from narrow domain English-Latvian MT systems that were trained on a rather small amount of data. We analyze post-edits produced by professional translators and manually annotated errors in these outputs. Analysis of post-edits allowed us to conclude that both approaches are comparably successful, allowing for an increase in translators’ productivity, with the NMT system showing slightly worse results. Through the analysis of annotated errors, we found that NMT translations are more fluent than SMT translations. However, errors related to accuracy, especially, mistranslation and omission errors, occur more often in NMT outputs. The word form errors, that characterize the morphological richness of Latvian, are frequent for both systems, but slightly fewer in NMT outputs.
2016
pdf
What Can We Really Learn from Post-editing?
Marcis Pinnis
|
Rihards Kalnins
|
Raivis Skadins
|
Inguna Skadina
Conferences of the Association for Machine Translation in the Americas: MT Users' Track
pdf
abs
Syntax-based Multi-system Machine Translation
Matīss Rikters
|
Inguna Skadiņa
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
This paper describes a hybrid machine translation system that explores a parser to acquire syntactic chunks of a source sentence, translates the chunks with multiple online machine translation (MT) system application program interfaces (APIs) and creates output by combining translated chunks to obtain the best possible translation. The selection of the best translation hypothesis is performed by calculating the perplexity for each translated chunk. The goal of this approach is to enhance the baseline multi-system hybrid translation (MHyT) system that uses only a language model to select best translation from translations obtained with different APIs and to improve overall English ― Latvian machine translation quality over each of the individual MT APIs. The presented syntax-based multi-system translation (SyMHyT) system demonstrates an improvement in terms of BLEU and NIST scores compared to the baseline system. Improvements reach from 1.74 up to 2.54 BLEU points.
2014
pdf
abs
The Strategic Impact of META-NET on the Regional, National and International Level
Georg Rehm
|
Hans Uszkoreit
|
Sophia Ananiadou
|
Núria Bel
|
Audronė Bielevičienė
|
Lars Borin
|
António Branco
|
Gerhard Budin
|
Nicoletta Calzolari
|
Walter Daelemans
|
Radovan Garabík
|
Marko Grobelnik
|
Carmen García-Mateo
|
Josef van Genabith
|
Jan Hajič
|
Inma Hernáez
|
John Judge
|
Svetla Koeva
|
Simon Krek
|
Cvetana Krstev
|
Krister Lindén
|
Bernardo Magnini
|
Joseph Mariani
|
John McNaught
|
Maite Melero
|
Monica Monachini
|
Asunción Moreno
|
Jan Odijk
|
Maciej Ogrodniczuk
|
Piotr Pęzik
|
Stelios Piperidis
|
Adam Przepiórkowski
|
Eiríkur Rögnvaldsson
|
Michael Rosner
|
Bolette Pedersen
|
Inguna Skadiņa
|
Koenraad De Smedt
|
Marko Tadić
|
Paul Thompson
|
Dan Tufiş
|
Tamás Váradi
|
Andrejs Vasiļjevs
|
Kadri Vider
|
Jolanta Zabarskaite
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
This article provides an overview of the dissemination work carried out in META-NET from 2010 until early 2014; we describe its impact on the regional, national and international level, mainly with regard to politics and the situation of funding for LT topics. This paper documents the initiatives work throughout Europe in order to boost progress and innovation in our field.
pdf
abs
CLARA: A New Generation of Researchers in Common Language Resources and Their Applications
Koenraad De Smedt
|
Erhard Hinrichs
|
Detmar Meurers
|
Inguna Skadiņa
|
Bolette Pedersen
|
Costanza Navarretta
|
Núria Bel
|
Krister Lindén
|
Markéta Lopatková
|
Jan Hajič
|
Gisle Andersen
|
Przemyslaw Lenkiewicz
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
CLARA (Common Language Resources and Their Applications) is a Marie Curie Initial Training Network which ran from 2009 until 2014 with the aim of providing researcher training in crucial areas related to language resources and infrastructure. The scope of the project was broad and included infrastructure design, lexical semantic modeling, domain modeling, multimedia and multimodal communication, applications, and parsing technologies and grammar models. An international consortium of 9 partners and 12 associate partners employed researchers in 19 new positions and organized a training program consisting of 10 thematic courses and summer/winter schools. The project has resulted in new theoretical insights as well as new resources and tools. Most importantly, the project has trained a new generation of researchers who can perform advanced research and development in language resources and technologies.
pdf
Application of machine translation in localization into low-resourced languages
Raivis Skadiņš
|
Mārcis Pinnis
|
Andrejs Vasiļjevs
|
Inguna Skadiņa
|
Tomas Hudik
Proceedings of the 17th Annual Conference of the European Association for Machine Translation
2013
pdf
Baltic and Nordic Parts of the European Linguistic Infrastructure
Inguna Skadiņa
|
Andrejs Vasiļjevs
|
Lars Borin
|
Krister Lindén
|
Gyri Losnegaard
|
Sussi Olsen
|
Bolette Sandford Pedersen
|
Roberts Rozis
|
Koenraad De Smedt
Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013)
2012
pdf
ACCURAT Toolkit for Multi-Level Alignment and Information Extraction from Comparable Corpora
Mārcis Pinnis
|
Radu Ion
|
Dan Ştefănescu
|
Fangzhong Su
|
Inguna Skadiņa
|
Andrejs Vasiļjevs
|
Bogdan Babych
Proceedings of the ACL 2012 System Demonstrations
pdf
abs
Creation of an Open Shared Language Resource Repository in the Nordic and Baltic Countries
Andrejs Vasiļjevs
|
Markus Forsberg
|
Tatiana Gornostay
|
Dorte Haltrup Hansen
|
Kristín Jóhannsdóttir
|
Gunn Lyse
|
Krister Lindén
|
Lene Offersgaard
|
Sussi Olsen
|
Bolette Pedersen
|
Eiríkur Rögnvaldsson
|
Inguna Skadiņa
|
Koenraad De Smedt
|
Ville Oksanen
|
Roberts Rozis
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
The META-NORD project has contributed to an open infrastructure for language resources (data and tools) under the META-NET umbrella. This paper presents the key objectives of META-NORD and reports on the results achieved in the first year of the project. META-NORD has mapped and described the national language technology landscape in the Nordic and Baltic countries in terms of language use, language technology and resources, main actors in the academy, industry, government and society; identified and collected the first batch of language resources in the Nordic and Baltic countries; documented, processed, linked, and upgraded the identified language resources to agreed standards and guidelines. The three horizontal multilingual actions in META-NORD are overviewed in this paper: linking and validating Nordic and Baltic wordnets, the harmonisation of multilingual Nordic and Baltic treebanks, and consolidating multilingual terminology resources across European countries. This paper also touches upon intellectual property rights for the sharing of language resources.
pdf
abs
Collecting and Using Comparable Corpora for Statistical Machine Translation
Inguna Skadiņa
|
Ahmet Aker
|
Nikos Mastropavlos
|
Fangzhong Su
|
Dan Tufis
|
Mateja Verlic
|
Andrejs Vasiļjevs
|
Bogdan Babych
|
Paul Clough
|
Robert Gaizauskas
|
Nikos Glaros
|
Monica Lestari Paramita
|
Mārcis Pinnis
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Lack of sufficient parallel data for many languages and domains is currently one of the major obstacles to further advancement of automated translation. The ACCURAT project is addressing this issue by researching methods how to improve machine translation systems by using comparable corpora. In this paper we present tools and techniques developed in the ACCURAT project that allow additional data needed for statistical machine translation to be extracted from comparable corpora. We present methods and tools for acquisition of comparable corpora from the Web and other sources, for evaluation of the comparability of collected corpora, for multi-level alignment of comparable corpora and for extraction of lexical and terminological data for machine translation. Finally, we present initial evaluation results on the utility of collected corpora in domain-adapted machine translation and real-life applications.
2011
pdf
Evaluation of SMT in localization to under-resourced inflected language
Raivis Skadiņš
|
Maris Puriņš
|
Inguna Skadiņa
|
Andrejs Vasiļjevs
Proceedings of the 15th Annual Conference of the European Association for Machine Translation
pdf
META-NORD: Towards Sharing of Language Resources in Nordic and Baltic Countries
Inguna Skadiņa
|
Andrejs Vasiļjevs
|
Lars Borin
|
Koenraad De Smedt
|
Krister Lindén
|
Eiríkur Rögnvaldsson
Proceedings of the Workshop on Language Resources, Technology and Services in the Sharing Paradigm
pdf
bib
Proceedings of the 18th Nordic Conference of Computational Linguistics (NODALIDA 2011)
Bolette Sandford Pedersen
|
Gunta Nešpore
|
Inguna Skadiņa
Proceedings of the 18th Nordic Conference of Computational Linguistics (NODALIDA 2011)
2010
pdf
abs
Towards Improving English-Latvian Translation: A System Comparison and a New Rescoring Feature
Maxim Khalilov
|
José A. R. Fonollosa
|
Inguna Skadin̨a
|
Edgars Brālītis
|
Lauma Pretkalnin̨a
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
Translation into the languages with relatively free word order has received a lot less attention than translation into fixed word order languages (English), or into analytical languages (Chinese). At the same time this translation task is found among the most difficult challenges for machine translation (MT), and intuitively it seems that there is some space in improvement intending to reflect the free word order structure of the target language. This paper presents a comparative study of two alternative approaches to statistical machine translation (SMT) and their application to a task of English-to-Latvian translation. Furthermore, a novel feature intending to reflect the relatively free word order scheme of the Latvian language is proposed and successfully applied on the n-best list rescoring step. Moving beyond classical automatic scores of translation quality that are classically presented in MT research papers, we contribute presenting a manual error analysis of MT systems output that helps to shed light on advantages and disadvantages of the SMT systems under consideration.
pdf
abs
Resource and Service Centres as the Backbone for a Sustainable Service Infrastructure
Peter Wittenburg
|
Nuria Bel
|
Lars Borin
|
Gerhard Budin
|
Nicoletta Calzolari
|
Eva Hajicova
|
Kimmo Koskenniemi
|
Lothar Lemnitzer
|
Bente Maegaard
|
Maciej Piasecki
|
Jean-Marie Pierrel
|
Stelios Piperidis
|
Inguna Skadina
|
Dan Tufis
|
Remco van Veenendaal
|
Tamas Váradi
|
Martin Wynne
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
Currently, research infrastructures are being designed and established in many disciplines since they all suffer from an enormous fragmentation of their resources and tools. In the domain of language resources and tools the CLARIN initiative has been funded since 2008 to overcome many of the integration and interoperability hurdles. CLARIN can build on knowledge and work from many projects that were carried out during the last years and wants to build stable and robust services that can be used by researchers. Here service centres will play an important role that have the potential of being persistent and that adhere to criteria as they have been established by CLARIN. In the last year of the so-called preparatory phase these centres are currently developing four use cases that can demonstrate how the various pillars CLARIN has been working on can be integrated. All four use cases fulfil the criteria of being cross-national.
2009
pdf
English–Latvian Toponym Processing: Translation Strategies and Linguistic Patterns
Tatiana Gornostay
|
Inguna Skadiņa
Proceedings of the 13th Annual Conference of the European Association for Machine Translation
pdf
Pattern-based English-Latvian Toponym Translation
Tatiana Gornostay
|
Inguna Skadiņa
Proceedings of the 17th Nordic Conference of Computational Linguistics (NODALIDA 2009)
pdf
English-Latvian SMT: knowledge or data?
Inguna Skadiņa
|
Edgars Brālītis
Proceedings of the 17th Nordic Conference of Computational Linguistics (NODALIDA 2009)
2008
pdf
abs
Dictionary of Multiword Expressions for Translation into highly Inflected Languages
Daiga Deksne
|
Raivis Skadiņš
|
Inguna Skadiņa
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
Treatment of Multiword Expressions (MWEs) is one of the most complicated issues in natural language processing, especially in Machine Translation (MT). The paper presents dictionary of MWEs for a English-Latvian MT system, demonstrating a way how MWEs could be handled for inflected languages with rich morphology and rather free word order. The proposed dictionary of MWEs consists of two constituents: a lexicon of phrases and a set of MWE rules. The lexicon of phrases is rather similar to translation lexicon of the MT system, while MWE rules describe syntactic structure of the source and target sentence allowing correct transformation of different MWE types into the target language and ensuring correct syntactic structure. The paper demonstrates this approach on different MWE types, starting from simple syntactic structures, followed by more complicated cases and including fully idiomatic expressions. Automatic evaluation shows that the described approach increases the quality of translation by 0.6 BLEU points.
2007
pdf
Comprehension Assistant for Languages of Baltic States
Inguna Skadiņa
|
Andrejs Vasiļjevs
|
Daiga Deksne
|
Raivis Skadiņš
|
Linda Goldberga
Proceedings of the 16th Nordic Conference of Computational Linguistics (NODALIDA 2007)