Andrejs Vasiļjevs

Also published as: Andrejs Vasiljevs

2024

The Common European Language Data Space (LDS) is an integral part of the EU data strategy, which aims at developing a single market for data. Its decentralised technical infrastructure and governance scheme are currently being developed by the LDS project, which also has dedicated tasks for proof-of-concept prototypes, handling legal aspects, raising awareness and promoting the LDS through events and social media channels. The LDS is part of a broader vision for establishing all necessary components to develop European large language models.

2022

pdf abs
Assessing Multilinguality of Publicly Accessible Websites
Rinalds Vīksna | Inguna Skadiņa | Raivis Skadiņš | Andrejs Vasiļjevs | Roberts Rozis
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Although information on the Internet can be shared in many languages, the language presence on the World Wide Web is very disproportionate. The problem of multilingualism on the Web, in particular access, availability and quality of information in the world’s languages, has been the subject of UNESCO focus for several decades. Making European websites more multilingual is also one of the focal targets of the Connecting Europe Facility Automated Translation (CEF AT) digital service infrastructure. In order to monitor this goal, alongside other possible solutions, CEF AT needs a methodology and easy to use tool to assess the degree of multilingualism of a given website. In this paper we investigate methods and tools that automatically analyse the language diversity of the Web and propose indicators and methodology on how to measure the multilingualism of European websites. We also introduce a prototype tool based on open-source software that helps to assess multilingualism of the Web and can be independently run at set intervals. We also present initial results obtained with our tool that allows us to conclude that multilingualism on the Web is still a problem not only at the world level, but also at the European and regional level.

pdf abs
Open Terminology Management and Sharing Toolkit for Federation of Terminology Databases
Andis Lagzdiņš | Uldis Siliņš | Toms Bergmanis | Mārcis Pinnis | Artūrs Vasiļevskis | Andrejs Vasiļjevs
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Consolidated access to current and reliable terms from different subject fields and languages is necessary for content creators and translators. Terminology is also needed in AI applications such as machine translation, speech recognition, information extraction, and other natural language processing tools. In this work, we facilitate standards-based sharing and management of terminology resources by providing an open terminology management solution - the EuroTermBank Toolkit. It allows organisations to manage and search their terms, create term collections, and share them within and outside the organisation by participating in the network of federated databases. The data curated in the federated databases are automatically shared with EuroTermBank, the largest multilingual terminology resource in Europe, allowing translators and language service providers as well as researchers and students to access terminology resources in their most current version.

This article presents the work in progress on the collaborative project of several European countries to develop National Language Technology Platform (NLTP). The project aims at combining the most advanced Language Technology tools and solutions in a new, state-of-the-art, Artificial Intelligence driven, National Language Technology Platform for five EU/EEA official and lower-resourced languages.

2021

Europe is a multilingual society, in which dozens of languages are spoken. The only option to enable and to benefit from multilingualism is through Language Technologies (LT), i.e., Natural Language Processing and Speech Technologies. We describe the European Language Grid (ELG), which is targeted to evolve into the primary platform and marketplace for LT in Europe by providing one umbrella platform for the European LT landscape, including research and industry, enabling all stakeholders to upload, share and distribute their services, products and resources. At the end of our EU project, which will establish a legal entity in 2022, the ELG will provide access to approx. 1300 services for all European languages as well as thousands of data sets.

2020

Multilingualism is a cultural cornerstone of Europe and firmly anchored in the European treaties including full language equality. However, language barriers impacting business, cross-lingual and cross-cultural communication are still omnipresent. Language Technologies (LTs) are a powerful means to break down these barriers. While the last decade has seen various initiatives that created a multitude of approaches and technologies tailored to Europe’s specific needs, there is still an immense level of fragmentation. At the same time, AI has become an increasingly important concept in the European Information and Communication Technology area. For a few years now, AI – including many opportunities, synergies but also misconceptions – has been overshadowing every other topic. We present an overview of the European LT landscape, describing funding programmes, activities, actions and challenges in the different countries with regard to LT, including the current state of play in industry and the LT market. We present a brief overview of the main LT-related activities on the EU level in the last ten years and develop strategic guidance with regard to four key dimensions.

With 24 official EU and many additional languages, multilingualism in Europe and an inclusive Digital Single Market can only be enabled through Language Technologies (LTs). European LT business is dominated by hundreds of SMEs and a few large players. Many are world-class, with technologies that outperform the global players. However, European LT business is also fragmented – by nation states, languages, verticals and sectors, significantly holding back its impact. The European Language Grid (ELG) project addresses this fragmentation by establishing the ELG as the primary platform for LT in Europe. The ELG is a scalable cloud platform, providing, in an easy-to-integrate way, access to hundreds of commercial and non-commercial LTs for all European languages, including running tools and services as well as data sets and resources. Once fully operational, it will enable the commercial and non-commercial European LT community to deposit and upload their technologies and data sets into the ELG, to deploy them through the grid, and to connect with other resources. The ELG will boost the Multilingual Digital Single Market towards a thriving European LT community, creating new jobs and opportunities. Furthermore, the ELG project organises two open calls for up to 20 pilot projects. It also sets up 32 national competence centres and the European LT Council for outreach and coordination purposes.

This paper presents the key results of a study on the global competitiveness of the European Language Technology market for three areas – Machine Translation, speech technology, and cross-lingual search. EU competitiveness is analyzed in comparison to North America and Asia. The study focuses on seven dimensions (research, innovations, investments, market dominance, industry, infrastructure, and Open Data) that have been selected to characterize the language technology market. The study concludes that while Europe still has strong positions in Research and Innovation, it lags behind North America and Asia in scaling innovations and conquering market share.

pdf bib
Proceedings of the 1st International Workshop on Language Technology Platforms
Georg Rehm | Kalina Bontcheva | Khalid Choukri | Jan Hajič | Stelios Piperidis | Andrejs Vasiļjevs
Proceedings of the 1st International Workshop on Language Technology Platforms

With regard to the wider area of AI/LT platform interoperability, we concentrate on two core aspects: (1) cross-platform search and discovery of resources and services; (2) composition of cross-platform service workflows. We devise five different levels (of increasing complexity) of platform interoperability that we suggest to implement in a wider federation of AI/LT platforms. We illustrate the approach using the five emerging AI/LT platforms AI4EU, ELG, Lynx, QURATOR and SPEAKER.

pdf
A Tale of Eight Countries or the EU Council Presidency Translator in Retrospect
Mārcis Pinnis | Toms Bergmanis | Kristīne Metuzāle | Valters Šics | Artūrs Vasiļevskis | Andrejs Vasiļjevs
Proceedings of the 14th Conference of the Association for Machine Translation in the Americas (Volume 2: User Track)

2019

2018

pdf
Tilde MT Platform for Developing Client Specific MT Solutions
Mārcis Pinnis | Andrejs Vasiļjevs | Rihards Kalniņš | Roberts Rozis | Raivis Skadiņš | Valters Šics
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf
Collecting Language Resources from Public Administrations in the Nordic and Baltic Countries
Andrejs Vasiļjevs | Rihards Kalniņš | Roberts Rozis | Aivars Bērziņš
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2016

pdf abs
Collecting Language Resources for the Latvian e-Government Machine Translation Platform
Roberts Rozis | Andrejs Vasiļjevs | Raivis Skadiņš
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper describes corpora collection activity for building large machine translation systems for Latvian e-Government platform. We describe requirements for corpora, selection and assessment of data sources, collection of the public corpora and creation of new corpora from miscellaneous sources. Methodology, tools and assessment methods are also presented along with the results achieved, challenges faced and conclusions made. Several approaches to address the data scarceness are discussed. We summarize the volume of obtained corpora and provide quality metrics of MT systems trained on this data. Resulting MT systems for English-Latvian, Latvian English and Latvian Russian are integrated in the Latvian e-service portal and are freely available on website HUGO.LV. This paper can serve as a guidance for similar activities initiated in other countries, particularly in the context of European Language Resource Coordination action.

pdf abs
Fostering the Next Generation of European Language Technology: Recent Developments ― Emerging Initiatives ― Challenges and Opportunities
Georg Rehm | Jan Hajič | Josef van Genabith | Andrejs Vasiljevs
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

META-NET is a European network of excellence, founded in 2010, that consists of 60 research centres in 34 European countries. One of the key visions and goals of META-NET is a truly multilingual Europe, which is substantially supported and realised through language technologies. In this article we provide an overview of recent developments around the multilingual Europe topic, we also describe recent and upcoming events as well as recent and upcoming strategy papers. Furthermore, we provide overviews of two new emerging initiatives, the CEF.AT and ELRC activity on the one hand and the Cracking the Language Barrier federation on the other. The paper closes with several suggested next steps in order to address the current challenges and to open up new opportunities.

2014

pdf abs
Terminology localization guidelines for the national scenario
Juris Borzovs | Ilze Ilziņa | Iveta Keiša | Mārcis Pinnis | Andrejs Vasiļjevs
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper presents a set of principles and practical guidelines for terminology work in the national scenario to ensure a harmonized approach in term localization. These linguistic principles and guidelines are elaborated by the Terminology Commission in Latvia in the domain of Information and Communication Technology (ICT). We also present a novel approach in a corpus-based selection and an evaluation of the most frequently used terms. Analysis of the terms proves that, in general, in the normative terminology work in Latvia localized terms are coined according to these guidelines. We further evaluate how terms included in the database of official terminology are adopted in the general use such as newspaper articles, blogs, forums, websites etc. Our evaluation shows that in a non-normative context the official terminology faces a strong competition from other variations of localized terms. Conclusions and recommendations from lexical analysis of localized terms are provided. We hope that presented guidelines and approach in evaluation will be useful to terminology institutions, regulative authorities and researchers in different countries that are involved in the national terminology work.

This article provides an overview of the dissemination work carried out in META-NET from 2010 until early 2014; we describe its impact on the regional, national and international level, mainly with regard to politics and the situation of funding for LT topics. This paper documents the initiatives work throughout Europe in order to boost progress and innovation in our field.

pdf abs
Terminology Resources and Terminology Work Benefit from Cloud Services
Tatiana Gornostay | Andrejs Vasiļjevs
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper presents the concept of the innovative platform TaaS Terminology as a Service. TaaS brings the benefits of cloud services to the user, in order to foster the creation of terminology resources and to maintain their up-to-datedness by integrating automated data extraction and user-supported clean-up of raw terminological data and sharing user-validated terminology. The platform is based on cutting-edge technologies, provides single-access-point terminology services, and facilitates the establishment of emerging trends beyond conventional praxis and static models in terminology work. A cloud-based, user-oriented, collaborative, portable, interoperable, and multilingual platform offers such terminology services as terminology project creation and sharing, data collection for translation lookup, user document upload and management, terminology extraction customisation and execution, raw terminological data management, validated terminological data export and reuse, and other terminology services.

Real-world challenges in application of MT for localization: the Baltic case
Mārcis Pinnis | Raivis Skadiņš | Andrejs Vasiļjevs
Proceedings of the 11th Conference of the Association for Machine Translation in the Americas: MT Users Track

Machine translation for e-government – the Baltic case
Andrejs Vasiļjevs | Rihards Kalniņš | Mārcis Pinnis | Raivis Skadiņš
Proceedings of the 11th Conference of the Association for Machine Translation in the Americas: MT Users Track

pdf
How to overtake Google in MT quality - the Baltic case
Andrejs Vasiljevs
Proceedings of the 3rd Workshop on Hybrid Approaches to Machine Translation (HyTra)

pdf
Application of machine translation in localization into low-resourced languages
Raivis Skadiņš | Mārcis Pinnis | Andrejs Vasiļjevs | Inguna Skadiņa | Tomas Hudik
Proceedings of the 17th Annual Conference of the European Association for Machine Translation

2013

pdf
Application of Online Terminology Services in Statistical Machine Translation
Raivis Skadins | Marcis Pinnis | Tatiana Gornostay | Andrejs Vasiljevs
Proceedings of Machine Translation Summit XIV: Posters

pdf
TaaS: Terminology as a Service
Andrejs Vasiljevs | Tatiana Gornostay
Proceedings of Machine Translation Summit XIV: European projects

2012

The META-NORD project has contributed to an open infrastructure for language resources (data and tools) under the META-NET umbrella. This paper presents the key objectives of META-NORD and reports on the results achieved in the first year of the project. META-NORD has mapped and described the national language technology landscape in the Nordic and Baltic countries in terms of language use, language technology and resources, main actors in the academy, industry, government and society; identified and collected the first batch of language resources in the Nordic and Baltic countries; documented, processed, linked, and upgraded the identified language resources to agreed standards and guidelines. The three horizontal multilingual actions in META-NORD are overviewed in this paper: linking and validating Nordic and Baltic wordnets, the harmonisation of multilingual Nordic and Baltic treebanks, and consolidating multilingual terminology resources across European countries. This paper also touches upon intellectual property rights for the sharing of language resources.

Lack of sufficient parallel data for many languages and domains is currently one of the major obstacles to further advancement of automated translation. The ACCURAT project is addressing this issue by researching methods how to improve machine translation systems by using comparable corpora. In this paper we present tools and techniques developed in the ACCURAT project that allow additional data needed for statistical machine translation to be extracted from comparable corpora. We present methods and tools for acquisition of comparable corpora from the Web and other sources, for evaluation of the comparability of collected corpora, for multi-level alignment of comparable corpora and for extraction of lexical and terminological data for machine translation. Finally, we present initial evaluation results on the utility of collected corpora in domain-adapted machine translation and real-life applications.

pdf
LetsMT!: Cloud-Based Platform for Do-It-Yourself Machine Translation
Andrejs Vasiļjevs | Raivis Skadiņš | Jörg Tiedemann
Proceedings of the ACL 2012 System Demonstrations

2011

pdf
LetsMT!: Cloud-Based Platform for Building User Tailored Machine Translation Engines
Andrejs Vasiljevs | Raivis Skadinš | Jörg Tiedemann
Proceedings of Machine Translation Summit XIII: System Presentations

pdf
Evaluation of SMT in localization to under-resourced inflected language
Raivis Skadiņš | Maris Puriņš | Inguna Skadiņa | Andrejs Vasiļjevs
Proceedings of the 15th Annual Conference of the European Association for Machine Translation

pdf
META-NORD: Towards Sharing of Language Resources in Nordic and Baltic Countries
Inguna Skadiņa | Andrejs Vasiļjevs | Lars Borin | Koenraad De Smedt | Krister Lindén | Eiríkur Rögnvaldsson
Proceedings of the Workshop on Language Resources, Technology and Services in the Sharing Paradigm

2010

pdf
Bridging the Gap – EuroTermBank Terminology Delivered to Users’ Environment
Tatiana Gornostay | Andrejs Vasiljevs | Signe Rirdance | Roberts Rozis
Proceedings of the 14th Annual Conference of the European Association for Machine Translation

pdf abs
Corpus Based Analysis for Multilingual Terminology Entry Compounding
Andrejs Vasiljevs | Kaspars Balodis
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

This paper proposes statistical analysis methods for improvement of terminology entry compounding. Terminology entry compounding is a mechanism that identifies matching entries across multiple multilingual terminology collections. Bilingual or trilingual term entries are unified in compounded multilingual entry. We suggest that corpus analysis can improve entry compounding results by analysing contextual terms of given term in the corpus data. Proposed algorithm is described. It is implemented in an experimental setup. Results of experiment on compounding of Latvian and Lithuanian terminology resources are provided. These results encourage further research for different language pairs and in different domains.

2007

pdf
Development of Text-To-Speech system for Latvian
Kārlis Goba | Andrejs Vasiļjevs
Proceedings of the 16th Nordic Conference of Computational Linguistics (NODALIDA 2007)

pdf
Comprehension Assistant for Languages of Baltic States
Inguna Skadiņa | Andrejs Vasiļjevs | Daiga Deksne | Raivis Skadiņš | Linda Goldberga
Proceedings of the 16th Nordic Conference of Computational Linguistics (NODALIDA 2007)

2006

pdf abs
EuroTermBank - a Terminology Resource based on Best Practice
Lina Henriksen | Claus Povlsen | Andrejs Vasiljevs
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

The new EU member countries face the problems of terminology resource fragmentation and lack of coordination in terminology development in general. The EuroTermBank project aims at contributing to improve the terminology infrastructure of the new EU countries and the project will result in a centralized online terminology bank - interlinked to other terminology banks and resources - for languages of the new EU member countries. The main focus of this paper is on a description of how to identify best practice within terminology work seen from a broad perspective. Surveys of real life terminology work have been conducted and these surveys have resulted in identification of scenario specific best practice descriptions of terminology work. Furthermore, this paper will present an outline of the specific criteria that have been used for selection of existing term resources to be included in the EuroTermBank database.