2022
pdf
abs
Synthetic Data Generation for Multilingual Domain-Adaptable Question Answering Systems
Alina Kramchaninova
|
Arne Defauw
Proceedings of the 23rd Annual Conference of the European Association for Machine Translation
Deep learning models have significantly advanced the state of the art of question answering systems. However, the majority of datasets available for training such models have been annotated by humans, are open-domain, and are composed primarily in English. To deal with these limitations, we introduce a pipeline that creates synthetic data from natural text. To illustrate the domain-adaptability of our approach, as well as its multilingual potential, we use our pipeline to obtain synthetic data in English and Dutch. We combine the synthetic data with non-synthetic data (SQuAD 2.0) and evaluate multilingual BERT models on the question answering task. Models trained with synthetically augmented data demonstrate a clear improvement in performance when evaluated on the domain-specific test set, compared to the models trained exclusively on SQuAD 2.0. We expect our work to be beneficial for training domain-specific question-answering systems when the amount of available data is limited.
pdf
abs
Automatically extracting the semantic network out of public services to support cities becoming Smart Cities
Joachim Van den Bogaert
|
Laurens Meeus
|
Alina Kramchaninova
|
Arne Defauw
|
Sara Szoc
|
Frederic Everaert
|
Koen Van Winckel
|
Anna Bardadym
|
Tom Vanallemeersch
Proceedings of the 23rd Annual Conference of the European Association for Machine Translation
The CEFAT4Cities project aims at creating a multilingual semantic interoperability layer for Smart Cities that allows users from all EU member States to interact with public services in their own language. The CEFAT4Cities processing pipeline transforms natural-language administrative procedures into machine-readable data using various multilingual Natural Language Processing techniques, such as semantic networks and machine translation, thus allowing for the development of more sophisticated and more user-friendly public services applications.
pdf
abs
ELRC Action: Covering Confidentiality, Correctness and Cross-linguality
Tom Vanallemeersch
|
Arne Defauw
|
Sara Szoc
|
Alina Kramchaninova
|
Joachim Van den Bogaert
|
Andrea Lösch
Proceedings of the Thirteenth Language Resources and Evaluation Conference
We describe the language technology (LT) assessments carried out in the ELRC action (European Language Resource Coordination) of the European Commission, which aims towards minimising language barriers across the EU. We zoom in on the two most extensive assessments. These LT specifications do not only involve experiments with tools and techniques but also an extensive consultation round with stakeholders from public organisations, academia and industry, in order to gather insights into scenarios and best practices. The LT specifications concern (1) the field of automated anonymisation, which is motivated by the need of public and other organisations to be able to store and share data, and (2) the field of multilingual fake news processing, which is motivated by the increasingly pressing problem of disinformation and the limited language coverage of systems for automatically detecting misleading articles. For each specification, we set up a corresponding proof-of-concept software to demonstrate the opportunities and challenges involved in the field.
2020
pdf
abs
OCR, Classification& Machine Translation (OCCAM)
Joachim Van den Bogaert
|
Arne Defauw
|
Frederic Everaert
|
Koen Van Winckel
|
Alina Kramchaninova
|
Anna Bardadym
|
Tom Vanallemeersch
|
Pavel Smrž
|
Michal Hradiš
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation
The OCCAM project (Optical Character recognition, ClassificAtion & Machine Translation) aims at integrating the CEF (Connecting Europe Facility) Automated Translation service with image classification, Translation Memories (TMs), Optical Character Recognition (OCR), and Machine Translation (MT). It will support the automated translation of scanned business documents (a document format that, currently, cannot be processed by the CEF eTranslation service) and will also lead to a tool useful for the Digital Humanities domain.
pdf
abs
CEFAT4Cities, a Natural Language Layer for the ISA2 Core Public Service Vocabulary
Joachim Van den Bogaert
|
Arne Defauw
|
Sara Szoc
|
Frederic Everaert
|
Koen Van Winckel
|
Alina Kramchaninova
|
Anna Bardadym
|
Tom Vanallemeersch
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation
The CEFAT4Cities project (2020-2022) will create a “Smart Cities natural language context” (a software layer that facilitates the conversion of natural-language administrative procedures, into machine-readable data sets) on top of the existing ISA2 interoperability layer for public services. Integration with the FIWARE/ORION “Smart City” Context Broker, will make existing, paper-based, public services discoverable through “Smart City” frameworks, thus allowing for the development of more sophisticated and more user-friendly public services applications. An automated translation component will be included, to provide a solution that can be used by all EU Member States. As a result, the project will allow EU citizens and businesses to interact with public services on the city, national, regional and EU level, in their own language.
pdf
abs
Being Generous with Sub-Words towards Small NMT Children
Arne Defauw
|
Tom Vanallemeersch
|
Koen Van Winckel
|
Sara Szoc
|
Joachim Van den Bogaert
Proceedings of the Twelfth Language Resources and Evaluation Conference
In the context of under-resourced neural machine translation (NMT), transfer learning from an NMT model trained on a high resource language pair, or from a multilingual NMT (M-NMT) model, has been shown to boost performance to a large extent. In this paper, we focus on so-called cold start transfer learning from an M-NMT model, which means that the parent model is not trained on any of the child data. Such a set-up enables quick adaptation of M-NMT models to new languages. We investigate the effectiveness of cold start transfer learning from a many-to-many M-NMT model to an under-resourced child. We show that sufficiently large sub-word vocabularies should be used for transfer learning to be effective in such a scenario. When adopting relatively large sub-word vocabularies we observe increases in performance thanks to transfer learning from a parent M-NMT model, both when translating to and from the under-resourced language. Our proposed approach involving dynamic vocabularies is both practical and effective. We report results on two under-resourced language pairs, i.e. Icelandic-English and Irish-English.
2019
pdf
Collecting domain specific data for MT: an evaluation of the ParaCrawlpipeline
Arne Defauw
|
Tom Vanallemeersch
|
Sara Szoc
|
Frederic Everaert
|
Koen Van Winckel
|
Kim Scholte
|
Joris Brabers
|
Joachim Van den Bogaert
Proceedings of Machine Translation Summit XVII: Translator, Project and User Tracks
pdf
Developing a Neural Machine Translation system for Irish
Arne Defauw
|
Sara Szoc
|
Tom Vanallemeersch
|
Anna Bardadym
|
Joris Brabers
|
Frederic Everaert
|
Kim Scholte
|
Koen Van Winckel
|
Joachim Van den Bogaert
Proceedings of the 2nd Workshop on Technologies for MT of Low Resource Languages