2022
pdf
abs
ELRC Action: Covering Confidentiality, Correctness and Cross-linguality
Tom Vanallemeersch
|
Arne Defauw
|
Sara Szoc
|
Alina Kramchaninova
|
Joachim Van den Bogaert
|
Andrea Lösch
Proceedings of the Thirteenth Language Resources and Evaluation Conference
We describe the language technology (LT) assessments carried out in the ELRC action (European Language Resource Coordination) of the European Commission, which aims towards minimising language barriers across the EU. We zoom in on the two most extensive assessments. These LT specifications do not only involve experiments with tools and techniques but also an extensive consultation round with stakeholders from public organisations, academia and industry, in order to gather insights into scenarios and best practices. The LT specifications concern (1) the field of automated anonymisation, which is motivated by the need of public and other organisations to be able to store and share data, and (2) the field of multilingual fake news processing, which is motivated by the increasingly pressing problem of disinformation and the limited language coverage of systems for automatically detecting misleading articles. For each specification, we set up a corresponding proof-of-concept software to demonstrate the opportunities and challenges involved in the field.
pdf
abs
Synthetic Data Generation for Multilingual Domain-Adaptable Question Answering Systems
Alina Kramchaninova
|
Arne Defauw
Proceedings of the 23rd Annual Conference of the European Association for Machine Translation
Deep learning models have significantly advanced the state of the art of question answering systems. However, the majority of datasets available for training such models have been annotated by humans, are open-domain, and are composed primarily in English. To deal with these limitations, we introduce a pipeline that creates synthetic data from natural text. To illustrate the domain-adaptability of our approach, as well as its multilingual potential, we use our pipeline to obtain synthetic data in English and Dutch. We combine the synthetic data with non-synthetic data (SQuAD 2.0) and evaluate multilingual BERT models on the question answering task. Models trained with synthetically augmented data demonstrate a clear improvement in performance when evaluated on the domain-specific test set, compared to the models trained exclusively on SQuAD 2.0. We expect our work to be beneficial for training domain-specific question-answering systems when the amount of available data is limited.
pdf
abs
Automatically extracting the semantic network out of public services to support cities becoming Smart Cities
Joachim Van den Bogaert
|
Laurens Meeus
|
Alina Kramchaninova
|
Arne Defauw
|
Sara Szoc
|
Frederic Everaert
|
Koen Van Winckel
|
Anna Bardadym
|
Tom Vanallemeersch
Proceedings of the 23rd Annual Conference of the European Association for Machine Translation
The CEFAT4Cities project aims at creating a multilingual semantic interoperability layer for Smart Cities that allows users from all EU member States to interact with public services in their own language. The CEFAT4Cities processing pipeline transforms natural-language administrative procedures into machine-readable data using various multilingual Natural Language Processing techniques, such as semantic networks and machine translation, thus allowing for the development of more sophisticated and more user-friendly public services applications.
2020
pdf
abs
OCR, Classification& Machine Translation (OCCAM)
Joachim Van den Bogaert
|
Arne Defauw
|
Frederic Everaert
|
Koen Van Winckel
|
Alina Kramchaninova
|
Anna Bardadym
|
Tom Vanallemeersch
|
Pavel Smrž
|
Michal Hradiš
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation
The OCCAM project (Optical Character recognition, ClassificAtion & Machine Translation) aims at integrating the CEF (Connecting Europe Facility) Automated Translation service with image classification, Translation Memories (TMs), Optical Character Recognition (OCR), and Machine Translation (MT). It will support the automated translation of scanned business documents (a document format that, currently, cannot be processed by the CEF eTranslation service) and will also lead to a tool useful for the Digital Humanities domain.
pdf
abs
CEFAT4Cities, a Natural Language Layer for the ISA2 Core Public Service Vocabulary
Joachim Van den Bogaert
|
Arne Defauw
|
Sara Szoc
|
Frederic Everaert
|
Koen Van Winckel
|
Alina Kramchaninova
|
Anna Bardadym
|
Tom Vanallemeersch
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation
The CEFAT4Cities project (2020-2022) will create a “Smart Cities natural language context” (a software layer that facilitates the conversion of natural-language administrative procedures, into machine-readable data sets) on top of the existing ISA2 interoperability layer for public services. Integration with the FIWARE/ORION “Smart City” Context Broker, will make existing, paper-based, public services discoverable through “Smart City” frameworks, thus allowing for the development of more sophisticated and more user-friendly public services applications. An automated translation component will be included, to provide a solution that can be used by all EU Member States. As a result, the project will allow EU citizens and businesses to interact with public services on the city, national, regional and EU level, in their own language.