Sara Szoc


2024

pdf
AI4Culture: Towards Multilingual Access for Cultural Heritage Data
Tom Vanallemeersch | Sara Szoc | Laurens Meeus
Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 2)

The AI4Culture project (2023-2025), funded by the European Commission, and involving a 12-partner consortium led by the National Technical University of Athens, develops a platform serving as an online capacity building hub for AI technologies in the cultural heritage (CH) sector, enabling multilingual access to CH data. It offers access to AI-related resources, including openly labelled datasets for model training and testing, deployable and reusable tools, and capacity building materials. The tools are aimed at optical character recognition (OCR) for printed and handwritten documents, subtitle generation and validation, machine translation (MT), and metadata enrichment via image information extraction and semantic linking. The project also customises these tools to enhance interface and component usability. We illustrate this with technology that corrects OCR output using language models and adapts it for MT.

2022

pdf
ELRC Action: Covering Confidentiality, Correctness and Cross-linguality
Tom Vanallemeersch | Arne Defauw | Sara Szoc | Alina Kramchaninova | Joachim Van den Bogaert | Andrea Lösch
Proceedings of the Thirteenth Language Resources and Evaluation Conference

We describe the language technology (LT) assessments carried out in the ELRC action (European Language Resource Coordination) of the European Commission, which aims towards minimising language barriers across the EU. We zoom in on the two most extensive assessments. These LT specifications do not only involve experiments with tools and techniques but also an extensive consultation round with stakeholders from public organisations, academia and industry, in order to gather insights into scenarios and best practices. The LT specifications concern (1) the field of automated anonymisation, which is motivated by the need of public and other organisations to be able to store and share data, and (2) the field of multilingual fake news processing, which is motivated by the increasingly pressing problem of disinformation and the limited language coverage of systems for automatically detecting misleading articles. For each specification, we set up a corresponding proof-of-concept software to demonstrate the opportunities and challenges involved in the field.

pdf
Automatically extracting the semantic network out of public services to support cities becoming Smart Cities
Joachim Van den Bogaert | Laurens Meeus | Alina Kramchaninova | Arne Defauw | Sara Szoc | Frederic Everaert | Koen Van Winckel | Anna Bardadym | Tom Vanallemeersch
Proceedings of the 23rd Annual Conference of the European Association for Machine Translation

The CEFAT4Cities project aims at creating a multilingual semantic interoperability layer for Smart Cities that allows users from all EU member States to interact with public services in their own language. The CEFAT4Cities processing pipeline transforms natural-language administrative procedures into machine-readable data using various multilingual Natural Language Processing techniques, such as semantic networks and machine translation, thus allowing for the development of more sophisticated and more user-friendly public services applications.

2021

pdf
Validating Quality Estimation in a Computer-Aided Translation Workflow: Speed, Cost and Quality Trade-off
Fernando Alva-Manchego | Lucia Specia | Sara Szoc | Tom Vanallemeersch | Heidi Depraetere
Proceedings of Machine Translation Summit XVIII: Users and Providers Track

In modern computer-aided translation workflows, Machine Translation (MT) systems are used to produce a draft that is then checked and edited where needed by human translators. In this scenario, a Quality Estimation (QE) tool can be used to score MT outputs, and a threshold on the QE scores can be applied to decide whether an MT output can be used as-is or requires human post-edition. While this could reduce cost and turnaround times, it could harm translation quality, as QE models are not 100% accurate. In the framework of the APE-QUEST project (Automated Post-Editing and Quality Estimation), we set up a case-study on the trade-off between speed, cost and quality, investigating the benefits of QE models in a real-world scenario, where we rely on end-user acceptability as quality metric. Using data in the public administration domain for English-Dutch and English-French, we experimented with two use cases: assimilation and dissemination. Results shed some light on how QE scores can be explored to establish thresholds that suit each use case and target language, and demonstrate the potential benefits of adding QE to a translation workflow.

2020

pdf
Being Generous with Sub-Words towards Small NMT Children
Arne Defauw | Tom Vanallemeersch | Koen Van Winckel | Sara Szoc | Joachim Van den Bogaert
Proceedings of the Twelfth Language Resources and Evaluation Conference

In the context of under-resourced neural machine translation (NMT), transfer learning from an NMT model trained on a high resource language pair, or from a multilingual NMT (M-NMT) model, has been shown to boost performance to a large extent. In this paper, we focus on so-called cold start transfer learning from an M-NMT model, which means that the parent model is not trained on any of the child data. Such a set-up enables quick adaptation of M-NMT models to new languages. We investigate the effectiveness of cold start transfer learning from a many-to-many M-NMT model to an under-resourced child. We show that sufficiently large sub-word vocabularies should be used for transfer learning to be effective in such a scenario. When adopting relatively large sub-word vocabularies we observe increases in performance thanks to transfer learning from a parent M-NMT model, both when translating to and from the under-resourced language. Our proposed approach involving dynamic vocabularies is both practical and effective. We report results on two under-resourced language pairs, i.e. Icelandic-English and Irish-English.

pdf
A Post-Editing Dataset in the Legal Domain: Do we Underestimate Neural Machine Translation Quality?
Julia Ive | Lucia Specia | Sara Szoc | Tom Vanallemeersch | Joachim Van den Bogaert | Eduardo Farah | Christine Maroti | Artur Ventura | Maxim Khalilov
Proceedings of the Twelfth Language Resources and Evaluation Conference

We introduce a machine translation dataset for three pairs of languages in the legal domain with post-edited high-quality neural machine translation and independent human references. The data was collected as part of the EU APE-QUEST project and comprises crawled content from EU websites with translation from English into three European languages: Dutch, French and Portuguese. Altogether, the data consists of around 31K tuples including a source sentence, the respective machine translation by a neural machine translation system, a post-edited version of such translation by a professional translator, and - where available - the original reference translation crawled from parallel language websites. We describe the data collection process, provide an analysis of the resulting post-edits and benchmark the data using state-of-the-art quality estimation and automatic post-editing models. One interesting by-product of our post-editing analysis suggests that neural systems built with publicly available general domain data can provide high-quality translations, even though comparison to human references suggests that this quality is quite low. This makes our dataset a suitable candidate to test evaluation metrics. The data is freely available as an ELRC-SHARE resource.

pdf
APE-QUEST: an MT Quality Gate
Heidi Depraetere | Joachim Van den Bogaert | Sara Szoc | Tom Vanallemeersch
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

The APE-QUEST project (2018–2020) sets up a quality gate and crowdsourcing workflow for the eTranslation system of EC’s Connecting Europe Facility to improve translation quality in specific domains. It packages these services as a translation portal for machine-to-machine and machine-to-human scenarios.

pdf
CEFAT4Cities, a Natural Language Layer for the ISA2 Core Public Service Vocabulary
Joachim Van den Bogaert | Arne Defauw | Sara Szoc | Frederic Everaert | Koen Van Winckel | Alina Kramchaninova | Anna Bardadym | Tom Vanallemeersch
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

The CEFAT4Cities project (2020-2022) will create a “Smart Cities natural language context” (a software layer that facilitates the conversion of natural-language administrative procedures, into machine-readable data sets) on top of the existing ISA2 interoperability layer for public services. Integration with the FIWARE/ORION “Smart City” Context Broker, will make existing, paper-based, public services discoverable through “Smart City” frameworks, thus allowing for the development of more sophisticated and more user-friendly public services applications. An automated translation component will be included, to provide a solution that can be used by all EU Member States. As a result, the project will allow EU citizens and businesses to interact with public services on the city, national, regional and EU level, in their own language.

2019

pdf
APE-QUEST
Joachim Van den Bogaert | Heidi Depraetere | Sara Szoc | Tom Vanallemeersch | Koen Van Winckel | Frederic Everaert | Lucia Specia | Julia Ive | Maxim Khalilov | Christine Maroti | Eduardo Farah | Artur Ventura
Proceedings of Machine Translation Summit XVII: Translator, Project and User Tracks

pdf
Collecting domain specific data for MT: an evaluation of the ParaCrawlpipeline
Arne Defauw | Tom Vanallemeersch | Sara Szoc | Frederic Everaert | Koen Van Winckel | Kim Scholte | Joris Brabers | Joachim Van den Bogaert
Proceedings of Machine Translation Summit XVII: Translator, Project and User Tracks

pdf
Developing a Neural Machine Translation system for Irish
Arne Defauw | Sara Szoc | Tom Vanallemeersch | Anna Bardadym | Joris Brabers | Frederic Everaert | Kim Scholte | Koen Van Winckel | Joachim Van den Bogaert
Proceedings of the 2nd Workshop on Technologies for MT of Low Resource Languages