Vincent Vandeghinste

2024

Pretrained Language Models (PLMs) are the de facto backbone of most state-of-the-art NLP systems. In this paper, we introduce a family of domain-specific pretrained PLMs for French, focusing on three important domains: transcribed speech, medicine, and law. We use a transformer architecture based on efficient methods (LinFormer) to maximise their utility, since these domains often involve processing long documents. We evaluate and compare our models to state-of-the-art models on a diverse set of tasks and datasets, some of which are introduced in this paper. We gather the datasets into a new French-language evaluation benchmark for these three domains. We also compare various training configurations: continued pretraining, pretraining from scratch, as well as single- and multi-domain pretraining. Extensive domain-specific experiments show that it is possible to attain competitive downstream performance even when pre-training with the approximative LinFormer attention mechanism. For full reproducibility, we release the models and pretraining data, as well as contributed datasets.

Les modèles de langue préentraînés (PLM) constituent aujourd’hui de facto l’épine dorsale de la plupart des systèmes de traitement automatique des langues. Dans cet article, nous présentons Jargon, une famille de PLMs pour des domaines spécialisés du français, en nous focalisant sur trois domaines : la parole transcrite, le domaine clinique / biomédical, et le domaine juridique. Nous utilisons une architecture de transformeur basée sur des méthodes computationnellement efficaces(LinFormer) puisque ces domaines impliquent souvent le traitement de longs documents. Nous évaluons et comparons nos modèles à des modèles de l’état de l’art sur un ensemble varié de tâches et de corpus d’évaluation, dont certains sont introduits dans notre article. Nous rassemblons les jeux de données dans un nouveau référentiel d’évaluation en langue française pour ces trois domaines. Nous comparons également diverses configurations d’entraînement : préentraînement prolongé en apprentissage autosupervisé sur les données spécialisées, préentraînement à partir de zéro, ainsi que préentraînement mono et multi-domaines. Nos expérimentations approfondies dans des domaines spécialisés montrent qu’il est possible d’atteindre des performances compétitives en aval, même lors d’un préentraînement avec le mécanisme d’attention approximatif de LinFormer. Pour une reproductibilité totale, nous publions les modèles et les données de préentraînement, ainsi que les corpus utilisés.

2023

pdf abs
A New English-Dutch-NGT Corpus for the Hospitality Domain
Mirella De Sisto | Vincent Vandeghinste | Dimitar Shterionov
Proceedings of the Second International Workshop on Automatic Translation for Signed and Spoken Languages

One of the major challenges hampering the development of language technology which targets sign languages is the extremely limited availability of good quality data geared towards machine learning and deep learning approaches. In this paper we introduce the NGT-Dutch Hotel Review Corpus (NGT-HoReCo), which addresses this issue by providing multimodal parallel data in English, Dutch and Sign Language of the Netherlands (NGT). The corpus contains 283 hotel reviews in written English, translated into written Dutch and into NGT videos. It will be made publicly available through CLARIN and through the ELG platform.

pdf abs
Word Sense Disambiguation for Automatic Translation of Medical Dialogues into Pictographs
Magali Norré | Rémi Cardon | Vincent Vandeghinste | Thomas François
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing

Word sense disambiguation is an NLP task embedded in different applications. We propose to evaluate its contribution to the automatic translation of French texts into pictographs, in the context of communication between doctors and patients with an intellectual disability. Different general and/or medical language models (Word2Vec, fastText, CamemBERT, FlauBERT, DrBERT, and CamemBERT-bio) are tested in order to choose semantically correct pictographs leveraging the synsets in the French WordNets (WOLF and WoNeF). The results of our automatic evaluations show that our method based on Word2Vec and fastText significantly improves the precision of medical translations into pictographs. We also present an evaluation corpus adapted to this task.

For Sign Languages (SLs), can we create a SignNet, like a WordNet for spoken languages: a network of semantic relations between constitutive elements of SLs? We first discuss approaches that link SL data to wordnets, or integrate such elements with some adaptations into the structure of WordNet. Then, we present requirements for a SignNet, which is built on SL data and then linked to WordNet.

SignON (https://signon-project.eu/) is a Horizon 2020 project, running from 2021 until the end of 2023, which addresses the lack of technology and services for the automatic translation between sign languages (SLs) and spoken languages, through an inclusive, human-centric solution, hence contributing to the repertoire of communication media for deaf, hard of hearing (DHH) and hearing individuals. In this paper, we present an update of the status of the project, describing the approaches developed to address the challenges and peculiarities of SL machine translation (SLMT).

pdf abs
GoSt-ParC-Sign: Gold Standard Parallel Corpus of Sign and spoken language
Mirella De Sisto | Vincent Vandeghinste | Lien Soetemans | Caro Brosens | Dimitar Shterionov
Proceedings of the 24th Annual Conference of the European Association for Machine Translation

Good quality training data for Sign Language Machine Translation (SLMT) is extremely scarce, and this is one of the challenges that any project focusing on Machine Translation (MT) which also targets sign languages is currently facing. The goal of this ongoing project is to create a parallel corpus of authentic Flemish Sign Language (VGT) and written Dutch which can be employed as gold standard in automated sign language translation. The availability of a gold standard corpus like Gost-ParC-Sign can facilitate the advances of SLMT; consequently, it supports and promotes inclusiveness in MT and, on a more general level, in language technology

2022

pdf abs
Investigating the Medical Coverage of a Translation System into Pictographs for Patients with an Intellectual Disability
Magali Norré | Vincent Vandeghinste | Thomas François | Bouillon Pierrette
Ninth Workshop on Speech and Language Processing for Assistive Technologies (SLPAT-2022)

Communication between physician and patients can lead to misunderstandings, especially for disabled people. An automatic system that translates natural language into a pictographic language is one of the solutions that could help to overcome this issue. In this preliminary study, we present the French version of a translation system using the Arasaac pictographs and we investigate the strategies used by speech therapists to translate into pictographs. We also evaluate the medical coverage of this tool for translating physician questions and patient instructions.

The SignON project (www.signon-project.eu) focuses on the research and development of a Sign Language (SL) translation mobile application and an open communications framework. SignON rectifies the lack of technology and services for the automatic translation between signed and spoken languages, through an inclusive, humancentric solution which facilitates communication between deaf, hard of hearing (DHH) and hearing individuals. We present an overview of the current status of the project, describing the milestones reached to date and the approaches that are being developed to address the challenges and peculiarities of Sign Language Machine Translation (SLMT).

pdf abs
Challenges with Sign Language Datasets for Sign Language Recognition and Translation
Mirella De Sisto | Vincent Vandeghinste | Santiago Egea Gómez | Mathieu De Coster | Dimitar Shterionov | Horacio Saggion
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Sign Languages (SLs) are the primary means of communication for at least half a million people in Europe alone. However, the development of SL recognition and translation tools is slowed down by a series of obstacles concerning resource scarcity and standardization issues in the available data. The former challenge relates to the volume of data available for machine learning as well as the time required to collect and process new data. The latter obstacle is linked to the variety of the data, i.e., annotation formats are not unified and vary amongst different resources. The available data formats are often not suitable for machine learning, obstructing the provision of automatic tools based on neural models. In the present paper, we give an overview of these challenges by comparing various SL corpora and SL machine learning datasets. Furthermore, we propose a framework to address the lack of standardization at format level, unify the available resources and facilitate SL research for different languages. Our framework takes ELAN files as inputs and returns textual and visual data ready to train SL recognition and translation models. We present a proof of concept, training neural translation models on the data produced by the proposed framework.

2021

pdf abs
Extending a Text-to-Pictograph System to French and to Arasaac
Magali Norré | Vincent Vandeghinste | Pierrette Bouillon | Thomas François
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

We present an adaptation of the Text-to-Picto system, initially designed for Dutch, and extended to English and Spanish. The original system, aimed at people with an intellectual disability, automatically translates text into pictographs (Sclera and Beta). We extend it to French and add a large set of Arasaac pictographs linked to WordNet 3.1. To carry out this adaptation, we automatically link the pictographs and their metadata to synsets of two French WordNets and leverage this information to translate words into pictographs. We automatically and manually evaluate our system with different corpora corresponding to different use cases, including one for medical communication between doctors and patients. The system is also compared to similar systems in other languages.

2018

pdf abs
M3TRA: integrating TM and MT for professional translators
Bram Bulté | Tom Vanallemeersch | Vincent Vandeghinste
Proceedings of the 21st Annual Conference of the European Association for Machine Translation

Translation memories (TM) and machine translation (MT) both are potentially useful resources for professional translators, but they are often still used independently in translation workflows. As translators tend to have a higher confidence in fuzzy matches than in MT, we investigate how to combine the benefits of TM retrieval with those of MT, by integrating the results of both. We develop a flexible TM-MT integration approach based on various techniques combining the use of TM and MT, such as fuzzy repair, span pretranslation and exploiting multiple matches. Results for ten language pairs using the DGT-TM dataset indicate almost consistently better BLEU, METEOR and TER scores compared to the MT, TM and NMT baselines.

pdf abs
A Comparison of Different Punctuation Prediction Approaches in a Translation Context
Vincent Vandeghinste | Lyan Verwimp | Joris Pelemans | Patrick Wambacq
Proceedings of the 21st Annual Conference of the European Association for Machine Translation

We test a series of techniques to predict punctuation and its effect on machine translation (MT) quality. Several techniques for punctuation prediction are compared: language modeling techniques, such as n-grams and long shortterm memories (LSTM), sequence labeling LSTMs (unidirectional and bidirectional), and monolingual phrase-based, hierarchical and neural MT. For actual translation, phrase-based, hierarchical and neural MT are investigated. We observe that for punctuation prediction, phrase-based statistical MT and neural MT reach similar results, and are best used as a preprocessing step which is followed by neural MT to perform the actual translation. Implicit punctuation insertion by a dedicated neural MT system, trained on unpunctuated source and punctuated target, yields similar results.

We present the highlights of the now finished 4-year SCATE project. It was completed in February 2018 and funded by the Flemish Government IWT-SBO, project No. 130041.1

2016

Compared to well-resourced languages such as English and Dutch, natural language processing (NLP) tools for Afrikaans are still not abundant. In the context of the AfriBooms project, KU Leuven and the North-West University collaborated to develop a first, small treebank, a dependency parser, and an easy to use online linguistic search engine for Afrikaans for use by researchers and students in the humanities and social sciences. The search tool is based on a similar development for Dutch, i.e. GrETEL, a user-friendly search engine which allows users to query a treebank by means of a natural language example instead of a formal search instruction.

pdf abs
Poly-GrETEL: Cross-Lingual Example-based Querying of Syntactic Constructions
Liesbeth Augustinus | Vincent Vandeghinste | Tom Vanallemeersch
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We present Poly-GrETEL, an online tool which enables syntactic querying in parallel treebanks, based on the monolingual GrETEL environment. We provide online access to the Europarl parallel treebank for Dutch and English, allowing users to query the treebank using either an XPath expression or an example sentence in order to look for similar constructions. We provide automatic alignments between the nodes. By combining example-based query functionality with node alignments, we limit the need for users to be familiar with the query language and the structure of the trees in the source and target language, thus facilitating the use of parallel corpora for comparative linguistics and translation studies.

pdf
Improving Text-to-Pictograph Translation Through Word Sense Disambiguation
Leen Sevens | Gilles Jacobs | Vincent Vandeghinste | Ineke Schuurman | Frank Van Eynde
Proceedings of the Fifth Joint Conference on Lexical and Computational Semantics

2015

pdf
Semantics-based pretranslation for SMT using fuzzy matches
Tom Vanallemeersch | Vincent Vandeghinste
Proceedings of the Ninth Workshop on Syntax, Semantics and Structure in Statistical Translation

pdf
Natural Language Generation from Pictographs
Leen Sevens | Vincent Vandeghinste | Ineke Schuurman | Frank Van Eynde
Proceedings of the 15th European Workshop on Natural Language Generation (ENLG)

pdf
Assessing linguistically aware fuzzy matching in translation memories
Tom Vanallemeersch | Vincent Vandeghinste
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

pdf
Extending a Dutch Text-to-Pictograph Converter to English and Spanish
Leen Sevens | Vincent Vandeghinste | Ineke Schuurman | Frank Van Eynde
Proceedings of SLPAT 2015: 6th Workshop on Speech and Language Processing for Assistive Technologies

2014

pdf abs
Linking Pictographs to Synsets: Sclera2Cornetto
Vincent Vandeghinste | Ineke Schuurman
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Social inclusion of people with Intellectual and Developmental Disabilities can be promoted by offering them ways to independently use the internet. People with reading or writing disabilities can use pictographs instead of text. We present a resource in which we have linked a set of 5710 pictographs to lexical-semantic concepts in Cornetto, a Wordnet-like database for Dutch. We show that, by using this resource in a text-to-pictograph translation system, we can greatly improve the coverage comparing with a baseline where words are converted into pictographs only if the word equals the filename.

pdf
Improving the Precision of Synset Links Between Cornetto and Princeton WordNet
Leen Sevens | Vincent Vandeghinste | Frank Van Eynde
Proceedings of Workshop on Lexical and Grammatical Resources for Language Processing

pdf
Improving fuzzy matching through syntactic knowledge
Tom Vanallemeersch | Vincent Vandeghinste
Proceedings of Translating and the Computer 36

2013

pdf
Example-Based Treebank Querying with GrETEL–Now Also for Spoken Dutch
Liesbeth Augustinus | Vincent Vandeghinste | Ineke Schuurman | Frank Van Eynde
Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013)

2012

pdf abs
Example-Based Treebank Querying
Liesbeth Augustinus | Vincent Vandeghinste | Frank Van Eynde
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

The recent construction of large linguistic treebanks for spoken and written Dutch (e.g. CGN, LASSY, Alpino) has created new and exciting opportunities for the empirical investigation of Dutch syntax and semantics. However, the exploitation of those treebanks requires knowledge of specific data structures and query languages such as XPath. Linguists who are unfamiliar with formal languages are often reluctant towards learning such a language. In order to make treebank querying more attractive for non-technical users we developed GrETEL (Greedy Extraction of Trees for Empirical Linguistics), a query engine in which linguists can use natural language examples as a starting point for searching the Lassy treebank without knowledge about tree representations nor formal query languages. By allowing linguists to search for similar constructions as the example they provide, we hope to bridge the gap between traditional and computational linguistics. Two case studies are conducted to provide a concrete demonstration of the tool. The architecture of the tool is optimised for searching the LASSY treebank, but the approach can be adapted to other treebank lay-outs.

pdf abs
Large aligned treebanks for syntax-based machine translation
Gideon Kotzé | Vincent Vandeghinste | Scott Martens | Jörg Tiedemann
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We present a collection of parallel treebanks that have been automatically aligned on both the terminal and the nonterminal constituent level for use in syntax-based machine translation. We describe how they were constructed and applied to a syntax- and example-based machine translation system called Parse and Corpus-Based Machine Translation (PaCo-MT). For the language pair Dutch to English, we present evaluation scores of both the nonterminal constituent alignments and the MT system itself, and in the latter case, compare them with those of Moses, a current state-of-the-art statistical MT system, when trained on the same data.

2011

pdf bib
Proceedings of the 15th Annual Conference of the European Association for Machine Translation
Mikel L. Forcada | Heidi Depraetere | Vincent Vandeghinste
Proceedings of the 15th Annual Conference of the European Association for Machine Translation

pdf
SMT-CAT integration in a Technical Domain: Handling XML Markup Using Pre & Post-processing Methods
Arda Tezcan | Vincent Vandeghinste
Proceedings of the 15th Annual Conference of the European Association for Machine Translation

2010

pdf
An Efficient, Generic Approach to Extracting Multi-Word Expressions from Dependency Trees
Scott Martens | Vincent Vandeghinste
Proceedings of the 2010 Workshop on Multiword Expressions: from Theory to Applications

pdf
Bottom-up Transfer in Example-based Machine Translation
Vincent Vandeghinste | Scott Martens
Proceedings of the 14th Annual Conference of the European Association for Machine Translation

pdf abs
Cultural Aspects of Spatiotemporal Analysis in Multilingual Applications
Ineke Schuurman | Vincent Vandeghinste
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

In this paper we want to point out some issues arising when a natural language processing task involves several languages (like multi- lingual, multidocument summarization and the machine translation aspects involved) which are often neglected. These issues are of a more cultural nature, and may even come into play when several documents in a single language are involved. We pay special attention to those aspects dealing with the spatiotemporal characteristics of a text. Correct automatic selection of (parts of) texts such as handling the same eventuality, presupposes spatiotemporal disambiguation at a rather specific level. The same holds for the analysis of the query. For generation and translation purposes, spatiotemporal aspects may be relevant as well. At the moment English (both the British and American variants) and Dutch (the Flemish and Dutch variant) are covered, all taking into account the perspective of a contemporary, Flemish user. In our approach the cultural aspects associated with for example the language of publication and the language used by the user play a crucial role.

2009

pdf
Tree-Based Target Language Modeling
Vincent Vandeghinste
Proceedings of the 13th Annual Conference of the European Association for Machine Translation

2008

The computational linguistics community in The Netherlands and Belgium has long recognized the dire need for a major reference corpus of written Dutch. In part to answer this need, the STEVIN programme was established. To pave the way for the effective building of a 500-million-word reference corpus of written Dutch, a pilot project was established. The Dutch Corpus Initiative project or D-Coi was highly successful in that it not only realized about 10% of the projected large reference corpus, but also established the best practices and developed all the protocols and the necessary tools for building the larger corpus within the confines of a necessarily limited budget. We outline the steps involved in an endeavour of this kind, including the major highlights and possible pitfalls. Once converted to a suitable XML format, further linguistic annotation based on the state-of-the-art tools developed either before or during the pilot by the consortium partners proved easily and fruitfully applicable. Linguistic enrichment of the corpus includes PoS tagging, syntactic parsing and semantic annotation, involving both semantic role labeling and spatiotemporal annotation. D-Coi is expected to be followed by SoNaR, during which the 500-million-word reference corpus of Dutch should be built.

In this paper we describe the METIS-II system and its evaluation on each of the language pairs: Dutch, German, Greek, and Spanish to English. The METIS-II system envisaged developing a data-driven approach in which no parallel corpus is required, and in which no full parser or extensive rule sets are needed. We describe evalution on a development test set and on a test set coming from Europarl, and compare our results with SYSTRAN. We also provide some further analysis, researching the impact of the number and source of the reference translations and analysing the results according to test text type. The results are expectably lower for the METIS system, but not at an unatainable distance from a mature system like SYSTRAN.

2007

pdf bib
Removing the distinction between a Translation Memory, a Bilingual Dictionary and a Parallel Corpus
Vincent Vandeghinste
Proceedings of Translating and the Computer 29

pdf
Demonstration of the Dutch-to-English METIS-II MT system
Peter Dirix | Vincent Vandeghinste | Ineke Schuurman
Proceedings of the 11th Conference on Theoretical and Methodological Issues in Machine Translation of Natural Languages: Papers

2006

pdf abs
Transferring PoS-tagging and lemmatization tools from spoken to written Dutch corpus development
Antal van den Bosch | Ineke Schuurman | Vincent Vandeghinste
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

We describe a case study in the reuse and transfer of tools in language resource development, from a corpus of spoken Dutch to a corpus of written Dutch. Once tools for a particular language have been developed, it is logical, but not trivial to reuse them for other types or registers of the language than the tools were originally designed for. This paper reviews the decisions and adaptations necessary to make this particular transfer from spoken to written language, focusing on a part-of-speech tagger and a lemmatizer. While the lemmatizer can be transferred fairly straightforwardly, the tagger needs to be adaptated considerably. We show how it can be adapted without starting from scratch. We describe how the part-of-speech tagset was adapted and how the tagger was retrained to deal with written-text phenomena it had not been trained on earlier.

pdf abs
METIS-II: Machine Translation for Low Resource Languages
Vincent Vandeghinste | Ineke Schuurman | Michael Carl | Stella Markantonatou | Toni Badia
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

In this paper we describe a machine translation prototype in which we use only minimal resources for both the source and the target language. A shallow source language analysis, combined with a translation dictionary and a mapping system of source language phenomena into the target language and a target language corpus for generation are all the resources needed in the described system. Several approaches are presented.

pdf abs
Syntactic Annotation of Large Corpora in STEVIN
Gertjan van Noord | Ineke Schuurman | Vincent Vandeghinste
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

The construction of a 500-million-word reference corpus of written Dutch has been identified as one of the priorities in the Dutch/Flemish STEVIN programme. For part of this corpus, manually corrected syntactic annotations will be provided. The paper presents the background of the syntactic annotation efforts, the Alpino parser which is used as an important tool for constructing the syntactic annotations, as well as a number of other annotation tools and guidelines. For the full STEVIN corpus, automatically derived syntactic annotations will be provided in a later phase of the programme. A number of arguments is provided suggesting that such a resource can be very useful for applications in information extraction, ontology building, lexical acquisition, machine translation and corpus linguistics.

2005

pdf abs
METISII: Example-based Machine Translation Using Monolingual CorporaSystem Description
Peter Dirix | Ineke Schuurman | Vincent Vandeghinste
Workshop on example-based machine translation

The METIS-II project is an example-based machine translation system, making use of minimal resources and tools for both source and target language, making use of a target-language (TL) corpus, but not of any parallel corpora. In the current paper, we discuss the view of our team on the general philosophy and outline of the METIS-II system.

pdf abs
Example-based Translation Without Parallel Corpora: First Experiments on a Prototype
Vincent Vandeghinste | Peter Dirix | Ineke Schuurman
Workshop on example-based machine translation

For the METIS-II project (IST, start: 10-2004 – end: 09-2007) we are working on an example-based machine translation system, making use of minimal resources and tools for both source and target language, i.e. making use of a target language corpus, but not of any parallel corpora. In the current paper, we present the results of the first experiments with our approach (CCL) within the METIS consortium : the translation of noun phrases from Dutch to English, using the British National Corpus as a target language corpus. Future research is planned along similar lines for the sentence as is presented here for the noun phrase.