Vincent Vandeghinste

2021

pdf bib abs
Extending a Text-to-Pictograph System to French and to Arasaac
Magali Norré | Vincent Vandeghinste | Pierrette Bouillon | Thomas François
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

We present an adaptation of the Text-to-Picto system, initially designed for Dutch, and extended to English and Spanish. The original system, aimed at people with an intellectual disability, automatically translates text into pictographs (Sclera and Beta). We extend it to French and add a large set of Arasaac pictographs linked to WordNet 3.1. To carry out this adaptation, we automatically link the pictographs and their metadata to synsets of two French WordNets and leverage this information to translate words into pictographs. We automatically and manually evaluate our system with different corpora corresponding to different use cases, including one for medical communication between doctors and patients. The system is also compared to similar systems in other languages.

2016

Compared to well-resourced languages such as English and Dutch, natural language processing (NLP) tools for Afrikaans are still not abundant. In the context of the AfriBooms project, KU Leuven and the North-West University collaborated to develop a first, small treebank, a dependency parser, and an easy to use online linguistic search engine for Afrikaans for use by researchers and students in the humanities and social sciences. The search tool is based on a similar development for Dutch, i.e. GrETEL, a user-friendly search engine which allows users to query a treebank by means of a natural language example instead of a formal search instruction.

pdf bib abs
Poly-GrETEL: Cross-Lingual Example-based Querying of Syntactic Constructions
Liesbeth Augustinus | Vincent Vandeghinste | Tom Vanallemeersch
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We present Poly-GrETEL, an online tool which enables syntactic querying in parallel treebanks, based on the monolingual GrETEL environment. We provide online access to the Europarl parallel treebank for Dutch and English, allowing users to query the treebank using either an XPath expression or an example sentence in order to look for similar constructions. We provide automatic alignments between the nodes. By combining example-based query functionality with node alignments, we limit the need for users to be familiar with the query language and the structure of the trees in the source and target language, thus facilitating the use of parallel corpora for comparative linguistics and translation studies.

pdf bib
Improving Text-to-Pictograph Translation Through Word Sense Disambiguation
Leen Sevens | Gilles Jacobs | Vincent Vandeghinste | Ineke Schuurman | Frank Van Eynde
Proceedings of the Fifth Joint Conference on Lexical and Computational Semantics

2015

pdf bib
Semantics-based pretranslation for SMT using fuzzy matches
Tom Vanallemeersch | Vincent Vandeghinste
Proceedings of the Ninth Workshop on Syntax, Semantics and Structure in Statistical Translation

pdf bib
Natural Language Generation from Pictographs
Leen Sevens | Vincent Vandeghinste | Ineke Schuurman | Frank Van Eynde
Proceedings of the 15th European Workshop on Natural Language Generation (ENLG)

pdf bib
Assessing linguistically aware fuzzy matching in translation memories
Tom Vanallemeersch | Vincent Vandeghinste
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

pdf bib
Extending a Dutch Text-to-Pictograph Converter to English and Spanish
Leen Sevens | Vincent Vandeghinste | Ineke Schuurman | Frank Van Eynde
Proceedings of SLPAT 2015: 6th Workshop on Speech and Language Processing for Assistive Technologies

2014

pdf bib abs
Linking Pictographs to Synsets: Sclera2Cornetto
Vincent Vandeghinste | Ineke Schuurman
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Social inclusion of people with Intellectual and Developmental Disabilities can be promoted by offering them ways to independently use the internet. People with reading or writing disabilities can use pictographs instead of text. We present a resource in which we have linked a set of 5710 pictographs to lexical-semantic concepts in Cornetto, a Wordnet-like database for Dutch. We show that, by using this resource in a text-to-pictograph translation system, we can greatly improve the coverage comparing with a baseline where words are converted into pictographs only if the word equals the filename.

pdf bib
Improving fuzzy matching through syntactic knowledge
Tom Vanallemeersch | Vincent Vandeghinste
Proceedings of Translating and the Computer 36

pdf bib
Improving the Precision of Synset Links Between Cornetto and Princeton WordNet
Leen Sevens | Vincent Vandeghinste | Frank Van Eynde
Proceedings of Workshop on Lexical and Grammatical Resources for Language Processing

2013

pdf bib
Example-Based Treebank Querying with GrETEL–Now Also for Spoken Dutch
Liesbeth Augustinus | Vincent Vandeghinste | Ineke Schuurman | Frank Van Eynde
Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013)

2012

pdf bib abs
Example-Based Treebank Querying
Liesbeth Augustinus | Vincent Vandeghinste | Frank Van Eynde
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

The recent construction of large linguistic treebanks for spoken and written Dutch (e.g. CGN, LASSY, Alpino) has created new and exciting opportunities for the empirical investigation of Dutch syntax and semantics. However, the exploitation of those treebanks requires knowledge of specific data structures and query languages such as XPath. Linguists who are unfamiliar with formal languages are often reluctant towards learning such a language. In order to make treebank querying more attractive for non-technical users we developed GrETEL (Greedy Extraction of Trees for Empirical Linguistics), a query engine in which linguists can use natural language examples as a starting point for searching the Lassy treebank without knowledge about tree representations nor formal query languages. By allowing linguists to search for similar constructions as the example they provide, we hope to bridge the gap between traditional and computational linguistics. Two case studies are conducted to provide a concrete demonstration of the tool. The architecture of the tool is optimised for searching the LASSY treebank, but the approach can be adapted to other treebank lay-outs.

pdf bib abs
Large aligned treebanks for syntax-based machine translation
Gideon Kotzé | Vincent Vandeghinste | Scott Martens | Jörg Tiedemann
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We present a collection of parallel treebanks that have been automatically aligned on both the terminal and the nonterminal constituent level for use in syntax-based machine translation. We describe how they were constructed and applied to a syntax- and example-based machine translation system called Parse and Corpus-Based Machine Translation (PaCo-MT). For the language pair Dutch to English, we present evaluation scores of both the nonterminal constituent alignments and the MT system itself, and in the latter case, compare them with those of Moses, a current state-of-the-art statistical MT system, when trained on the same data.

2011

pdf bib
Proceedings of the 15th Annual conference of the European Association for Machine Translation
Mikel L. Forcada | Heidi Depraetere | Vincent Vandeghinste
Proceedings of the 15th Annual conference of the European Association for Machine Translation

pdf bib
SMT-CAT integration in a Technical Domain: Handling XML Markup Using Pre & Post-processing Methods
Arda Tezcan | Vincent Vandeghinste
Proceedings of the 15th Annual conference of the European Association for Machine Translation

2010

pdf bib abs
Cultural Aspects of Spatiotemporal Analysis in Multilingual Applications
Ineke Schuurman | Vincent Vandeghinste
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

In this paper we want to point out some issues arising when a natural language processing task involves several languages (like multi- lingual, multidocument summarization and the machine translation aspects involved) which are often neglected. These issues are of a more cultural nature, and may even come into play when several documents in a single language are involved. We pay special attention to those aspects dealing with the spatiotemporal characteristics of a text. Correct automatic selection of (parts of) texts such as handling the same eventuality, presupposes spatiotemporal disambiguation at a rather specific level. The same holds for the analysis of the query. For generation and translation purposes, spatiotemporal aspects may be relevant as well. At the moment English (both the British and American variants) and Dutch (the Flemish and Dutch variant) are covered, all taking into account the perspective of a contemporary, Flemish user. In our approach the cultural aspects associated with for example the language of publication and the language used by the user play a crucial role.

pdf bib
Bottom-up Transfer in Example-based Machine Translation
Vincent Vandeghinste | Scott Martens
Proceedings of the 14th Annual conference of the European Association for Machine Translation

pdf bib
An Efficient, Generic Approach to Extracting Multi-Word Expressions from Dependency Trees
Scott Martens | Vincent Vandeghinste
Proceedings of the 2010 Workshop on Multiword Expressions: from Theory to Applications

2009

pdf bib
Tree-Based Target Language Modeling
Vincent Vandeghinste
Proceedings of the 13th Annual conference of the European Association for Machine Translation

2008

The computational linguistics community in The Netherlands and Belgium has long recognized the dire need for a major reference corpus of written Dutch. In part to answer this need, the STEVIN programme was established. To pave the way for the effective building of a 500-million-word reference corpus of written Dutch, a pilot project was established. The Dutch Corpus Initiative project or D-Coi was highly successful in that it not only realized about 10% of the projected large reference corpus, but also established the best practices and developed all the protocols and the necessary tools for building the larger corpus within the confines of a necessarily limited budget. We outline the steps involved in an endeavour of this kind, including the major highlights and possible pitfalls. Once converted to a suitable XML format, further linguistic annotation based on the state-of-the-art tools developed either before or during the pilot by the consortium partners proved easily and fruitfully applicable. Linguistic enrichment of the corpus includes PoS tagging, syntactic parsing and semantic annotation, involving both semantic role labeling and spatiotemporal annotation. D-Coi is expected to be followed by SoNaR, during which the 500-million-word reference corpus of Dutch should be built.

In this paper we describe the METIS-II system and its evaluation on each of the language pairs: Dutch, German, Greek, and Spanish to English. The METIS-II system envisaged developing a data-driven approach in which no parallel corpus is required, and in which no full parser or extensive rule sets are needed. We describe evalution on a development test set and on a test set coming from Europarl, and compare our results with SYSTRAN. We also provide some further analysis, researching the impact of the number and source of the reference translations and analysing the results according to test text type. The results are expectably lower for the METIS system, but not at an unatainable distance from a mature system like SYSTRAN.

2007

pdf bib
Removing the distinction between a Translation Memory, a Bilingual Dictionary and a Parallel Corpus
Vincent Vandeghinste
Proceedings of Translating and the Computer 29

pdf bib
Demonstration of the Dutch-to-English METIS-II MT system
Peter Dirix | Vincent Vandeghinste | Ineke Schuurman
Proceedings of the 11th Conference on Theoretical and Methodological Issues in Machine Translation of Natural Languages: Papers

2006

pdf bib abs
Transferring PoS-tagging and lemmatization tools from spoken to written Dutch corpus development
Antal van den Bosch | Ineke Schuurman | Vincent Vandeghinste
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

We describe a case study in the reuse and transfer of tools in language resource development, from a corpus of spoken Dutch to a corpus of written Dutch. Once tools for a particular language have been developed, it is logical, but not trivial to reuse them for other types or registers of the language than the tools were originally designed for. This paper reviews the decisions and adaptations necessary to make this particular transfer from spoken to written language, focusing on a part-of-speech tagger and a lemmatizer. While the lemmatizer can be transferred fairly straightforwardly, the tagger needs to be adaptated considerably. We show how it can be adapted without starting from scratch. We describe how the part-of-speech tagset was adapted and how the tagger was retrained to deal with written-text phenomena it had not been trained on earlier.

pdf bib abs
METIS-II: Machine Translation for Low Resource Languages
Vincent Vandeghinste | Ineke Schuurman | Michael Carl | Stella Markantonatou | Toni Badia
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

In this paper we describe a machine translation prototype in which we use only minimal resources for both the source and the target language. A shallow source language analysis, combined with a translation dictionary and a mapping system of source language phenomena into the target language and a target language corpus for generation are all the resources needed in the described system. Several approaches are presented.

pdf bib abs
Syntactic Annotation of Large Corpora in STEVIN
Gertjan van Noord | Ineke Schuurman | Vincent Vandeghinste
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

The construction of a 500-million-word reference corpus of written Dutch has been identified as one of the priorities in the Dutch/Flemish STEVIN programme. For part of this corpus, manually corrected syntactic annotations will be provided. The paper presents the background of the syntactic annotation efforts, the Alpino parser which is used as an important tool for constructing the syntactic annotations, as well as a number of other annotation tools and guidelines. For the full STEVIN corpus, automatically derived syntactic annotations will be provided in a later phase of the programme. A number of arguments is provided suggesting that such a resource can be very useful for applications in information extraction, ontology building, lexical acquisition, machine translation and corpus linguistics.

2005

pdf bib abs
METISII: Example-based Machine Translation Using Monolingual CorporaSystem Description
Peter Dirix | Ineke Schuurman | Vincent Vandeghinste
Workshop on example-based machine translation

The METIS-II project is an example-based machine translation system, making use of minimal resources and tools for both source and target language, making use of a target-language (TL) corpus, but not of any parallel corpora. In the current paper, we discuss the view of our team on the general philosophy and outline of the METIS-II system.

pdf bib abs
Example-based Translation Without Parallel Corpora: First Experiments on a Prototype
Vincent Vandeghinste | Peter Dirix | Ineke Schuurman
Workshop on example-based machine translation

For the METIS-II project (IST, start: 10-2004 – end: 09-2007) we are working on an example-based machine translation system, making use of minimal resources and tools for both source and target language, i.e. making use of a target language corpus, but not of any parallel corpora. In the current paper, we present the results of the first experiments with our approach (CCL) within the METIS consortium : the translation of noun phrases from Dutch to English, using the British National Corpus as a target language corpus. Future research is planned along similar lines for the sentence as is presented here for the noun phrase.