Gosse Bouma


2024

pdf bib
Proceedings of the Joint Workshop on Multiword Expressions and Universal Dependencies (MWE-UD) @ LREC-COLING 2024
Archna Bhatia | Gosse Bouma | A. Seza Doğruöz | Kilian Evang | Marcos Garcia | Voula Giouli | Lifeng Han | Joakim Nivre | Alexandre Rademaker
Proceedings of the Joint Workshop on Multiword Expressions and Universal Dependencies (MWE-UD) @ LREC-COLING 2024

2022

pdf bib
UDapter: Typology-based Language Adapters for Multilingual Dependency Parsing and Sequence Labeling
Ahmet Üstün | Arianna Bisazza | Gosse Bouma | Gertjan van Noord
Computational Linguistics, Volume 48, Issue 3 - September 2022

Recent advances in multilingual language modeling have brought the idea of a truly universal parser closer to reality. However, such models are still not immune to the “curse of multilinguality”: Cross-language interference and restrained model capacity remain major obstacles. To address this, we propose a novel language adaptation approach by introducing contextual language adapters to a multilingual parser. Contextual language adapters make it possible to learn adapters via language embeddings while sharing model parameters across languages based on contextual parameter generation. Moreover, our method allows for an easy but effective integration of existing linguistic typology features into the parsing model. Because not all typological features are available for every language, we further combine typological feature prediction with parsing in a multi-task model that achieves very competitive parsing performance without the need for an external prediction system for missing features. The resulting parser, UDapter, can be used for dependency parsing as well as sequence labeling tasks such as POS tagging, morphological tagging, and NER. In dependency parsing, it outperforms strong monolingual and multilingual baselines on the majority of both high-resource and low-resource (zero-shot) languages, showing the success of the proposed adaptation approach. In sequence labeling tasks, our parser surpasses the baseline on high resource languages, and performs very competitively in a zero-shot setting. Our in-depth analyses show that adapter generation via typological features of languages is key to this success.1

pdf
Hyper-X: A Unified Hypernetwork for Multi-Task Multilingual Transfer
Ahmet Üstün | Arianna Bisazza | Gosse Bouma | Gertjan van Noord | Sebastian Ruder
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Massively multilingual models are promising for transfer learning across tasks and languages. However, existing methods are unable to fully leverage training data when it is available in different task-language combinations. To exploit such heterogeneous supervision, we propose Hyper-X, a single hypernetwork that unifies multi-task and multilingual learning with efficient adaptation. It generates weights for adapter modules conditioned on both tasks and language embeddings. By learning to combine task and language-specific knowledge, our model enables zero-shot transfer for unseen languages and task-language combinations. Our experiments on a diverse set of languages demonstrate that Hyper-X achieves the best or competitive gain when a mixture of multiple resources is available, while on par with strong baseline in the standard scenario. Hyper-X is also considerably more efficient in terms of parameters and resources compared to methods that train separate adapters. Finally, Hyper-X consistently produces strong results in few-shot scenarios for new languages, showing the versatility of our approach beyond zero-shot transfer.

pdf
PoS Tagging, Lemmatization and Dependency Parsing of West Frisian
Wilbert Heeringa | Gosse Bouma | Martha Hofman | Jelle Brouwer | Eduard Drenth | Jan Wijffels | Hans Van de Velde
Proceedings of the Thirteenth Language Resources and Evaluation Conference

We present a lemmatizer/PoS tagger/dependency parser for West Frisian using a corpus of 44,714 words in 3,126 sentences that were annotated according to the guidelines of Universal Dependencies version 2. PoS tags were assigned to words by using a Dutch PoS tagger that was applied to a Dutch word-by-word translation, or to sentences of a Dutch parallel text. Best results were obtained when using word-by-word translations that were created by using the previous version of the Frisian translation program Oersetter. Morphologic and syntactic annotations were generated on the basis of a Dutch word-by-word translation as well. The performance of the lemmatizer/tagger/annotator when it was trained using default parameters was compared to the performance that was obtained when using the parameter values that were used for training the LassySmall UD 2.5 corpus. We study the effects of different hyperparameter settings on the accuracy of the annotation pipeline. The Frisian lemmatizer/PoS tagger/dependency parser is released as a web app and as a web service.

2021

pdf bib
Proceedings of the 17th International Conference on Parsing Technologies and the IWPT 2021 Shared Task on Parsing into Enhanced Universal Dependencies (IWPT 2021)
Stephan Oepen | Kenji Sagae | Reut Tsarfaty | Gosse Bouma | Djamé Seddah | Daniel Zeman
Proceedings of the 17th International Conference on Parsing Technologies and the IWPT 2021 Shared Task on Parsing into Enhanced Universal Dependencies (IWPT 2021)

pdf
From Raw Text to Enhanced Universal Dependencies: The Parsing Shared Task at IWPT 2021
Gosse Bouma | Djamé Seddah | Daniel Zeman
Proceedings of the 17th International Conference on Parsing Technologies and the IWPT 2021 Shared Task on Parsing into Enhanced Universal Dependencies (IWPT 2021)

We describe the second IWPT task on end-to-end parsing from raw text to Enhanced Universal Dependencies. We provide details about the evaluation metrics and the datasets used for training and evaluation. We compare the approaches taken by participating teams and discuss the results of the shared task, also in comparison with the first edition of this task.

2020

pdf
UDapter: Language Adaptation for Truly Universal Dependency Parsing
Ahmet Üstün | Arianna Bisazza | Gosse Bouma | Gertjan van Noord
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Recent advances in multilingual dependency parsing have brought the idea of a truly universal parser closer to reality. However, cross-language interference and restrained model capacity remain major obstacles. To address this, we propose a novel multilingual task adaptation approach based on contextual parameter generation and adapter modules. This approach enables to learn adapters via language embeddings while sharing model parameters across languages. It also allows for an easy but effective integration of existing linguistic typology features into the parsing network. The resulting parser, UDapter, outperforms strong monolingual and multilingual baselines on the majority of both high-resource and low-resource (zero-shot) languages, showing the success of the proposed adaptation approach. Our in-depth analyses show that soft parameter sharing via typological features is key to this success.

pdf bib
Proceedings of the 16th International Conference on Parsing Technologies and the IWPT 2020 Shared Task on Parsing into Enhanced Universal Dependencies
Gosse Bouma | Yuji Matsumoto | Stephan Oepen | Kenji Sagae | Djamé Seddah | Weiwei Sun | Anders Søgaard | Reut Tsarfaty | Dan Zeman
Proceedings of the 16th International Conference on Parsing Technologies and the IWPT 2020 Shared Task on Parsing into Enhanced Universal Dependencies

pdf
Overview of the IWPT 2020 Shared Task on Parsing into Enhanced Universal Dependencies
Gosse Bouma | Djamé Seddah | Daniel Zeman
Proceedings of the 16th International Conference on Parsing Technologies and the IWPT 2020 Shared Task on Parsing into Enhanced Universal Dependencies

This overview introduces the task of parsing into enhanced universal dependencies, describes the datasets used for training and evaluation, and evaluation metrics. We outline various approaches and discuss the results of the shared task.

2019

pdf
Cross-Lingual Word Embeddings for Morphologically Rich Languages
Ahmet Üstün | Gosse Bouma | Gertjan van Noord
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

Cross-lingual word embedding models learn a shared vector space for two or more languages so that words with similar meaning are represented by similar vectors regardless of their language. Although the existing models achieve high performance on pairs of morphologically simple languages, they perform very poorly on morphologically rich languages such as Turkish and Finnish. In this paper, we propose a morpheme-based model in order to increase the performance of cross-lingual word embeddings on morphologically rich languages. Our model includes a simple extension which enables us to exploit morphemes for cross-lingual mapping. We applied our model for the Turkish-Finnish language pair on the bilingual word translation task. Results show that our model outperforms the baseline models by 2% in the nearest neighbour ranking.

pdf
Multi-Team: A Multi-attention, Multi-decoder Approach to Morphological Analysis.
Ahmet Üstün | Rob van der Goot | Gosse Bouma | Gertjan van Noord
Proceedings of the 16th Workshop on Computational Research in Phonetics, Phonology, and Morphology

This paper describes our submission to SIGMORPHON 2019 Task 2: Morphological analysis and lemmatization in context. Our model is a multi-task sequence to sequence neural network, which jointly learns morphological tagging and lemmatization. On the encoding side, we exploit character-level as well as contextual information. We introduce a multi-attention decoder to selectively focus on different parts of character and word sequences. To further improve the model, we train on multiple datasets simultaneously and use external embeddings for initialization. Our final model reaches an average morphological tagging F1 score of 94.54 and a lemma accuracy of 93.91 on the test data, ranking respectively 3rd and 6th out of 13 teams in the SIGMORPHON 2019 shared task.

2018

pdf
Expletives in Universal Dependency Treebanks
Gosse Bouma | Jan Hajic | Dag Haug | Joakim Nivre | Per Erik Solberg | Lilja Øvrelid
Proceedings of the Second Workshop on Universal Dependencies (UDW 2018)

Although treebanks annotated according to the guidelines of Universal Dependencies (UD) now exist for many languages, the goal of annotating the same phenomena in a cross-linguistically consistent fashion is not always met. In this paper, we investigate one phenomenon where we believe such consistency is lacking, namely expletive elements. Such elements occupy a position that is structurally associated with a core argument (or sometimes an oblique dependent), yet are non-referential and semantically void. Many UD treebanks identify at least some elements as expletive, but the range of phenomena differs between treebanks, even for closely related languages, and sometimes even for different treebanks for the same language. In this paper, we present criteria for identifying expletives that are applicable across languages and compatible with the goals of UD, give an overview of expletives as found in current UD treebanks, and present recommendations for the annotation of expletives so that more consistent annotation can be achieved in future releases.

2017

pdf
Increasing Return on Annotation Investment: The Automatic Construction of a Universal Dependency Treebank for Dutch
Gosse Bouma | Gertjan van Noord
Proceedings of the NoDaLiDa 2017 Workshop on Universal Dependencies (UDW 2017)

2013

pdf
The Automatic Identification of Discourse Units in Dutch Text
Nynke van der Vliet | Gosse Bouma | Gisela Redeker
Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013)

2012

pdf
Multi-Layer Discourse Annotation of a Dutch Text Corpus
Gisela Redeker | Ildikó Berzlánovich | Nynke van der Vliet | Gosse Bouma | Markus Egg
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We have compiled a corpus of 80 Dutch texts from expository and persuasive genres, which we annotated for rhetorical and genre-specific discourse structure, and lexical cohesion with the goal of creating a gold standard for further research. The annota¬tions are based on a segmentation of the text in elementary discourse units that takes into account cues from syntax and punctuation. During the labor-intensive discourse-structure annotation (RST analysis), we took great care to thoroughly reconcile the initial analyses. That process and the availability of two independent initial analyses for each text allows us to analyze our disagreements and to assess the confusability of RST relations, and thereby improve the annotation guidelines and gather evidence for the classification of these relations into larger groups. We are using this resource for corpus-based studies of discourse relations, discourse markers, cohesion, and genre differences, e.g., the question of how discourse structure and lexical cohesion interact for different genres in the overall organization of texts. We are also exploring automatic text segmentation and semi-automatic discourse annotation.

2010

pdf
On Learning Subtypes of the Part-Whole Relation: Do Not Mix Your Seeds
Ashwin Ittoo | Gosse Bouma
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

pdf
Cross-lingual Ontology Alignment using EuroWordNet and Wikipedia
Gosse Bouma
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

This paper describes a system for linking the thesaurus of the Netherlands Institute for Sound and Vision to English WordNet and dbpedia. The thesaurus contains subject (concept) terms, and names of persons, locations, and miscalleneous names. We used EuroWordNet, a multilingual wordnet, and Dutch Wikipedia as intermediaries for the two alignments. EuroWordNet covers most of the subject terms in the thesaurus, but the organization of the cross-lingual links makes selection of the most appropriate English target term almost impossible. Precision and recall of the automatic alignment with WordNet for subject terms is 0.59. Using page titles, redirects, disambiguation pages, and anchor text harvested from Dutch Wikipedia gives reasonable performance on subject terms and geographical terms. Many person and miscalleneous names in the thesaurus could not be located in (Dutch or English) Wikipedia. Precision for miscellaneous names, subjects, persons and locations for the alignment with Wikipedia ranges from 0.63 to 0.94, while recall for subject terms is 0.62.

2009

pdf
Parsed Corpora for Linguistics
Gertjan van Noord | Gosse Bouma
Proceedings of the EACL 2009 Workshop on the Interaction between Linguistics and Computational Linguistics: Virtuous, Vicious or Vacuous?

pdf
Cross-lingual Alignment and Completion of Wikipedia Templates
Gosse Bouma | Sergio Duarte | Zahurul Islam
Proceedings of the Third International Workshop on Cross Lingual Information Access: Addressing the Information Need of Multilingual Societies (CLIAWS3)

2008

pdf
A Coreference Corpus and Resolution System for Dutch
Iris Hendrickx | Gosse Bouma | Frederik Coppens | Walter Daelemans | Veronique Hoste | Geert Kloosterman | Anne-Marie Mineur | Joeri Van Der Vloet | Jean-Luc Verschelde
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

We present the main outcomes of the COREA project: a corpus annotated with coreferential relations and a coreference resolution system for Dutch. In the project we developed annotation guidelines for coreference resolution for Dutch and annotated a corpus of 135K tokens. We discuss these guidelines, the annotation tool, and the inter-annotator agreement. We also show a visualization of the annotated relations. The standard approach to evaluate a coreference resolution system is to compare the predictions of the system to a hand-annotated gold standard test set (cross-validation). A more practically oriented evaluation is to test the usefulness of coreference relation information in an NLP application. We run experiments with an Information Extraction module for the medical domain, and measure the performance of this module with and without the coreference relation information. We present the results of both this application-oriented evaluation of our system and of a standard cross-validation evaluation. In a separate experiment we also evaluate the effect of coreference information produced by a simple rule-based coreference module in a Question Answering application.

2007

pdf
Mining Syntactically Annotated Corpora with XQuery
Gosse Bouma | Geert Kloosterman
Proceedings of the Linguistic Annotation Workshop

2006

pdf bib
Linguistic Knowledge and Question Answering
Gosse Bouma
Proceedings of the Workshop KRAQ’06: Knowledge and Reasoning for Language Processing

pdf
Learning to Identify Definitions using Syntactic Features
Ismail Fahmi | Gosse Bouma
Proceedings of the Workshop on Learning Structured Information in Natural Language Applications

2005

pdf
Automatic Acquisition of Lexico-semantic Knowledge for QA
Lonneke van der Plas | Gosse Bouma
Proceedings of OntoLex 2005 - Ontologies and Lexical Resources

2004

pdf
A New Approach to the Corpus-based Statistical Investigation of Hungarian Multi-word Lexemes
Balázs Kis | Begoña Villada | Gosse Bouma | Gábor Ugray | Tamás Bíró | Gábor Pohl | John Nerbonne
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

2002

pdf
Querying Dependency Treebanks in XML
Gosse Bouma | Geert Kloosterman
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)

2000

pdf
A Finite State and Data-Oriented Method for Grapheme to Phoneme Conversion
Gosse Bouma
1st Meeting of the North American Chapter of the Association for Computational Linguistics

1999

pdf bib
A Modern Computational Linguistics Course Using Dutch
Gosse Bouma
EACL 1999: Computer and Internet Supported Education in Language and Speech Technology

1997

pdf
Grammatical analysis in the OVIS spoken-dialogue system
Mark-Jan Nederhof | Gosse Bouma | Rob Koeling | Gertjan van Noord
Interactive Spoken Dialog Systems: Bringing Speech and NLP Together in Real Applications

pdf
Hdrug. A Flexible and Extendible Development Environment for Natural Language Processing.
Gertjan van Noord | Gosse Bouma
Computational Environments for Grammar Development and Linguistic Engineering

1994

pdf
Adjuncts and the Processing of Lexical Rules
Gertjan van Noord | Gosse Bouma
COLING 1994 Volume 1: The 15th International Conference on Computational Linguistics

pdf
Constraint-Based Categorial Grammar
Gosse Bouma | Gertjan van Noord
32nd Annual Meeting of the Association for Computational Linguistics

1993

pdf
Head-driven Parsing for Lexicalist Grammars: Experimental Results
Gosse Bouma | Gertjan van Noord
Sixth Conference of the European Chapter of the Association for Computational Linguistics

1992

pdf
A Lexicalist Account of Icelandic Case Marking
Gosse Bouma
COLING 1992 Volume 1: The 14th International Conference on Computational Linguistics

pdf
Feature Structures and Nonmonotonicity
Gosse Bouma
Computational Linguistics, Volume 18, Number 2, Special Issue on Inheritance: I

1991

pdf
Prediction in Chart Parsing Algorithms for Categorial Unification Grammar
Gosse Bouma
Fifth Conference of the European Chapter of the Association for Computational Linguistics

1990

pdf
Defaults in Unification Grammar
Gosse Bouma
28th Annual Meeting of the Association for Computational Linguistics

1989

pdf
Efficient Processing of Flexible Categorial Grammar
Gosse Bouma
Fourth Conference of the European Chapter of the Association for Computational Linguistics