Antal van den Bosch - ACL Anthology

This is an internal, incomplete preview of a proposed change to the ACL Anthology. For efficiency reasons, we don't generate MODS or Endnote formats, and the preview may be incomplete in other ways, or contain mistakes. Do not treat this content as an official publication.

Antal van den Bosch

Also published as: Antal Van Den Bosch, Antal Van den Bosch

2024

Re-evaluating the Tomes for the Times
Ryan Brate | Marieke van Erp | Antal van den Bosch
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Literature is to some degree a snapshot of the time it was written in and the societal attitudes of the time. Not all depictions are pleasant or in-line with modern-day sensibilities; this becomes problematic when the prevalent depictions over a large body of work are negatively biased, leading to their normalisation. Many much-loved and much-read classics are set in periods of heightened social inequality: slavery, pre-womens’ rights movements, colonialism, etc. In this paper, we exploit known text co-occurrence metrics with respect to token-level level contexts to identify prevailing themes associated with known problematic descriptors. We see that prevalent, negative depictions are perpetuated by classic literature. We propose that such a methodology could form the basis of a system for making explicit such problematic associations, for interested parties: such as, sensitivity coordinators of publishing houses, library curators, or organisations concerned with social justice

A Bayesian Quantification of Aporophobia and the Aggravating Effect of Low–Wealth Contexts on Stigmatization
Ryan Brate | Marieke Van Erp | Antal Van Den Bosch
Proceedings of the 8th Workshop on Online Abuse and Harms (WOAH 2024)

Aporophobia, a negative social bias against poverty and the poor, has been highlighted asan overlooked phenomenon in toxicity detec-tion in texts. Aporophobia is potentially im-portant both as a standalone form of toxicity,but also given its potential as an aggravatingfactor in the wider stigmatization of groups. Asyet, there has been limited quantification of thisphenomenon. In this paper, we first quantifythe extent of aporophobia, as observable in Red-dit data: contrasting estimates of stigmatisingtopic propensity between low–wealth contextsand high–wealth contexts via Bayesian estima-tion. Next, we consider aporophobia as a causalfactor in the prejudicial association of groupswith stigmatising topics, by introducing peoplegroup as a variable, specifically Black people.This group is selected given its history of be-ing the subject of toxicity. We evaluate theaggravating effect on the observed n–grams in-dicative of stigmatised topics observed in com-ments which refer to Black people, due to thepresence of low–wealth contexts. We performthis evaluation via a Structural Causal Mod-elling approach, performing interventions onsimulations via Bayesian models, for three hy-pothesised causal mechanisms.

2023

Contextual Profiling of Charged Terms in Historical Newspapers
Ryan Brate | Marieke Van Erp | Antal Van den Bosch
Proceedings of the 4th Conference on Language, Data and Knowledge

2022

Detecting Minority Arguments for Mutual Understanding: A Moderation Tool for the Online Climate Change Debate
Cedric Waterschoot | Ernst van den Hemel | Antal van den Bosch
Proceedings of the 29th International Conference on Computational Linguistics

Moderating user comments and promoting healthy understanding is a challenging task, especially in the context of polarized topics such as climate change. We propose a moderation tool to assist moderators in promoting mutual understanding in regard to this topic. The approach is twofold. First, we train classifiers to label incoming posts for the arguments they entail, with a specific focus on minority arguments. We apply active learning to further supplement the training data with rare arguments. Second, we dive deeper into singular arguments and extract the lexical patterns that distinguish each argument from the others. Our findings indicate that climate change arguments form clearly separable clusters in the embedding space. These classes are characterized by their own unique lexical patterns that provide a quick insight in an argument’s key concepts. Additionally, supplementing our training data was necessary for our classifiers to be able to adequately recognize rare arguments. We argue that this detailed rundown of each argument provides insight into where others are coming from. These computational approaches can be part of the toolkit for content moderators and researchers struggling with polarized topics.

Understanding Narratives from Demographic Survey Data: a Comparative Study with Multiple Neural Topic Models
Xiao Xu | Gert Stulp | Antal Van Den Bosch | Anne Gauthier
Proceedings of the Fifth Workshop on Natural Language Processing and Computational Social Science (NLP+CSS)

Fertility intentions as verbalized in surveys are a poor predictor of actual fertility outcomes, the number of children people have. This can partly be explained by the uncertainty people have in their intentions. Such uncertainties are hard to capture through traditional survey questions, although open-ended questions can be used to get insight into people’s subjective narratives of the future that determine their intentions. Analyzing such answers to open-ended questions can be done through Natural Language Processing techniques. Traditional topic models (e.g., LSA and LDA), however, often fail to do since they rely on co-occurrences, which are often rare in short survey responses. The aim of this study was to apply and evaluate topic models on demographic survey data. In this study, we applied neural topic models (e.g. BERTopic, CombinedTM) based on language models to responses from Dutch women on their fertility plans, and compared the topics and their coherence scores from each model to expert judgments. Our results show that neural models produce topics more in line with human interpretation compared to LDA. However, the coherence score could only partly reflect on this, depending on the corpus used for calculation. This research is important because, first, it helps us develop more informed strategies on model selection and evaluation for topic modeling on survey data; and second, it shows that the field of demography has much to gain from adopting NLP methods.

Correlating Political Party Names in Tweets, Newspapers and Election Results
Eric Sanders | Antal van den Bosch
Proceedings of the LREC 2022 workshop on Natural Language Processing for Political Sciences

Twitter has been used as a textual resource to attempt to predict the outcome of elections for over a decade. A body of literature suggests that this is not consistently possible. In this paper we test the hypothesis that mentions of political parties in tweets are better correlated with the appearance of party names in newspapers than to the intention of the tweeter to vote for that party. Five Dutch national elections are used in this study. We find only a small positive, negligible difference in Pearson’s correlation coefficient as well as in the absolute error of the relation between tweets and news, and between tweets and elections. However, we find a larger correlation and a smaller absolute error between party mentions in newspapers and the outcome of the elections in four of the five elections. This suggests that newspapers are a better starting point for predicting the election outcome than tweets.

2020

Less is Better: A cognitively inspired unsupervised model for language segmentation
Jinbiao Yang | Stefan L. Frank | Antal van den Bosch
Proceedings of the Workshop on the Cognitive Aspects of the Lexicon

Language users process utterances by segmenting them into many cognitive units, which vary in their sizes and linguistic levels. Although we can do such unitization/segmentation easily, its cognitive mechanism is still not clear. This paper proposes an unsupervised model, Less-is-Better (LiB), to simulate the human cognitive process with respect to language unitization/segmentation. LiB follows the principle of least effort and aims to build a lexicon which minimizes the number of unit tokens (alleviating the effort of analysis) and number of unit types (alleviating the effort of storage) at the same time on any given corpus. LiB’s workflow is inspired by empirical cognitive phenomena. The design makes the mechanism of LiB cognitively plausible and the computational requirement light-weight. The lexicon generated by LiB performs the best among different types of lexicons (e.g. ground-truth words) both from an information-theoretical view and a cognitive view, which suggests that the LiB lexicon may be a plausible proxy of the mental lexicon.

Optimising Twitter-based Political Election Prediction with Relevance andSentiment Filters
Eric Sanders | Antal van den Bosch
Proceedings of the Twelfth Language Resources and Evaluation Conference

We study the relation between the number of mentions of political parties in the last weeks before the elections and the election results. In this paper we focus on the Dutch elections of the parliament in 2012 and for the provinces (and the senate) in 2011 and 2015. With raw counts, without adaptations, we achieve a mean absolute error (MAE) of 2.71% for 2011, 2.02% for 2012 and 2.89% for 2015. A set of over 17,000 tweets containing political party names were annotated by at least three annotators per tweet on ten features denoting communicative intent (including the presence of sarcasm, the message’s polarity, the presence of an explicit voting endorsement or explicit voting advice, etc.). The annotations were used to create oracle (gold-standard) filters. Tweets with or without a certain majority annotation are held out from the tweet counts, with the goal of attaining lower MAEs. With a grid search we tested all combinations of filters and their responding MAE to find the best filter ensemble. It appeared that the filters show markedly different behaviour for the three elections and only a small MAE improvement is possible when optimizing on all three elections. Larger improvements for one election are possible, but result in deterioration of the MAE for the other elections.

2019

Question Similarity in Community Question Answering: A Systematic Exploration of Preprocessing Methods and Models
Florian Kunneman | Thiago Castro Ferreira | Emiel Krahmer | Antal van den Bosch
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

Community Question Answering forums are popular among Internet users, and a basic problem they encounter is trying to find out if their question has already been posed before. To address this issue, NLP researchers have developed methods to automatically detect question-similarity, which was one of the shared tasks in SemEval. The best performing systems for this task made use of Syntactic Tree Kernels or the SoftCosine metric. However, it remains unclear why these methods seem to work, whether their performance can be improved by better preprocessing methods and what kinds of errors they (and other methods) make. In this paper, we therefore systematically combine and compare these two approaches with the more traditional BM25 and translation-based models. Moreover, we analyze the impact of preprocessing steps (lowercasing, suppression of punctuation and stop words removal) and word meaning similarity based on different distributions (word translation probability, Word2Vec, fastText and ELMo) on the performance of the task. We conduct an error analysis to gain insight into the differences in performance between the system set-ups. The implementation is made publicly available from https://github.com/fkunneman/DiscoSumo/tree/master/ranlp.

Simulating Spanish-English Code-Switching: El Modelo Está Generating Code-Switches
Chara Tsoukala | Stefan L. Frank | Antal van den Bosch | Jorge Valdés Kroff | Mirjam Broersma
Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics

Multilingual speakers are able to switch from one language to the other (“code-switch”) between or within sentences. Because the underlying cognitive mechanisms are not well understood, in this study we use computational cognitive modeling to shed light on the process of code-switching. We employed the Bilingual Dual-path model, a Recurrent Neural Network of bilingual sentence production (Tsoukala et al., 2017), and simulated sentence production in simultaneous Spanish-English bilinguals. Our first goal was to investigate whether the model would code-switch without being exposed to code-switched training input. The model indeed produced code-switches even without any exposure to such input and the patterns of code-switches are in line with earlier linguistic work (Poplack,1980). The second goal of this study was to investigate an auxiliary phrase asymmetry that exists in Spanish-English code-switched production. Using this cognitive model, we examined a possible cause for this asymmetry. To our knowledge, this is the first computational cognitive model that aims to simulate code-switched sentence production.

Dependency Parsing with your Eyes: Dependency Structure Predicts Eye Regressions During Reading
Alessandro Lopopolo | Stefan L. Frank | Antal van den Bosch | Roel Willems
Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics

Backward saccades during reading have been hypothesized to be involved in structural reanalysis, or to be related to the level of text difficulty. We test the hypothesis that backward saccades are involved in online syntactic analysis. If this is the case we expect that saccades will coincide, at least partially, with the edges of the relations computed by a dependency parser. In order to test this, we analyzed a large eye-tracking dataset collected while 102 participants read three short narrative texts. Our results show a relation between backward saccades and the syntactic structure of sentences.

Detecting harassment in real-time as conversations develop
Wessel Stoop | Florian Kunneman | Antal van den Bosch | Ben Miller
Proceedings of the Third Workshop on Abusive Language Online

We developed a machine-learning-based method to detect video game players that harass teammates or opponents in chat earlier in the conversation. This real-time technology would allow gaming companies to intervene during games, such as issue warnings or muting or banning a player. In a proof-of-concept experiment on League of Legends data we compute and visualize evaluation metrics for a machine learning classifier as conversations unfold, and observe that the optimal precision and recall of detecting toxic players at each moment in the conversation depends on the confidence threshold of the classifier: the threshold should start low, and increase as the conversation unfolds. How fast this sliding threshold should increase depends on the training set size.

2018

Aspect-based summarization of pros and cons in unstructured product reviews
Florian Kunneman | Sander Wubben | Antal van den Bosch | Emiel Krahmer
Proceedings of the 27th International Conference on Computational Linguistics

We developed three systems for generating pros and cons summaries of product reviews. Automating this task eases the writing of product reviews, and offers readers quick access to the most important information. We compared SynPat, a system based on syntactic phrases selected on the basis of valence scores, against a neural-network-based system trained to map bag-of-words representations of reviews directly to pros and cons, and the same neural system trained on clusters of word-embedding encodings of similar pros and cons. We evaluated the systems in two ways: first on held-out reviews with gold-standard pros and cons, and second by asking human annotators to rate the systems’ output on relevance and completeness. In the second evaluation, the gold-standard pros and cons were assessed along with the system output. We find that the human-generated summaries are not deemed as significantly more relevant or complete than the SynPat systems; the latter are scored higher than the human-generated summaries on a precision metric. The neural approaches yield a lower performance in the human assessment, and are outperformed by the baseline.

A Multilingual Wikified Data Set of Educational Material
Iris Hendrickx | Eirini Takoulidou | Thanasis Naskos | Katia Lida Kermanidis | Vilelmini Sosoni | Hugo de Vos | Maria Stasimioti | Menno van Zaanen | Panayota Georgakopoulou | Valia Kordoni | Maja Popovic | Markus Egg | Antal van den Bosch
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

Discovering the Language of Wine Reviews: A Text Mining Account
Els Lefever | Iris Hendrickx | Ilja Croijmans | Antal van den Bosch | Asifa Majid
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

We present the results and the findings of the Second VarDial Evaluation Campaign on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects. The campaign was organized as part of the fifth edition of the VarDial workshop, collocated with COLING’2018. This year, the campaign included five shared tasks, including two task re-runs – Arabic Dialect Identification (ADI) and German Dialect Identification (GDI) –, and three new tasks – Morphosyntactic Tagging of Tweets (MTT), Discriminating between Dutch and Flemish in Subtitles (DFS), and Indo-Aryan Language Identification (ILI). A total of 24 teams submitted runs across the five shared tasks, and contributed 22 system description papers, which were included in the VarDial workshop proceedings and are referred to in this report.

2017

Exploring Lexical and Syntactic Features for Language Variety Identification
Chris van der Lee | Antal van den Bosch
Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)

We present a method to discriminate between texts written in either the Netherlandic or the Flemish variant of the Dutch language. The method draws on a feature bundle representing text statistics, syntactic features, and word n-grams. Text statistics include average word length and sentence length, while syntactic features include ratios of function words and part-of-speech n-grams. The effectiveness of the classifier was measured by classifying Dutch subtitles developed for either Dutch or Flemish television. Several machine learning algorithms were compared as well as feature combination methods in order to find the optimal generalization performance. A machine-learning meta classifier based on AdaBoost attained the best F-score of 0.92.

2016

Sarcastic Soulmates: Intimacy and irony markers in social media messaging
Koen Hallmann | Florian Kunneman | Christine Liebrecht | Antal van den Bosch | Margot van Mulken
Linguistic Issues in Language Technology, Volume 14, 2016 - Modality: Logic, Semantics, Annotation, and Machine Learning

Verbal irony, or sarcasm, presents a significant technical and conceptual challenge when it comes to automatic detection. Moreover, it can be a disruptive factor in sentiment analysis and opinion mining, because it changes the polarity of a message implicitly. Extant methods for automatic detection are mostly based on overt clues to ironic intent such as hashtags, also known as irony markers. In this paper, we investigate whether people who know each other make use of irony markers less often than people who do not know each other. We trained a machinelearning classifier to detect sarcasm in Twitter messages (tweets) that were addressed to specific users, and in tweets that were not addressed to a particular user. Human coders analyzed the top-1000 features found to be most discriminative into ten categories of irony markers. The classifier was also tested within and across the two categories. We find that tweets with a user mention contain fewer irony markers than tweets not addressed to a particular user. Classification experiments confirm that the irony in the two types of tweets is signaled differently. The within-category performance of the classifier is about 91% for both categories, while cross-category experiments yield substantially lower generalization performance scores of 75% and 71%. We conclude that irony markers are used more often when there is less mutual knowledge between sender and receiver. Senders addressing other Twitter users less often use irony markers, relying on mutual knowledge which should lead the receiver to infer ironic intent from more implicit clues. With regard to automatic detection, we conclude that our classifier is able to detect ironic tweets addressed at another user as reliably as tweets that are not addressed at at a particular person.

Predicting Liaison: an Example-Based Approach
Antal van den Bosch | Alexander Greefhorst
Traitement Automatique des Langues, Volume 57, Numéro 1 : Varia [Varia]

Enhancing Access to Online Education: Quality Machine Translation of MOOC Content
Valia Kordoni | Antal van den Bosch | Katia Lida Kermanidis | Vilelmini Sosoni | Kostadin Cholakov | Iris Hendrickx | Matthias Huck | Andy Way
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

The present work is an overview of the TraMOOC (Translation for Massive Open Online Courses) research and innovation project, a machine translation approach for online educational content. More specifically, videolectures, assignments, and MOOC forum text is automatically translated from English into eleven European and BRIC languages. Unlike previous approaches to machine translation, the output quality in TraMOOC relies on a multimodal evaluation schema that involves crowdsourcing, error type markup, an error taxonomy for translation model comparison, and implicit evaluation via text mining, i.e. entity recognition and its performance comparison between the source and the translated text, and sentiment analysis on the students’ forum posts. Finally, the evaluation output will result in more and better quality in-domain parallel data that will be fed back to the translation engine for higher quality output. The translation service will be incorporated into the Iversity MOOC platform and into the VideoLectures.net digital library portal.

Nederlab: Towards a Single Portal and Research Environment for Diachronic Dutch Text Corpora
Hennie Brugman | Martin Reynaert | Nicoline van der Sijs | René van Stipriaan | Erik Tjong Kim Sang | Antal van den Bosch
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

The Nederlab project aims to bring together all digitized texts relevant to the Dutch national heritage, the history of the Dutch language and culture (circa 800 – present) in one user friendly and tool enriched open access web interface. This paper describes Nederlab halfway through the project period and discusses the collections incorporated, back-office processes, system back-end as well as the Nederlab Research Portal end-user web application.

Can Tweets Predict TV Ratings?
Bridget Sommerdijk | Eric Sanders | Antal van den Bosch
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We set out to investigate whether TV ratings and mentions of TV programmes on the Twitter social media platform are correlated. If such a correlation exists, Twitter may be used as an alternative source for estimating viewer popularity. Moreover, the Twitter-based rating estimates may be generated during the programme, or even before. We count the occurrences of programme-specific hashtags in an archive of Dutch tweets of eleven popular TV shows broadcast in the Netherlands in one season, and perform correlation tests. Overall we find a strong correlation of 0.82; the correlation remains strong, 0.79, if tweets are counted a half hour before broadcast time. However, the two most popular TV shows account for most of the positive effect; if we leave out the single and second most popular TV shows, the correlation drops to being moderate to weak. Also, within a TV show, correlations between ratings and tweet counts are mostly weak, while correlations between TV ratings of the previous and next shows are strong. In absence of information on previous shows, Twitter-based counts may be a viable alternative to classic estimation methods for TV ratings. Estimates are more reliable with more popular TV shows.

Improving cross-domain n-gram language modelling with skipgrams
Louis Onrust | Antal van den Bosch | Hugo Van hamme
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Very quaffable and great fun: Applying NLP to wine reviews
Iris Hendrickx | Els Lefever | Ilja Croijmans | Asifa Majid | Antal van den Bosch
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Abstractive Compression of Captions with Attentive Recurrent Neural Networks
Sander Wubben | Emiel Krahmer | Antal van den Bosch | Suzan Verberne
Proceedings of the 9th International Natural Language Generation conference

2015

Automatically Identifying Periodic Social Events from Twitter
Florian Kunneman | Antal Van den Bosch
Proceedings of the International Conference Recent Advances in Natural Language Processing

Modeling dative alternations of individual children
Antal van den Bosch | Joan Bresnan
Proceedings of the Sixth Workshop on Cognitive Aspects of Computational Language Learning

2014

Using idiolects and sociolects to improve word prediction
Wessel Stoop | Antal van den Bosch
Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics

Creating and using large monolingual parallel corpora for sentential paraphrase generation
Sander Wubben | Antal van den Bosch | Emiel Krahmer
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper we investigate the automatic generation of paraphrases by using machine translation techniques. Three contributions we make are the construction of a large paraphrase corpus for English and Dutch, a re-ranking heuristic to use machine translation for paraphrase generation and a proper evaluation methodology. A large parallel corpus is constructed by aligning clustered headlines that are scraped from a news aggregator site. To generate sentential paraphrases we use a standard phrase-based machine translation (PBMT) framework modified with a re-ranking component (henceforth PBMT-R). We demonstrate this approach for Dutch and English and evaluate by using human judgements collected from 76 participants. The judgments are compared to two automatic machine translation evaluation metrics. We observe that as the paraphrases deviate more from the source sentence, the performance of the PBMT-R system degrades less than that of the word substitution baseline system.

Translation Assistance by Translation of L1 Fragments in an L2 Context
Maarten van Gompel | Antal van den Bosch
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

SemEval 2014 Task 5 - L2 Writing Assistant
Maarten van Gompel | Iris Hendrickx | Antal van den Bosch | Els Lefever | Véronique Hoste
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)

Proceedings of the 8th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH)
Kalliopi Zervanou | Cristina Vertan | Antal van den Bosch | Caroline Sporleder
Proceedings of the 8th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH)

Estimating Time to Event from Tweets Using Temporal Expressions
Ali Hürriyetoǧlu | Nelleke Oostdijk | Antal van den Bosch
Proceedings of the 5th Workshop on Language Analysis for Social Media (LASM)

The (Un)Predictability of Emotional Hashtags in Twitter
Florian Kunneman | Christine Liebrecht | Antal van den Bosch
Proceedings of the 5th Workshop on Language Analysis for Social Media (LASM)

2013

WSD2: Parameter optimisation for Memory-based Cross-Lingual Word-Sense Disambiguation
Maarten van Gompel | Antal van den Bosch
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013)

The perfect solution for detecting sarcasm in tweets #not
Christine Liebrecht | Florian Kunneman | Antal van den Bosch
Proceedings of the 4th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

Using character overlap to improve language transformation
Sander Wubben | Emiel Krahmer | Antal van den Bosch
Proceedings of the 7th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities

Memory-based Grammatical Error Correction
Antal van den Bosch | Peter Berck
Proceedings of the Seventeenth Conference on Computational Natural Language Learning: Shared Task

2012

The effect of domain and text type on text prediction quality
Suzan Verberne | Antal van den Bosch | Helmer Strik | Lou Boves
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics

DutchSemCor: Targeting the ideal sense-tagged corpus
Piek Vossen | Attila Görög | Rubén Izquierdo | Antal van den Bosch
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Word Sense Disambiguation (WSD) systems require large sense-tagged corpora along with lexical databases to reach satisfactory results. The number of English language resources for developed WSD increased in the past years while most other languages are still under-resourced. The situation is no different for Dutch. In order to overcome this data bottleneck, the DutchSemCor project will deliver a Dutch corpus that is sense-tagged with senses from the Cornetto lexical database. In this paper, we discuss the different conflicting requirements for a sense-tagged corpus and our strategies to fulfill them. We report on a first series of experiments to sup- port our semi-automatic approach to build the corpus.

Sentence Simplification by Monolingual Machine Translation
Sander Wubben | Antal van den Bosch | Emiel Krahmer
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Proceedings of the 6th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities
Kalliopi Zervanou | Antal van den Bosch
Proceedings of the 6th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities

Memory-based text correction for preposition and determiner errors
Antal van den Bosch | Peter Berck
Proceedings of the Seventh Workshop on Building Educational Applications Using NLP

Proceedings of the Workshop on Detecting Structure in Scholarly Discourse
Antal Van Den Bosch | Hagit Shatkay
Proceedings of the Workshop on Detecting Structure in Scholarly Discourse

2011

Enrichment and Structuring of Archival Description Metadata
Kalliopi Zervanou | Ioannis Korkontzelos | Antal van den Bosch | Sophia Ananiadou
Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities

Comparing Phrase-based and Syntax-based Paraphrase Generation
Sander Wubben | Erwin Marsi | Antal van den Bosch | Emiel Krahmer
Proceedings of the Workshop on Monolingual Text-To-Text Generation

A Link to the Past: Constructing Historical Social Networks
Matje van de Camp | Antal van den Bosch
Proceedings of the 2nd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis (WASSA 2.011)

2010

Supertags as Source Language Context in Hierarchical Phrase-Based SMT
Rejwanul Haque | Sudip Naskar | Antal van den Bosch | Andy Way
Proceedings of the 9th Conference of the Association for Machine Translation in the Americas: Research Papers

Statistical machine translation (SMT) models have recently begun to include source context modeling, under the assumption that the proper lexical choice of the translation for an ambiguous word can be determined from the context in which it appears. Various types of lexical and syntactic features have been explored as effective source context to improve phrase selection in SMT. In the present work, we introduce lexico-syntactic descriptions in the form of supertags as source-side context features in the state-of-the-art hierarchical phrase-based SMT (HPB) model. These features enable us to exploit source similarity in addition to target similarity, as modelled by the language model. In our experiments two kinds of supertags are employed: those from lexicalized tree-adjoining grammar (LTAG) and combinatory categorial grammar (CCG). We use a memory-based classification framework that enables the efficient estimation of these features. Despite the differences between the two supertagging approaches, they give similar improvements. We evaluate the performance of our approach on an English-to-Dutch translation task, and report statistically significant improvements of 4.48% and 6.3% BLEU scores in translation quality when adding CCG and LTAG supertags, respectively, as context-informed features.

Paraphrase Generation as Monolingual Translation: Data and Evaluation
Sander Wubben | Antal van den Bosch | Emiel Krahmer
Proceedings of the 6th International Natural Language Generation Conference

2009

A Constraint Satisfaction Approach to Machine Translation
Sander Canisius | Antal van den Bosch
Proceedings of the 13th Annual Conference of the European Association for Machine Translation

Dependency Parsing and Semantic Role Labeling as a Single Task
Roser Morante | Vincent Van Asch | Antal van den Bosch
Proceedings of the International Conference RANLP-2009

Instance-Driven Discovery of Ontological Relation Labels
Marieke van Erp | Antal van den Bosch | Sander Wubben | Steve Hunt
Proceedings of the EACL 2009 Workshop on Language Technology and Resources for Cultural Heritage, Social Sciences, Humanities, and Education (LaTeCH – SHELT&R 2009)

Clustering and Matching Headlines for Automatic Paraphrase Acquisition
Sander Wubben | Antal van den Bosch | Emiel Krahmer | Erwin Marsi
Proceedings of the 12th European Workshop on Natural Language Generation (ENLG 2009)

Joint Memory-Based Learning of Syntactic and Semantic Dependencies in Multiple Languages
Roser Morante | Vincent Van Asch | Antal van den Bosch
Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL 2009): Shared Task

Comparing Alternative Data-Driven Ontological Vistas of Natural History (short paper)
Marieke van Erp | Piroska Lendvai | Antal van den Bosch
Proceedings of the Eight International Conference on Computational Semantics

A semantic relatedness metric based on free link structure (short paper)
Sander Wubben | Antal van den Bosch
Proceedings of the Eight International Conference on Computational Semantics

Dependency Relations as Source Context in Phrase-Based SMT
Rejwanul Haque | Sudip Kumar Naskar | Antal van den Bosch | Andy Way
Proceedings of the 23rd Pacific Asia Conference on Language, Information and Computation, Volume 1

2007

Exploiting source similarity for SMT using context-informed features
Nicolas Stroppa | Antal van den Bosch | Andy Way
Proceedings of the 11th Conference on Theoretical and Methodological Issues in Machine Translation of Natural Languages: Papers

Letter to the Editor
Walter Daelemans | Antal van den Bosch
Computational Linguistics, Volume 33, Number 1, March 2007

Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics
Annie Zaenen | Antal van den Bosch
Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics

ILK: Machine learning of semantic relations with shallow features and almost no data
Iris Hendrickx | Roser Morante | Caroline Sporleder | Antal van den Bosch
Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007)

Proceedings of the Workshop on Language Technology for Cultural Heritage Data (LaTeCH 2007).
Caroline Sporleder | Antal van den Bosch | Claire Grover
Proceedings of the Workshop on Language Technology for Cultural Heritage Data (LaTeCH 2007).

2006

Transferring PoS-tagging and lemmatization tools from spoken to written Dutch corpus development
Antal van den Bosch | Ineke Schuurman | Vincent Vandeghinste
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

We describe a case study in the reuse and transfer of tools in language resource development, from a corpus of spoken Dutch to a corpus of written Dutch. Once tools for a particular language have been developed, it is logical, but not trivial to reuse them for other types or registers of the language than the tools were originally designed for. This paper reviews the decisions and adaptations necessary to make this particular transfer from spoken to written language, focusing on a part-of-speech tagger and a lemmatizer. While the lemmatizer can be transferred fairly straightforwardly, the tagger needs to be adaptated considerably. We show how it can be adapted without starting from scratch. We describe how the part-of-speech tagset was adapted and how the tagger was retrained to deal with written-text phenomena it had not been trained on earlier.

Identifying Named Entities in Text Databases from the Natural History Domain
Caroline Sporleder | Marieke van Erp | Tijn Porcelijn | Antal van den Bosch | Pim Arntzen
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

In this paper, we investigate whether it is possible to bootstrap a named entity tagger for textual databases by exploiting the database structure to automatically generate domain and database-specific gazetteer lists. We compare three tagging strategies: (i) using the extracted gazetteers in a look-up tagger, (ii) using the gazetteers to automatically extract training data to train a database-specific tagger, and (iii) using a generic named entity tagger. Our results suggest that automatically built gazetteers in combination with a look-up tagger lead to a relatively good performance and that generic taggers do not perform particularly well on this type of data.

Spotting the ‘Odd-one-out’: Data-Driven Error Detection and Correction in Textual Databases
Caroline Sporleder | Marieke van Erp | Tijn Porcelijn | Antal van den Bosch
Proceedings of the Workshop on Adaptive Text Extraction and Mining (ATEM 2006)

Constraint Satisfaction Inference: Non-probabilistic Global Inference for Sequence Labelling
Sander Canisius | Antal van den Bosch | Walter Daelemans
Proceedings of the Workshop on Learning Structured Information in Natural Language Applications

Dependency Parsing by Inference over High-recall Dependency Predictions
Sander Canisius | Toine Bogers | Antal van den Bosch | Jeroen Geertzen | Erik Tjong Kim Sang
Proceedings of the Tenth Conference on Computational Natural Language Learning (CoNLL-X)

Improved morpho-phonological sequence processing with constraint satisfaction inference
Antal van den Bosch | Sander Canisius
Proceedings of the Eighth Meeting of the ACL Special Interest Group on Computational Phonology and Morphology at HLT-NAACL 2006

All-word Prediction as the Ultimate Confusible Disambiguation
Antal van den Bosch
Proceedings of the Workshop on Computationally Hard Problems and Joint Inference in Speech and Language Processing

2005

Improving Sequence Segmentation Learning by Predicting Trigrams
Antal van den Bosch | Walter Daelemans
Proceedings of the Ninth Conference on Computational Natural Language Learning (CoNLL-2005)

Applying Spelling Error Correction Techniques for Improving Semantic Role Labelling
Erik Tjong Kim Sang | Sander Canisius | Antal van den Bosch | Toine Bogers
Proceedings of the Ninth Conference on Computational Natural Language Learning (CoNLL-2005)

Memory-Based Morphological Analysis Generation and Part-of-Speech Tagging of Arabic
Erwin Marsi | Antal van den Bosch | Abdelhadi Soudi
Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages

2004

GAMBL, genetic algorithm optimization of memory-based WSD
Bart Decadt | Véronique Hoste | Walter Daelemans | Antal van den Bosch
Proceedings of SENSEVAL-3, the Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text

Memory-based semantic role labeling: Optimizing features, algorithm, and output
Antal van den Bosch | Sander Canisius | Walter Daelemans | Iris Hendrickx | Erik Tjong Kim Sang
Proceedings of the Eighth Conference on Computational Natural Language Learning (CoNLL-2004) at HLT-NAACL 2004

2003

Learning PP attachment for filtering prosodic phrasing
Olga van Herwijnen | Jacques Terken | Antal van den Bosch | Erwin Marsi
10th Conference of the European Chapter of the Association for Computational Linguistics

Learning to Predict Pitch Accents and Prosodic Boundaries in Dutch
Erwin Marsi | Martin Reynaert | Antal van den Bosch | Walter Daelemans | Véronique Hoste
Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics

Memory-based one-step named-entity recognition: Effects of seed list features, classifier stacking, and unannotated data
Iris Hendrickx | Antal van den Bosch
Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003

Machine Learning for Shallow Interpretation of User Utterances in Spoken Dialogue Systems
Piroska Lendvai | Antal van den Bosch | Emiel Krahmer
Proceedings of the 2003 EACL Workshop on Dialogue Systems: interaction, adaptation and styes of management

2002

Shallow Parsing on the Basis of Words Only: A Case Study
Antal van den Bosch | Sabine Buchholz
Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics

Dutch Word Sense Disambiguation: Optimizing the Localness of Context
Antal van den Bosch | Iris Hendrickx | Veronique Hoste | Walter Daelemans
Proceedings of the ACL-02 Workshop on Word Sense Disambiguation: Recent Successes and Future Directions

Evaluating the results of a memory-based word-expert approach to unrestricted word sense disambiguation
Veronique Hoste | Walter Daelemans | Iris Hendrickx | Antal van den Bosch
Proceedings of the ACL-02 Workshop on Word Sense Disambiguation: Recent Successes and Future Directions

2001

Detecting Problematic Turns in Human-Machine Interactions: Rule-induction Versus Memory-based Learning Approaches
Antal van den Bosch | Emiel Krahmer | Marc Swerts
Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics

Dutch Word Sense Disambiguation: Data and Preliminary Results
Iris Hendrickx | Antal van den Bosch
Proceedings of SENSEVAL-2 Second International Workshop on Evaluating Word Sense Disambiguation Systems

2000

Integrating Seed Names and ngrams for a Named Entity List and Classifier
Sabine Buchholz | Antal van den Bosch
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)

Using Induced Rules as Complex Features in Memory-Based Language Learning
Antal van den Bosch
Fourth Conference on Computational Natural Language Learning and the Second Learning Language in Logic Workshop

Single-Classifier Memory-Based Phrase Chunking
Jorn Veenstra | Antal van den Bosch
Fourth Conference on Computational Natural Language Learning and the Second Learning Language in Logic Workshop

1999

Memory-Based Morphological Analysis
Antal van den Bosch | Walter Daelemans
Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics

1998

Modularity in Inductively-Learned Word Pronunciation Systems
Antal van den Bosch | Ton Weijters | Walter Daelemans
New Methods in Language Processing and Computational Natural Language Learning

Do Not Forget: Full Memory in Memory-Based Learning of Word Pronunciation
Antal van den Bosch | Walter Daelemans
New Methods in Language Processing and Computational Natural Language Learning

1993

Data-Oriented Methods for Grapheme-to-Phoneme Conversion
Antal van den Bosch | Walter Daelemans
Sixth Conference of the European Chapter of the Association for Computational Linguistics

Co-authors

Marieke van Erp 7

Sander Canisius 6

Veronique Hoste 5

Katia Lida Kermanidis 5

Valia Kordoni 5

Vilelmini Sosoni 5

Caroline Sporleder 5

Kostadin Cholakov 4

Erik Tjong Kim Sang 4

Menno van Zaanen 4

Stefan L. Frank 3

Panayota Georgakopoulou 3

Maria Gialama 3

Christine Liebrecht 3

Roser Morante 3

Michael Papadopoulos 3

Dimitrios Tsoumakos 3

Kalliopi Zervanou 3

Maarten van Gompel 3

Sabine Buchholz 2

Ilja Croijmans 2

Rejwanul Haque 2

Piroska Lendvai 2

Sudip Kumar Naskar 2

Nelleke Oostdijk 2

Maja Popović 2

Tijn Porcelijn 2

Martin Reynaert 2

Vincent Van Asch 2

Suzan Verberne 2

Chris van der Lee 2

Sophia Ananiadou 1

Mirjam Broersma 1

Hennie Brugman 1

Thiago Castro Ferreira 1

Federico Gaspari 1

Anne Gauthier 1

Jeroen Geertzen 1

Yota Georgakopolou 1

Alexander Greefhorst 1

Stefan Grondelaers 1

Claire Grover 1

Attila Görög 1

Koen Hallmann 1

Matthias Huck 1

Ali Hürriyetoğlu 1

Rubén Izquierdo 1

Ioannis Korkontzelos 1

Bornini Lahiri 1

Nikola Ljubešić 1

Alessandro Lopopolo 1

Shervin Malmasi 1

Joss Moorkens 1

Preslav Nakov 1

Thanasis Naskos 1

Tanja Samardzic 1

Yves Scherrer 1

Ineke Schuurman 1

Rico Sennrich 1

Hagit Shatkay 1

Bridget Sommerdijk 1

Abdelhadi Soudi 1

Dirk Speelman 1

Maria Stasimioti 1

Nicolas Stroppa 1

Eirini Takoulidou 1

Jacques Terken 1

Jörg Tiedemann 1

Chara Tsoukala 1

Jorge Valdés Kroff 1

Vincent Vandeghinste 1

Jorn Veenstra 1

Cristina Vertan 1

Cedric Waterschoot 1

Marcos Zampieri 1

Hugo Van hamme 1

Olga van Herwijnen 1

Margot van Mulken 1

René van Stipriaan 1

Matje van de Camp 1

Ernst van den Hemel 1

Nicoline van der Sijs 1

Venues