Antal van den Bosch

Also published as: Antal Van Den Bosch, Antal Van den Bosch

2024

pdf abs
A Bayesian Quantification of Aporophobia and the Aggravating Effect of Low–Wealth Contexts on Stigmatization
Ryan Brate | Marieke Van Erp | Antal Van Den Bosch
Proceedings of the 8th Workshop on Online Abuse and Harms (WOAH 2024)

Aporophobia, a negative social bias against poverty and the poor, has been highlighted asan overlooked phenomenon in toxicity detec-tion in texts. Aporophobia is potentially im-portant both as a standalone form of toxicity,but also given its potential as an aggravatingfactor in the wider stigmatization of groups. Asyet, there has been limited quantification of thisphenomenon. In this paper, we first quantifythe extent of aporophobia, as observable in Red-dit data: contrasting estimates of stigmatisingtopic propensity between low–wealth contextsand high–wealth contexts via Bayesian estima-tion. Next, we consider aporophobia as a causalfactor in the prejudicial association of groupswith stigmatising topics, by introducing peoplegroup as a variable, specifically Black people.This group is selected given its history of be-ing the subject of toxicity. We evaluate theaggravating effect on the observed n–grams in-dicative of stigmatised topics observed in com-ments which refer to Black people, due to thepresence of low–wealth contexts. We performthis evaluation via a Structural Causal Mod-elling approach, performing interventions onsimulations via Bayesian models, for three hy-pothesised causal mechanisms.

pdf abs
Re-evaluating the Tomes for the Times
Ryan Brate | Marieke van Erp | Antal van den Bosch
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Literature is to some degree a snapshot of the time it was written in and the societal attitudes of the time. Not all depictions are pleasant or in-line with modern-day sensibilities; this becomes problematic when the prevalent depictions over a large body of work are negatively biased, leading to their normalisation. Many much-loved and much-read classics are set in periods of heightened social inequality: slavery, pre-womens’ rights movements, colonialism, etc. In this paper, we exploit known text co-occurrence metrics with respect to token-level level contexts to identify prevailing themes associated with known problematic descriptors. We see that prevalent, negative depictions are perpetuated by classic literature. We propose that such a methodology could form the basis of a system for making explicit such problematic associations, for interested parties: such as, sensitivity coordinators of publishing houses, library curators, or organisations concerned with social justice

2023

pdf
Contextual Profiling of Charged Terms in Historical Newspapers
Ryan Brate | Marieke Van Erp | Antal Van den Bosch
Proceedings of the 4th Conference on Language, Data and Knowledge

2022

pdf abs
Understanding Narratives from Demographic Survey Data: a Comparative Study with Multiple Neural Topic Models
Xiao Xu | Gert Stulp | Antal Van Den Bosch | Anne Gauthier
Proceedings of the Fifth Workshop on Natural Language Processing and Computational Social Science (NLP+CSS)

Fertility intentions as verbalized in surveys are a poor predictor of actual fertility outcomes, the number of children people have. This can partly be explained by the uncertainty people have in their intentions. Such uncertainties are hard to capture through traditional survey questions, although open-ended questions can be used to get insight into people’s subjective narratives of the future that determine their intentions. Analyzing such answers to open-ended questions can be done through Natural Language Processing techniques. Traditional topic models (e.g., LSA and LDA), however, often fail to do since they rely on co-occurrences, which are often rare in short survey responses. The aim of this study was to apply and evaluate topic models on demographic survey data. In this study, we applied neural topic models (e.g. BERTopic, CombinedTM) based on language models to responses from Dutch women on their fertility plans, and compared the topics and their coherence scores from each model to expert judgments. Our results show that neural models produce topics more in line with human interpretation compared to LDA. However, the coherence score could only partly reflect on this, depending on the corpus used for calculation. This research is important because, first, it helps us develop more informed strategies on model selection and evaluation for topic modeling on survey data; and second, it shows that the field of demography has much to gain from adopting NLP methods.

pdf bib abs
Correlating Political Party Names in Tweets, Newspapers and Election Results
Eric Sanders | Antal van den Bosch
Proceedings of the LREC 2022 workshop on Natural Language Processing for Political Sciences

Twitter has been used as a textual resource to attempt to predict the outcome of elections for over a decade. A body of literature suggests that this is not consistently possible. In this paper we test the hypothesis that mentions of political parties in tweets are better correlated with the appearance of party names in newspapers than to the intention of the tweeter to vote for that party. Five Dutch national elections are used in this study. We find only a small positive, negligible difference in Pearson’s correlation coefficient as well as in the absolute error of the relation between tweets and news, and between tweets and elections. However, we find a larger correlation and a smaller absolute error between party mentions in newspapers and the outcome of the elections in four of the five elections. This suggests that newspapers are a better starting point for predicting the election outcome than tweets.

pdf abs
Detecting Minority Arguments for Mutual Understanding: A Moderation Tool for the Online Climate Change Debate
Cedric Waterschoot | Ernst van den Hemel | Antal van den Bosch
Proceedings of the 29th International Conference on Computational Linguistics

Moderating user comments and promoting healthy understanding is a challenging task, especially in the context of polarized topics such as climate change. We propose a moderation tool to assist moderators in promoting mutual understanding in regard to this topic. The approach is twofold. First, we train classifiers to label incoming posts for the arguments they entail, with a specific focus on minority arguments. We apply active learning to further supplement the training data with rare arguments. Second, we dive deeper into singular arguments and extract the lexical patterns that distinguish each argument from the others. Our findings indicate that climate change arguments form clearly separable clusters in the embedding space. These classes are characterized by their own unique lexical patterns that provide a quick insight in an argument’s key concepts. Additionally, supplementing our training data was necessary for our classifiers to be able to adequately recognize rare arguments. We argue that this detailed rundown of each argument provides insight into where others are coming from. These computational approaches can be part of the toolkit for content moderators and researchers struggling with polarized topics.

2020

pdf abs
Optimising Twitter-based Political Election Prediction with Relevance andSentiment Filters
Eric Sanders | Antal van den Bosch
Proceedings of the Twelfth Language Resources and Evaluation Conference

We study the relation between the number of mentions of political parties in the last weeks before the elections and the election results. In this paper we focus on the Dutch elections of the parliament in 2012 and for the provinces (and the senate) in 2011 and 2015. With raw counts, without adaptations, we achieve a mean absolute error (MAE) of 2.71% for 2011, 2.02% for 2012 and 2.89% for 2015. A set of over 17,000 tweets containing political party names were annotated by at least three annotators per tweet on ten features denoting communicative intent (including the presence of sarcasm, the message’s polarity, the presence of an explicit voting endorsement or explicit voting advice, etc.). The annotations were used to create oracle (gold-standard) filters. Tweets with or without a certain majority annotation are held out from the tweet counts, with the goal of attaining lower MAEs. With a grid search we tested all combinations of filters and their responding MAE to find the best filter ensemble. It appeared that the filters show markedly different behaviour for the three elections and only a small MAE improvement is possible when optimizing on all three elections. Larger improvements for one election are possible, but result in deterioration of the MAE for the other elections.

pdf abs
Less is Better: A cognitively inspired unsupervised model for language segmentation
Jinbiao Yang | Stefan L. Frank | Antal van den Bosch
Proceedings of the Workshop on the Cognitive Aspects of the Lexicon

Language users process utterances by segmenting them into many cognitive units, which vary in their sizes and linguistic levels. Although we can do such unitization/segmentation easily, its cognitive mechanism is still not clear. This paper proposes an unsupervised model, Less-is-Better (LiB), to simulate the human cognitive process with respect to language unitization/segmentation. LiB follows the principle of least effort and aims to build a lexicon which minimizes the number of unit tokens (alleviating the effort of analysis) and number of unit types (alleviating the effort of storage) at the same time on any given corpus. LiB’s workflow is inspired by empirical cognitive phenomena. The design makes the mechanism of LiB cognitively plausible and the computational requirement light-weight. The lexicon generated by LiB performs the best among different types of lexicons (e.g. ground-truth words) both from an information-theoretical view and a cognitive view, which suggests that the LiB lexicon may be a plausible proxy of the mental lexicon.

2019

pdf abs
Simulating Spanish-English Code-Switching: El Modelo Está Generating Code-Switches
Chara Tsoukala | Stefan L. Frank | Antal van den Bosch | Jorge Valdés Kroff | Mirjam Broersma
Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics

Multilingual speakers are able to switch from one language to the other (“code-switch”) between or within sentences. Because the underlying cognitive mechanisms are not well understood, in this study we use computational cognitive modeling to shed light on the process of code-switching. We employed the Bilingual Dual-path model, a Recurrent Neural Network of bilingual sentence production (Tsoukala et al., 2017), and simulated sentence production in simultaneous Spanish-English bilinguals. Our first goal was to investigate whether the model would code-switch without being exposed to code-switched training input. The model indeed produced code-switches even without any exposure to such input and the patterns of code-switches are in line with earlier linguistic work (Poplack,1980). The second goal of this study was to investigate an auxiliary phrase asymmetry that exists in Spanish-English code-switched production. Using this cognitive model, we examined a possible cause for this asymmetry. To our knowledge, this is the first computational cognitive model that aims to simulate code-switched sentence production.

pdf abs
Dependency Parsing with your Eyes: Dependency Structure Predicts Eye Regressions During Reading
Alessandro Lopopolo | Stefan L. Frank | Antal van den Bosch | Roel Willems
Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics

Backward saccades during reading have been hypothesized to be involved in structural reanalysis, or to be related to the level of text difficulty. We test the hypothesis that backward saccades are involved in online syntactic analysis. If this is the case we expect that saccades will coincide, at least partially, with the edges of the relations computed by a dependency parser. In order to test this, we analyzed a large eye-tracking dataset collected while 102 participants read three short narrative texts. Our results show a relation between backward saccades and the syntactic structure of sentences.

pdf abs
Detecting harassment in real-time as conversations develop
Wessel Stoop | Florian Kunneman | Antal van den Bosch | Ben Miller
Proceedings of the Third Workshop on Abusive Language Online

We developed a machine-learning-based method to detect video game players that harass teammates or opponents in chat earlier in the conversation. This real-time technology would allow gaming companies to intervene during games, such as issue warnings or muting or banning a player. In a proof-of-concept experiment on League of Legends data we compute and visualize evaluation metrics for a machine learning classifier as conversations unfold, and observe that the optimal precision and recall of detecting toxic players at each moment in the conversation depends on the confidence threshold of the classifier: the threshold should start low, and increase as the conversation unfolds. How fast this sliding threshold should increase depends on the training set size.

pdf abs
Question Similarity in Community Question Answering: A Systematic Exploration of Preprocessing Methods and Models
Florian Kunneman | Thiago Castro Ferreira | Emiel Krahmer | Antal van den Bosch
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

Community Question Answering forums are popular among Internet users, and a basic problem they encounter is trying to find out if their question has already been posed before. To address this issue, NLP researchers have developed methods to automatically detect question-similarity, which was one of the shared tasks in SemEval. The best performing systems for this task made use of Syntactic Tree Kernels or the SoftCosine metric. However, it remains unclear why these methods seem to work, whether their performance can be improved by better preprocessing methods and what kinds of errors they (and other methods) make. In this paper, we therefore systematically combine and compare these two approaches with the more traditional BM25 and translation-based models. Moreover, we analyze the impact of preprocessing steps (lowercasing, suppression of punctuation and stop words removal) and word meaning similarity based on different distributions (word translation probability, Word2Vec, fastText and ELMo) on the performance of the task. We conduct an error analysis to gain insight into the differences in performance between the system set-ups. The implementation is made publicly available from https://github.com/fkunneman/DiscoSumo/tree/master/ranlp.

2018

pdf abs
Aspect-based summarization of pros and cons in unstructured product reviews
Florian Kunneman | Sander Wubben | Antal van den Bosch | Emiel Krahmer
Proceedings of the 27th International Conference on Computational Linguistics

We developed three systems for generating pros and cons summaries of product reviews. Automating this task eases the writing of product reviews, and offers readers quick access to the most important information. We compared SynPat, a system based on syntactic phrases selected on the basis of valence scores, against a neural-network-based system trained to map bag-of-words representations of reviews directly to pros and cons, and the same neural system trained on clusters of word-embedding encodings of similar pros and cons. We evaluated the systems in two ways: first on held-out reviews with gold-standard pros and cons, and second by asking human annotators to rate the systems’ output on relevance and completeness. In the second evaluation, the gold-standard pros and cons were assessed along with the system output. We find that the human-generated summaries are not deemed as significantly more relevant or complete than the SynPat systems; the latter are scored higher than the human-generated summaries on a precision metric. The neural approaches yield a lower performance in the human assessment, and are outperformed by the baseline.

pdf
Discovering the Language of Wine Reviews: A Text Mining Account
Els Lefever | Iris Hendrickx | Ilja Croijmans | Antal van den Bosch | Asifa Majid
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

We present the results and the findings of the Second VarDial Evaluation Campaign on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects. The campaign was organized as part of the fifth edition of the VarDial workshop, collocated with COLING’2018. This year, the campaign included five shared tasks, including two task re-runs – Arabic Dialect Identification (ADI) and German Dialect Identification (GDI) –, and three new tasks – Morphosyntactic Tagging of Tweets (MTT), Discriminating between Dutch and Flemish in Subtitles (DFS), and Indo-Aryan Language Identification (ILI). A total of 24 teams submitted runs across the five shared tasks, and contributed 22 system description papers, which were included in the VarDial workshop proceedings and are referred to in this report.

2017

pdf abs
Exploring Lexical and Syntactic Features for Language Variety Identification
Chris van der Lee | Antal van den Bosch
Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)

We present a method to discriminate between texts written in either the Netherlandic or the Flemish variant of the Dutch language. The method draws on a feature bundle representing text statistics, syntactic features, and word n-grams. Text statistics include average word length and sentence length, while syntactic features include ratios of function words and part-of-speech n-grams. The effectiveness of the classifier was measured by classifying Dutch subtitles developed for either Dutch or Flemish television. Several machine learning algorithms were compared as well as feature combination methods in order to find the optimal generalization performance. A machine-learning meta classifier based on AdaBoost attained the best F-score of 0.92.

2016

pdf abs
Sarcastic Soulmates: Intimacy and irony markers in social media messaging
Koen Hallmann | Florian Kunneman | Christine Liebrecht | Antal van den Bosch | Margot van Mulken
Linguistic Issues in Language Technology, Volume 14, 2016 - Modality: Logic, Semantics, Annotation, and Machine Learning

Verbal irony, or sarcasm, presents a significant technical and conceptual challenge when it comes to automatic detection. Moreover, it can be a disruptive factor in sentiment analysis and opinion mining, because it changes the polarity of a message implicitly. Extant methods for automatic detection are mostly based on overt clues to ironic intent such as hashtags, also known as irony markers. In this paper, we investigate whether people who know each other make use of irony markers less often than people who do not know each other. We trained a machinelearning classifier to detect sarcasm in Twitter messages (tweets) that were addressed to specific users, and in tweets that were not addressed to a particular user. Human coders analyzed the top-1000 features found to be most discriminative into ten categories of irony markers. The classifier was also tested within and across the two categories. We find that tweets with a user mention contain fewer irony markers than tweets not addressed to a particular user. Classification experiments confirm that the irony in the two types of tweets is signaled differently. The within-category performance of the classifier is about 91% for both categories, while cross-category experiments yield substantially lower generalization performance scores of 75% and 71%. We conclude that irony markers are used more often when there is less mutual knowledge between sender and receiver. Senders addressing other Twitter users less often use irony markers, relying on mutual knowledge which should lead the receiver to infer ironic intent from more implicit clues. With regard to automatic detection, we conclude that our classifier is able to detect ironic tweets addressed at another user as reliably as tweets that are not addressed at at a particular person.

pdf bib
Predicting Liaison: an Example-Based Approach
Antal van den Bosch | Alexander Greefhorst
Traitement Automatique des Langues, Volume 57, Numéro 1 : Varia [Varia]

pdf
Abstractive Compression of Captions with Attentive Recurrent Neural Networks
Sander Wubben | Emiel Krahmer | Antal van den Bosch | Suzan Verberne
Proceedings of the 9th International Natural Language Generation conference

The present work is an overview of the TraMOOC (Translation for Massive Open Online Courses) research and innovation project, a machine translation approach for online educational content. More specifically, videolectures, assignments, and MOOC forum text is automatically translated from English into eleven European and BRIC languages. Unlike previous approaches to machine translation, the output quality in TraMOOC relies on a multimodal evaluation schema that involves crowdsourcing, error type markup, an error taxonomy for translation model comparison, and implicit evaluation via text mining, i.e. entity recognition and its performance comparison between the source and the translated text, and sentiment analysis on the students’ forum posts. Finally, the evaluation output will result in more and better quality in-domain parallel data that will be fed back to the translation engine for higher quality output. The translation service will be incorporated into the Iversity MOOC platform and into the VideoLectures.net digital library portal.

pdf abs
Nederlab: Towards a Single Portal and Research Environment for Diachronic Dutch Text Corpora
Hennie Brugman | Martin Reynaert | Nicoline van der Sijs | René van Stipriaan | Erik Tjong Kim Sang | Antal van den Bosch
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

The Nederlab project aims to bring together all digitized texts relevant to the Dutch national heritage, the history of the Dutch language and culture (circa 800 – present) in one user friendly and tool enriched open access web interface. This paper describes Nederlab halfway through the project period and discusses the collections incorporated, back-office processes, system back-end as well as the Nederlab Research Portal end-user web application.

pdf abs
Can Tweets Predict TV Ratings?
Bridget Sommerdijk | Eric Sanders | Antal van den Bosch
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We set out to investigate whether TV ratings and mentions of TV programmes on the Twitter social media platform are correlated. If such a correlation exists, Twitter may be used as an alternative source for estimating viewer popularity. Moreover, the Twitter-based rating estimates may be generated during the programme, or even before. We count the occurrences of programme-specific hashtags in an archive of Dutch tweets of eleven popular TV shows broadcast in the Netherlands in one season, and perform correlation tests. Overall we find a strong correlation of 0.82; the correlation remains strong, 0.79, if tweets are counted a half hour before broadcast time. However, the two most popular TV shows account for most of the positive effect; if we leave out the single and second most popular TV shows, the correlation drops to being moderate to weak. Also, within a TV show, correlations between ratings and tweet counts are mostly weak, while correlations between TV ratings of the previous and next shows are strong. In absence of information on previous shows, Twitter-based counts may be a viable alternative to classic estimation methods for TV ratings. Estimates are more reliable with more popular TV shows.

pdf
Improving cross-domain n-gram language modelling with skipgrams
Louis Onrust | Antal van den Bosch | Hugo Van hamme
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf
Very quaffable and great fun: Applying NLP to wine reviews
Iris Hendrickx | Els Lefever | Ilja Croijmans | Asifa Majid | Antal van den Bosch
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2015

pdf
Automatically Identifying Periodic Social Events from Twitter
Florian Kunneman | Antal Van den Bosch
Proceedings of the International Conference Recent Advances in Natural Language Processing

pdf
Modeling dative alternations of individual children
Antal van den Bosch | Joan Bresnan
Proceedings of the Sixth Workshop on Cognitive Aspects of Computational Language Learning

2014

pdf abs
Creating and using large monolingual parallel corpora for sentential paraphrase generation
Sander Wubben | Antal van den Bosch | Emiel Krahmer
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper we investigate the automatic generation of paraphrases by using machine translation techniques. Three contributions we make are the construction of a large paraphrase corpus for English and Dutch, a re-ranking heuristic to use machine translation for paraphrase generation and a proper evaluation methodology. A large parallel corpus is constructed by aligning clustered headlines that are scraped from a news aggregator site. To generate sentential paraphrases we use a standard phrase-based machine translation (PBMT) framework modified with a re-ranking component (henceforth PBMT-R). We demonstrate this approach for Dutch and English and evaluate by using human judgements collected from 76 participants. The judgments are compared to two automatic machine translation evaluation metrics. We observe that as the paraphrases deviate more from the source sentence, the performance of the PBMT-R system degrades less than that of the word substitution baseline system.

pdf
SemEval 2014 Task 5 - L2 Writing Assistant
Maarten van Gompel | Iris Hendrickx | Antal van den Bosch | Els Lefever | Véronique Hoste
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)

pdf bib
Proceedings of the 8th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH)
Kalliopi Zervanou | Cristina Vertan | Antal van den Bosch | Caroline Sporleder
Proceedings of the 8th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH)

pdf bib
Estimating Time to Event from Tweets Using Temporal Expressions
Ali Hürriyetoǧlu | Nelleke Oostdijk | Antal van den Bosch
Proceedings of the 5th Workshop on Language Analysis for Social Media (LASM)

pdf
The (Un)Predictability of Emotional Hashtags in Twitter
Florian Kunneman | Christine Liebrecht | Antal van den Bosch
Proceedings of the 5th Workshop on Language Analysis for Social Media (LASM)

pdf
Translation Assistance by Translation of L1 Fragments in an L2 Context
Maarten van Gompel | Antal van den Bosch
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf
Using idiolects and sociolects to improve word prediction
Wessel Stoop | Antal van den Bosch
Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics

2013

pdf
The perfect solution for detecting sarcasm in tweets #not
Christine Liebrecht | Florian Kunneman | Antal van den Bosch
Proceedings of the 4th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

pdf bib
Using character overlap to improve language transformation
Sander Wubben | Emiel Krahmer | Antal van den Bosch
Proceedings of the 7th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities

pdf
Memory-based Grammatical Error Correction
Antal van den Bosch | Peter Berck
Proceedings of the Seventeenth Conference on Computational Natural Language Learning: Shared Task

pdf
WSD2: Parameter optimisation for Memory-based Cross-Lingual Word-Sense Disambiguation
Maarten van Gompel | Antal van den Bosch
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013)

2012

pdf
Sentence Simplification by Monolingual Machine Translation
Sander Wubben | Antal van den Bosch | Emiel Krahmer
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Proceedings of the 6th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities
Kalliopi Zervanou | Antal van den Bosch
Proceedings of the 6th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities

pdf
Memory-based text correction for preposition and determiner errors
Antal van den Bosch | Peter Berck
Proceedings of the Seventh Workshop on Building Educational Applications Using NLP

pdf bib
Proceedings of the Workshop on Detecting Structure in Scholarly Discourse
Antal Van Den Bosch | Hagit Shatkay
Proceedings of the Workshop on Detecting Structure in Scholarly Discourse

pdf abs
DutchSemCor: Targeting the ideal sense-tagged corpus
Piek Vossen | Attila Görög | Rubén Izquierdo | Antal van den Bosch
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Word Sense Disambiguation (WSD) systems require large sense-tagged corpora along with lexical databases to reach satisfactory results. The number of English language resources for developed WSD increased in the past years while most other languages are still under-resourced. The situation is no different for Dutch. In order to overcome this data bottleneck, the DutchSemCor project will deliver a Dutch corpus that is sense-tagged with senses from the Cornetto lexical database. In this paper, we discuss the different conflicting requirements for a sense-tagged corpus and our strategies to fulfill them. We report on a first series of experiments to sup- port our semi-automatic approach to build the corpus.

pdf
The effect of domain and text type on text prediction quality
Suzan Verberne | Antal van den Bosch | Helmer Strik | Lou Boves
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics

2011

pdf
Enrichment and Structuring of Archival Description Metadata
Kalliopi Zervanou | Ioannis Korkontzelos | Antal van den Bosch | Sophia Ananiadou
Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities

pdf
Comparing Phrase-based and Syntax-based Paraphrase Generation
Sander Wubben | Erwin Marsi | Antal van den Bosch | Emiel Krahmer
Proceedings of the Workshop on Monolingual Text-To-Text Generation

pdf
A Link to the Past: Constructing Historical Social Networks
Matje van de Camp | Antal van den Bosch
Proceedings of the 2nd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis (WASSA 2.011)

2010

pdf abs
Supertags as Source Language Context in Hierarchical Phrase-Based SMT
Rejwanul Haque | Sudip Naskar | Antal van den Bosch | Andy Way
Proceedings of the 9th Conference of the Association for Machine Translation in the Americas: Research Papers

Statistical machine translation (SMT) models have recently begun to include source context modeling, under the assumption that the proper lexical choice of the translation for an ambiguous word can be determined from the context in which it appears. Various types of lexical and syntactic features have been explored as effective source context to improve phrase selection in SMT. In the present work, we introduce lexico-syntactic descriptions in the form of supertags as source-side context features in the state-of-the-art hierarchical phrase-based SMT (HPB) model. These features enable us to exploit source similarity in addition to target similarity, as modelled by the language model. In our experiments two kinds of supertags are employed: those from lexicalized tree-adjoining grammar (LTAG) and combinatory categorial grammar (CCG). We use a memory-based classification framework that enables the efficient estimation of these features. Despite the differences between the two supertagging approaches, they give similar improvements. We evaluate the performance of our approach on an English-to-Dutch translation task, and report statistically significant improvements of 4.48% and 6.3% BLEU scores in translation quality when adding CCG and LTAG supertags, respectively, as context-informed features.

pdf
Paraphrase Generation as Monolingual Translation: Data and Evaluation
Sander Wubben | Antal van den Bosch | Emiel Krahmer
Proceedings of the 6th International Natural Language Generation Conference

We describe a case study in the reuse and transfer of tools in language resource development, from a corpus of spoken Dutch to a corpus of written Dutch. Once tools for a particular language have been developed, it is logical, but not trivial to reuse them for other types or registers of the language than the tools were originally designed for. This paper reviews the decisions and adaptations necessary to make this particular transfer from spoken to written language, focusing on a part-of-speech tagger and a lemmatizer. While the lemmatizer can be transferred fairly straightforwardly, the tagger needs to be adaptated considerably. We show how it can be adapted without starting from scratch. We describe how the part-of-speech tagset was adapted and how the tagger was retrained to deal with written-text phenomena it had not been trained on earlier.

pdf abs
Identifying Named Entities in Text Databases from the Natural History Domain
Caroline Sporleder | Marieke van Erp | Tijn Porcelijn | Antal van den Bosch | Pim Arntzen
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

In this paper, we investigate whether it is possible to bootstrap a named entity tagger for textual databases by exploiting the database structure to automatically generate domain and database-specific gazetteer lists. We compare three tagging strategies: (i) using the extracted gazetteers in a look-up tagger, (ii) using the gazetteers to automatically extract training data to train a database-specific tagger, and (iii) using a generic named entity tagger. Our results suggest that automatically built gazetteers in combination with a look-up tagger lead to a relatively good performance and that generic taggers do not perform particularly well on this type of data.

pdf
Spotting the ‘Odd-one-out’: Data-Driven Error Detection and Correction in Textual Databases
Caroline Sporleder | Marieke van Erp | Tijn Porcelijn | Antal van den Bosch
Proceedings of the Workshop on Adaptive Text Extraction and Mining (ATEM 2006)

pdf bib
Constraint Satisfaction Inference: Non-probabilistic Global Inference for Sequence Labelling
Sander Canisius | Antal van den Bosch | Walter Daelemans
Proceedings of the Workshop on Learning Structured Information in Natural Language Applications

pdf
Dependency Parsing by Inference over High-recall Dependency Predictions
Sander Canisius | Toine Bogers | Antal van den Bosch | Jeroen Geertzen | Erik Tjong Kim Sang
Proceedings of the Tenth Conference on Computational Natural Language Learning (CoNLL-X)

pdf
Improved morpho-phonological sequence processing with constraint satisfaction inference
Antal van den Bosch | Sander Canisius
Proceedings of the Eighth Meeting of the ACL Special Interest Group on Computational Phonology and Morphology at HLT-NAACL 2006

pdf
All-word Prediction as the Ultimate Confusible Disambiguation
Antal van den Bosch
Proceedings of the Workshop on Computationally Hard Problems and Joint Inference in Speech and Language Processing