Ben Wellner

Also published as: Benjamin Wellner

2019

This paper describes MITRE’s participation in SemEval-2019 Task 5, HatEval: Multilingual detection of hate speech against immigrants and women in Twitter. The techniques explored range from simple bag-of-ngrams classifiers to neural architectures with varied attention mechanisms. We describe several styles of transfer learning from auxiliary tasks, including a novel method for adapting pre-trained BERT models to Twitter data. Logistic regression ties the systems together into an ensemble submitted for evaluation. The resulting system was used to produce predictions for all four HatEval subtasks, achieving the best mean rank of all teams that participated in all four conditions.

2009

pdf bib

Sources of Performance in CRF Transfer Training: a Business Name-tagging Case Study
Marc Vilain | Jonathan Huggins | Ben Wellner
Proceedings of the International Conference RANLP-2009

pdf bib

A simple feature-copying approach for long-distance dependencies
Marc Vilain | Jonathan Huggins | Ben Wellner
Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL-2009)

2008

pdf bib abs

Identification of Duplicate News Stories in Web Pages
John Gibson | Ben Wellner | Susan Lubar
Proceedings of the 4th Web as Corpus Workshop

Identifying near duplicate documents is a challenge often faced in the field of information discovery. Unfortunately many algorithms that find near duplicate pairs of plain text documents perform poorly when used on web pages, where metadata, other extraneous information make that process much more difficult. If the content of the page (e.g., the body of a news article) can be extracted from the page, then the accuracy of the duplicate detection algorithms is greatly increased. Using machine learning techniques to identify the content portion of web pages, we achieve duplicate detection accuracy that is nearly identical to plain text, significantly better than simple heuristic approaches to content extraction. We performed these experiments on a small, but fully annotated corpus.

pdf bib abs

SpatialML is an annotation scheme for marking up references to places in natural language. It covers both named and nominal references to places, grounding them where possible with geo-coordinates, including both relative and absolute locations, and characterizes relationships among places in terms of a region calculus. A freely available annotation editor has been developed for SpatialML, along with a corpus of annotated documents released by the Linguistic Data Consortium. Inter-annotator agreement on SpatialML is 77.0 F-measure for extents on that corpus. An automatic tagger for SpatialML extents scores 78.5 F-measure. A disambiguator scores 93.0 F-measure and 93.4 Predictive Accuracy. In adapting the extent tagger to new domains, merging the training data from the above corpus with annotated data in the new domain provides the best performance.

2007

pdf bib

Automatically Identifying the Arguments of Discourse Connectives
Ben Wellner | James Pustejovsky
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)

2006

pdf bib

pdf bib abs

Leveraging Machine Readable Dictionaries in Discriminative Sequence Models
Ben Wellner | Marc Vilain
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

Many natural language processing tasks make use of a lexicon typically the words collected from some annotated training data along with their associated properties. We demonstrate here the utility of corpora-independent lexicons derived from machine readable dictionaries. Lexical information is encoded in the form of features in a Conditional Random Field tagger providing improved performance in cases where: i) limited training data is made available ii) the data is case-less and iii) the test data genre or domain is different than that of the training data. We show substantial error reductions, especially on unknown words, for the tasks of part-of-speech tagging and shallow parsing, achieving up to 20% error reduction on Penn TreeBank part-of-speech tagging and up to a 15.7% error reduction for shallow parsing using the CoNLL 2000 data. Our results here point towards a simple, but effective methodology for increasing the adaptability of text processing systems by training models with annotated data in one genre augmented with general lexical information or lexical information pertinent to the target genre (or domain).

pdf bib

Machine Learning of Temporal Relations
Inderjeet Mani | Marc Verhagen | Ben Wellner | Chong Min Lee | James Pustejovsky
Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics

pdf bib

A Pilot Study on Acquiring Metric Temporal Constraints for Events
Inderjeet Mani | Ben Wellner
Proceedings of the Workshop on Annotating and Reasoning about Time and Events

pdf bib

Classification of Discourse Coherence Relations: An Exploratory Study using Multiple Knowledge Sources
Ben Wellner | James Pustejovsky | Catherine Havasi | Anna Rumshisky | Roser Saurí
Proceedings of the 7th SIGdial Workshop on Discourse and Dialogue