Matthias Eck


2023

pdf
A Keyword Based Approach to Understanding the Overpenalization of Marginalized Groups by English Marginal Abuse Models on Twitter
Kyra Yee | Alice Schoenauer Sebag | Olivia Redfield | Matthias Eck | Emily Sheng | Luca Belli
Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023)

Harmful content detection models tend to have higher false positive rates for content from marginalized groups. In the context of marginal abuse modeling on Twitter, such disproportionate penalization poses the risk of reduced visibility, where marginalized communities lose the opportunity to voice their opinion on the platform. Current approaches to algorithmic harm mitigation, and bias detection for NLP models are often very ad hoc and subject to human bias. We make two main contributions in this paper. First, we design a novel methodology, which provides a principled approach to detecting and measuring the severity of potential harms associated with a text-based model. Second, we apply our methodology to audit Twitter’s English marginal abuse model, which is used for removing amplification eligibility of marginally abusive content. Without utilizing demographic labels or dialect classifiers, we are still able to detect and measure the severity of issues related to the over-penalization of the speech of marginalized communities, such as the use of reclaimed speech, counterspeech, and identity related terms. In order to mitigate the associated harms, we experiment with adding additional true negative examples and find that doing so provides improvements to our fairness metrics without large degradations in model performance.

2014

pdf
Extracting translation pairs from social network content
Matthias Eck | Yuri Zemlyanskiy | Joy Zhang | Alex Waibel
Proceedings of the 11th International Workshop on Spoken Language Translation: Papers

We introduce two methods to collect additional training data for statistical machine translation systems from public social network content. The first method identifies multilingual content where the author self-translated their own post to reach additional friends, fans or customers. Once identified, we can split the post in the language segments and extract translation pairs from this content. The second methods considers web links (URLs) that users add as part of their post to point the reader to a video, article or website. If the same URL is shared from different language users, there is a chance they might give the same comment in their respective language. We use a support vector machine (SVM) as a classifier to identify true translations from all candidate pairs. We collected additional translation pairs using both methods for the language pairs Spanish-English and Portuguese-English. Testing the collected data as additional training data for statistical machine translations on in-domain test sets resulted in very significant improvements of up to 5 BLEU.

2010

pdf
Tools for Collecting Speech Corpora via Mechanical-Turk
Ian Lane | Matthias Eck | Kay Rottmann | Alex Waibel
Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk

2009

pdf
Incremental Adaptation of Speech-to-Speech Translation
Nguyen Bach | Roger Hsiao | Matthias Eck | Paisarn Charoenpornsawat | Stephan Vogel | Tanja Schultz | Ian Lane | Alex Waibel | Alan Black
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers

2008

pdf
Communicating Unknown Words in Machine Translation
Matthias Eck | Stephan Vogel | Alex Waibel
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

A new approach to handle unknown words in machine translation is presented. The basic idea is to find definitions for the unknown words on the source language side and translate those definitions instead. Only monolingual resources are required, which generally offer a broader coverage than bilingual resources and are available for a large number of languages. In order to use this in a machine translation system definitions are extracted automatically from online dictionaries and encyclopedias. The translated definition is then inserted and clearly marked in the original hypothesis. This is shown to lead to significant improvements in (subjective) translation quality.

2007

pdf
Translation Model Pruning via Usage Statistics for Statistical Machine Translation
Matthias Eck | Stephan Vogel | Alex Waibel
Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers

pdf
Estimating phrase pair relevance for translation model pruning
Matthias Eck | Stephan Vogel | Alex Waibel
Proceedings of Machine Translation Summit XI: Papers

2006

pdf
The UKA/CMU statistical machine translation system for IWSLT 2006
Matthias Eck | Ian Lane | Nguyen Bach | Sanjika Hewavitharana | Muntsin Kolss | Bing Zhao | Almut Silja Hildebrand | Stephan Vogel | Alex Waibel
Proceedings of the Third International Workshop on Spoken Language Translation: Evaluation Campaign

pdf
A Flexible Online Server for Machine Translation Evaluation
Matthias Eck | Stephan Vogel | Alex Waibel
Proceedings of the 11th Annual Conference of the European Association for Machine Translation

2005

pdf
Low Cost Portability for Statistical Machine Translation based on N-gram Coverage
Matthias Eck | Stephan Vogel | Alex Waibel
Proceedings of Machine Translation Summit X: Papers

Statistical machine translation relies heavily on the available training data. However, in some cases, it is necessary to limit the amount of training data that can be created for or actually used by the systems. To solve that problem, we introduce a weighting scheme that tries to select more informative sentences first. This selection is based on the previously unseen n-grams the sentences contain, and it allows us to sort the sentences according to their estimated importance. After sorting, we can construct smaller training corpora, and we are able to demonstrate that systems trained on much less training data show a very competitive performance compared to baseline systems using all available training data.

pdf bib
Overview of the IWSLT 2005 Evaluation Campaign
Matthias Eck | Chiori Hori
Proceedings of the Second International Workshop on Spoken Language Translation

pdf
The CMU Statistical Machine Translation System for IWSLT2005
Sanjika Hewavitharana | Bing Zhao | Hildebrand | Almut Silja | Matthias Eck | Chiori Hori | Stephan Vogel | Alex Waibel
Proceedings of the Second International Workshop on Spoken Language Translation

pdf
Low Cost Portability for Statistical Machine Translation based on N-gram Frequency and TF-IDF
Matthias Eck | Stephan Vogel | Alex Waibel
Proceedings of the Second International Workshop on Spoken Language Translation

pdf
Adaptation of the translation model for statistical machine translation based on information retrieval
Almut Silja Hildebrand | Matthias Eck | Stephan Vogel | Alex Waibel
Proceedings of the 10th EAMT Conference: Practical applications of machine translation

2004

pdf
Phrase Pair Rescoring with Term Weighting for Statistical Machine Translation
Bing Zhao | Stephan Vogel | Matthias Eck | Alex Waibel
Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing

pdf
Language Model Adaptation for Statistical Machine Translation via Structured Query Models
Bing Zhao | Matthias Eck | Stephan Vogel
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics

pdf
Improving Statistical Machine Translation in the Medical Domain using the Unified Medical Language system
Matthias Eck | Stephan Vogel | Alex Waibel
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics

pdf
Language Model Adaptation for Statistical Machine Translation Based on Information Retrieval
Matthias Eck | Stephan Vogel | Alex Waibel
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)