Kay Rottmann

2022

pdf abs
Unsupervised training data re-weighting for natural language understanding with local distribution approximation
Jose Garrido Ramas | Dieu-thu Le | Bei Chen | Manoj Kumar | Kay Rottmann
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track

One of the major challenges of training Natural Language Understanding (NLU) production models lies in the discrepancy between the distributions of the offline training data and of the online live data, due to, e.g., biased sampling scheme, cyclic seasonality shifts, annotated training data coming from a variety of different sources, and a changing pool of users. Consequently, the model trained by the offline data is biased. We often observe this problem especially in task-oriented conversational systems, where topics of interest and the characteristics of users using the system change over time. In this paper we propose an unsupervised approach to mitigate the offline training data sampling bias in multiple NLU tasks. We show that a local distribution approximation in the pre-trained embedding space enables the estimation of importance weights for training samples guiding re-sampling for an effective bias mitigation. We illustrate our novel approach using multiple NLU datasets and show improvements obtained without additional annotation, making this a general approach for mitigating effects of sampling bias.

To improve deep learning models’ robustness, adversarial training has been frequently used in computer vision with satisfying results. However, adversarial perturbation on text have turned out to be more challenging due to the discrete nature of text. The generated adversarial text might not sound natural or does not preserve semantics, which is the key for real world applications where text classification is based on semantic meaning. In this paper, we describe a new way for generating adversarial samples by using pseudo-labeled in-domain text data to train a seq2seq model for adversarial generation and combine it with paraphrase detection. We showcase the benefit of our approach for a real-world Natural Language Understanding (NLU) task, which maps a user’s request to an intent. Furthermore, we experiment with gradient-based training for the NLU task and try using token importance scores to guide the adversarial text generation. We show that our approach can generate realistic and relevant adversarial samples compared to other state-of-the-art adversarial training methods. Applying adversarial training using these generated samples helps the NLU model to recover up to 70% of these types of errors and makes the model more robust, especially in the tail distribution in a large scale real world application.

pdf bib
Proceedings of the Massively Multilingual Natural Language Understanding Workshop (MMNLU-22)
Jack FitzGerald | Kay Rottmann | Julia Hirschberg | Mohit Bansal | Anna Rumshisky | Charith Peris | Christopher Hench
Proceedings of the Massively Multilingual Natural Language Understanding Workshop (MMNLU-22)

pdf abs
Massively Multilingual Natural Language Understanding 2022 (MMNLU-22) Workshop and Competition
Jack FitzGerald | Christopher Hench | Charith Peris | Kay Rottmann
Proceedings of the Massively Multilingual Natural Language Understanding Workshop (MMNLU-22)

To be writen (workshop summary paper)

2021

pdf abs
Training data reduction for multilingual Spoken Language Understanding systems
Anmol Bansal | Anjali Shenoy | Krishna Chaitanya Pappu | Kay Rottmann | Anurag Dwarakanath
Proceedings of the 18th International Conference on Natural Language Processing (ICON)

Fine-tuning self-supervised pre-trained language models such as BERT has significantly improved state-of-the-art performance on natural language processing tasks. Similar finetuning setups can also be used in commercial large scale Spoken Language Understanding (SLU) systems to perform intent classification and slot tagging on user queries. Finetuning such powerful models for use in commercial systems requires large amounts of training data and compute resources to achieve high performance. This paper is a study on the different empirical methods of identifying training data redundancies for the fine tuning paradigm. Particularly, we explore rule based and semantic techniques to reduce data in a multilingual fine tuning setting and report our results on key SLU metrics. Through our experiments, we show that we can achieve on par/better performance on fine-tuning using a reduced data set as compared to a model finetuned on the entire data set.

2010

pdf
Tools for Collecting Speech Corpora via Mechanical-Turk
Ian Lane | Matthias Eck | Kay Rottmann | Alex Waibel
Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk

2008

2007

This paper describes the CMU-UKA statistical machine translation systems submitted to the IWSLT 2007 evaluation campaign. Systems were submitted for three language-pairs: Japanese→English, Chinese→English and Arabic→English. All systems were based on a common phrase-based SMT (statistical machine translation) framework but for each language-pair a specific research problem was tackled. For Japanese→English we focused on two problems: first, punctuation recovery, and second, how to incorporate topic-knowledge into the translation framework. Our Chinese→English submission focused on syntax-augmented SMT and for the Arabic→English task we focused on incorporating morphological-decomposition into the SMT framework. This research strategy enabled us to evaluate a wide variety of approaches which proved effective for the language pairs they were evaluated on.

pdf
Word reordering in statistical machine translation with a POS-based distortion model
Kay Rottmann | Stephan Vogel
Proceedings of the 11th Conference on Theoretical and Methodological Issues in Machine Translation of Natural Languages: Papers

pdf
The ISL Phrase-Based MT System for the 2007 ACL Workshop on Statistical Machine Translation
Matthias Paulik | Kay Rottmann | Jan Niehues | Silja Hildebrand | Stephan Vogel
Proceedings of the Second Workshop on Statistical Machine Translation