Tanja Schultz

2024

pdf abs
Uncovering the Full Potential of Visual Grounding Methods in VQA
Daniel Reich | Tanja Schultz
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Visual Grounding (VG) methods in Visual Question Answering (VQA) attempt to improve VQA performance by strengthening a model’s reliance on question-relevant visual information. The presence of such relevant information in the visual input is typically assumed in training and testing. This assumption, however, is inherently flawed when dealing with imperfect image representations common in large-scale VQA, where the information carried by visual features frequently deviates from expected ground-truth contents. As a result, training and testing of VG-methods is performed with largely inaccurate data, which obstructs proper assessment of their potential benefits.In this study, we demonstrate that current evaluation schemes for VG-methods are problematic due to the flawed assumption of availability of relevant visual information. Our experiments show that these methods can be much more effective when evaluation conditions are corrected. Code is provided.

2023

pdf abs
Measuring Faithful and Plausible Visual Grounding in VQA
Daniel Reich | Felix Putze | Tanja Schultz
Findings of the Association for Computational Linguistics: EMNLP 2023

Metrics for Visual Grounding (VG) in Visual Question Answering (VQA) systems primarily aim to measure a system’s reliance on relevant parts of the image when inferring an answer to the given question. Lack of VG has been a common problem among state-of-the-art VQA systems and can manifest in over-reliance on irrelevant image parts or a disregard for the visual modality entirely. Although inference capabilities of VQA models are often illustrated by a few qualitative illustrations, most systems are not quantitatively assessed for their VG properties. We believe, an easily calculated criterion for meaningfully measuring a system’s VG can help remedy this shortcoming, as well as add another valuable dimension to model evaluations and analysis. To this end, we propose a new VG metric that captures if a model a) identifies question-relevant objects in the scene, and b) actually relies on the information contained in the relevant objects when producing its answer, i.e., if its visual grounding is both “faithful” and “plausible”. Our metric, called Faithful & Plausible Visual Grounding (FPVG), is straightforward to determine for most VQA model designs. We give a detailed description of FPVG and evaluate several reference systems spanning various VQA architectures. Code to support the metric calculations on the GQA data set is available on GitHub.

2020

pdf abs
Analysis of GlobalPhone and Ethiopian Languages Speech Corpora for Multilingual ASR
Martha Yifiru Tachbelie | Solomon Teferra Abate | Tanja Schultz
Proceedings of the Twelfth Language Resources and Evaluation Conference

In this paper, we present the analysis of GlobalPhone (GP) and speech corpora of Ethiopian languages (Amharic, Tigrigna, Oromo and Wolaytta). The aim of the analysis is to select speech data from GP for the development of multilingual Automatic Speech Recognition (ASR) system for the Ethiopian languages. To this end, phonetic overlaps among GP and Ethiopian languages have been analyzed. The result of our analysis shows that there is much phonetic overlap among Ethiopian languages although they are from three different language families. From GP, Turkish, Uyghur and Croatian are found to have much overlap with the Ethiopian languages. On the other hand, Korean has less phonetic overlap with the rest of the languages. Moreover, morphological complexity of the GP and Ethiopian languages, reflected by type to token ration (TTR) and out of vocabulary (OOV) rate, has been analyzed. Both metrics indicated the morphological complexity of the languages. Korean and Amharic have been identified as extremely morphologically complex compared to the other languages. Tigrigna, Russian, Turkish, Polish, etc. are also among the morphologically complex languages.

pdf abs
Automatic Speech Recognition for Uyghur through Multilingual Acoustic Modeling
Ayimunishagu Abulimiti | Tanja Schultz
Proceedings of the Twelfth Language Resources and Evaluation Conference

Low-resource languages suffer from lower performance of Automatic Speech Recognition (ASR) system due to the lack of data. As a common approach, multilingual training has been applied to achieve more context coverage and has shown better performance over the monolingual training (Heigold et al., 2013). However, the difference between the donor language and the target language may distort the acoustic model trained with multilingual data, especially when much larger amount of data from donor languages is used for training the models of low-resource language. This paper presents our effort towards improving the performance of ASR system for the under-resourced Uyghur language with multilingual acoustic training. For the developing of multilingual speech recognition system for Uyghur, we used Turkish as donor language, which we selected from GlobalPhone corpus as the most similar language to Uyghur. By generating subsets of Uyghur training data, we explored the performance of multilingual speech recognition systems trained with different sizes of Uyghur and Turkish data. The best speech recognition system for Uyghur is achieved by multilingual training using all Uyghur data (10hours) and 17 hours of Turkish data and the WER is 19.17%, which corresponds to 4.95% relative improvement over monolingual training.

pdf abs
DNN-Based Multilingual Automatic Speech Recognition for Wolaytta using Oromo Speech
Martha Yifiru Tachbelie | Solomon Teferra Abate | Tanja Schultz
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)

It is known that Automatic Speech Recognition (ASR) is very useful for human-computer interaction in all the human languages. However, due to its requirement for a big speech corpus, which is very expensive, it has not been developed for most of the languages. Multilingual ASR (MLASR) has been suggested to share existing speech corpora among related languages to develop an ASR for languages which do not have the required speech corpora. Literature shows that phonetic relatedness goes across language families. We have, therefore, conducted experiments on MLASR taking two language families: one as source (Oromo from Cushitic) and the other as target (Wolaytta from Omotic). Using Oromo Deep Neural Network (DNN) based acoustic model, Wolaytta pronunciation dictionary and language model we have achieved Word Error Rate (WER) of 48.34% for Wolaytta. Moreover, our experiments show that adding only 30 minutes of speech data from the target language (Wolaytta) to the whole training data (22.8 hours) of the source language (Oromo) results in a relative WER reduction of 32.77%. Our results show the possibility of developing ASR system for a language, if we have pronunciation dictionary and language model, using an existing speech corpus of another language irrespective of their language family.

pdf abs
Building Language Models for Morphological Rich Low-Resource Languages using Data from Related Donor Languages: the Case of Uyghur
Ayimunishagu Abulimiti | Tanja Schultz
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)

Huge amounts of data are needed to build reliable statistical language models. Automatic speech processing tasks in low-resource languages typically suffer from lower performances due to weak or unreliable language models. Furthermore, language modeling for agglutinative languages is very challenging, as the morphological richness results in higher Out Of Vocabulary (OOV) rate. In this work, we show our effort to build word-based as well as morpheme-based language models for Uyghur, a language that combines both challenges, i.e. it is a low-resource and agglutinative language. Fortunately, there exists a closely-related rich-resource language, namely Turkish. Here, we present our work on leveraging Turkish text data to improve Uyghur language models. To maximize the overlap between Uyghur and Turkish words, the Turkish data is pre-processed on the word surface level, which results in 7.76% OOV-rate reduction on the Uyghur development set. To investigate various levels of low-resource conditions, different subsets of Uyghur data are generated. Morpheme-based language models trained with bilingual data achieved up to 40.91% relative perplexity reduction over the language models trained only with Uyghur data.

2016

pdf abs
Towards Automatic Transcription of ILSE ― an Interdisciplinary Longitudinal Study of Adult Development and Aging
Jochen Weiner | Claudia Frankenberg | Dominic Telaar | Britta Wendelstein | Johannes Schröder | Tanja Schultz
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

The Interdisciplinary Longitudinal Study on Adult Development and Aging (ILSE) was created to facilitate the study of challenges posed by rapidly aging societies in developed countries such as Germany. ILSE contains over 8,000 hours of biographic interviews recorded from more than 1,000 participants over the course of 20 years. Investigations on various aspects of aging, such as cognitive decline, often rely on the analysis of linguistic features which can be derived from spoken content like these interviews. However, transcribing speech is a time and cost consuming manual process and so far only 380 hours of ILSE interviews have been transcribed. Thus, it is the aim of our work to establish technical systems to fully automatically transcribe the ILSE interview data. The joint occurrence of poor recording quality, long audio segments, erroneous transcriptions, varying speaking styles & crosstalk, and emotional & dialectal speech in these interviews presents challenges for automatic speech recognition (ASR). We describe our ongoing work towards the fully automatic transcription of all ILSE interviews and the steps we implemented in preparing the transcriptions to meet the interviews’ challenges. Using a recursive long audio alignment procedure 96 hours of the transcribed data have been made accessible for ASR training.

2014

pdf abs
GlobalPhone: Pronunciation Dictionaries in 20 Languages
Tanja Schultz | Tim Schlippe
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper describes the advances in the multilingual text and speech database GlobalPhone, a multilingual database of high-quality read speech with corresponding transcriptions and pronunciation dictionaries in 20 languages. GlobalPhone was designed to be uniform across languages with respect to the amount of data, speech quality, the collection scenario, the transcription and phone set conventions. With more than 400 hours of transcribed audio data from more than 2000 native speakers GlobalPhone supplies an excellent basis for research in the areas of multilingual speech recognition, rapid deployment of speech processing systems to yet unsupported languages, language identification tasks, speaker recognition in multiple languages, multilingual speech synthesis, as well as monolingual speech recognition in a large variety of languages. Very recently the GlobalPhone pronunciation dictionaries have been made available for research and commercial purposes by the European Language Resources Association (ELRA).

pdf
Exploration of the Impact of Maximum Entropy in Recurrent Neural Network Language Models for Code-Switching Speech
Ngoc Thang Vu | Tanja Schultz
Proceedings of the First Workshop on Computational Approaches to Code Switching

2013

pdf
Combination of Recurrent Neural Networks and Factored Language Models for Code-Switching Language Modeling
Heike Adel | Ngoc Thang Vu | Tanja Schultz
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2011

This paper describes the speech-to-text systems used to provide automatic transcriptions used in the Quaero 2010 evaluation of Machine Translation from speech. Quaero (www.quaero.org) is a large research and industrial innovation program focusing on technologies for automatic analysis and classification of multimedia and multilingual documents. The ASR transcript is the result of a Rover combination of systems from three teams ( KIT, RWTH, LIMSI+VR) for the French and German languages. The casesensitive word error rates (WER) of the combined systems were respectively 20.8% and 18.1% on the 2010 evaluation data, relative WER reductions of 14.6% and 17.4% respectively over the best component system.

2009

pdf
Joint Learning of Preposition Senses and Semantic Roles of Prepositional Phrases
Daniel Dahlmeier | Hwee Tou Ng | Tanja Schultz
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing

2008

pdf
Modeling Vocal Interaction for Text-Independent Participant Characterization in Multi-Party Conversation
Kornel Laskowski | Mari Ostendorf | Tanja Schultz
Proceedings of the 9th SIGdial Workshop on Discourse and Dialogue

pdf
Speech Translation for Triage of Emergency Phonecalls in Minority Languages
Udhyakumar Nallasamy | Alan Black | Tanja Schultz | Robert Frederking | Jerry Weltman
Coling 2008: Proceedings of the workshop on Speech Processing for Safety Critical Translation and Pervasive Applications

pdf abs
NineOneOne: Recognizing and Classifying Speech for Handling Minority Language Emergency Calls
Udhyakumar Nallasamy | Alan Black | Tanja Schultz | Robert Frederking
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In this paper, we describe NineOneOne (9-1-1), a system designed to recognize and translate Spanish emergency calls for better dispatching. We analyze the research challenges in adapting speech translation technology to 9-1-1 domain. We report our initial research towards building the system and the results of our initial experiments.

2007

pdf bib
Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference
Candace Sidner | Tanja Schultz | Matthew Stone | ChengXiang Zhai
Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference

pdf bib
Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers
Candace Sidner | Tanja Schultz | Matthew Stone | ChengXiang Zhai
Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers

pdf
A Geometric Interpretation of Non-Target-Normalized Maximum Cross-Channel Correlation for Vocal Activity Detection in Meetings
Kornel Laskowski | Tanja Schultz
Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers

pdf
Advances in the CMU/Interact Arabic GALE Transcription System
Mohamed Noamany | Thomas Schaaf | Tanja Schultz
Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers

pdf
Bilingual-LSA Based LM Adaptation for Spoken Language Translation
Yik-Cheung Tam | Ian Lane | Tanja Schultz
Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics

pdf
Improving spoken language translation by automatic disfluency removal: evidence from conversational speech transcripts
Sharath Rao | Ian Lane | Tanja Schultz
Proceedings of Machine Translation Summit XI: Papers

pdf
Modeling Vocal Interaction for Text-Independent Classification of Conversation Type
Kornel Laskowski | Mari Ostendorf | Tanja Schultz
Proceedings of the 8th SIGdial Workshop on Discourse and Dialogue

The paper describes our portable two-way speech-to-speech translation system using a completely eyes-free/hands-free user interface. This system translates between the language pair English and Iraqi Arabic as well as between English and Farsi, and was built within the framework of the DARPA TransTac program. The Farsi language support was developed within a 90-day period, testing our ability to rapidly support new languages. The paper gives an overview of the system’s components along with the individual component objective measures and a discussion of issues relevant for the overall usage of the system. We found that usability, flexibility, and robustness serve as severe constraints on system architecture and design.

2006

pdf
Thai Grapheme-Based Speech Recognition
Paisarn Charoenpornsawat | Sanjika Hewavitharana | Tanja Schultz
Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers

2005

pdf
Rapid Development of an Afrikaans English Speech-to-Speech Translator
Herman A. Engelbrecht | Tanja Schultz
Proceedings of the Second International Workshop on Spoken Language Translation

2004