Timo Baumann

2024

pdf
Evaluating and Fine-Tuning Retrieval-Augmented Language Models to Generate Text with Accurate Citations
Vinzent Penzkofer | Timo Baumann
Proceedings of the 20th Conference on Natural Language Processing (KONVENS 2024)

2022

pdf abs
Merkel Podcast Corpus: A Multimodal Dataset Compiled from 16 Years of Angela Merkel’s Weekly Video Podcasts
Debjoy Saha | Shravan Nayak | Timo Baumann
Proceedings of the Thirteenth Language Resources and Evaluation Conference

We introduce the Merkel Podcast Corpus, an audio-visual-text corpus in German collected from 16 years of (almost) weekly Internet podcasts of former German chancellor Angela Merkel. To the best of our knowledge, this is the first single speaker corpus in the German language consisting of audio, visual and text modalities of comparable size and temporal extent. We describe the methods used with which we have collected and edited the data which involves downloading the videos, transcripts and other metadata, forced alignment, performing active speaker recognition and face detection to finally curate the single speaker dataset consisting of utterances spoken by Angela Merkel. The proposed pipeline is general and can be used to curate other datasets of similar nature, such as talk show contents. Through various statistical analyses and applications of the dataset in talking face generation and TTS, we show the utility of the dataset. We argue that it is a valuable contribution to the research community, in particular, due to its realistic and challenging material at the boundary between prepared and spontaneous speech.

2020

Dubbing has two shades; synchronisation constraints are applied only when the actor’s mouth is visible on screen, while the translation is unconstrained for off-screen dubbing. Consequently, different synchronisation requirements, and therefore translation strategies, are applied depending on the type of dubbing. In this work, we manually annotate an existing dubbing corpus (Heroes) for this dichotomy. We show that, even though we did not observe distinctive features between on- and off-screen dubbing at the textual level, on-screen dubbing is more difficult for MT (-4 BLEU points). Moreover, synchronisation constraints dramatically decrease translation quality for off-screen dubbing. We conclude that, distinguishing between on-screen and off-screen dubbing is necessary for determining successful strategies for dubbing-customised Machine Translation.

2019

pdf abs
Integration of Dubbing Constraints into Machine Translation
Ashutosh Saboo | Timo Baumann
Proceedings of the Fourth Conference on Machine Translation (Volume 1: Research Papers)

Translation systems aim to perform a meaning-preserving conversion of linguistic material (typically text but also speech) from a source to a target language (and, to a lesser degree, the corresponding socio-cultural contexts). Dubbing, i.e., the lip-synchronous translation and revoicing of speech adds to this constraints about the close matching of phonetic and resulting visemic synchrony characteristics of source and target material. There is an inherent conflict between a translation’s meaning preservation and ‘dubbability’ and the resulting trade-off can be controlled by weighing the synchrony constraints. We introduce our work, which to the best of our knowledge is the first of its kind, on integrating synchrony constraints into the machine translation paradigm. We present first results for the integration of synchrony constraints into encoder decoder-based neural machine translation and show that considerably more ‘dubbable’ translations can be achieved with only a small impact on BLEU score, and dubbability improves more steeply than BLEU degrades.

2018

pdf abs
Style Detection for Free Verse Poetry from Text and Speech
Timo Baumann | Hussein Hussein | Burkhard Meyer-Sickendiek
Proceedings of the 27th International Conference on Computational Linguistics

Modern and post-modern free verse poems feature a large and complex variety in their poetic prosodies that falls along a continuum from a more fluent to a more disfluent and choppy style. As the poets of modernism overcame rhyme and meter, they oriented themselves in these two opposing directions, creating a free verse spectrum that calls for new analyses of prosodic forms. We present a method, grounded in philological analysis and current research on cognitive (dis)fluency, for automatically analyzing this spectrum. We define and relate six classes of poetic styles (ranging from parlando to lettristic decomposition) by their gradual differentiation. Based on this discussion, we present a model for automatic prosodic classification of spoken free verse poetry that uses deep hierarchical attention networks to integrate the source text and audio and predict the assigned class. We evaluate our model on a large corpus of German author-read post-modern poetry and find that classes can reliably be differentiated, reaching a weighted f-measure of 0.73, when combining textual and phonetic evidence. In our further analyses, we validate the model’s decision-making process, the philologically hypothesized continuum of fluency and investigate the relative importance of various features.

pdf abs
Analysis of Rhythmic Phrasing: Feature Engineering vs. Representation Learning for Classifying Readout Poetry
Timo Baumann | Hussein Hussein | Burkhard Meyer-Sickendiek
Proceedings of the Second Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

We show how to classify the phrasing of readout poems with the help of machine learning algorithms that use manually engineered features or automatically learn representations. We investigate modern and postmodern poems from the webpage lyrikline, and focus on two exemplary rhythmical patterns in order to detect the rhythmic phrasing: The Parlando and the Variable Foot. These rhythmical patterns have been compared by using two important theoretical works: The Generative Theory of Tonal Music and the Rhythmic Phrasing in English Verse. Using both, we focus on a combination of four different features: The grouping structure, the metrical structure, the time-span-variation, and the prolongation in order to detect the rhythmic phrasing in the two rhythmical types. We use manually engineered features based on text-speech alignment and parsing for classification. We also train a neural network to learn its own representation based on text, speech and audio during pauses. The neural network outperforms manual feature engineering, reaching an f-measure of 0.85.

2016

pdf abs
Large-scale Analysis of Spoken Free-verse Poetry
Timo Baumann | Burkhard Meyer-Sickendiek
Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH)

Most modern and post-modern poems have developed a post-metrical idea of lyrical prosody that employs rhythmical features of everyday language and prose instead of a strict adherence to rhyme and metrical schemes. This development is subsumed under the term free verse prosody. We present our methodology for the large-scale analysis of modern and post-modern poetry in both their written form and as spoken aloud by the author. We employ language processing tools to align text and speech, to generate a null-model of how the poem would be spoken by a naïve reader, and to extract contrastive prosodic features used by the poet. On these, we intend to build our model of free verse prosody, which will help to understand, differentiate and relate the different styles of free verse poetry. We plan to use our processing scheme on large amounts of data to iteratively build models of styles, to validate and guide manual style annotation, to identify further rhythmical categories, and ultimately to broaden our understanding of free verse poetry. In this paper, we report on a proof-of-concept of our methodology using smaller amounts of poems and a limited set of features. We find that our methodology helps to extract differentiating features in the authors’ speech that can be explained by philological insight. Thus, our automatic method helps to guide the literary analysis and this in turn helps to improve our computational models.

pdf abs
Mining the Spoken Wikipedia for Speech Data and Beyond
Arne Köhn | Florian Stegen | Timo Baumann
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We present a corpus of time-aligned spoken data of Wikipedia articles as well as the pipeline that allows to generate such corpora for many languages. There are initiatives to create and sustain spoken Wikipedia versions in many languages and hence the data is freely available, grows over time, and can be used for automatic corpus creation. Our pipeline automatically downloads and aligns this data. The resulting German corpus currently totals 293h of audio, of which we align 71h in full sentences and another 86h of sentences with some missing words. The English corpus consists of 287h, for which we align 27h in full sentence and 157h with some missing words. Results are publically available.

pdf abs
Predictive Incremental Parsing Helps Language Modeling
Arne Köhn | Timo Baumann
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Predictive incremental parsing produces syntactic representations of sentences as they are produced, e.g. by typing or speaking. In order to generate connected parses for such unfinished sentences, upcoming word types can be hypothesized and structurally integrated with already realized words. For example, the presence of a determiner as the last word of a sentence prefix may indicate that a noun will appear somewhere in the completion of that sentence, and the determiner can be attached to the predicted noun. We combine the forward-looking parser predictions with backward-looking N-gram histories and analyze in a set of experiments the impact on language models, i.e. stronger discriminative power but also higher data sparsity. Conditioning N-gram models, MaxEnt models or RNN-LMs on parser predictions yields perplexity reductions of about 6%. Our method (a) retains online decoding capabilities and (b) incurs relatively little computational overhead which sets it apart from previous approaches that use syntax for language modeling. Our method is particularly attractive for modular systems that make use of a syntax parser anyway, e.g. as part of an understanding pipeline where predictive parsing improves language modeling at no additional cost.

2014

pdf bib abs
Towards simultaneous interpreting: the timing of incremental machine translation and speech synthesis
Timo Baumann | Srinivas Bangalore | Julia Hirschberg
Proceedings of the 11th International Workshop on Spoken Language Translation: Papers

In simultaneous interpreting, human experts incrementally construct and extend partial hypotheses about the source speaker’s message, and start to verbalize a corresponding message in the target language, based on a partial translation – which may have to be corrected occasionally. They commence the target utterance in the hope that they will be able to finish understanding the source speaker’s message and determine its translation in time for the unfolding delivery. Of course, both incremental understanding and translation by humans can be garden-pathed, although experts are able to optimize their delivery so as to balance the goals of minimal latency, translation quality and high speech fluency with few corrections. We investigate the temporal properties of both translation input and output to evaluate the tradeoff between low latency and translation quality. In addition, we estimate the improvements that can be gained with a tempo-elastic speech synthesizer.

pdf
Situationally Aware In-Car Information Presentation Using Incremental Speech Generation: Safer, and More Effective
Spyros Kousidis | Casey Kennington | Timo Baumann | Hendrik Buschmeier | Stefan Kopp | David Schlangen
Proceedings of the EACL 2014 Workshop on Dialogue in Motion