This is an internal, incomplete preview of a proposed change to the ACL Anthology.
For efficiency reasons, we don't generate MODS or Endnote formats, and the preview may be incomplete in other ways, or contain mistakes.
Do not treat this content as an official publication.
FlorianSchiel
Fixing paper assignments
Please select all papers that belong to the same person.
Indicate below which author they should be assigned to.
This paper presents the results of an ongoing collaboration to develop an Arabic variety-independent romanization system that aims to homogenize and simplify the romanization of the Arabic script, and introduces an Arabic variety-independent WebMAUS service offering a free to use forced-alignment service fully integrated within the WebMAUS services. We present the rationale for developing such a system, highlighting the need for a detailed romanization system with graphemes corresponding to the phonemic short and long vowels/consonants in Arabic varieties. We describe how the acoustic model was created, followed by several hands-on recipes for applying the forced alignment webservice either online or programatically. Finally, we discuss some of the issues we faced during the development of the system.
The BAS CLARIN speech data repository is introduced. At the current state it comprises 31 pre-dominantly German corpora of spoken language. It is compliant to the CLARIN-D as well as the OLAC requirements. This enables its embedding into several infrastructures. We give an overview over its structure, its implementation as well as the corpora it contains.
In 2012 the Bavarian Archive for Speech Signals started providing some of its tools from the field of spoken language in the form of Software as a Service (SaaS). This means users access the processing functionality over a web browser and therefore do not have to install complex software packages on a local computer. Amongst others, these tools include segmentation & labeling, grapheme-to-phoneme conversion, text alignment, syllabification and metadata generation, where all but the last are available for a variety of languages. Since its creation the number of available services and the web interface have changed considerably. We give an overview and a detailed description of the system architecture, the available web services and their functionality. Furthermore, we show how the number of files processed over the system developed in the last four years.
Language documentation projects supported by recent funding intiatives have created a large number of multimedia corpora of typologically diverse languages. Most of these corpora provide a manual alignment of transcription and audio data at the level of larger units, such as sentences or intonation units. Their usefulness both for corpus-linguistic and psycholinguistic research and for the development of tools and teaching materials could, however, be increased by achieving a more fine-grained alignment of transcription and audio at the word or even phoneme level. Since most language documentation corpora contain data on small languages, there usually do not exist any speech recognizers or acoustic models specifically trained on these languages. We therefore investigate the feasibility of untrained forced alignment for such corpora. We report on an evaluation of the tool (Web)MAUS (Kisler, 2012) on several language documentation corpora and discuss practical issues in the application of forced alignment. Our evaluation shows that (Web)MAUS with its existing acoustic models combined with simple grapheme-to-phoneme conversion can be successfully used for word-level forced alignment of a diverse set of languages without additional training, especially if a manual prealignment of larger annotation units is already avaible.
Speech uttered under the influence of alcohol is known to deviate from the speech of the same person when sober. This is an important feature in forensic investigations and could also be used to detect intoxication in the automotive environment. Aside from acoustic-phonetic features and speech content which have already been studied by others in this contribution we address the question whether speakers use dialectal variation or dialect words more frequently when intoxicated than when sober. We analyzed 300,000 recorded word tokens in read and spontaneous speech uttered by 162 female and male speakers within the German Alcohol Language Corpus. We found that contrary to our expectations the frequency of dialectal forms decreases significantly when speakers are under the influence. We explain this effect with a compensatory over-shoot mechanism: speakers are aware of their intoxication and that they are being monitored. In forensic analysis of speech this ‘awareness factor’ must be taken into account.
In this study we investigate the idea to automatically evaluate newly created pronunciation encodings for being correct or containing a potential error. Using a cascaded triphone detector and phonotactical n-gram modeling with an optimal Bayesian threshold we classify unknown pronunciation transcripts into the classes 'probably faulty' or 'probably correct'. Transcripts tagged 'probably faulty' are forwarded to a manual inspection performed by an expert, while encodings tagged 'probably correct' are passed without further inspection. An evaluation of the new method on the German PHONOLEX lexical resource shows that with a tolerable error margin of approximately 3% faulty transcriptions a major reduction in work effort during the production of a new lexical resource can be achieved.
A new type of language resource called 'BAStat' has been released by the Bavarian Archive for Speech Signals at Ludwig Maximilians Universitaet, Munich. In contrast to primary resources like speech and text corpora BAStat comprises statistical estimates based on a number of primary spoken language resources: first and second order occurrence probability of phones, syllables and words, duration statistics, probabilities of pronunciation variants of words and probabilities of context information. Unlike other statistical speech resources BAStat is based solely on recordings of conversational German and therefore models spoken language not text. The resource consists of a bundle of 7-bit ASCII tables and matrices to maximize inter-operability between different operation systems and can be downloaded for free from the BAS web-site. This contribution gives a detailed description about the empirical basis, the contained data types, the format of the resulting statistical data, some interesting interpretations of grand figures and a brief comparison to the text-based statistical resource CELEX.
Nowadays portable devices such as smart phones can be used to capture the face of a user simultaneously with the voice input. Server based or even embedded dialogue system might utilize this additional information to detect whether the speaking user addresses the system or other parties or whether the listening user is focused on the display or not. Depending on these findings the dialogue system might change its strategy to interact with the user improving the overall communication between human and system. To develop and test methods for On/Off-Focus detection a multimodal corpus of user-machine interactions was recorded within the German SmartWeb project. The corpus comprises 99 recording sessions of a triad communication between the user, the system and a human companion. The user can address/watch/listen to the system but also talk to his companion, read from the display or simply talk to herself. Facial video is captured with a standard built-in video camera of a smart phone while voice input in being recorded by a high quality close microphone as well as over a realistic transmission line via Bluetooth and WCDMA. The resulting SmartWeb Video Corpus (SVC) can be obtained from the Bavarian Archive for Speech Signals.
The first release of the German Ph@ttSessionz speech database contains read and spontaneous speech from 864 adolescent speakers and is the largest database of its kind for German. It was recorded via the WWW in over 40 public schools in all dialect regions of Germany. In this paper, we present a cross-sectional study of f0 measurements on this database. The study documents the profound changes in male voices at the age 13-15. Furthermore, it shows that on a perceptive mel-scale, there is little difference in the relative f0 variability for male and female speakers. A closer analysis reveals that f0 variability is dependent on the speech style and both the length and the type of the utterance. The study provides statistically reliable voice parameters of adolescent speakers for German. The results may contribute to making spoken dialog systems more robust by restricting user input to utterances with low f0 variability.
A number of forensic studies published during the last 50 years report that intoxication with alcohol influences speech in a way that is made manifest in certain features of the speech signal. However, most of these studies are based on data that are not publicly available nor of statistically sufficient size. Furthermore, in spite of the positive reports nobody ever successfully implemented a method to detect alcoholic intoxication from the speech signal. The Alcohol Language Corpus (ALC) aims to answer these open questions by providing a publicly available large and statistically sound corpus of intoxicated and sober speech. This paper gives a detailed description of the corpus features and methodology. Also, we will present some preliminary results on a series of verifications about reported potential features that are claimed to reliably indicate alcoholic intoxication.
In this paper we outline the German speech data collection for the SmartWeb project, which is fundedby the German Ministry of Science and Education. We focus on the SmartWeb Handheld Corpus (SHC), which has been collected by the Bavarian Archive for Speech Signals (BAS) at the Phonetic Institute (IPSK) of Munich University. Signals of SHC are being recorded in real-life environments(indoor and outdoor) with real background noise as well as real transmission line errors. We developed a new elicitation method and recording technique, calledsituational prompting, which facilitates collecting realistic dialogue speech data in a cost efficient way. We can show that almost realistic speech queries to a dialogue system issued over a mobile PDA or smart phonecan be collected very efficiently using an automatic speech server. We describe the technical and linguistic features of the resulting speech corpus, which will bepublicly available at BAS or ELDA.
Three advanced German speech corpora have been collected during theGerman SmartWeb project. One of them, the SmartWeb MotorbikeCorpus (SMC) is described in this paper. As with all SmartWeb speech corpora SMC is designed for a dialogue system dealing with open domains. The corpus is recorded under the special circumstances of a motorbike ride and contains utterances of the driver related to information retrieval from various sources and different topics. Audio tracks show characteristic noise from the engine and surrounding traffic as well as drop outs caused by the transmission over Bluetooth and the UMTS mobile network. We discuss the problems of the technical setup and the fully automatic evocation of natural-spoken queries by means of dialogue-like sequences.