This is an internal, incomplete preview of a proposed change to the ACL Anthology.
For efficiency reasons, we don't generate MODS or Endnote formats, and the preview may be incomplete in other ways, or contain mistakes.
Do not treat this content as an official publication.
NickCampbell
Fixing paper assignments
Please select all papers that belong to the same person.
Indicate below which author they should be assigned to.
Casual conversation has become a focus for artificial dialogue applications. Such talk is ubiquitous and its structure differs from that found in the task-based interactions which have been the focus of dialogue system design for many years. It is unlikely that such conversations can be modelled as an extension of task-based talk. We review theories of casual conversation, report on our studies of the structure of casual dialogue, and outline challenges we see for the development of spoken dialog systems capable of carrying on casual friendly conversation in addition to performing well-defined tasks.
We introduce a Multi-modal Neural Machine Translation model in which a doubly-attentive decoder naturally incorporates spatial visual features obtained using pre-trained convolutional neural networks, bridging the gap between image description and translation. Our decoder learns to attend to source-language words and parts of an image independently by means of two separate attention mechanisms as it generates words in the target language. We find that our model can efficiently exploit not just back-translated in-domain multi-modal data but also large general-domain text-only MT corpora. We also report state-of-the-art results on the Multi30k data set.
This paper presents the multimodal Interlingual Map Task Corpus (ILMT-s2s corpus) collected at Trinity College Dublin, and discuss some of the issues related to the collection and analysis of the data. The corpus design is inspired by the HCRC Map Task Corpus which was initially designed to support the investigation of linguistic phenomena, and has been the focus of a variety of studies of communicative behaviour. The simplicity of the task, and the complexity of phenomena it can elicit, make the map task an ideal object of study. Although there are studies that used replications of the map task to investigate communication in computer mediated tasks, this ILMT-s2s corpus is, to the best of our knowledge, the first investigation of communicative behaviour in the presence of three additional “filters”: Automatic Speech Recognition (ASR), Machine Translation (MT) and Text To Speech (TTS) synthesis, where the instruction giver and the instruction follower speak different languages. This paper details the data collection setup and completed annotation of the ILMT-s2s corpus, and outlines preliminary results obtained from the data.
This paper reports the preservation of an old speech synthesis website as a corpus. CHATR was a revolutionary technique developed in the mid nineties for concatenative speech synthesis. The method has since become the standard for high quality speech output by computer although much of the current research is devoted to parametric or hybrid methods that employ smaller amounts of data and can be more easily tunable to individual voices. The system was first reported in 1994 and the website was functional in 1996. The ATR labs where this system was invented no longer exist, but the website has been preserved as a corpus containing 1537 samples of synthesised speech from that period (118 MB in aiff format) in 211 pages under various finely interrelated themes The corpus can be accessed from www.speech-data.jp as well as www.tcd-fastnet.com, where the original code and samples are now being maintained.
Casual multiparty conversation is an understudied but very common genre of spoken interaction, whose analysis presents a number of challenges in terms of data scarcity and annotation. We describe the annotation process used on the d64 and DANS multimodal corpora of multiparty casual talk, which have been manually segmented, transcribed, annotated for laughter and disfluencies, and aligned using the Penn Aligner. We also describe a visualization tool, STAVE, developed during the annotation process, which allows long stretches of talk or indeed entire conversations to be viewed, aiding preliminary identification of features and patterns worthy of analysis. It is hoped that this tool will be of use to other researchers working in this field.
Biosignals, such as electrodermal activity (EDA) and heart rate, are increasingly being considered as potential data sources to provide information about the temporal fluctuations in affective experience during human interaction. This paper describes an English-speaking, multiple session corpus of small groups of people engaged in informal, unscripted conversation while wearing wireless, wrist-based EDA sensors. Additionally, one participant per recording session wore a heart rate monitor. This corpus was collected in order to observe potential interactions between various social and communicative phenomena and the temporal dynamics of the recorded biosignals. Here we describe the communicative context, technical set-up, synchronization process, and challenges in collecting and utilizing such data. We describe the segmentation and annotations to date, including laughter annotations, and how the research community can access and collaborate on this corpus now and in the future. We believe this corpus is particularly relevant to researchers interested in unscripted social conversation as well as to researchers with a specific interest in observing the dynamics of biosignals during informal social conversation rich with examples of laughter, conversational turn-taking, and non-task-based interaction.
This paper presents methodologies and tools for language resource (LR) construction. It describes a database of interactive speech collected over a three-month period at the Science Gallery in Dublin, where visitors could take part in a conversation with a robot. The system collected samples of informal, chatty dialogue -- normally difficult to capture under laboratory conditions for human-human dialogue, and particularly so for human-machine interaction. The conversations were based on a script followed by the robot consisting largely of social chat with some task-based elements. The interactions were audio-visually recorded using several cameras together with microphones. As part of the conversation the participants were asked to sign a consent form giving permission to use their data for human-machine interaction research. The multimodal corpus will be made available to interested researchers and the technology developed during the three-month exhibition is being extended for use in education and assisted-living applications.
We investigate the influence of audiovisual features on the perception of speaking style and performance of politicians, utilizing a large publicly available dataset of German parliament recordings. We conduct a human perception experiment involving eye-tracker data to evaluate human ratings as well as behavior in two separate conditions, i.e. audiovisual and video only. The ratings are evaluated on a five dimensional scale comprising measures of insecurity, monotony, expressiveness, persuasiveness, and overall performance. Further, they are statistically analyzed and put into context in a multimodal feature analysis, involving measures of prosody, voice quality and motion energy. The analysis reveals several statistically significant features, such as pause timing, voice quality measures and motion energy, that highly positively or negatively correlate with certain human ratings of speaking style. Additionally, we compare the gaze behavior of the human subjects to evaluate saliency regions in the multimodal and visual only conditions. The eye-tracking analysis reveals significant changes in the gaze behavior of the human subjects; participants reduce their focus of attention in the audiovisual condition mainly to the region of the face of the politician and scan the upper body, including hands and arms, in the video only condition.
This paper describes a software toolkit for the interactive display and analysis of automatically extracted or manually derived annotation features of visual and audio data. It has been extensively tested with material collected as part of the FreeTalk Multimodal Conversation Corpus. Both the corpus and the software are available for download from sites in Europe and Japan. The corpus consists of several hours of video and audio recordings from a variety of capture devices, and includes subjective annotations of the content, along with derived data obtained from image processing. Because of the large size of the corpus, it is unrealistic to expect researchers to download all the material before deciding whether it will be useful to them in their research. We have therefore devised a means for interactive browsing of the content and for viewing at different levels of granularity. This has resulted in a simple set of tools that can be added to any website to allow similar browsing of audio- video recordings and their related data and annotations.
This paper describes tools and techniques for accessing large quantities of speech data and for the visualisation of discourse interactions and events at levels above that of linguistic content. We are working with large quantities of dialogue speech including business meetings, friendly discourse, and telephone conversations, and have produced web-based tools for the visualisation of non-verbal and paralinguistic features of the speech data. In essence, they provide higher-level displays so that specific sections of speech, text, or other annotation can be accessed by the researcher and provide an interactive interface to the large amount of data through an Archive Browser.
At ATR, we are collecting and analysing meetings data using a table-top sensor device consisting of a small 360-degree camera surrounded by an array of high-quality directional microphones. This equipment provides a stream of information about the audio and visual events of the meeting which is then processed to form a representation of the verbal and non-verbal interpersonal activity, or discourse flow, during the meeting. This paper describes the resulting corpus of speech and video data which is being collected for the abovere search. It currently includes data from 12 monthly sessions, comprising 71 video and 33 audio modules. Collection is continuingmonthly and is scheduled to include another ten sessions.