Mark Dingemanse


Building and curating conversational corpora for diversity-aware language science and technology
Andreas Liesenfeld | Mark Dingemanse
Proceedings of the Thirteenth Language Resources and Evaluation Conference

We present an analysis pipeline and best practice guidelines for building and curating corpora of everyday conversation in diverse languages. Surveying language documentation corpora and other resources that cover 67 languages and varieties from 28 phyla, we describe the compilation and curation process, specify minimal properties of a unified format for interactional data, and develop methods for quality control that take into account turn-taking and timing. Two case studies show the broad utility of conversational data for (i) charting human interactional infrastructure and (ii) tracing challenges and opportunities for current ASR solutions. Linguistically diverse conversational corpora can provide new insights for the language sciences and stronger empirical foundations for language technology.

From text to talk: Harnessing conversational corpora for humane and diversity-aware language technology
Mark Dingemanse | Andreas Liesenfeld
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Informal social interaction is the primordial home of human language. Linguistically diverse conversational corpora are an important and largely untapped resource for computational linguistics and language technology. Through the efforts of a worldwide language documentation movement, such corpora are increasingly becoming available. We show how interactional data from 63 languages (26 families) harbours insights about turn-taking, timing, sequential structure and social action, with implications for language technology, natural language understanding, and the design of conversational interfaces. Harnessing linguistically diverse conversational corpora will provide the empirical foundations for flexible, localizable, humane language technologies of the future.

Evaluation of Automatic Speech Recognition for Conversational Speech in Dutch, English and German: What Goes Missing?
Alianda Lopez | Andreas Liesenfeld | Mark Dingemanse
Proceedings of the 18th Conference on Natural Language Processing (KONVENS 2022)


A simple repair mechanism can alleviate computational demands of pragmatic reasoning: simulations and complexity analysis
Jacqueline van Arkel | Marieke Woensdregt | Mark Dingemanse | Mark Blokpoel
Proceedings of the 24th Conference on Computational Natural Language Learning

How can people communicate successfully while keeping resource costs low in the face of ambiguity? We present a principled theoretical analysis comparing two strategies for disambiguation in communication: (i) pragmatic reasoning, where communicators reason about each other, and (ii) other-initiated repair, where communicators signal and resolve trouble interactively. Using agent-based simulations and computational complexity analyses, we compare the efficiency of these strategies in terms of communicative success, computation cost and interaction cost. We show that agents with a simple repair mechanism can increase efficiency, compared to pragmatic agents, by reducing their computational burden at the cost of longer interactions. We also find that efficiency is highly contingent on the mechanism, highlighting the importance of explicit formalisation and computational rigour.


pdf bib
A high speed transcription interface for annotating primary linguistic data
Mark Dingemanse | Jeremy Hammond | Herman Stehouwer | Aarthy Somasundaram | Sebastian Drude
Proceedings of the 6th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities