Constructing the CODA Corpus: A Parallel Corpus of Monologues and Expository Dialogues

Svetlana Stoyanchev, Paul Piwek


Abstract
We describe the construction of the CODA corpus, a parallel corpus of monologues and expository dialogues. The dialogue part of the corpus consists of expository, i.e., information-delivering rather than dramatic, dialogues written by several acclaimed authors. The monologue part of the corpus is a paraphrase in monologue form of these dialogues by a human annotator. The annotator-written monologue preserves all information present in the original dialogue and does not introduce any new information that is not present in the original dialogue. The corpus was constructed as a resource for extracting rules for automated generation of dialogue from monologue. Using authored dialogues allows us to analyse the techniques used by accomplished writers for presenting information in the form of dialogue. The dialogues are annotated with dialogue acts and the monologues with rhetorical structure. We developed annotation and translation guidelines together with a custom-developed tool for carrying out translation, alignment and annotation of the dialogues. The final parallel CODA corpus consists of 1000 dialogue turns that are tagged with dialogue acts and aligned with monologue that expresses the same information and has been annotated with rhetorical structure relations.
Anthology ID:
L10-1079
Volume:
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
Month:
May
Year:
2010
Address:
Valletta, Malta
Editors:
Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, Mike Rosner, Daniel Tapias
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2010/pdf/127_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Svetlana Stoyanchev and Paul Piwek. 2010. Constructing the CODA Corpus: A Parallel Corpus of Monologues and Expository Dialogues. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10), Valletta, Malta. European Language Resources Association (ELRA).
Cite (Informal):
Constructing the CODA Corpus: A Parallel Corpus of Monologues and Expository Dialogues (Stoyanchev & Piwek, LREC 2010)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2010/pdf/127_Paper.pdf