Syntactic annotation of spontaneous speech: application to call-center conversation data

Thierry Bazillon, Melanie Deplano, Frederic Bechet, Alexis Nasr, Benoit Favre


Abstract
This paper describes the syntactic annotation process of the DECODA corpus. This corpus contains manual transcriptions of spoken conversations recorded in the French call-center of the Paris Public Transport Authority (RATP). Three levels of syntactic annotation have been performed with a semi-supervised approach: POS tags, Syntactic Chunks and Dependency parses. The main idea is to use off-the-shelf NLP tools and models, originaly developped and trained on written text, to perform a first automatic annotation on the manually transcribed corpus. At the same time a fully manual annotation process is performed on a subset of the original corpus, called the GOLD corpus. An iterative process is then applied, consisting in manually correcting errors found in the automatic annotations, retraining the linguistic models of the NLP tools on this corrected corpus, then checking the quality of the adapted models on the fully manual annotations of the GOLD corpus. This process iterates until a certain error rate is reached. This paper describes this process, the main issues raising when adapting NLP tools to process speech transcriptions, and presents the first evaluations performed with these new adapted tools.
Anthology ID:
L12-1397
Volume:
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Month:
May
Year:
2012
Address:
Istanbul, Turkey
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
1338–1342
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/682_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Thierry Bazillon, Melanie Deplano, Frederic Bechet, Alexis Nasr, and Benoit Favre. 2012. Syntactic annotation of spontaneous speech: application to call-center conversation data. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 1338–1342, Istanbul, Turkey. European Language Resources Association (ELRA).
Cite (Informal):
Syntactic annotation of spontaneous speech: application to call-center conversation data (Bazillon et al., LREC 2012)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/682_Paper.pdf