Chunk Different Kind of Spoken Discourse: Challenges for Machine Learning

Iris Eshkol-Taravella, Mariame Maarouf, Flora Badin, Marie Skrovec, Isabelle Tellier


Abstract
This paper describes the development of a chunker for spoken data by supervised machine learning using the CRFs, based on a small reference corpus composed of two kinds of discourse: prepared monologue vs. spontaneous talk in interaction. The methodology considers the specific character of the spoken data. The machine learning uses the results of several available taggers, without correcting the results manually. Experiments show that the discourse type (monologue vs. free talk), the speech nature (spontaneous vs. prepared) and the corpus size can influence the results of the machine learning process and must be considered while interpreting the results.
Anthology ID:
2020.lrec-1.635
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
5164–5168
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.635
DOI:
Bibkey:
Cite (ACL):
Iris Eshkol-Taravella, Mariame Maarouf, Flora Badin, Marie Skrovec, and Isabelle Tellier. 2020. Chunk Different Kind of Spoken Discourse: Challenges for Machine Learning. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 5164–5168, Marseille, France. European Language Resources Association.
Cite (Informal):
Chunk Different Kind of Spoken Discourse: Challenges for Machine Learning (Eshkol-Taravella et al., LREC 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/landing_page/2020.lrec-1.635.pdf