Chunk Different Kind of Spoken Discourse: Challenges for Machine Learning
Iris Eshkol-Taravella, Mariame Maarouf, Flora Badin, Marie Skrovec, Isabelle Tellier
Abstract
This paper describes the development of a chunker for spoken data by supervised machine learning using the CRFs, based on a small reference corpus composed of two kinds of discourse: prepared monologue vs. spontaneous talk in interaction. The methodology considers the specific character of the spoken data. The machine learning uses the results of several available taggers, without correcting the results manually. Experiments show that the discourse type (monologue vs. free talk), the speech nature (spontaneous vs. prepared) and the corpus size can influence the results of the machine learning process and must be considered while interpreting the results.- Anthology ID:
- 2020.lrec-1.635
- Volume:
- Proceedings of the Twelfth Language Resources and Evaluation Conference
- Month:
- May
- Year:
- 2020
- Address:
- Marseille, France
- Editors:
- Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
- Venue:
- LREC
- SIG:
- Publisher:
- European Language Resources Association
- Note:
- Pages:
- 5164–5168
- Language:
- English
- URL:
- https://aclanthology.org/2020.lrec-1.635
- DOI:
- Cite (ACL):
- Iris Eshkol-Taravella, Mariame Maarouf, Flora Badin, Marie Skrovec, and Isabelle Tellier. 2020. Chunk Different Kind of Spoken Discourse: Challenges for Machine Learning. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 5164–5168, Marseille, France. European Language Resources Association.
- Cite (Informal):
- Chunk Different Kind of Spoken Discourse: Challenges for Machine Learning (Eshkol-Taravella et al., LREC 2020)
- PDF:
- https://preview.aclanthology.org/naacl24-info/2020.lrec-1.635.pdf