Punctuation Prediction for Unsegmented Transcript Based on Word Vector

Xiaoyin Che, Cheng Wang, Haojin Yang, Christoph Meinel


Abstract
In this paper we propose an approach to predict punctuation marks for unsegmented speech transcript. The approach is purely lexical, with pre-trained Word Vectors as the only input. A training model of Deep Neural Network (DNN) or Convolutional Neural Network (CNN) is applied to classify whether a punctuation mark should be inserted after the third word of a 5-words sequence and which kind of punctuation mark the inserted one should be. TED talks within IWSLT dataset are used in both training and evaluation phases. The proposed approach shows its effectiveness by achieving better result than the state-of-the-art lexical solution which works with same type of data, especially when predicting puncuation position only.
Anthology ID:
L16-1103
Volume:
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Month:
May
Year:
2016
Address:
Portorož, Slovenia
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
654–658
Language:
URL:
https://aclanthology.org/L16-1103
DOI:
Bibkey:
Cite (ACL):
Xiaoyin Che, Cheng Wang, Haojin Yang, and Christoph Meinel. 2016. Punctuation Prediction for Unsegmented Transcript Based on Word Vector. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 654–658, Portorož, Slovenia. European Language Resources Association (ELRA).
Cite (Informal):
Punctuation Prediction for Unsegmented Transcript Based on Word Vector (Che et al., LREC 2016)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-4/L16-1103.pdf