The NAIST-NTT TED talk treebank

Graham Neubig, Katsuhiro Sudoh, Yusuke Oda, Kevin Duh, Hajime Tsukuda, Masaaki Nagata


Abstract
Syntactic parsing is a fundamental natural language processing technology that has proven useful in machine translation, language modeling, sentence segmentation, and a number of other applications related to speech translation. However, there is a paucity of manually annotated syntactic parsing resources for speech, and particularly for the lecture speech that is the current target of the IWSLT translation campaign. In this work, we present a new manually annotated treebank of TED talks that we hope will prove useful for investigation into the interaction between syntax and these speechrelated applications. The first version of the corpus includes 1,217 sentences and 23,158 words manually annotated with parse trees, and aligned with translations in 26-43 different languages. In this paper we describe the collection of the corpus, and an analysis of its various characteristics.
Anthology ID:
2014.iwslt-papers.16
Volume:
Proceedings of the 11th International Workshop on Spoken Language Translation: Papers
Month:
December 4-5
Year:
2014
Address:
Lake Tahoe, California
Venue:
IWSLT
SIG:
SIGSLT
Publisher:
Note:
Pages:
265–270
Language:
URL:
https://aclanthology.org/2014.iwslt-papers.16
DOI:
Bibkey:
Cite (ACL):
Graham Neubig, Katsuhiro Sudoh, Yusuke Oda, Kevin Duh, Hajime Tsukuda, and Masaaki Nagata. 2014. The NAIST-NTT TED talk treebank. In Proceedings of the 11th International Workshop on Spoken Language Translation: Papers, pages 265–270, Lake Tahoe, California.
Cite (Informal):
The NAIST-NTT TED talk treebank (Neubig et al., IWSLT 2014)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/2014.iwslt-papers.16.pdf