Katsuhiro Sudoh


2014

pdf
The NAIST-NTT TED talk treebank
Graham Neubig | Katsuhiro Sudoh | Yusuke Oda | Kevin Duh | Hajime Tsukuda | Masaaki Nagata
Proceedings of the 11th International Workshop on Spoken Language Translation: Papers

Syntactic parsing is a fundamental natural language processing technology that has proven useful in machine translation, language modeling, sentence segmentation, and a number of other applications related to speech translation. However, there is a paucity of manually annotated syntactic parsing resources for speech, and particularly for the lecture speech that is the current target of the IWSLT translation campaign. In this work, we present a new manually annotated treebank of TED talks that we hope will prove useful for investigation into the interaction between syntax and these speechrelated applications. The first version of the corpus includes 1,217 sentences and 23,158 words manually annotated with parse trees, and aligned with translations in 26-43 different languages. In this paper we describe the collection of the corpus, and an analysis of its various characteristics.