The NAIST-NTT TED talk treebank
Graham Neubig, Katsuhiro Sudoh, Yusuke Oda, Kevin Duh, Hajime Tsukuda, Masaaki Nagata
Abstract
Syntactic parsing is a fundamental natural language processing technology that has proven useful in machine translation, language modeling, sentence segmentation, and a number of other applications related to speech translation. However, there is a paucity of manually annotated syntactic parsing resources for speech, and particularly for the lecture speech that is the current target of the IWSLT translation campaign. In this work, we present a new manually annotated treebank of TED talks that we hope will prove useful for investigation into the interaction between syntax and these speechrelated applications. The first version of the corpus includes 1,217 sentences and 23,158 words manually annotated with parse trees, and aligned with translations in 26-43 different languages. In this paper we describe the collection of the corpus, and an analysis of its various characteristics.- Anthology ID:
- 2014.iwslt-papers.16
- Volume:
- Proceedings of the 11th International Workshop on Spoken Language Translation: Papers
- Month:
- December 4-5
- Year:
- 2014
- Address:
- Lake Tahoe, California
- Venue:
- IWSLT
- SIG:
- SIGSLT
- Publisher:
- Note:
- Pages:
- 265–270
- Language:
- URL:
- https://aclanthology.org/2014.iwslt-papers.16
- DOI:
- Cite (ACL):
- Graham Neubig, Katsuhiro Sudoh, Yusuke Oda, Kevin Duh, Hajime Tsukuda, and Masaaki Nagata. 2014. The NAIST-NTT TED talk treebank. In Proceedings of the 11th International Workshop on Spoken Language Translation: Papers, pages 265–270, Lake Tahoe, California.
- Cite (Informal):
- The NAIST-NTT TED talk treebank (Neubig et al., IWSLT 2014)
- PDF:
- https://preview.aclanthology.org/paclic-22-ingestion/2014.iwslt-papers.16.pdf