Word-based Partial Annotation for Efficient Corpus Construction

Graham Neubig, Shinsuke Mori


Abstract
In order to utilize the corpus-based techniques that have proven effective in natural language processing in recent years, costly and time-consuming manual creation of linguistic resources is often necessary. Traditionally these resources are created on the document or sentence-level. In this paper, we examine the benefit of annotating only particular words with high information content, as opposed to the entire sentence or document. Using the task of Japanese pronunciation estimation as an example, we devise a machine learning method that can be trained on data annotated word-by-word. This is done by dividing the estimation process into two steps (word segmentation and word-based pronunciation estimation), and introducing a point-wise estimator that is able to make each decision independent of the other decisions made for a particular sentence. In an evaluation, the proposed strategy is shown to provide greater increases in accuracy using a smaller number of annotated words than traditional sentence-based annotation techniques.
Anthology ID:
L10-1281
Volume:
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
Month:
May
Year:
2010
Address:
Valletta, Malta
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2010/pdf/408_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Graham Neubig and Shinsuke Mori. 2010. Word-based Partial Annotation for Efficient Corpus Construction. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10), Valletta, Malta. European Language Resources Association (ELRA).
Cite (Informal):
Word-based Partial Annotation for Efficient Corpus Construction (Neubig & Mori, LREC 2010)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2010/pdf/408_Paper.pdf