Quality Assurance of Automatic Annotation of Very Large Corpora: a Study based on heterogeneous Tagging System

Chu-Ren Huang, Lung-Hao Lee, Wei-guang Qu, Jia-Fei Hong, Shiwen Yu


Abstract
We propose a set of heuristics for improving annotation quality of very large corpora efficiently. The Xinhua News portion of the Chinese Gigaword Corpus was tagged independently with both the Peking University ICL tagset and the Academia Sinica CKIP tagset. The corpus-based POS tags mapping will serve as the basis of the possible contrast in grammatical systems between PRC and Taiwan. And it can serve as the basic model for mapping between the CKIP and ICL tagging systems for any data.
Anthology ID:
L08-1106
Volume:
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
Month:
May
Year:
2008
Address:
Marrakech, Morocco
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2008/pdf/686_paper.pdf
DOI:
Bibkey:
Cite (ACL):
Chu-Ren Huang, Lung-Hao Lee, Wei-guang Qu, Jia-Fei Hong, and Shiwen Yu. 2008. Quality Assurance of Automatic Annotation of Very Large Corpora: a Study based on heterogeneous Tagging System. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08), Marrakech, Morocco. European Language Resources Association (ELRA).
Cite (Informal):
Quality Assurance of Automatic Annotation of Very Large Corpora: a Study based on heterogeneous Tagging System (Huang et al., LREC 2008)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2008/pdf/686_paper.pdf