CLTC: A Chinese-English Cross-lingual Topic Corpus

Yunqing Xia, Guoyu Tang, Peng Jin, Xia Yang


Abstract
Cross-lingual topic detection within text is a feasible solution to resolving the language barrier in accessing the information. This paper presents a Chinese-English cross-lingual topic corpus (CLTC), in which 90,000 Chinese articles and 90,000 English articles are organized within 150 topics. Compared with TDT corpora, CLTC has three advantages. First, CLTC is bigger in size. This makes it possible to evaluate the large-scale cross-lingual text clustering methods. Second, articles are evenly distributed within the topics. Thus it can be used to produce test datasets for different purposes. Third, CLTC can be used as a cross-lingual comparable corpus to develop methods for cross-lingual information access. A preliminary evaluation with CLTC corpus indicates that the corpus is effective in evaluating cross-lingual topic detection methods.
Anthology ID:
L12-1197
Volume:
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Month:
May
Year:
2012
Address:
Istanbul, Turkey
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
532–537
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/389_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Yunqing Xia, Guoyu Tang, Peng Jin, and Xia Yang. 2012. CLTC: A Chinese-English Cross-lingual Topic Corpus. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 532–537, Istanbul, Turkey. European Language Resources Association (ELRA).
Cite (Informal):
CLTC: A Chinese-English Cross-lingual Topic Corpus (Xia et al., LREC 2012)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/389_Paper.pdf