Co-clustering of bilingual datasets as a mean for assisting the construction of thematic bilingual comparable corpora

Guiyao Ke, Pierre-Francois Marteau


Abstract
We address in this paper the assisted construction of bilingual thematic comparable corpora by means of co-clustering bilingual documents collected from raw sources such as the Web. The proposed approach is based on a quantitative comparability measure and a co-clustering approach which allow to mix similarity measures existing in each of the two linguistic spaces with a “thematic” comparability measure that defines a mapping between these two spaces. With the improvement of the co-clustering (k-medoids) performance we get, we use a comparability threshold and a manual verification to ensure the good and robust alignment of co-clusters (co-medoids). Finally, from any available raw corpus, we enrich the aligned clusters in order to provide “thematic” comparable corpora of good quality and controlled size. On a case study that exploit raw web data, we show that this approach scales reasonably well and is quite suited for the construction of thematic comparable corpora of good quality.
Anthology ID:
L14-1677
Volume:
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Month:
May
Year:
2014
Address:
Reykjavik, Iceland
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
1992–1999
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/88_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Guiyao Ke and Pierre-Francois Marteau. 2014. Co-clustering of bilingual datasets as a mean for assisting the construction of thematic bilingual comparable corpora. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 1992–1999, Reykjavik, Iceland. European Language Resources Association (ELRA).
Cite (Informal):
Co-clustering of bilingual datasets as a mean for assisting the construction of thematic bilingual comparable corpora (Ke & Marteau, LREC 2014)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/88_Paper.pdf