Innovations in Parallel Corpus Search Tools

Martin Volk, Johannes Graën, Elena Callegaro


Abstract
Recent years have seen an increased interest in and availability of parallel corpora. Large corpora from international organizations (e.g. European Union, United Nations, European Patent Office), or from multilingual Internet sites (e.g. OpenSubtitles) are now easily available and are used for statistical machine translation but also for online search by different user groups. This paper gives an overview of different usages and different types of search systems. In the past, parallel corpus search systems were based on sentence-aligned corpora. We argue that automatic word alignment allows for major innovations in searching parallel corpora. Some online query systems already employ word alignment for sorting translation variants, but none supports the full query functionality that has been developed for parallel treebanks. We propose to develop such a system for efficiently searching large parallel corpora with a powerful query language.
Anthology ID:
L14-1418
Volume:
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Month:
May
Year:
2014
Address:
Reykjavik, Iceland
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
3172–3178
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/504_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Martin Volk, Johannes Graën, and Elena Callegaro. 2014. Innovations in Parallel Corpus Search Tools. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 3172–3178, Reykjavik, Iceland. European Language Resources Association (ELRA).
Cite (Informal):
Innovations in Parallel Corpus Search Tools (Volk et al., LREC 2014)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/504_Paper.pdf