Multilingual Corpus Creation for Multilingual Semantic Similarity Task
Mahtab Ahmed, Chahna Dixit, Robert E. Mercer, Atif Khan, Muhammad Rifayat Samee, Felipe Urra
Abstract
In natural language processing, the performance of a semantic similarity task relies heavily on the availability of a large corpus. Various monolingual corpora are available (mainly English); but multilingual resources are very limited. In this work, we describe a semi-automated framework to create a multilingual corpus which can be used for the multilingual semantic similarity task. The similar sentence pairs are obtained by crawling bilingual websites, whereas the dissimilar sentence pairs are selected by applying topic modeling and an Open-AI GPT model on the similar sentence pairs. We focus on websites in the government, insurance, and banking domains to collect English-French and English-Spanish sentence pairs; however, this corpus creation approach can be applied to any other industry vertical provided that a bilingual website exists. We also show experimental results for multilingual semantic similarity to verify the quality of the corpus and demonstrate its usage.- Anthology ID:
- 2020.lrec-1.516
- Volume:
- Proceedings of the Twelfth Language Resources and Evaluation Conference
- Month:
- May
- Year:
- 2020
- Address:
- Marseille, France
- Editors:
- Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
- Venue:
- LREC
- SIG:
- Publisher:
- European Language Resources Association
- Note:
- Pages:
- 4190–4196
- Language:
- English
- URL:
- https://aclanthology.org/2020.lrec-1.516
- DOI:
- Cite (ACL):
- Mahtab Ahmed, Chahna Dixit, Robert E. Mercer, Atif Khan, Muhammad Rifayat Samee, and Felipe Urra. 2020. Multilingual Corpus Creation for Multilingual Semantic Similarity Task. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4190–4196, Marseille, France. European Language Resources Association.
- Cite (Informal):
- Multilingual Corpus Creation for Multilingual Semantic Similarity Task (Ahmed et al., LREC 2020)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-2/2020.lrec-1.516.pdf