Abstract
Neural machine translation requires large number of parallel sentences along with in-domain parallel data to attain best results. Nevertheless, no scientific parallel corpus for English-Persian language pair is available. In this paper, a parallel corpus called Esposito is introduced, which contains 3.5 million parallel sentences in the scientific domain for English-Persian language pair. In addition, we present a manually validated scientific test set that might serve as a baseline for future studies. We show that a system trained using Esposito along with other publicly available data improves the baseline on average by 7.6 and 8.4 BLEU scores for En->Fa and Fa->En directions, respectively. Additionally, domain analysis using the 5-gram KenLM model revealed notable distinctions between our parallel corpus and the existing generic parallel corpus. This dataset will be available to the public upon the acceptance of the paper.- Anthology ID:
- 2024.lrec-main.557
- Volume:
- Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
- Month:
- May
- Year:
- 2024
- Address:
- Torino, Italia
- Editors:
- Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
- Venues:
- LREC | COLING
- SIG:
- Publisher:
- ELRA and ICCL
- Note:
- Pages:
- 6299–6308
- Language:
- URL:
- https://aclanthology.org/2024.lrec-main.557
- DOI:
- Cite (ACL):
- Mersad Esalati, Mohammad Javad Dousti, and Heshaam Faili. 2024. Esposito: An English-Persian Scientific Parallel Corpus for Machine Translation. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 6299–6308, Torino, Italia. ELRA and ICCL.
- Cite (Informal):
- Esposito: An English-Persian Scientific Parallel Corpus for Machine Translation (Esalati et al., LREC-COLING 2024)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-4/2024.lrec-main.557.pdf