Itihasa: A large-scale corpus for Sanskrit to English translation
Rahul Aralikatte, Miryam de Lhoneux, Anoop Kunchukuttan, Anders Søgaard
Abstract
This work introduces Itihasa, a large-scale translation dataset containing 93,000 pairs of Sanskrit shlokas and their English translations. The shlokas are extracted from two Indian epics viz., The Ramayana and The Mahabharata. We first describe the motivation behind the curation of such a dataset and follow up with empirical analysis to bring out its nuances. We then benchmark the performance of standard translation models on this corpus and show that even state-of-the-art transformer architectures perform poorly, emphasizing the complexity of the dataset.- Anthology ID:
- 2021.wat-1.22
- Original:
- 2021.wat-1.22v1
- Version 2:
- 2021.wat-1.22v2
- Volume:
- Proceedings of the 8th Workshop on Asian Translation (WAT2021)
- Month:
- August
- Year:
- 2021
- Address:
- Online
- Editors:
- Toshiaki Nakazawa, Hideki Nakayama, Isao Goto, Hideya Mino, Chenchen Ding, Raj Dabre, Anoop Kunchukuttan, Shohei Higashiyama, Hiroshi Manabe, Win Pa Pa, Shantipriya Parida, Ondřej Bojar, Chenhui Chu, Akiko Eriguchi, Kaori Abe, Yusuke Oda, Katsuhito Sudoh, Sadao Kurohashi, Pushpak Bhattacharyya
- Venue:
- WAT
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 191–197
- Language:
- URL:
- https://preview.aclanthology.org/icon-24-ingestion/2021.wat-1.22/
- DOI:
- 10.18653/v1/2021.wat-1.22
- Cite (ACL):
- Rahul Aralikatte, Miryam de Lhoneux, Anoop Kunchukuttan, and Anders Søgaard. 2021. Itihasa: A large-scale corpus for Sanskrit to English translation. In Proceedings of the 8th Workshop on Asian Translation (WAT2021), pages 191–197, Online. Association for Computational Linguistics.
- Cite (Informal):
- Itihasa: A large-scale corpus for Sanskrit to English translation (Aralikatte et al., WAT 2021)
- PDF:
- https://preview.aclanthology.org/icon-24-ingestion/2021.wat-1.22.pdf
- Data
- Itihasa