PRALEKHA: Cross-Lingual Document Alignment for Indic Languages
Sanjay Suryanarayanan, Haiyue Song, Mohammed Safi Ur Rahman Khan, Anoop Kunchukuttan, Raj Dabre
Abstract
Mining parallel document pairs for document-level machine translation (MT) remains challenging due to the limitations of existing Cross-Lingual Document Alignment (CLDA) techniques. Existing methods often rely on metadata such as URLs, which are scarce, or on pooled document representations that fail to capture fine-grained alignment cues. Moreover, the limited context window of sentence embedding models hinders their ability to represent document-level context, while sentence-based alignment introduces a combinatorially large search space, leading to high computational cost. To address these challenges for Indic languages, we introduce Pralekha, a benchmark containing over 3 million aligned document pairs across 11 Indic languages and English, which includes 1.5 million English–Indic pairs. Furthermore, we propose Document Alignment Coefficient (DAC), a novel metric for fine-grained document alignment. Unlike pooling-based methods, DAC aligns documents by matching smaller chunks and computes similarity as the ratio of aligned chunks to the average number of chunks in a pair. Intrinsic evaluation shows that our chunk-based method is 2–3× faster while maintaining competitive performance, and that DAC achieves substantial gains over pooling-based baselines. Extrinsic evaluation further demonstrates that document-level MT models trained on DAC-aligned pairs consistently outperform those using baseline alignment methods. These results highlight DAC’s effectiveness for parallel document mining. The dataset and evaluation framework are publicly available to support further research.- Anthology ID:
- 2025.ijcnlp-long.37
- Volume:
- Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
- Month:
- December
- Year:
- 2025
- Address:
- Mumbai, India
- Editors:
- Kentaro Inui, Sakriani Sakti, Haofen Wang, Derek F. Wong, Pushpak Bhattacharyya, Biplab Banerjee, Asif Ekbal, Tanmoy Chakraborty, Dhirendra Pratap Singh
- Venues:
- IJCNLP | AACL
- SIG:
- Publisher:
- The Asian Federation of Natural Language Processing and The Association for Computational Linguistics
- Note:
- Pages:
- 662–676
- Language:
- URL:
- https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.ijcnlp-long.37/
- DOI:
- Cite (ACL):
- Sanjay Suryanarayanan, Haiyue Song, Mohammed Safi Ur Rahman Khan, Anoop Kunchukuttan, and Raj Dabre. 2025. PRALEKHA: Cross-Lingual Document Alignment for Indic Languages. In Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, pages 662–676, Mumbai, India. The Asian Federation of Natural Language Processing and The Association for Computational Linguistics.
- Cite (Informal):
- PRALEKHA: Cross-Lingual Document Alignment for Indic Languages (Suryanarayanan et al., IJCNLP-AACL 2025)
- PDF:
- https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.ijcnlp-long.37.pdf