Accessible Sanskrit: A Cascading System for Text Analysis and Dictionary Access

Giacomo De Luca


Abstract
Sanskrit text processing presents unique com-putational challenges due to its complex mor-phology, frequent compound formation, and the phenomenon of Sandhi. While several ap-proaches to Sanskrit word segmentation ex-ist, the field lacks integrated tools that make texts accessible while maintaining high accu-racy. We present a hybrid approach combining rule-based and statistical methods that achieves reliable Sanskrit text analysis through a cascade mechanism in which a deterministic matching using inflection tables is used for simple cases and statistical approaches are used for the more complex ones. The goal of the system is to provide automatic text annotation and inflected dictionary search, returning for each word root forms, comprehensive grammatical analysis, inflection tables, and dictionary entries from multiple sources. The system is evaluated on 300 randomly selected compounds from the GRETIL corpus across different length cate-gories and maintains 90% accuracy regardless of compound length, with 91% accuracy on the 40+ characters long compounds. The approach is also tested on the complete text of the Yoga Sutra, demonstrating 96% accuracy in the prac-tical use case. This approach is implemented both as an open-source Python library and a web application, making Sanskrit text analysis accessible to scholars and interested readers while retaining state-of-the-art accuracy.
Anthology ID:
2025.alp-1.5
Volume:
Proceedings of the Second Workshop on Ancient Language Processing
Month:
May
Year:
2025
Address:
The Albuquerque Convention Center, Laguna
Editors:
Adam Anderson, Shai Gordin, Bin Li, Yudong Liu, Marco C. Passarotti, Rachele Sprugnoli
Venues:
ALP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
38–46
Language:
URL:
https://preview.aclanthology.org/fix-sig-urls/2025.alp-1.5/
DOI:
Bibkey:
Cite (ACL):
Giacomo De Luca. 2025. Accessible Sanskrit: A Cascading System for Text Analysis and Dictionary Access. In Proceedings of the Second Workshop on Ancient Language Processing, pages 38–46, The Albuquerque Convention Center, Laguna. Association for Computational Linguistics.
Cite (Informal):
Accessible Sanskrit: A Cascading System for Text Analysis and Dictionary Access (De Luca, ALP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/fix-sig-urls/2025.alp-1.5.pdf