Giacomo De Luca


2026

We present Sanskrit Travelogue, to our knowledge the largest open, unified and richly annotated Sanskrit corpus. Aggregating eight digital libraries, it comprises 12,394 texts, 73.1M tokens and 9M segments after de-duplication. A reproducible pipeline standardizes transliteration to IAST, reconciles heterogeneous metadata, preserves structural semantics (verse markers, chapter hierarchies, textual apparatus) and adds automatic annotations. We provide corpus-scale morphosyntactic annotation combining two systems: the BYT-5 Sanskrit model for compound and sandhi splitting, and the process-sanskrit library for inflection removal and morphological tagging through a hybrid deterministic-statistical cascade. For each segment we materialize synchronized representations: cleaned, analyzed (sandhi/compound split), stemmed, diacritic-normalized and morphologically tagged. These representations are indexed jointly for retrieval. Both approaches achieve high accuracy (84.61% sentence-level exact matches for BYT-5 segmentation, 92.37% correct root extraction for compounds, 95.94% on the Yoga Sūtra). Manual evaluation on the Yoga Sūtra showed 98% correct root extraction when combining both methods, outperforming individual approaches. These annotations enable searching across orthographic sandhi and within compounds, robust lemma-level retrieval despite rich inflectional variation, and provide training material for segmentation and lemmatization while maintaining ambiguity for downstream modeling. We release the annotated corpus as TSV shards, code for corpus acquisition, processing and annotation, a query normalizer, all under a Creative Commons non-commercial license.

2025

Sanskrit Voyager is a web application for searching, reading, and analyzing the texts in the Sanskrit literary corpus. Unlike previous tools that require expert linguistic knowledge or manual normalization, Sanskrit Voyager enables users to search for words and phrases as they actually appear in texts, handling inflection, sandhi, and compound forms automatically while supporting any transliteration. The system integrates four core functionalities: (1) multi-dictionary lookup with morphological analysis and inflection tables; (2) real-time text parsing and annotation; (3) an interactive reader for over 900 digitalized texts; and (4) advanced corpus search with fuzzy matching and filtering. Evaluation shows over 92% parsing accuracy on complex compounds and substantially higher search recall than BuddhaNexus on challenging queries. Source code is publicly available under CC-BY-NC license, resource-efficient, and designed for both learners and researchers, offering the first fully integrated, user-friendly platform for computational Sanskrit studies.
Sanskrit text processing presents unique com-putational challenges due to its complex mor-phology, frequent compound formation, and the phenomenon of Sandhi. While several ap-proaches to Sanskrit word segmentation ex-ist, the field lacks integrated tools that make texts accessible while maintaining high accu-racy. We present a hybrid approach combining rule-based and statistical methods that achieves reliable Sanskrit text analysis through a cascade mechanism in which a deterministic matching using inflection tables is used for simple cases and statistical approaches are used for the more complex ones. The goal of the system is to provide automatic text annotation and inflected dictionary search, returning for each word root forms, comprehensive grammatical analysis, inflection tables, and dictionary entries from multiple sources. The system is evaluated on 300 randomly selected compounds from the GRETIL corpus across different length cate-gories and maintains 90% accuracy regardless of compound length, with 91% accuracy on the 40+ characters long compounds. The approach is also tested on the complete text of the Yoga Sutra, demonstrating 96% accuracy in the prac-tical use case. This approach is implemented both as an open-source Python library and a web application, making Sanskrit text analysis accessible to scholars and interested readers while retaining state-of-the-art accuracy.