2025
pdf
bib
abs
Automatic Accent Restoration in Vedic Sanskrit with Neural Language Models
Yuzuki Tsukagoshi
|
Ikki Ohmukai
Proceedings of the 1st Workshop on Benchmarks, Harmonization, Annotation, and Standardization for Human-Centric AI in Indian Languages (BHASHA 2025)
Vedic Sanskrit, the oldest attested form of Sanskrit, employs a distinctive pitch-accent system that marks one syllable per word. This work presents the first application of large language models to the automatic restoration of accent marks in transliterated Vedic Sanskrit texts. A comprehensive corpus was assembled by extracting major Vedic works from the TITUS project and constructing paired samples of unaccented input and correctly accented references, yielding more than 100,000 training examples. Three generative LLMs were fine-tuned on this corpus: a LoRA-adapted Llama 3.1 8B Instruct model, OpenAI GPT‐4.1 nano, and Google Gemini 2.5 Flash. These models were trained in a sequence‐to‐sequence fashion to insert accent marks at appropriate positions. Evaluation on roughly 2,000 sentences using precision, recall, F1, character error rate, word error rate, and ChrF1 metrics shows that fine‐tuned models substantially outperform their untuned baselines. The LoRA-tuned Llama achieves the highest F1, followed by Gemini 2.5 Flash and GPT‐4.1 nano. Error analysis reveals that the models learn to infer accents from grammatical and phonological context. These results demonstrate that LLMs can capture complex accentual patterns and recover lost information, opening possibilities for improved sandhi splitting, morphological analysis, syntactic parsing and machine translation in Vedic NLP pipelines.
pdf
bib
Towards Accent-Aware Vedic Sanskrit Optical Character Recognition Based on Transformer Models
Yuzuki Tsukagoshi
|
Ryo Kuroiwa
|
Ikki Ohmukai
Computational Sanskrit and Digital Humanities - World Sanskrit Conference 2025
2024
pdf
bib
abs
The Metronome Approach to Sanskrit Meter: Analysis for the Rigveda
Yuzuki Tsukagoshi
|
Ikki Ohmukai
Proceedings of the 1st Workshop on Machine Learning for Ancient Languages (ML4AL 2024)
This study analyzes the verses of the Rigveda, the oldest Sanskrit text, from a metrical perspective. Based on metrical structures, the verses are represented by four elements: light syllables, heavy syllables, word boundaries, and line boundaries. As a result, it became evident that among verses traditionally categorized under the same metrical name, there are those forming distinct clusters. Furthermore, the study reveals commonalities in metrical structures, such as similar metrical patterns grouping together despite differences in the number of lines. Going forward, it is anticipated that this methodology will enable comparisons across multiple languages within the Indo-European language family.
pdf
bib
abs
N-gram-Based Preprocessing for Sandhi Reversion in Vedic Sanskrit
Yuzuki Tsukagoshi
|
Ikki Ohmukai
Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities
This study aims to address the challenges posed by sandhi in Vedic Sanskrit, a phenomenon that complicates the computational analysis of Sanskrit texts. By focusing on sandhi reversion, the research seeks to improve the accuracy of processing Vedic Sanskrit, an older layer of the language. Sandhi, a phonological phenomenon, poses challenges for text processing in Sanskrit due to the fusion of word boundaries or the sound change around word boundaries. In this research, we developed a transformer-based model with a novel n-gram preprocessing strategy to improve the accuracy of sandhi reversion for Vedic. We created character-based n-gram texts of varying lengths (n = 2, 3, 4, 5, 6) from the Rigveda, the oldest Vedic text, and trained models on these texts to perform machine translation from post-sandhi to pre-sandhi forms. In the results, we found that the model trained with 5-gram text achieved the highest accuracy. This success is likely due to the 5-gram’s ability to capture the maximum phonemic context in which Vedic sandhi occurs, making it more effective for the task. These findings suggest that by leveraging the inherent characteristics of phonological changes in language, even simple preprocessing methods like n-gram segmentation can significantly improve the accuracy of complex linguistic tasks.
2022
pdf
bib
abs
A Japanese Masked Language Model for Academic Domain
Hiroki Yamauchi
|
Tomoyuki Kajiwara
|
Marie Katsurai
|
Ikki Ohmukai
|
Takashi Ninomiya
Proceedings of the Third Workshop on Scholarly Document Processing
We release a pretrained Japanese masked language model for an academic domain. Pretrained masked language models have recently improved the performance of various natural language processing applications. In domains such as medical and academic, which include a lot of technical terms, domain-specific pretraining is effective. While domain-specific masked language models for medical and SNS domains are widely used in Japanese, along with domain-independent ones, pretrained models specific to the academic domain are not publicly available. In this study, we pretrained a RoBERTa-based Japanese masked language model on paper abstracts from the academic database CiNii Articles. Experimental results on Japanese text classification in the academic domain revealed the effectiveness of the proposed model over existing pretrained models.