Akhil Rajeev P


2026

Text Generation has achieved remarkable performance using large language models. It has also been recently well-studied that these large language models are capable of creative generation tasks but prominently for high-resource languages. This prompts a fundamental question: Is there a way to utilize these (large) language models for structured poetry generation in a low-resource language, such as Sanskrit? We present Chandomitra, an English input to structured Sanskrit Poetry translation dataset, specifically adhering to the Anushtubh meter. We benchmark various open and closed models, and scrutinize specialized techniques such as constrained decoding and instruction fine-tuning, for the proposed task. Our constrained decoding methodology achieves 99.86% syntactic accuracy in generating metrically valid Sanskrit poetry, outperforming GPT-4o (1-shot: 31.24%). Our best-performing instruction-tuned model, on the other hand, performs better in semantic coherence with the English input, at the expense of slightly lower syntactic accuracy. Human evaluation further reveals that instruction fine-tuned model is better able to capture the poetic aspects.

2025

The Rigveda, among the oldest Indian texts in Vedic Sanskrit, employs a distinctive pitchaccent system - udatta, anudatta, svarita whose marks encode melodic and interpretive cues but are often absent from moderne-texts. This work develops a parallel corpus of accented-unaccented ́slokas and conducts a controlled comparison of three strategies for automatic accent placement in Rigvedic verse: (i) full fine-tuning of ByT5, a byte-level Transformer that operates directly on Unicode combining marks, (ii) a from-scratch BiLSTM-CRF sequence-labeling baseline, and (iii) LoRA-based parameter-efficient fine-tuning atop ByT5. Evaluation uses Word Error Rate (WER) and Character Error Rate (CER) for orthographic fidelity, plus a task-specific Diacritic Error Rate (DER) that isolates accent edits. Full ByT5 fine-tuning attains the lowest error across all metrics; LoRA offers strong efficiencyaccuracy trade-offs, and BiLSTM-CRF serves as a transparent baseline. The study underscores practical requirements for accent restoration - Unicode-safe preprocessing, mark-aware tokenization, and evaluation that separates grapheme from accent errors - and positions heritage-language technology as an emerging NLP area connecting computational modeling with philological and pedagogical aims. Results establish reproducible baselines for Rigvedic accent restoration and provide guidance for downstream tasks such as accentaware OCR, ASR/chant synthesis, and digital scholarship.
Grammatical error correction for Indic languages faces limited supervision, diverse scripts, and rich morphology. We propose an augmentation-free setup that uses instructiontuned large language models and conservative decoding. A 12B GEMMA 3 model is instruction-tuned in bnb 4 bit precision with Parameter efficient Finetuning and Alpaca-style formatting. Decoding follows a deterministic, constraint-aware procedure with a lightweight normaliser that encourages minimal, meaningpreserving edits. We operationalise inference, subsequent to instruction fine-tuning (IFT), via a fixed, language-specific prompt directly synthesised from a deterministic error classifier’s taxonomy, label distributions, and precedence ordering computed on the training corpus. Under the official untuned GLEU evaluation, the system scores 92.41 on Malayalam, sixth overall, and 81.44 on Hindi, third overall. These results indicate that classifier-informed prompt design, adapter-based instruction tuning, and deterministic decoding provide a reproducible and computationally efficient alternative to augmentation-centred pipelines for Indic GEC. The approach also motivates future work on stronger morphosyntactic constraints and human centered evaluation of conservative edits.