Mahalakshmi S
2026
Under the Surface: Probing Tamil Paraphrase Intelligence
Viswadarshan R R | Dr. J. Felicia Lilian | Mahalakshmi S
Proceedings of the 15th Joint Conference on Lexical and Computational Semantics (*SEM 2026)
Viswadarshan R R | Dr. J. Felicia Lilian | Mahalakshmi S
Proceedings of the 15th Joint Conference on Lexical and Computational Semantics (*SEM 2026)
We present a systematic study on paraphrase detection in Tamil by constructing a unified dataset through translation and semantic verification of three English benchmarks QQP, PAWS, and MRPC. Unlike prior efforts that focus on individual sources or limited scales, our dataset combines multiple paraphrase detection paradigms and is evaluated using semantic similarity metrics, round-trip translation checks, and classifier agreement analysis. We fine-tune five multilingual transformer models (mBERT, XLM-R, IndicBERT, MuRIL, and DistilmBERT) and a Tamil-specific compact model, TLMR (Tamil Language Model - DeBERTa), pretrained on 525M Tamil tokens. Furthermore, we assess the representational quality of the sentence embeddings that are taken from these models using lightweight classifiers (SVM, XGBoost, and Logistic Regression). We formulate an efficiency-oriented metric that incorporates top-5 accuracy, vocabulary usage, and script fidelity in relation to perplexity in order to facilitate resource-aware evaluation. The experimental findings lay the groundwork for future Tamil semantic understanding tasks by highlighting differences in generalization and efficiency across models.
2024
Enhancing Masked Word Prediction in Tamil Language Models: A Synergistic Approach Using BERT and SBERT
Viswadarshan R R | Viswaa Selvam S | Felicia Lilian J | Mahalakshmi S
Proceedings of the 21st International Conference on Natural Language Processing (ICON)
Viswadarshan R R | Viswaa Selvam S | Felicia Lilian J | Mahalakshmi S
Proceedings of the 21st International Conference on Natural Language Processing (ICON)
This research work presents a novel approach to enhancing masked word prediction and sentence-level semantic analysis in Tamil language models. By synergistically combining BERT and Sentence-BERT (SBERT) models, we leverage the strengths of both architectures to capture the contextual understanding and semantic relationships in Tamil Language sentences. Our methodology incorporates sentence tokenization as a crucial pre-processing step, preserving the grammatical structure and word-level dependencies of Tamil sentences. We trained BERT and SBERT on a diverse corpus of Tamil data, including synthetic datasets, the Oscar Corpus, AI4Bharat Parallel Corpus, and data extracted from Tamil Wikipedia and news websites. The combined model effectively predicts masked words while maintaining semantic coherence in generated sentences. While traditional accuracy metrics may not fully capture the model’s performance, intrinsic and extrinsic evaluations reveal the model’s ability to generate contextually relevant and linguistically sound outputs. Our research highlights the importance of sentence tokenization and the synergistic combination of BERT and SBERT for improving masked word prediction in Tamil sentences.
Monolingual text summarization for Indic Languages using LLMs
Jothir Adithya T K | Nithish Kumar S | Felicia Lilian J | Mahalakshmi S
Proceedings of the 21st International Conference on Natural Language Processing (ICON)
Jothir Adithya T K | Nithish Kumar S | Felicia Lilian J | Mahalakshmi S
Proceedings of the 21st International Conference on Natural Language Processing (ICON)
We have analyzed the growth of advanced text summarization method leveraging LLM for Indic language. Text summarization involves transforming a longer text information into a more concise version, ensuring that the most prominent information and key meanings are maintained. Our goal is to produce concise and accurate summaries from longer texts, focusing on maintaining detailed information and coherence. We utilize NLP techniques for text cleaning, keyword extraction and summarization, along with performance evaluation metrics such as ROUGE score, BLEU score and BERT Score. The results demonstrate an incremental improvement in the quality of generated summaries, with a particular emphasis on enhancing informativeness while minimizing redundancy. This research work also highlights the importance of tuning parameters and leveraging advanced models for producing high quality summaries in diverse domains for Indic Language.