Vidhyakshaya Kannan


2025

pdf bib
ConfReady: A RAG based Assistant and Dataset for Conference Checklist Responses
Michael Galarnyk | Rutwik Routu | Vidhyakshaya Kannan | Kosha Bheda | Prasun Banerjee | Agam Shah | Sudheer Chava
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

The ARR Responsible NLP Research checklist website states that the “checklist is designed to encourage best practices for responsible research, addressing issues of research ethics, societal impact and reproducibility.” Answering the questions is an opportunity for authors to reflect on their work and make sure any shared scientific assets follow best practices. Ideally, considering a checklist before submission can favorably impact the writing of a research paper. However, previous research has shown that self-reported checklist responses don’t always accurately represent papers. In this work, we introduce ConfReady, a retrieval-augmented generation (RAG) application that can be used to empower authors to reflect on their work and assist authors with conference checklists. To evaluate checklist assistants, we curate a dataset of 1,975 ACL checklist responses, analyze problems in human answers, and benchmark RAG and Large Language Model (LM) based systems on an evaluation subset. Our code is released under the AGPL-3.0 license on GitHub, with documentation covering the user interface and PyPI package.

pdf bib
T-VEC: A Telecom-Specific Vectorization Model with Enhanced Semantic Understanding via Deep Triplet Loss Fine-Tuning
Vignesh Ethiraj | Ashwath D | Sidhanth Menon | Divya Vijay | Vidhyakshaya Kannan
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track

The specialized vocabulary and nuanced concepts of the telecommunications industry pose persistent challenges for standard Natural Language Processing (NLP) models. Generic embedding models often struggle to represent telecom-specific semantics, limiting their utility in retrieval and downstream tasks. We present T-VEC (Telecom Vectorization Model), a domain-adapted embedding model fine-tuned from the gte-Qwen2-1.5B-instruct backbone using a triplet loss objective. Fine-tuning was performed on T-Embed, a high-quality, large-scale dataset covering diverse telecom concepts, standards, and operational scenarios. Although T-Embed contains some proprietary material and cannot be fully released, we open source 75% of the dataset to support continued research in domain-specific representation learning. On a custom benchmark comprising 1500 query-passage pairs from IETF RFCs and vendor manuals, T-VEC surpasses MPNet, BGE, Jina and E5, demonstrating superior domain grounding and semantic precision in telecom-specific retrieval. Embedding visualizations further showcase tight clustering of telecom-relevant concepts. We release T-VEC and its tokenizer to support semantically faithful NLP applications within the telecom domain.