Sourav Das


2025

pdf bib
Can LLMs be Literary Companions?: Analysing LLMs on Bengali Figures of Speech Identification
Sourav Das | Kripabandhu Ghosh
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Despite Bengali being among the most spoken languages bearing cultural importance and richness, the NLP endeavors on it, remain relatively limited. Figures of Speech (FoS) not only contribute to the phonetic and semantic nuances of a language, but they also exhibit aesthetics, expression, and creativity in literature. To our knowledge, in this paper, we present the first ever Bengali figures of speech classification dataset, **BengFoS**, on works of six renowned poets of Bengali literature. We deploy state-of-the-art Large Language Models (LLMs) to this dataset in the zero-shot setup, thereafter fine-tuning the best performing models, and finally dissect them for language model probing. This reveals novel insights on the intrinsic behavior of two open-source LLMs (Llama and DeepSeek) in FoS detection. **Though we have limited ourselves to Bengali, the experimental framework can be reproduced for English as well as for other low-resource languages**.

2024

pdf bib
AcKnowledge: Acquired Knowledge Representation by Small Language Model Without Pre-training
Sourav Das | Sanjay Chatterji | Imon Mukherjee
Proceedings of the 1st Workshop on Towards Knowledgeable Language Models (KnowLLM 2024)

Large language models (LLMs) are pre-trained on enormous amounts of text data and show acclaimed success in knowledge representation. However, there are two bottlenecks with this approach. (1) Pre-training data cannot be regularly updated once the models are deployed, and it is not very fruitful if the model cannot represent updated knowledge. (2) The consistently increasing size and computational resources make it difficult for non-commercial and individual researchers to fine-tune and scale these language models. Major LLMs with external knowledge are also proprietary. In this paper, we propose AcKnowledge, a framework wrapped around a small, non-pre-trained language model for an open-domain question-answering (QA) experiment. AcKnowledge learns relevant knowledge from the internet via meta-learning based on user questions, and re-learns from user feedback if knowledge is misrepresented. Our efficient knowledge representation framework avoids pre-training overhead while enabling updated information. Benchmarking shows competitive performance against similarly sized state-of-the-art (SoTA) LLMs on gold standard QA datasets, demonstrating the potential of integrating internet search and user feedback for improved performance and generalizability.

2023

pdf bib
Combating Hallucination and Misinformation: Factual Information Generation with Tokenized Generative Transformer
Sourav Das | Sanjay Chatterji | Imon Mukherjee
Proceedings of the Joint 3rd International Conference on Natural Language Processing for Digital Humanities and 8th International Workshop on Computational Linguistics for Uralic Languages

Large language models have gained a meteoric rise recently. With the prominence of LLMs, hallucination and misinformation generation have become a severity too. To combat this issue, we propose a contextual topic modeling approach called Co-LDA for generative transformer. It is based on Latent Dirichlet Allocation and is designed for accurate sentence-level information generation. This method extracts cohesive topics from COVID-19 research literature, grouping them into relevant categories. These contextually rich topic words serve as masked tokens in our proposed Tokenized Generative Transformer, a modified Generative Pre-Trained Transformer for generating accurate information in any designated topics. Our approach addresses micro hallucination and incorrect information issues in experimentation with the LLMs. We also introduce a Perplexity-Similarity Score system to measure semantic similarity between generated and original documents, offering accuracy and authenticity for generated texts. Evaluation of benchmark datasets, including question answering, language understanding, and language similarity demonstrates the effectiveness of our text generation method, surpassing some state-of-the-art transformer models.