Sanjay Chatterji

Large language models (LLMs) are pre-trained on enormous amounts of text data and show acclaimed success in knowledge representation. However, there are two bottlenecks with this approach. (1) Pre-training data cannot be regularly updated once the models are deployed, and it is not very fruitful if the model cannot represent updated knowledge. (2) The consistently increasing size and computational resources make it difficult for non-commercial and individual researchers to fine-tune and scale these language models. Major LLMs with external knowledge are also proprietary. In this paper, we propose AcKnowledge, a framework wrapped around a small, non-pre-trained language model for an open-domain question-answering (QA) experiment. AcKnowledge learns relevant knowledge from the internet via meta-learning based on user questions, and re-learns from user feedback if knowledge is misrepresented. Our efficient knowledge representation framework avoids pre-training overhead while enabling updated information. Benchmarking shows competitive performance against similarly sized state-of-the-art (SoTA) LLMs on gold standard QA datasets, demonstrating the potential of integrating internet search and user feedback for improved performance and generalizability.

2023

This research investigates the correlation between Sentiment and SEPSIS(SpEculation, oPinion, biaS, and twISt) characteristics in news sentences through an ablation study. Various Sentiment analysis models, including TextBlob, Vader, and RoBERTa, are examined to discern their impact on news sentences. Additionally, we explore the Logistic Regression(LR), Decision Trees(DT), Support Vector Machines(SVM) and Convolutional Neural Network (CNN) models for Septic sentence classification.

Large language models have gained a meteoric rise recently. With the prominence of LLMs, hallucination and misinformation generation have become a severity too. To combat this issue, we propose a contextual topic modeling approach called Co-LDA for generative transformer. It is based on Latent Dirichlet Allocation and is designed for accurate sentence-level information generation. This method extracts cohesive topics from COVID-19 research literature, grouping them into relevant categories. These contextually rich topic words serve as masked tokens in our proposed Tokenized Generative Transformer, a modified Generative Pre-Trained Transformer for generating accurate information in any designated topics. Our approach addresses micro hallucination and incorrect information issues in experimentation with the LLMs. We also introduce a Perplexity-Similarity Score system to measure semantic similarity between generated and original documents, offering accuracy and authenticity for generated texts. Evaluation of benchmark datasets, including question answering, language understanding, and language similarity demonstrates the effectiveness of our text generation method, surpassing some state-of-the-art transformer models.

2019

Often sentences of correct news are either made biased towards a particular person or a group of persons or parties or maybe distorted to add some sentiment or importance in it. Engaged readers often are not able to extract the inherent meaning of such synthetic sentences. In Bengali, the news contents of the synthetic sentences are presented in such a rich way that it usually becomes difficult to identify the synthetic part of it. We have used machine learning algorithms to classify Bengali news sentences into synthetic and legitimate and then used some rule-based postprocessing on each of these models. Finally, we have developed a voting based combination of these models to build a hybrid model for Bengali synthetic sentence identification. This is a new task and therefore we could not compare it with any existing work in the field. Identification of such types of sentences may be used to improve the performance of identifying fake news and satire news. Thus, identifying molecular level biasness in news articles.