Sanjay Chatterji


2023

pdf
Combating Hallucination and Misinformation: Factual Information Generation with Tokenized Generative Transformer
Sourav Das | Sanjay Chatterji | Imon Mukherjee
Proceedings of the Joint 3rd International Conference on Natural Language Processing for Digital Humanities and 8th International Workshop on Computational Linguistics for Uralic Languages

Large language models have gained a meteoric rise recently. With the prominence of LLMs, hallucination and misinformation generation have become a severity too. To combat this issue, we propose a contextual topic modeling approach called Co-LDA for generative transformer. It is based on Latent Dirichlet Allocation and is designed for accurate sentence-level information generation. This method extracts cohesive topics from COVID-19 research literature, grouping them into relevant categories. These contextually rich topic words serve as masked tokens in our proposed Tokenized Generative Transformer, a modified Generative Pre-Trained Transformer for generating accurate information in any designated topics. Our approach addresses micro hallucination and incorrect information issues in experimentation with the LLMs. We also introduce a Perplexity-Similarity Score system to measure semantic similarity between generated and original documents, offering accuracy and authenticity for generated texts. Evaluation of benchmark datasets, including question answering, language understanding, and language similarity demonstrates the effectiveness of our text generation method, surpassing some state-of-the-art transformer models.

2019

pdf
Identification of Synthetic Sentence in Bengali News using Hybrid Approach
Soma Das | Sanjay Chatterji
Proceedings of the 16th International Conference on Natural Language Processing

Often sentences of correct news are either made biased towards a particular person or a group of persons or parties or maybe distorted to add some sentiment or importance in it. Engaged readers often are not able to extract the inherent meaning of such synthetic sentences. In Bengali, the news contents of the synthetic sentences are presented in such a rich way that it usually becomes difficult to identify the synthetic part of it. We have used machine learning algorithms to classify Bengali news sentences into synthetic and legitimate and then used some rule-based postprocessing on each of these models. Finally, we have developed a voting based combination of these models to build a hybrid model for Bengali synthetic sentence identification. This is a new task and therefore we could not compare it with any existing work in the field. Identification of such types of sentences may be used to improve the performance of identifying fake news and satire news. Thus, identifying molecular level biasness in news articles.

2012

pdf
A Hybrid Dependency Parser for Bangla
Arnab Dhar | Sanjay Chatterji | Sudeshna Sarkar | Anupam Basu
Proceedings of the 10th Workshop on Asian Language Resources

pdf
Repairing Bengali Verb Chunks for Improved Bengali to Hindi Machine Translation
Sanjay Chatterji | Nabanita Datta | Arnab Dhar | Biswanath Barik | Sudeshna Sarkar | Anupam Basu
Proceedings of the 10th Workshop on Asian Language Resources

pdf
Translations of Ambiguous Hindi Pronouns to Possible Bengali Pronouns
Sanjay Chatterji | Sudeshna Sarkar | Anupam Basu
Proceedings of the 10th Workshop on Asian Language Resources

pdf
A Three Stage Hybrid Parser for Hindi
Sanjay Chatterji | Arnad Dhar | Sudeshna Sarkar | Anupam Basu
Proceedings of the Workshop on Machine Translation and Parsing in Indian Languages

pdf
An Efficient Technique for De-Noising Sentences using Monolingual Corpus and Synonym Dictionary
Sanjay Chatterji | Diptesh Chatterjee | Sudeshna Sarkar
Proceedings of COLING 2012: Demonstration Papers

2008

pdf
A Hybrid Named Entity Recognition System for South and South East Asian Languages
Sujan Kumar Saha | Sanjay Chatterji | Sandipan Dandapat | Sudeshna Sarkar | Pabitra Mitra
Proceedings of the IJCNLP-08 Workshop on Named Entity Recognition for South and South East Asian Languages