byteSizedLLM@DravidianLangTech 2024: Fake News Detection in Dravidian Languages - Unleashing the Power of Custom Subword Tokenization with Subword2Vec and BiLSTM

Rohith Gowtham Kodali; Durga Prasad Manukonda

byteSizedLLM@DravidianLangTech 2024: Fake News Detection in Dravidian Languages - Unleashing the Power of Custom Subword Tokenization with Subword2Vec and BiLSTM

Rohith Gowtham Kodali, Durga Prasad Manukonda

Abstract

This paper focuses on detecting fake news in resource-constrained languages, particularly Malayalam. We present a novel framework combining subword tokenization, Sanskrit-transliterated Subword2vec embeddings, and a powerful Bidirectional Long Short-Term Memory (BiLSTM) architecture. Despite using only monolingual Malayalam data, our model excelled in the FakeDetect-Malayalam challenge, ranking 4th. The innovative subword tokenizer achieves a remarkable 200x compression ratio, highlighting its efficiency in minimizing model size without compromising accuracy. Our work facilitates resource-efficient deployment in diverse linguistic landscapes and sparks discussion on the potential of multilingual data augmentation. This research provides a promising avenue for mitigating linguistic challenges in the NLP-driven battle against deceptive content.

Anthology ID:: 2024.dravidianlangtech-1.12
Volume:: Proceedings of the Fourth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages
Month:: March
Year:: 2024
Address:: St. Julian's, Malta
Editors:: Bharathi Raja Chakravarthi, Ruba Priyadharshini, Anand Kumar Madasamy, Sajeetha Thavareesan, Elizabeth Sherly, Rajeswari Nadarajan, Manikandan Ravikiran
Venues:: DravidianLangTech | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 79–84
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2024.dravidianlangtech-1.12/
DOI:
Bibkey:
Cite (ACL):: Rohith Gowtham Kodali and Durga Prasad Manukonda. 2024. byteSizedLLM@DravidianLangTech 2024: Fake News Detection in Dravidian Languages - Unleashing the Power of Custom Subword Tokenization with Subword2Vec and BiLSTM. In Proceedings of the Fourth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages, pages 79–84, St. Julian's, Malta. Association for Computational Linguistics.
Cite (Informal):: byteSizedLLM@DravidianLangTech 2024: Fake News Detection in Dravidian Languages - Unleashing the Power of Custom Subword Tokenization with Subword2Vec and BiLSTM (Kodali & Manukonda, DravidianLangTech 2024)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2024.dravidianlangtech-1.12.pdf
Video:: https://preview.aclanthology.org/ingest-emnlp/2024.dravidianlangtech-1.12.mp4

PDF Cite Search Video Fix data