Abstract
Hate speech and offensive language recognition in social media platforms have been an active field of research over recent years. In non-native English spoken countries, social media texts are mostly in code mixed or script mixed/switched form. The current study presents extensive experiments using multiple machine learning, deep learning, and transfer learning models to detect offensive content on Twitter. The data set used for this study are in Tanglish (Tamil and English), Manglish (Malayalam and English) code-mixed, and Malayalam script-mixed. The experimental results showed that 1 to 6-gram character TF-IDF features are better for the said task. The best performing models were naive bayes, logistic regression, and vanilla neural network for the dataset Tamil code-mix, Malayalam code-mixed, and Malayalam script-mixed, respectively instead of more popular transfer learning models such as BERT and ULMFiT and hybrid deep models.- Anthology ID:
- 2021.dravidianlangtech-1.5
- Volume:
- Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages
- Month:
- April
- Year:
- 2021
- Address:
- Kyiv
- Editors:
- Bharathi Raja Chakravarthi, Ruba Priyadharshini, Anand Kumar M, Parameswari Krishnamurthy, Elizabeth Sherly
- Venue:
- DravidianLangTech
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 36–45
- Language:
- URL:
- https://aclanthology.org/2021.dravidianlangtech-1.5
- DOI:
- Cite (ACL):
- Sunil Saumya, Abhinav Kumar, and Jyoti Prakash Singh. 2021. Offensive language identification in Dravidian code mixed social media text. In Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages, pages 36–45, Kyiv. Association for Computational Linguistics.
- Cite (Informal):
- Offensive language identification in Dravidian code mixed social media text (Saumya et al., DravidianLangTech 2021)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-1/2021.dravidianlangtech-1.5.pdf