A Corpus of Hindi-English Code-Mixed Posts to Hate Speech Detection

Prashant Kapil; Asif Ekbal

A Corpus of Hindi-English Code-Mixed Posts to Hate Speech Detection

Abstract

Social media content, such as blog posts, comments, and tweets, often contains offensive language, including racial hate speech, personal attacks, and sexual harassment. Detecting inappropriate language is crucial for user safety and prevention of hateful behavior and aggression. This study introduces the HECM (Hindi-English code-mixed tweets) to fill the gap in Hindi language resources. The corpus comprises approximately 9.4K tweets labeled as hateful and nonhateful. It includes detailed information on the data, such as the annotation schema, the label definitions, and an interannotator agreement score of 85%. The study evaluates the effectiveness of traditional machine learning, deep neural networks, and transformer encoder-based approaches. The results show a significant improvement in terms of macro-F1 and weighted F1 scores. Additionally, a lexicon containing 2000 lexicons tagged in 21 categories is created based on the multilingual HURTLEX lexicon. This lexicon is merged with the transformer encoder, resulting in a marginal improvement in macro-F1 and weighted-F1. The study also experiments with a Hindi-Devanagari dataset to assess the impact of the lexicon on performance metrics.

Anthology ID:: 2024.icon-1.9
Volume:: Proceedings of the 21st International Conference on Natural Language Processing (ICON)
Month:: December
Year:: 2024
Address:: AU-KBC Research Centre, Chennai, India
Editors:: Sobha Lalitha Devi, Karunesh Arora
Venue:: ICON
SIG:
Publisher:: NLP Association of India (NLPAI)
Note:
Pages:: 79–85
Language:
URL:: https://preview.aclanthology.org/fix-sig-urls/2024.icon-1.9/
DOI:
Bibkey:
Cite (ACL):: Prashant Kapil and Asif Ekbal. 2024. A Corpus of Hindi-English Code-Mixed Posts to Hate Speech Detection. In Proceedings of the 21st International Conference on Natural Language Processing (ICON), pages 79–85, AU-KBC Research Centre, Chennai, India. NLP Association of India (NLPAI).
Cite (Informal):: A Corpus of Hindi-English Code-Mixed Posts to Hate Speech Detection (Kapil & Ekbal, ICON 2024)
Copy Citation:
PDF:: https://preview.aclanthology.org/fix-sig-urls/2024.icon-1.9.pdf

PDF Cite Search Fix data