Team Innovative at SemEval-2024 Task 8: Multigenerator, Multidomain, and Multilingual Black-Box Machine-Generated Text Detection

Surbhi Sharma, Irfan Mansuri


Abstract
With the widespread adoption of large language models (LLMs), such as ChatGPT and GPT-4, in various domains, concerns regarding their potential misuse, including spreading misinformation and disrupting education, have escalated. The need to discern between human-generated and machine-generated text has become increasingly crucial. This paper addresses the challenge of automatic text classification with a focus on distinguishing between human-written and machine-generated text. Leveraging the robust capabilities of the RoBERTa model, we propose an approach for text classification, termed as RoBERTa hybrid, which involves fine-tuning the pre-trained Roberta model coupled with additional dense layers and softmax activation for authorship attribution. In this paper, we present an approach that leverages Stylometric features, hybrid features, and the output probabilities of a fine-tuned RoBERTa model. Our method achieves a test accuracy of 73% and a validation accuracy of 89%, demonstrating promising advancements in the field of machine-generated text detection. These results mark significant progress in the domain of machine-generated text detection, as evidenced by our 74th position on the leaderboard for Subtask-A of SemEval-2024 Task 8.
Anthology ID:
2024.semeval-1.171
Volume:
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)
Month:
June
Year:
2024
Address:
Mexico City, Mexico
Editors:
Atul Kr. Ojha, A. Seza Doğruöz, Harish Tayyar Madabushi, Giovanni Da San Martino, Sara Rosenthal, Aiala Rosá
Venue:
SemEval
SIG:
SIGLEX
Publisher:
Association for Computational Linguistics
Note:
Pages:
1172–1176
Language:
URL:
https://aclanthology.org/2024.semeval-1.171
DOI:
Bibkey:
Cite (ACL):
Surbhi Sharma and Irfan Mansuri. 2024. Team Innovative at SemEval-2024 Task 8: Multigenerator, Multidomain, and Multilingual Black-Box Machine-Generated Text Detection. In Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024), pages 1172–1176, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):
Team Innovative at SemEval-2024 Task 8: Multigenerator, Multidomain, and Multilingual Black-Box Machine-Generated Text Detection (Sharma & Mansuri, SemEval 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-checklist/2024.semeval-1.171.pdf
Supplementary material:
 2024.semeval-1.171.SupplementaryMaterial.txt
Supplementary material:
 2024.semeval-1.171.SupplementaryMaterial.zip