BadRock at SemEval-2024 Task 8: DistilBERT to Detect Multigenerator, Multidomain and Multilingual Black-Box Machine-Generated Text

Marco Siino


Abstract
The rise of Large Language Models (LLMs) has brought about a notable shift, rendering them increasingly ubiquitous and readily accessible. This accessibility has precipitated a surge in machine-generated content across diverse platforms encompassing news outlets, social media platforms, question-answering forums, educational platforms, and even academic domains. Recent iterations of LLMs, exemplified by entities like ChatGPT and GPT-4, exhibit a remarkable ability to produce coherent and contextually relevant responses across a broad spectrum of user inquiries. The fluidity and sophistication of these generated texts position LLMs as compelling candidates for substituting human labor in numerous applications. Nevertheless, this proliferation of machine-generated content has raised apprehensions regarding potential misuse, including the dissemination of misinformation and disruption of educational ecosystems. Given that humans marginally outperform random chance in discerning between machine-generated and human-authored text, there arises a pressing imperative to develop automated systems capable of accurately distinguishing machine-generated text. This pursuit is driven by the overarching objective of curbing the potential misuse of machine-generated content. Our manuscript delineates the approach we adopted for participation in this competition. Specifically, we detail the use of a DistilBERT model for classifying each sample in the test set provided. Our submission is able to reach an accuracy equal to 0.754 in place of the worst result obtained at the competition that is equal to 0.231.
Anthology ID:
2024.semeval-1.37
Volume:
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)
Month:
June
Year:
2024
Address:
Mexico City, Mexico
Editors:
Atul Kr. Ojha, A. Seza Doğruöz, Harish Tayyar Madabushi, Giovanni Da San Martino, Sara Rosenthal, Aiala Rosá
Venue:
SemEval
SIG:
SIGLEX
Publisher:
Association for Computational Linguistics
Note:
Pages:
239–245
Language:
URL:
https://aclanthology.org/2024.semeval-1.37
DOI:
10.18653/v1/2024.semeval-1.37
Bibkey:
Cite (ACL):
Marco Siino. 2024. BadRock at SemEval-2024 Task 8: DistilBERT to Detect Multigenerator, Multidomain and Multilingual Black-Box Machine-Generated Text. In Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024), pages 239–245, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):
BadRock at SemEval-2024 Task 8: DistilBERT to Detect Multigenerator, Multidomain and Multilingual Black-Box Machine-Generated Text (Siino, SemEval 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-4/2024.semeval-1.37.pdf
Supplementary material:
 2024.semeval-1.37.SupplementaryMaterial.zip
Supplementary material:
 2024.semeval-1.37.SupplementaryMaterial.txt