Groningen team D at SemEval-2024 Task 8: Exploring data generation and a combined model for fine-tuning LLMs for Multidomain Machine-Generated Text Detection

Thijs Brekhof, Xuanyi Liu, Joris Ruitenbeek, Niels Top, Yuwen Zhou


Abstract
In this system description, we describe our process and the systems that we created for the subtasks A monolingual, A multilingual, and B forthe SemEval-2024 Task 8: Multigenerator, Multidomain, and Multilingual Black-Box MachineGenerated Text Detection. This shared task aimsat detecting and differentiating between machinegenerated text and human-written text. SubtaskA is focused on detecting if a text is machinegenerated or human-written both in a monolingualand a multilingual setting. Subtask B is also focused on detecting if a text is human-written ormachine-generated, though it takes it one step further by also requiring the detection of the correct language model used for generating the text.For the monolingual aspects of this task, our approach is centered around fine-tuning a debertav3-large LM. For the multilingual setting, we created an ensemble model utilizing different monolingual models and a language identification toolto classify each text. We also experiment with thegeneration of extra training data. Our results showthat the generation of extra data aids our modelsand leads to an increase in accuracy.
Anthology ID:
2024.semeval-1.61
Volume:
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)
Month:
June
Year:
2024
Address:
Mexico City, Mexico
Editors:
Atul Kr. Ojha, A. Seza Doğruöz, Harish Tayyar Madabushi, Giovanni Da San Martino, Sara Rosenthal, Aiala Rosá
Venue:
SemEval
SIG:
SIGLEX
Publisher:
Association for Computational Linguistics
Note:
Pages:
391–398
Language:
URL:
https://aclanthology.org/2024.semeval-1.61
DOI:
Bibkey:
Cite (ACL):
Thijs Brekhof, Xuanyi Liu, Joris Ruitenbeek, Niels Top, and Yuwen Zhou. 2024. Groningen team D at SemEval-2024 Task 8: Exploring data generation and a combined model for fine-tuning LLMs for Multidomain Machine-Generated Text Detection. In Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024), pages 391–398, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):
Groningen team D at SemEval-2024 Task 8: Exploring data generation and a combined model for fine-tuning LLMs for Multidomain Machine-Generated Text Detection (Brekhof et al., SemEval 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/jeptaln-2024-ingestion/2024.semeval-1.61.pdf
Supplementary material:
 2024.semeval-1.61.SupplementaryMaterial.txt