PetKaz at SemEval-2024 Task 8: Can Linguistics Capture the Specifics of LLM-generated Text?

Kseniia Petukhova, Roman Kazakov, Ekaterina Kochmar


Abstract
In this paper, we present our submission to the SemEval-2024 Task 8 “Multigenerator, Multidomain, and Multilingual Black-Box Machine-Generated Text Detection”, focusing on the detection of machine-generated texts (MGTs) in English. Specifically, our approach relies on combining embeddings from the RoBERTa-base with diversity features and uses a resampled training set. We score 16th from 139 in the ranking for Subtask A, and our results show that our approach is generalizable across unseen models and domains, achieving an accuracy of 0.91.
Anthology ID:
2024.semeval-1.166
Volume:
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)
Month:
June
Year:
2024
Address:
Mexico City, Mexico
Editors:
Atul Kr. Ojha, A. Seza Doğruöz, Harish Tayyar Madabushi, Giovanni Da San Martino, Sara Rosenthal, Aiala Rosá
Venue:
SemEval
SIG:
SIGLEX
Publisher:
Association for Computational Linguistics
Note:
Pages:
1140–1147
Language:
URL:
https://aclanthology.org/2024.semeval-1.166
DOI:
Bibkey:
Cite (ACL):
Kseniia Petukhova, Roman Kazakov, and Ekaterina Kochmar. 2024. PetKaz at SemEval-2024 Task 8: Can Linguistics Capture the Specifics of LLM-generated Text?. In Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024), pages 1140–1147, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):
PetKaz at SemEval-2024 Task 8: Can Linguistics Capture the Specifics of LLM-generated Text? (Petukhova et al., SemEval 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/bionlp-24-ingestion/2024.semeval-1.166.pdf
Supplementary material:
 2024.semeval-1.166.SupplementaryMaterial.zip
Supplementary material:
 2024.semeval-1.166.SupplementaryMaterial.txt