SubmissionNumber#=%=#17
FinalPaperTitle#=%=#Genaios at SemEval-2024 Task 8: Detecting Machine-Generated Text by Mixing Language Model Probabilistic Features
ShortPaperTitle#=%=#
NumberOfPages#=%=#7
CopyrightSigned#=%=#José Ángel González Barba
JobTitle#==#
Organization#==#Genaios, Valencia, Spain.
Abstract#==#This paper describes the participation of the Genaios team in the monolingual track of Subtask A at SemEval-2024 Task 8. Our best system, LLMixtic, is a Transformer Encoder that mixes token-level probabilistic features extracted from four LLaMA-2 models. We obtained the best results in the official ranking (96.88% accuracy), showing a false positive ratio of 4.38% and a false negative ratio of 1.97% on the test set. We further study LLMixtic through ablation, probabilistic, and attention analyses, finding that (i) performance improves as more LLMs and probabilistic features are included, (ii) LLMixtic puts most attention on the features of the last tokens, (iii) it fails on samples where human text probabilities become consistently higher than for generated text, and (iv) LLMixtic's false negatives exhibit a bias towards text with newlines.
Author{1}{Firstname}#=%=#Areg Mikael
Author{1}{Lastname}#=%=#Sarvazyan
Author{1}{Email}#=%=#areg.sarvazyan@genaios.ai
Author{1}{Affiliation}#=%=#Genaios
Author{2}{Firstname}#=%=#José Ángel
Author{2}{Lastname}#=%=#González
Author{2}{Username}#=%=#jogonba2
Author{2}{Email}#=%=#jose.gonzalez@genaios.ai
Author{2}{Affiliation}#=%=#Genaios
Author{3}{Firstname}#=%=#Marc
Author{3}{Lastname}#=%=#Franco-Salvador
Author{3}{Username}#=%=#neosyon
Author{3}{Email}#=%=#marc.franco@symanto.com
Author{3}{Affiliation}#=%=#Symanto Research

==========
èéáğö