Genaios at SemEval-2024 Task 8: Detecting Machine-Generated Text by Mixing Language Model Probabilistic Features

Areg Mikael Sarvazyan; José-Ángel González; Marc Franco-Salvador

doi:10.18653/v1/2024.semeval-1.17

Genaios at SemEval-2024 Task 8: Detecting Machine-Generated Text by Mixing Language Model Probabilistic Features

Areg Mikael Sarvazyan, José Ángel González, Marc Franco-salvador

Abstract

This paper describes the participation of the Genaios team in the monolingual track of Subtask A at SemEval-2024 Task 8. Our best system, LLMixtic, is a Transformer Encoder that mixes token-level probabilistic features extracted from four LLaMA-2 models. We obtained the best results in the official ranking (96.88% accuracy), showing a false positive ratio of 4.38% and a false negative ratio of 1.97% on the test set. We further study LLMixtic through ablation, probabilistic, and attention analyses, finding that (i) performance improves as more LLMs and probabilistic features are included, (ii) LLMixtic puts most attention on the features of the last tokens, (iii) it fails on samples where human text probabilities become consistently higher than for generated text, and (iv) LLMixtic’s false negatives exhibit a bias towards text with newlines.

Anthology ID:: 2024.semeval-1.17
Volume:: Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)
Month:: June
Year:: 2024
Address:: Mexico City, Mexico
Editors:: Atul Kr. Ojha, A. Seza Doğruöz, Harish Tayyar Madabushi, Giovanni Da San Martino, Sara Rosenthal, Aiala Rosá
Venue:: SemEval
SIG:: SIGLEX
Publisher:: Association for Computational Linguistics
Note:
Pages:: 101–107
Language:
URL:: https://aclanthology.org/2024.semeval-1.17
DOI:: 10.18653/v1/2024.semeval-1.17
Bibkey:
Cite (ACL):: Areg Mikael Sarvazyan, José Ángel González, and Marc Franco-salvador. 2024. Genaios at SemEval-2024 Task 8: Detecting Machine-Generated Text by Mixing Language Model Probabilistic Features. In Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024), pages 101–107, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):: Genaios at SemEval-2024 Task 8: Detecting Machine-Generated Text by Mixing Language Model Probabilistic Features (Sarvazyan et al., SemEval 2024)
Copy Citation:
PDF:: https://preview.aclanthology.org/dois-2013-emnlp/2024.semeval-1.17.pdf
Supplementary material:: 2024.semeval-1.17.SupplementaryMaterial.txt

PDF Search Supplementary material