Areg Mikael Sarvazyan
2026
How Many Samples Do We Need? A Toolkit for Power-Aware Evaluation Design
Angelo Basile | Areg Mikael Sarvazyan | José Ángel González
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Angelo Basile | Areg Mikael Sarvazyan | José Ángel González
Proceedings of the Fifteenth Language Resources and Evaluation Conference
If datasets are the telescopes of our field, then statistical power is their resolution, i.e., their ability to reveal a true difference in model performance when one exists. Many NLP evaluations are underpowered, leading to overstated claims of improvement. This paper introduces sk-power, an open-source Python library that helps researchers and practitioners design well-powered evaluations. Built with familiar scikit-learn-style abstractions, sk-power enables users to simulate evaluation scenarios, estimate minimum detectable effects, and assess the reliability of reported gains. We also illustrate what can go wrong when power analysis isn’t carried out. Our goal is to position power analysis as a first-class, practical step in evaluation planning.
2024
Genaios at SemEval-2024 Task 8: Detecting Machine-Generated Text by Mixing Language Model Probabilistic Features
Areg Mikael Sarvazyan | José Ángel González | Marc Franco-salvador
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)
Areg Mikael Sarvazyan | José Ángel González | Marc Franco-salvador
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)
This paper describes the participation of the Genaios team in the monolingual track of Subtask A at SemEval-2024 Task 8. Our best system, LLMixtic, is a Transformer Encoder that mixes token-level probabilistic features extracted from four LLaMA-2 models. We obtained the best results in the official ranking (96.88% accuracy), showing a false positive ratio of 4.38% and a false negative ratio of 1.97% on the test set. We further study LLMixtic through ablation, probabilistic, and attention analyses, finding that (i) performance improves as more LLMs and probabilistic features are included, (ii) LLMixtic puts most attention on the features of the last tokens, (iii) it fails on samples where human text probabilities become consistently higher than for generated text, and (iv) LLMixtic’s false negatives exhibit a bias towards text with newlines.