Adrian Ghajari
2025
Robust Estimation of Population-Level Effects in Repeated-Measures NLP Experimental Designs
Alejandro Benito-Santos
|
Adrian Ghajari
|
Víctor Fresno
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
NLP research frequently grapples with multiple sources of variability—spanning runs, datasets, annotators, and more—yet conventional analysis methods often neglect these hierarchical structures, threatening the reproducibility of findings. To address this gap, we contribute a case study illustrating how linear mixed-effects models (LMMs) can rigorously capture systematic language-dependent differences (i.e., population-level effects) in a population of monolingual and multilingual language models. In the context of a bilingual hate speech detection task, we demonstrate that LMMs can uncover significant population-level effects—even under low-resource (small-N) experimental designs—while mitigating confounds and random noise. By setting out a transparent blueprint for repeated-measures experimentation, we encourage the NLP community to embrace variability as a feature, rather than a nuisance, in order to advance more robust, reproducible, and ultimately trustworthy results.
Beyond Averages: Learning with Annotator Disagreement in STS
Alejandro Benito-Santos
|
Adrian Ghajari
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
This work investigates capturing and modeling disagreement in Semantic Textual Similarity (STS), where sentence pairs are assigned ordinal similarity labels (0–5). Conventional STS systems average multiple annotator scores and focus on a single numeric estimate, overlooking label dispersion. By leveraging the disaggregated SemEval-2015 dataset (Soft-STS-15), this paper proposes and compares two disagreement-aware strategies that treat STS as an ordinal distribution prediction problem: a lightweight truncated Gaussian head for standard regression models, and a cross-encoder trained with a distance-aware objective, refined with temperature scaling. Results show improved performance in distance-based metrics, with the calibrated soft-label model proving best overall and notably more accurate on the most ambiguous pairs. This demonstrates that modeling disagreement benefits both calibration and ranking accuracy, highlighting the value of retaining and modeling full annotation distributions rather than collapsing them to a single mean label.
On the Correspondence between the Squared Norm and Information Content in Text Embeddings
Enrique Amigo
|
Adrian Ghajari
|
Alejandro Benito-Santos
|
Diego De La Fuente Rodríguez
Findings of the Association for Computational Linguistics: EMNLP 2025
Previous work has reported both empirical and theoretical evidence, for specific training models, of the correspondence between the squared norm of an embedding and the information content of the text it represents.In this paper, we investigate the relationship at the theoretical and empirical levels, focusing on the mechanisms and composition functions used to combine token embeddings. i) We formally derive two sufficient theoretical conditions for this correspondence to hold in embedding models. ii) We empirically examine the correspondence and the validity of these conditions at the word level for both static and contextual embeddings and different subword token composition mechanisms.iii) Building on Shannon’s Constant Entropy Rate (CER) principle, we explore whether embedding mechanisms exhibit a linearly monotonic increase in information content as text length increases.Our formal analysis and experiments reveal that:i) At the word embedding level, models satisfy the sufficient conditions and show a strong correspondence when certain subword composition functions are applied.ii) Only scaled embedding averages proposed in this paper and certain information-theoretic composition functions preserve the correspondence. Some non-compositional representations—such as the CLS token in BERT or the EOS token in LLaMA—tend to converge toward a fixed point. The CLS token in ModernBERT, however, exhibits behavior that aligns more closely with the CER hypothesis.