Seffi Cohen

2026

DFPE: A Diverse Fingerprint Ensemble for Enhancing LLM Performance
Seffi Cohen | Nurit Cohen Inger | Niv Goldshlager | Bracha Shapira | Lior Rokach
Findings of the Association for Computational Linguistics: EACL 2026

Large Language Models (LLMs) demonstrate impressive capabilities but exhibit inconsistent performance across diverse domains. We propose DFPE (Diverse Fingerprint Ensemble), a novel training-free method that systematically constructs subject-adaptive ensembles by balancing model diversity and competence. DFPE introduces three key innovations: (1) semantic fingerprinting using averaged response embeddings to capture distinct problem-solving patterns, (2) DBSCAN-based clustering with quantile-based competence filtering to ensure diverse yet capable model selection, and (3) exponentially-weighted aggregation adapted to subject-specific performance. Our method’s effectiveness is highlighted on the challenging MMLU-pro benchmark, where DFPE achieves a striking 17.1 percentage point gain over the best single model, reaching 71.4% accuracy. This strong performance is consistent across other standard benchmarks, with significant accuracy improvements of 4.4 points on AGIEval and 2.7 points on MMLU. Our results underscore that a systematic approach to ensemble construction - one that balances diversity, subject-specific competence, and adaptive weighting, can substantially enhance the generalization and robustness of LLMs on multifaceted language understanding tasks.

2025

pdf bib abs

Forget What You Know about LLMs Evaluations - LLMs are Like a Chameleon
Nurit Cohen Inger | Yehonatan Elisha | Bracha Shapira | Lior Rokach | Seffi Cohen
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Large language models (LLMs) often appear to excel on public benchmarks, but these high scores may mask an overreliance on dataset-specific surface cues rather than true language understanding. We introduce the **Chameleon Benchmark Overfit Detector (C-BOD)**, a meta-evaluation framework designed to reveal such overfitting. C-BOD systematically rephrases benchmark inputs via a parameterized transformation that preserves semantic content and labels, enabling the detection of performance degradation indicative of superficial pattern reliance.We conduct extensive experiments across two datasets, three rephrasing models, and multiple distortion levels, evaluating 32 state-of-the-art LLMs. On the MMLU benchmark, C-BOD reveals an average performance drop of 2.75% under modest rephrasings, with over 80% of models exhibiting statistically significant differences. Notably, higher-performing models and larger LLMs tend to show greater sensitivity, suggesting a deeper dependence on benchmark-specific phrasing.Due to its dataset and model-agnostic design, C-BOD can be easily integrated into evaluation pipelines and offers a promising foundation for overfitting mitigation strategies. Our findings challenge the community to look beyond leaderboard scores and prioritize resilience and generalization in LLM evaluation. Our code and benchmark datasets are availableat: https://github.com/nuritci/cbod

Co-authors

Venues

EMNLP1
Findings1

Fix author