FLUKE: A Linguistically-Driven and Task-Agnostic Framework for Robustness Evaluation
Yulia Otmakhova, Thinh Hung Truong, Rahmad Mahendra, Zenan Zhai, Rongxin Zhu, Daniel Beck, Jey Han Lau
Abstract
We present FLUKE (Framework for LingUistically-driven and tasK-agnostic robustness Evaluation), a framework for assessing model robustness through systematic minimal variations of test data. FLUKE introduces controlled variations across linguistic levels — from orthography to dialect and style — and leverages large language models (LLMs) with human validation to generate modifications. We demonstrate FLUKE’s utility by evaluating both fine-tuned models and LLMs across six diverse NLP tasks (four classification and two generation tasks), and reveal that (1) the impact of linguistic variations is highly task-dependent, with some tests being critical for certain tasks but irrelevant for others; (2) LLMs still exhibit significant brittleness to certain linguistic variations, with reasoning LLMs surprisingly showing less robustness on some tasks compared to base models, and scaling improving robustness only for surface-level modifications; (3) models are overall more brittle to natural, fluent modifications such as syntax or style changes (and especially to negation), compared to corruption-style tests such as letter flipping; (4) the ability of a model to use a linguistic feature in generation does not correlate to its robustness to this feature on downstream tasks. These findings highlight the importance of systematic robustness testing for understanding model behaviors.- Anthology ID:
- 2026.findings-eacl.269
- Volume:
- Findings of the Association for Computational Linguistics: EACL 2026
- Month:
- March
- Year:
- 2026
- Address:
- Rabat, Morocco
- Editors:
- Vera Demberg, Kentaro Inui, Lluís Marquez
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 5103–5123
- Language:
- URL:
- https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.269/
- DOI:
- Cite (ACL):
- Yulia Otmakhova, Thinh Hung Truong, Rahmad Mahendra, Zenan Zhai, Rongxin Zhu, Daniel Beck, and Jey Han Lau. 2026. FLUKE: A Linguistically-Driven and Task-Agnostic Framework for Robustness Evaluation. In Findings of the Association for Computational Linguistics: EACL 2026, pages 5103–5123, Rabat, Morocco. Association for Computational Linguistics.
- Cite (Informal):
- FLUKE: A Linguistically-Driven and Task-Agnostic Framework for Robustness Evaluation (Otmakhova et al., Findings 2026)
- PDF:
- https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.269.pdf