Tolgahan Türker


2026

Grammatical Error Correction (GEC) requires models to make edit decisions under competing objectives: correcting errors while either minimizing changes or maximizing fluency.However, we lack a principled characterization of how instruction-following Large Language Models (LLMs) shift their edit decisions across such editing modes, and whether standard evaluation setups faithfully reflect these shifts.We address this gap by defining three modes—Neutral, Minimal-Edit, and Fluency-Edit—and measuring neutral-anchored performance shifts to quantify instructional sensitivity.We benchmark seven LLMs, including proprietary and open-weight models, in a unified zero-shot prompting schema on CoNLL-2014, BEA-2019, and JFLEG datasets.The Minimal-Edit instruction mitigates over-editing and typically boosts precision; in some settings, strong models also improve recall, suggesting more selective and effective corrections.In contrast, the Fluency-Edit instruction often encourages broader paraphrastic rewriting that may improve perceived fluency while lowering GLEU, suggesting both a metric-objective mismatch and a shift away from targeted local correction.Notably, Claude-Sonnet-4.5 demonstrates superior zero-shot capabilities, outperforming previously reported scores and matching or even exceeding few-shot results across CoNLL-2014 (F_0.5: 67.05), BEA-2019 (F_0.5: 64.91), and JFLEG (GLEU: 66.09).