Processing Inconsistency Predicts Language Competence: LLM Evaluation Without Answer Labels on Turkic Languages

Ilya Galyukshev; Ilseyar Alimova

Processing Inconsistency Predicts Language Competence: LLM Evaluation Without Answer Labels on Turkic Languages

Abstract

Most languages lack labeled evaluation benchmarks for large language models (LLMs). Creating such benchmarks requires native speakers, domain expertise, and answer annotation—resources unavailable for the vast majority of languages. We investigate whether a model’s internal processing signals—such as generation entropy and tokenizer statistics—correlate with its actual accuracy on a language, with the long-term goal of estimating language competence without labeled data. Our key observation is that for languages a model does not know, both tokenizer segmentation and generation entropy become highly variable across questions, whereas for known languages they remain consistent. We call this the *inconsistency hypothesis* and test it on 11 instruction-tuned LLMs (1B–70B parameters) across 14 language–script varieties (12 Turkic plus English and Russian controls). We extract over 25 processing features per model–language pair; individually, even the strongest correlate only moderately with accuracy (Pearson |r| up to 0.55). Yet combining just three complementary features—a tokenizer coverage ratio, entropy variability, and the model’s English/Russian benchmark score—explains 75% of accuracy variance in leave-one-language-out evaluation, nearly doubling the 44% explained by a model-mean baseline. The variability of processing signals (standard deviation) consistently outperforms mean values as a predictor across all five model families, but only for greedy-pass measures; sampling-based measures show no such pattern.

Anthology ID:: 2026.acl-srw.94
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Santosh T.Y.S.S., Juan Diego Rodriguez, Ona de Gibert
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1074–1086
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-srw.94/
DOI:
Bibkey:
Cite (ACL):: Ilya Galyukshev and Ilseyar Alimova. 2026. Processing Inconsistency Predicts Language Competence: LLM Evaluation Without Answer Labels on Turkic Languages. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026), pages 1074–1086, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Processing Inconsistency Predicts Language Competence: LLM Evaluation Without Answer Labels on Turkic Languages (Galyukshev & Alimova, ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-srw.94.pdf

PDF Cite Search Fix data