Speech Disfluencies and LLM Confidence: Length Bias and Pragmatic Insensitivity in Brazilian Portuguese

Valeria Santos


Abstract
Training Large Language Models (LLMs) relies predominantly on written, curated corpora, which may limit their reliability on spontaneous speech. Oral language exhibits real-time planning markers — filled pauses, repetitions, false starts, and vowel lengthenings — that modulate epistemic commitment. This pilot study investigates how such disfluencies affect the alignment between LLM confidence and a discourse-pragmatic uncertainty proxy in a Portuguese model (Llama-3.1-8B-Instruct). Using a benchmark of 344 turns from the Roda Viva corpus, we contrast faithful Conversation Analysis transcriptions with sanitized versions and combine binned divergence metrics (ECE, OE) with rank correlation and multivariate regression analyses. We find that model confidence is overwhelmingly driven by a surface feature — turn length (${\beta_{\text{std}}} = +14.47, p 0.001$) — rather than by pragmatic markers of uncertainty (${\beta_{\text{oral}}} = -3.09, {\beta_{\text{hedges}}} = -0.97$, both non-significant; $R2 = 0.29$). After controlling for length, residual effects of disfluency markers align in the human-expected direction but are dwarfed by length bias. We argue that this surface-feature dominance subsumes the pragmatic blindness phenomenon and explains the substantial divergence observed via ECE (41.95) and OE (4.29) between faithful and sanitized conditions.
Anthology ID:
2026.codi-1.5
Volume:
Proceedings of the 2nd Joint Workshop on Computational Approaches to Discourse, Context and Document-Level Inferences and Computational Models of Reference, Anaphora and Coreference (CODI-CRAC 2026)
Month:
July
Year:
2026
Address:
San Diego, California, USA
Editors:
Chloé Braud, Christian Hardmeier, Maciej Ogrodniczuk, Sharid Loaiciga, Amir Zeldes, Michal Novák, Chuyuan Li, Michael Strube, Junyi Jessy Li
Venues:
CODI | CRAC | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
24–28
Language:
URL:
https://preview.aclanthology.org/ingest-acl-workshops/2026.codi-1.5/
DOI:
Bibkey:
Cite (ACL):
Valeria Santos. 2026. Speech Disfluencies and LLM Confidence: Length Bias and Pragmatic Insensitivity in Brazilian Portuguese. In Proceedings of the 2nd Joint Workshop on Computational Approaches to Discourse, Context and Document-Level Inferences and Computational Models of Reference, Anaphora and Coreference (CODI-CRAC 2026), pages 24–28, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):
Speech Disfluencies and LLM Confidence: Length Bias and Pragmatic Insensitivity in Brazilian Portuguese (Santos, CODI-CRAC 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-workshops/2026.codi-1.5.pdf
Supplementarymaterial:
 2026.codi-1.5.SupplementaryMaterial.zip