Investigating the Multilingual Calibration Effects of Language Model Instruction Tuning

Jerry Huang; Peng Lu; Qiuhao Zeng; Yusuke Iwasawa; Yutaka Matsuo; Sarath Chandar; Edison Marrese-Taylor; Irene Li

Investigating the Multilingual Calibration Effects of Language Model Instruction Tuning

Jerry Huang, Peng Lu, Qiuhao Zeng, Yusuke Iwasawa, Yutaka Matsuo, Sarath Chandar, Edison Marrese-Taylor, Irene Li

Abstract

Ensuring that deep learning models are well-calibrated in terms of their predictive uncertainty is essential in maintaining their trustworthiness and reliability, yet despite increasing advances in foundation model research, the relationship between such large language models (LLMs) and their calibration remains an open area of research. In this work, we look at a critical gap in the calibration of LLMs within multilingual settings, in an attempt to better understand how the data scarcity can potentially lead to different calibration effects and how commonly used techniques can apply in these settings. Our analysis on two multilingual benchmarks, over 29 and 42 languages respectively, reveals that even in low-resource languages, model confidence can increase significantly after instruction-tuning on high-resource language SFT datasets. However, improvements in accuracy are marginal or non-existent, resulting in mis-calibration, highlighting a critical shortcoming of standard SFT for multilingual languages. Furthermore, we observe that the use of label smoothing to be a reasonable method alleviate this concern, again without any need for low-resource SFT data, maintaining better calibration across all languages. Overall, this highlights the importance of multilingual considerations for both training and tuning LLMs in order to improve their reliability and fairness in downstream use.

Anthology ID:: 2026.eacl-short.1
Volume:: Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers)
Month:: March
Year:: 2026
Address:: Rabat, Morocco
Editors:: Vera Demberg, Kentaro Inui, Lluís Marquez
Venue:: EACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1–59
Language:
URL:: https://preview.aclanthology.org/ingest-eacl/2026.eacl-short.1/
DOI:
Bibkey:
Cite (ACL):: Jerry Huang, Peng Lu, Qiuhao Zeng, Yusuke Iwasawa, Yutaka Matsuo, Sarath Chandar, Edison Marrese-Taylor, and Irene Li. 2026. Investigating the Multilingual Calibration Effects of Language Model Instruction Tuning. In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1–59, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):: Investigating the Multilingual Calibration Effects of Language Model Instruction Tuning (Huang et al., EACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-eacl/2026.eacl-short.1.pdf
Checklist:: 2026.eacl-short.1.checklist.pdf

PDF Cite Search Checklist Fix data