GSM-Noise: Exploring and Enhancing Large Language Models’ Reasoning under Noisy Inputs

Zhengxin Zhang; Chengyu Huang; Xufu Liu; Dan Zhao; Jinyan Su; Claire Cardie

GSM-Noise: Exploring and Enhancing Large Language Models’ Reasoning under Noisy Inputs

Zhengxin Zhang, Chengyu Huang, Xufu Liu, Dan Zhao, Jinyan Su, Claire Cardie

Abstract

Large language models (LLMs) have demonstrated impressive reasoning capabilities, yet they often struggle when dealing with complex, ill-formed, or noisy inputs that frequently occur in interactions with real users. LLMs typically lack crucial refining capabilities needed to filter out irrelevant details, restructure key points before reasoning over the text and responding, resulting in suboptimal performance and incorrect answers. From an information theory perspective, this behavior is akin to decoding a high-entropy problem without first reducing its entropy. In this work, we first introduce GSM-Noise, a benchmark featuring grade-school math problems systematically perturbed to reflect real-world input variability. We show that the reasoning ability of open-source models (e.g., LLaMA and Qwen series) can be compromised by noise, while closed-source models are more robust. To improve LLM robustness under noisy conditions, we propose that LLMs first refine inputs — thereby reducing their entropy — before engaging in in-depth analysis. We investigate three approaches to instill this refinement capability: prompt engineering (PE), supervised finetuning (SFT), and reinforcement learning (RL). Experimental results show that input refinement leads to consistent performance gains: 2–12% with PE, 4–13% with SFT, and 3–25% with RL. These results highlight the importance of incorporating an explicit refinement phase to enhance the robustness and reliability of LLM reasoning in real-world scenarios.

Anthology ID:: 2026.findings-acl.1748
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 35020–35045
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1748/
DOI:
Bibkey:
Cite (ACL):: Zhengxin Zhang, Chengyu Huang, Xufu Liu, Dan Zhao, Jinyan Su, and Claire Cardie. 2026. GSM-Noise: Exploring and Enhancing Large Language Models’ Reasoning under Noisy Inputs. In Findings of the Association for Computational Linguistics: ACL 2026, pages 35020–35045, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: GSM-Noise: Exploring and Enhancing Large Language Models’ Reasoning under Noisy Inputs (Zhang et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1748.pdf
Checklist:: 2026.findings-acl.1748.checklist.pdf

PDF Cite Search Checklist Fix data