Multiplication in Multimodal LLMs: Computation with Text, Image, and Audio Inputs

Samuel Gideon Balter, Ethan Jerzak, Connor Thomas Jerzak


Abstract
Multimodal LLMs can accurately perceive numerical content across modalities yet fail to perform exact multi-digit multiplication when the identical underlying arithmetic problem is presented as numerals, number words, images, or in audio form. Because existing benchmarks often lack systematically paired instances across modalities, it remains difficult to compare genuine arithmetic limits within and across model families. We therefore introduce a controlled multimodal multiplication benchmark that factorially varies digit length, digit sparsity, representation (e.g., numerals vs. number words), and modality (text, rendered images, and audio), with paired instances from a reproducible generator. We also define arithmetic load, C, as the product of the total and non-zero digit number as a compact, mechanistically motivated proxy for operation count. Across evaluations, accuracy falls sharply as C grows, often nearing zero by C > 100. Indeed, C remains predictive of performance across modalities and models, with R2 > 0.5, nearing the value from more complex measures of arithmetic load that count the number of intermediate arithmetic steps. A separate perception-versus-computation decomposition shows that multimodal degradation is primarily computational rather than perceptual: on matched perception checks, models are near-perfect (>99%) across modalities even when multiplication accuracy drops substantially. Beyond measuring when models fail, we ask which procedures they are predisposed to follow. We introduce a style-controlled forced-completion loss probe that scores heuristic-specific reasoning prefixes—including columnar multiplication, distributive decomposition, and rounding/compensation. Here, distributive decomposition is favored in both text and vision modalities; heuristic-specific LoRA adapters produce near-orthogonal updates yet degrade accuracy, indicating the base model maintains a well-tuned internal router.
Anthology ID:
2026.findings-acl.2025
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
40766–40780
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.2025/
DOI:
Bibkey:
Cite (ACL):
Samuel Gideon Balter, Ethan Jerzak, and Connor Thomas Jerzak. 2026. Multiplication in Multimodal LLMs: Computation with Text, Image, and Audio Inputs. In Findings of the Association for Computational Linguistics: ACL 2026, pages 40766–40780, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Multiplication in Multimodal LLMs: Computation with Text, Image, and Audio Inputs (Balter et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.2025.pdf
Checklist:
 2026.findings-acl.2025.checklist.pdf