Similar Predictions, Different Processes: A Multi-Level Comparison of Human and Multimodal LLM Language Prediction

Shuqi Wang; Zhenguang Cai

Similar Predictions, Different Processes: A Multi-Level Comparison of Human and Multimodal LLM Language Prediction

Abstract

Humans and large language models (LLMs) both generate predictions during language processing, but whether they integrate structural and prosodic cues similarly during visually grounded speech remains underexplored. Multimodal LLMs that jointly process speech and vision now make it possible to compare not only what humans and models predict, but also when predictions emerge. We compared Mandarin speakers and Qwen2.5-Omni-7B on Mandarin dative constructions in a visual world paradigm (VWP), asking how these cues guide predictions about upcoming referents. Experiment 1 used a cloze-in-VWP task to assess offline prediction outputs; Experiment 2 examined online processing via human eye-tracking and a model audio-to-image cross-modal attention measure. In Experiment 1, humans and the model were both sensitive to structure and prosody, consistent with partial output-level alignment, but the model showed a larger structural effect and a condition-specific atypical prosody pattern. In Experiment 2, the time courses diverged: humans showed structural effects before the contrastive connective, whereas the model’s sensitivity emerged later, after connective onset. These findings indicate that output-level and process-level alignment can dissociate in this paradigm. This study contributes a methodology for multi-level human–model comparison and provides empirical constraints on claims about the cognitive plausibility of multimodal LLMs.

Anthology ID:: 2026.conll-main.6
Volume:: Proceedings of the 30th Conference on Computational Natural Language Learning
Month:: July
Year:: 2026
Address:: San Diego, California, USA
Editors:: Claire Bonial, Yevgeni Berzak
Venues:: CoNLL | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 70–89
Language:
URL:: https://preview.aclanthology.org/ingest-acl-workshops/2026.conll-main.6/
DOI:
Bibkey:
Cite (ACL):: Shuqi Wang and Zhenguang Cai. 2026. Similar Predictions, Different Processes: A Multi-Level Comparison of Human and Multimodal LLM Language Prediction. In Proceedings of the 30th Conference on Computational Natural Language Learning, pages 70–89, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):: Similar Predictions, Different Processes: A Multi-Level Comparison of Human and Multimodal LLM Language Prediction (Wang & Cai, CoNLL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl-workshops/2026.conll-main.6.pdf

PDF Cite Search Fix data