Challenges for AI in Multimodal STEM Assessments: a Human-AI Comparison

Aymeric de Chillaz, Anna Sotnikova, Patrick Jermann, Antoine Bosselut


Abstract
Generative AI systems have rapidly advanced, with multimodal input capabilities enabling reasoning beyond text-based tasks. In education, these advancements could influence assessment design and question answering, presenting both opportunities and challenges. To investigate these effects, we introduce a high-quality dataset of 201 university-level STEM questions, manually annotated with features such as image type, role, problem complexity, and question format. Our study analyzes how these features affect generative AI performance compared to students. We evaluate four model families with five prompting strategies, comparing results to the average of 546 student responses per question. Although the best model correctly answers on average 58.5% of the questions using majority vote aggregation, human participants consistently outperform AI on questions involving visual components. Interestingly, human performance remains stable across question features but varies by subject, whereas AI performance is susceptible to both subject matter and question features. Finally, we provide actionable insights for educators, demonstrating how question design can enhance academic integrity by leveraging features that challenge current AI systems without increasing the cognitive burden for students
Anthology ID:
2025.bea-1.22
Volume:
Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025)
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Ekaterina Kochmar, Bashar Alhafni, Marie Bexte, Jill Burstein, Andrea Horbach, Ronja Laarmann-Quante, Anaïs Tack, Victoria Yaneva, Zheng Yuan
Venues:
BEA | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
279–293
Language:
URL:
https://preview.aclanthology.org/acl25-workshop-ingestion/2025.bea-1.22/
DOI:
Bibkey:
Cite (ACL):
Aymeric de Chillaz, Anna Sotnikova, Patrick Jermann, and Antoine Bosselut. 2025. Challenges for AI in Multimodal STEM Assessments: a Human-AI Comparison. In Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025), pages 279–293, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
Challenges for AI in Multimodal STEM Assessments: a Human-AI Comparison (de Chillaz et al., BEA 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/acl25-workshop-ingestion/2025.bea-1.22.pdf