ProMQA-Assembly: Multimodal Procedural QA Dataset on Assembly

Kimihiro Hasegawa, Wiradee Imrattanatrai, Masaki Asada, Susan E. Holm, Yuran Wang, Xuanang Zhou, Ken Fukuda, Teruko Mitamura


Abstract
Assistants on assembly tasks show great potential to benefit humans ranging from helping with everyday tasks to interacting in industrial settings. However, evaluation resources in assembly activities are underexplored. To foster system development, we propose a new multimodal QA evaluation dataset on assembly activities. Our dataset, ProMQA-Assembly, consists of 646 QA pairs that require multimodal understanding of human activity videos and their instruction manuals in an online-style manner. For cost effectiveness in the data creation, we adopt a semi-automated QA annotation approach, where LLMs generate candidate QA pairs and humans verify them. We further improve QA generation by integrating fine-grained action labels to diversify question types. Additionally, we create 81 instruction task graphs for our target assembly tasks. These newly created task graphs are used in our benchmarking experiment, as well as in facilitating the human verification process. With our dataset, we benchmark models, including competitive proprietary multimodal models. We find that ProMQA-Assembly contains challenging multimodal questions, where reasoning models showcase promising results. We believe our new evaluation dataset contributes to the further development of procedural-activity assistants.
Anthology ID:
2026.lrec-main.714
Volume:
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Month:
May
Year:
2026
Address:
Palma de Mallorca, Spain
Editors:
Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
Venue:
LREC
SIG:
Publisher:
ELRA Language Resource Association
Note:
Pages:
9082–9104
Language:
URL:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.714/
DOI:
Bibkey:
Cite (ACL):
Kimihiro Hasegawa, Wiradee Imrattanatrai, Masaki Asada, Susan E. Holm, Yuran Wang, Xuanang Zhou, Ken Fukuda, and Teruko Mitamura. 2026. ProMQA-Assembly: Multimodal Procedural QA Dataset on Assembly. International Conference on Language Resources and Evaluation, main:9082–9104.
Cite (Informal):
ProMQA-Assembly: Multimodal Procedural QA Dataset on Assembly (Hasegawa et al., LREC 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.714.pdf