DualFact+: A Multimodal Fact Verification Framework for Procedural Video Captioning

Cennet Oguz, Yasser Hamidullah, Josef Van Genabith, Simon Ostermann


Abstract
Evaluating factual correctness in procedural video captions is challenging because captions must reflect both the abstract procedural roles (e.g., actions, ingredients, tools, locations) and their visual execution. Existing evaluation metrics, which rely on lexical overlap or holistic semantic similarity, often miss role-specific omissions and misclassify visually present but task-irrelevant content as hallucinations. We introduce DualFact+, a role-aware, fact-level evaluation framework that distinguishes conceptual facts, encoding ontology-based role typing of procedural steps (Action, Object or Ingredient, Tool, Location), from contextual facts, encoding video-grounded predicate–argument relations that specify how these roles are instantiated during execution. To enable complete and role-consistent evaluation, DualFact+ incorporates visually grounded implicit arguments and contrastive fact sets, and operates in two complementary modes: DualFact-C for text-based verification and DualFact-V for video-grounded verification. Experiments on YouCook3-Fact and CraftBench-Fact show that state-of-the-art captioning models produce fluent but often incomplete descriptions with systematic role-level errors. DualFact+ achieves stronger correlation with human factuality judgments than standard lexical and embedding-based metrics, highlighting the importance of role-aware evaluation for procedural video understanding.
Anthology ID:
2026.findings-acl.1912
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
38356–38371
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1912/
DOI:
Bibkey:
Cite (ACL):
Cennet Oguz, Yasser Hamidullah, Josef Van Genabith, and Simon Ostermann. 2026. DualFact+: A Multimodal Fact Verification Framework for Procedural Video Captioning. In Findings of the Association for Computational Linguistics: ACL 2026, pages 38356–38371, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
DualFact+: A Multimodal Fact Verification Framework for Procedural Video Captioning (Oguz et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1912.pdf
Checklist:
 2026.findings-acl.1912.checklist.pdf