DualFact+: A Multimodal Fact Verification Framework for Procedural Video Captioning
Cennet Oguz, Yasser Hamidullah, Josef Van Genabith, Simon Ostermann
Abstract
Evaluating factual correctness in procedural video captions is challenging because captions must reflect both the abstract procedural roles (e.g., actions, ingredients, tools, locations) and their visual execution. Existing evaluation metrics, which rely on lexical overlap or holistic semantic similarity, often miss role-specific omissions and misclassify visually present but task-irrelevant content as hallucinations. We introduce DualFact+, a role-aware, fact-level evaluation framework that distinguishes conceptual facts, encoding ontology-based role typing of procedural steps (Action, Object or Ingredient, Tool, Location), from contextual facts, encoding video-grounded predicate–argument relations that specify how these roles are instantiated during execution. To enable complete and role-consistent evaluation, DualFact+ incorporates visually grounded implicit arguments and contrastive fact sets, and operates in two complementary modes: DualFact-C for text-based verification and DualFact-V for video-grounded verification. Experiments on YouCook3-Fact and CraftBench-Fact show that state-of-the-art captioning models produce fluent but often incomplete descriptions with systematic role-level errors. DualFact+ achieves stronger correlation with human factuality judgments than standard lexical and embedding-based metrics, highlighting the importance of role-aware evaluation for procedural video understanding.- Anthology ID:
- 2026.findings-acl.1912
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2026
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
- Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 38356–38371
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1912/
- DOI:
- Cite (ACL):
- Cennet Oguz, Yasser Hamidullah, Josef Van Genabith, and Simon Ostermann. 2026. DualFact+: A Multimodal Fact Verification Framework for Procedural Video Captioning. In Findings of the Association for Computational Linguistics: ACL 2026, pages 38356–38371, San Diego, California, United States. Association for Computational Linguistics.
- Cite (Informal):
- DualFact+: A Multimodal Fact Verification Framework for Procedural Video Captioning (Oguz et al., Findings 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1912.pdf