2025
pdf
bib
abs
Generating Questions, Answers, and Distractors for Videos: Exploring Semantic Uncertainty of Object Motions
Wenjian Ding
|
Yao Zhang
|
Jun Wang
|
Adam Jatowt
|
Zhenglu Yang
Findings of the Association for Computational Linguistics: ACL 2025
Video Question-Answer-Distractors (QADs) show promising values for assessing the performance of systems in perceiving and comprehending multimedia content. Given the significant cost and labor demands of manual annotation, existing large-scale Video QADs benchmarks are typically generated automatically using video captions. Since video captions are incomplete representations of visual content and susceptible to error propagation, direct generation of QADs from video is crucial. This work first leverages a large vision-language model for video QADs generation. To enhance the consistency and diversity of the generated QADs, we propose utilizing temporal motion to describe the video objects. In addition, We design a selection mechanism that chooses diverse temporal object motions to generate diverse QADs focusing on different objects and interactions, maximizing overall semantic uncertainty for a given video. Evaluation on the NExT-QA and Perception Test benchmarks demonstrates that the proposed approach significantly improves both the consistency and diversity of QADs generated by a range of large vision-language models, thus highlighting its effectiveness and generalizability.
2024
pdf
bib
abs
Exploring Union and Intersection of Visual Regions for Generating Questions, Answers, and Distractors
Wenjian Ding
|
Yao Zhang
|
Jun Wang
|
Adam Jatowt
|
Zhenglu Yang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Multiple-choice visual question answering (VQA) is to automatically choose a correct answer from a set of choices after reading an image. Existing efforts have been devoted to a separate generation of an image-related question, a correct answer, or challenge distractors. By contrast, we turn to a holistic generation and optimization of questions, answers, and distractors (QADs) in this study. This integrated generation strategy eliminates the need for human curation and guarantees information consistency. Furthermore, we first propose to put the spotlight on different image regions to diversify QADs. Accordingly, a novel framework ReBo is formulated in this paper. ReBo cyclically generates each QAD based on a recurrent multimodal encoder, and each generation is focusing on a different area of the image compared to those already concerned by the previously generated QADs. In addition to traditional VQA comparisons with state-of-the-art approaches, we also validate the capability of ReBo in generating augmented data to benefit VQA models.
2023
pdf
bib
abs
ACROSS: An Alignment-based Framework for Low-Resource Many-to-One Cross-Lingual Summarization
Peiyao Li
|
Zhengkun Zhang
|
Jun Wang
|
Liang Li
|
Adam Jatowt
|
Zhenglu Yang
Findings of the Association for Computational Linguistics: ACL 2023
This research addresses the challenges of Cross-Lingual Summarization (CLS) in low-resource scenarios and over imbalanced multilingual data. Existing CLS studies mostly resort to pipeline frameworks or multi-task methods in bilingual settings. However, they ignore the data imbalance in multilingual scenarios and do not utilize the high-resource monolingual summarization data. In this paper, we propose the Aligned CROSs-lingual Summarization (ACROSS) model to tackle these issues. Our framework aligns low-resource cross-lingual data with high-resource monolingual data via contrastive and consistency loss, which help enrich low-resource information for high-quality summaries. In addition, we introduce a data augmentation method that can select informative monolingual sentences, which facilitates a deep exploration of high-resource information and introduce new information for low-resource languages. Experiments on the CrossSum dataset show that ACROSS outperforms baseline models and obtains consistently dominant performance on 45 language pairs.
2022
pdf
bib
abs
Multi-Party Empathetic Dialogue Generation: A New Task for Dialog Systems
Ling.Yu Zhu
|
Zhengkun Zhang
|
Jun Wang
|
Hongbin Wang
|
Haiying Wu
|
Zhenglu Yang
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Empathetic dialogue assembles emotion understanding, feeling projection, and appropriate response generation. Existing work for empathetic dialogue generation concentrates on the two-party conversation scenario. Multi-party dialogues, however, are pervasive in reality. Furthermore, emotion and sensibility are typically confused; a refined empathy analysis is needed for comprehending fragile and nuanced human feelings. We address these issues by proposing a novel task called Multi-Party Empathetic Dialogue Generation in this study. Additionally, a Static-Dynamic model for Multi-Party Empathetic Dialogue Generation, SDMPED, is introduced as a baseline by exploring the static sensibility and dynamic emotion for the multi-party empathetic dialogue learning, the aspects that help SDMPED achieve the state-of-the-art performance.
pdf
bib
abs
Modeling Temporal-Modal Entity Graph for Procedural Multimodal Machine Comprehension
Huibin Zhang
|
Zhengkun Zhang
|
Yao Zhang
|
Jun Wang
|
Yufan Li
|
Ning Jiang
|
Xin Wei
|
Zhenglu Yang
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Procedural Multimodal Documents (PMDs) organize textual instructions and corresponding images step by step. Comprehending PMDs and inducing their representations for the downstream reasoning tasks is designated as Procedural MultiModal Machine Comprehension (M3C). In this study, we approach Procedural M3C at a fine-grained level (compared with existing explorations at a document or sentence level), that is, entity. With delicate consideration, we model entity both in its temporal and cross-modal relation and propose a novel Temporal-Modal Entity Graph (TMEG). Specifically, graph structure is formulated to capture textual and visual entities and trace their temporal-modal evolution. In addition, a graph aggregation module is introduced to conduct graph encoding and reasoning. Comprehensive experiments across three Procedural M3C tasks are conducted on a traditional dataset RecipeQA and our new dataset CraftQA, which can better evaluate the generalization of TMEG.