Huma Jamil
2025
Multimodal Common Ground Annotation for Partial Information Collaborative Problem Solving
Yifan Zhu | Changsoo Jung | Kenneth Lai | Videep Venkatesha | Mariah Bradford | Jack Fitzgerald | Huma Jamil | Carine Graff | Sai Kiran Ganesh Kumar | Bruce Draper | Nathaniel Blanchard | James Pustejovsky | Nikhil Krishnaswamy
Proceedings of the 21st Joint ACL - ISO Workshop on Interoperable Semantic Annotation (ISA-21)
Yifan Zhu | Changsoo Jung | Kenneth Lai | Videep Venkatesha | Mariah Bradford | Jack Fitzgerald | Huma Jamil | Carine Graff | Sai Kiran Ganesh Kumar | Bruce Draper | Nathaniel Blanchard | James Pustejovsky | Nikhil Krishnaswamy
Proceedings of the 21st Joint ACL - ISO Workshop on Interoperable Semantic Annotation (ISA-21)
This project note describes challenges and procedures undertaken in annotating an audiovisual dataset capturing a multimodal situated collaborative construction task. In the task, all participants begin with different partial information, and must collaborate using speech, gesture, and action to arrive a solution that satisfies all individual pieces of private information. This rich data poses a number of annotation challenges, from small objects in a close space, to the implicit and multimodal fashion in which participants express agreement, disagreement, and beliefs. We discuss the data collection procedure, annotation schemas and tools, and future use cases.
A Graph Autoencoder Approach for Gesture Classification with Gesture AMR
Huma Jamil | Ibrahim Khebour | Kenneth Lai | James Pustejovsky | Nikhil Krishnaswamy
Proceedings of the 16th International Conference on Computational Semantics
Huma Jamil | Ibrahim Khebour | Kenneth Lai | James Pustejovsky | Nikhil Krishnaswamy
Proceedings of the 16th International Conference on Computational Semantics
We present a novel graph autoencoder (GAE) architecture for classifying gestures using Gesture Abstract Meaning Representation (GAMR), a structured semantic annotation framework for gestures in collaborative tasks. We leverage the inherent graphical structure of GAMR by employing Graph Neural Networks (GNNs), specifically an Edge-aware Graph Attention Network (EdgeGAT), to learn embeddings of gesture semantic representations. Using the EGGNOG dataset, which captures diverse physical gesture forms expressing similar semantics, we evaluate our GAE on a multi-label classification task for gestural actions. Results indicate that our approach significantly outperforms naive baselines and is competitive with specialized Transformer-based models like AMRBART, despite using considerably fewer parameters and no pretraining. This work highlights the effectiveness of structured graphical representations in modeling multimodal semantics, offering a scalable and efficient approach to gesture interpretation in situated human-agent collaborative scenarios.
2024
Multimodal Cross-Document Event Coreference Resolution Using Linear Semantic Transfer and Mixed-Modality Ensembles
Abhijnan Nath | Huma Jamil | Shafiuddin Rehan Ahmed | George Arthur Baker | Rahul Ghosh | James H. Martin | Nathaniel Blanchard | Nikhil Krishnaswamy
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Abhijnan Nath | Huma Jamil | Shafiuddin Rehan Ahmed | George Arthur Baker | Rahul Ghosh | James H. Martin | Nathaniel Blanchard | Nikhil Krishnaswamy
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Event coreference resolution (ECR) is the task of determining whether distinct mentions of events within a multi-document corpus are actually linked to the same underlying occurrence. Images of the events can help facilitate resolution when language is ambiguous. Here, we propose a multimodal cross-document event coreference resolution method that integrates visual and textual cues with a simple linear map between vision and language models. As existing ECR benchmark datasets rarely provide images for all event mentions, we augment the popular ECB+ dataset with event-centric images scraped from the internet and generated using image diffusion models. We establish three methods that incorporate images and text for coreference: 1) a standard fused model with finetuning, 2) a novel linear mapping method without finetuning and 3) an ensembling approach based on splitting mention pairs by semantic and discourse-level difficulty. We evaluate on 2 datasets: the augmented ECB+, and AIDA Phase 1. Our ensemble systems using cross-modal linear mapping establish an upper limit (91.9 CoNLL F1) on ECB+ ECR performance given the preprocessing assumptions used, and establish a novel baseline on AIDA Phase 1. Our results demonstrate the utility of multimodal information in ECR for certain challenging coreference problems, and highlight a need for more multimodal resources in the coreference resolution space.