This is an internal, incomplete preview of a proposed change to the ACL Anthology.
For efficiency reasons, we don't generate MODS or Endnote formats, and the preview may be incomplete in other ways, or contain mistakes.
Do not treat this content as an official publication.
IvanMarsic
Fixing paper assignments
Please select all papers that belong to the same person.
Indicate below which author they should be assigned to.
This paper describes our system used in SemEval-2023 Task-1: Visual Word Sense Disambiguation (VWSD). The VWSD task is to identify the correct image that corresponds to an ambiguous target word given limited textual context. To reduce word ambiguity and enhance image selection, we proposed several text augmentation techniques, such as prompting, WordNet synonyms, and text generation. We experimented with different vision-language pre-trained models to capture the joint features of the augmented text and image. Our approach achieved the best performance using a combination of GPT-3 text generation and the CLIP model. On the multilingual test sets, our system achieved an average hit rate (at top-1) of 51.11 and a mean reciprocal rank of 65.69.
We examine the utility of linguistic content and vocal characteristics for multimodal deep learning in human spoken language understanding. We present a deep multimodal network with both feature attention and modality attention to classify utterance-level speech data. The proposed hybrid attention architecture helps the system focus on learning informative representations for both modality-specific feature extraction and model fusion. The experimental results show that our system achieves state-of-the-art or competitive results on three published multimodal datasets. We also demonstrated the effectiveness and generalization of our system on a medical speech dataset from an actual trauma scenario. Furthermore, we provided a detailed comparison and analysis of traditional approaches and deep learning methods on both feature extraction and fusion.
Multimodal affective computing, learning to recognize and interpret human affect and subjective information from multiple data sources, is still a challenge because: (i) it is hard to extract informative features to represent human affects from heterogeneous inputs; (ii) current fusion strategies only fuse different modalities at abstract levels, ignoring time-dependent interactions between modalities. Addressing such issues, we introduce a hierarchical multimodal architecture with attention and word-level fusion to classify utterance-level sentiment and emotion from text and audio data. Our introduced model outperforms state-of-the-art approaches on published datasets, and we demonstrate that our model is able to visualize and interpret synchronized attention over modalities.