Alexander G. Hauptmann

Also published as: Alex Hauptmann, Alexander Hauptmann


KAT: A Knowledge Augmented Transformer for Vision-and-Language
Liangke Gui | Borui Wang | Qiuyuan Huang | Alexander Hauptmann | Yonatan Bisk | Jianfeng Gao
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

The primary focus of recent work with large-scale transformers has been on optimizing the amount of information packed into the model’s parameters. In this work, we ask a complementary question: Can multimodal transformers leverage explicit knowledge in their reasoning? Existing, primarily unimodal, methods have explored approaches under the paradigm of knowledge retrieval followed by answer prediction, but leave open questions about the quality and relevance of the retrieved knowledge used, and how the reasoning processes over implicit and explicit knowledge should be integrated. To address these challenges, we propose a - Knowledge Augmented Transformer (KAT) - which achieves a strong state-of-the-art result (+6% absolute) on the open-domain multimodal task of OK-VQA. Our approach integrates implicit and explicit knowledge in an encoder-decoder architecture, while still jointly reasoning over both knowledge sources during answer generation. Additionally, explicit knowledge integration improves interpretability of model predictions in our analysis.


Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual Transfer of Vision-Language Models
Po-Yao Huang | Mandela Patrick | Junjie Hu | Graham Neubig | Florian Metze | Alexander Hauptmann
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

This paper studies zero-shot cross-lingual transfer of vision-language models. Specifically, we focus on multilingual text-to-video search and propose a Transformer-based model that learns contextual multilingual multimodal embeddings. Under a zero-shot setting, we empirically demonstrate that performance degrades significantly when we query the multilingual text-video model with non-English sentences. To address this problem, we introduce a multilingual multimodal pre-training strategy, and collect a new multilingual instructional video dataset (Multi-HowTo100M) for pre-training. Experiments on VTT show that our method significantly improves video search in non-English languages without additional annotations. Furthermore, when multilingual annotations are available, our method outperforms recent baselines by a large margin in multilingual text-to-video search on VTT and VATEX; as well as in multilingual text-to-image search on Multi30K. Our model and Multi-HowTo100M is available at


Unsupervised Multimodal Neural Machine Translation with Pseudo Visual Pivoting
Po-Yao Huang | Junjie Hu | Xiaojun Chang | Alexander Hauptmann
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Unsupervised machine translation (MT) has recently achieved impressive results with monolingual corpora only. However, it is still challenging to associate source-target sentences in the latent space. As people speak different languages biologically share similar visual systems, the potential of achieving better alignment through visual content is promising yet under-explored in unsupervised multimodal MT (MMT). In this paper, we investigate how to utilize visual content for disambiguation and promoting latent space alignment in unsupervised MMT. Our model employs multimodal back-translation and features pseudo visual pivoting in which we learn a shared multilingual visual-semantic embedding space and incorporate visually-pivoted captioning as additional weak supervision. The experimental results on the widely used Multi30K dataset show that the proposed model significantly improves over the state-of-the-art methods and generalizes well when images are not available at the testing time.

Event-Related Bias Removal for Real-time Disaster Events
Salvador Medina Maza | Evangelia Spiliopoulou | Eduard Hovy | Alexander Hauptmann
Findings of the Association for Computational Linguistics: EMNLP 2020

Social media has become an important tool to share information about crisis events such as natural disasters and mass attacks. Detecting actionable posts that contain useful information requires rapid analysis of huge volumes of data in real-time. This poses a complex problem due to the large amount of posts that do not contain any actionable information. Furthermore, the classification of information in real-time systems requires training on out-of-domain data, as we do not have any data from a new emerging crisis. Prior work focuses on models pre-trained on similar event types. However, those models capture unnecessary event-specific biases, like the location of the event, which affect the generalizability and performance of the classifiers on new unseen data from an emerging new event. In our work, we train an adversarial neural model to remove latent event-specific biases and improve the performance on tweet importance classification.


Multi-Head Attention with Diversity for Learning Grounded Multilingual Multimodal Representations
Po-Yao Huang | Xiaojun Chang | Alexander Hauptmann
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

With the aim of promoting and understanding the multilingual version of image search, we leverage visual object detection and propose a model with diverse multi-head attention to learn grounded multilingual multimodal representations. Specifically, our model attends to different types of textual semantics in two languages and visual objects for fine-grained alignments between sentences and images. We introduce a new objective function which explicitly encourages attention diversity to learn an improved visual-semantic embedding space. We evaluate our model in the German-Image and English-Image matching tasks on the Multi30K dataset, and in the Semantic Textual Similarity task with the English descriptions of visual content. Results show that our model yields a significant performance gain over other methods in all of the three tasks.

ExCL: Extractive Clip Localization Using Natural Language Descriptions
Soham Ghosh | Anuva Agarwal | Zarana Parekh | Alexander Hauptmann
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

The task of retrieving clips within videos based on a given natural language query requires cross-modal reasoning over multiple frames. Prior approaches such as sliding window classifiers are inefficient, while text-clip similarity driven ranking-based approaches such as segment proposal networks are far more complicated. In order to select the most relevant video clip corresponding to the given text description, we propose a novel extractive approach that predicts the start and end frames by leveraging cross-modal interactions between the text and video - this removes the need to retrieve and re-rank multiple proposal segments. Using recurrent networks we encode the two modalities into a joint representation which is then used in different variants of start-end frame predictor networks. Through extensive experimentation and ablative analysis, we demonstrate that our simple and elegant approach significantly outperforms state of the art on two datasets and has comparable performance on a third.


Vox Populi Annotation: Measuring Intensity of Ideological Perspectives by Aggregating Group Judgments
Wei-Hao Lin | Alexander Hauptmann
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Polarizing discussions about political and social issues are common in mass media. Annotations on the degree to which a sentence expresses an ideological perspective can be valuable for evaluating computer programs that can automatically identify strongly biased sentences, but such annotations remain scarce. We annotated the intensity of ideological perspectives expressed in 250 sentences by aggregating judgments from 18 annotators. We proposed methods of determining the number of annotators and assessing reliability, and showed the annotations were highly consistent across different annotator groups.


Which Side are You on? Identifying Perspectives at the Document and Sentence Levels
Wei-Hao Lin | Theresa Wilson | Janyce Wiebe | Alexander Hauptmann
Proceedings of the Tenth Conference on Computational Natural Language Learning (CoNLL-X)

Are These Documents Written from Different Perspectives? A Test of Different Perspectives Based on Statistical Distribution Divergence
Wei-Hao Lin | Alexander Hauptmann
Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics


A New Probabilistic Model for Title Generation
Rong Jin | Alexander G. Hauptmann
COLING 2002: The 19th International Conference on Computational Linguistics


Automatic Title Generation for Spoken Broadcast News
Rong Jin | Alexander G. Hauptmann
Proceedings of the First International Conference on Human Language Technology Research


A Prototype Reading Coach that Listens: Summary of Project LISTEN
Alex Hauptmann | Jack Mostow | Steven F. Roth | Matthew Kane | Adam Swift
Human Language Technology: Proceedings of a Workshop held at Plainsboro, New Jersey, March 8-11, 1994


A Comparison of Speech and Typed Input
Alexander G. Hauptmann | Alexander I. Rudnicky
Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24-27,1990


Parsing Spoken Language: a Semantic Caseframe Approach
Philip J. Hayes | Alexander G. Hauptmann | Jaime G. Carbonell | Masaru Tomita
Coling 1986 Volume 1: The 11th International Conference on Computational Linguistics