Trevor Darrell


2021

pdf bib
Modular Networks for Compositional Instruction Following
Rodolfo Corona | Daniel Fried | Coline Devin | Dan Klein | Trevor Darrell
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Standard architectures used in instruction following often struggle on novel compositions of subgoals (e.g. navigating to landmarks or picking up objects) observed during training. We propose a modular architecture for following natural language instructions that describe sequences of diverse subgoals. In our approach, subgoal modules each carry out natural language instructions for a specific subgoal type. A sequence of modules to execute is chosen by learning to segment the instructions and predicting a subgoal type for each segment. When compared to standard, non-modular sequence-to-sequence approaches on ALFRED, a challenging instruction following benchmark, we find that modularization improves generalization to novel subgoal compositions, as well as to environments unseen in training.

pdf bib
NewsCLIPpings: Automatic Generation of Out-of-Context Multimodal Media
Grace Luo | Trevor Darrell | Anna Rohrbach
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Online misinformation is a prevalent societal issue, with adversaries relying on tools ranging from cheap fakes to sophisticated deep fakes. We are motivated by the threat scenario where an image is used out of context to support a certain narrative. While some prior datasets for detecting image-text inconsistency generate samples via text manipulation, we propose a dataset where both image and text are unmanipulated but mismatched. We introduce several strategies for automatically retrieving convincing images for a given caption, capturing cases with inconsistent entities or semantic context. Our large-scale automatically generated the NewsCLIPpings Dataset: (1) demonstrates that machine-driven image repurposing is now a realistic threat, and (2) provides samples that represent challenging instances of mismatch between text and image in news that are able to mislead humans. We benchmark several state-of-the-art multimodal models on our dataset and analyze their performance across different pretraining domains and visual backbones.

2019

pdf bib
Are You Looking? Grounding to Multiple Modalities in Vision-and-Language Navigation
Ronghang Hu | Daniel Fried | Anna Rohrbach | Dan Klein | Trevor Darrell | Kate Saenko
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Vision-and-Language Navigation (VLN) requires grounding instructions, such as “turn right and stop at the door”, to routes in a visual environment. The actual grounding can connect language to the environment through multiple modalities, e.g. “stop at the door” might ground into visual objects, while “turn right” might rely only on the geometric structure of a route. We investigate where the natural language empirically grounds under two recent state-of-the-art VLN models. Surprisingly, we discover that visual features may actually hurt these models: models which only use route structure, ablating visual features, outperform their visual counterparts in unseen new environments on the benchmark Room-to-Room dataset. To better use all the available modalities, we propose to decompose the grounding procedure into a set of expert models with access to different modalities (including object detections) and ensemble them at prediction time, improving the performance of state-of-the-art models on the VLN task.

2018

pdf bib
Localizing Moments in Video with Temporal Language
Lisa Anne Hendricks | Oliver Wang | Eli Shechtman | Josef Sivic | Trevor Darrell | Bryan Russell
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Localizing moments in a longer video via natural language queries is a new, challenging task at the intersection of language and video understanding. Though moment localization with natural language is similar to other language and vision tasks like natural language object retrieval in images, moment localization offers an interesting opportunity to model temporal dependencies and reasoning in text. We propose a new model that explicitly reasons about different temporal segments in a video, and shows that temporal context is important for localizing phrases which include temporal language. To benchmark whether our model, and other recent video localization models, can effectively reason about temporal language, we collect the novel TEMPOral reasoning in video and language (TEMPO) dataset. Our dataset consists of two parts: a dataset with real videos and template sentences (TEMPO - Template Language) which allows for controlled studies on temporal language, and a human language dataset which consists of temporal sentences annotated by humans (TEMPO - Human Language).

pdf bib
Object Hallucination in Image Captioning
Anna Rohrbach | Lisa Anne Hendricks | Kaylee Burns | Trevor Darrell | Kate Saenko
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Despite continuously improving performance, contemporary image captioning models are prone to “hallucinating” objects that are not actually in a scene. One problem is that standard metrics only measure similarity to ground truth captions and may not fully capture image relevance. In this work, we propose a new image relevance metric to evaluate current models with veridical visual labels and assess their rate of object hallucination. We analyze how captioning model architectures and learning objectives contribute to object hallucination, explore when hallucination is likely due to image misclassification or language priors, and assess how well current sentence metrics capture object hallucination. We investigate these questions on the standard image captioning benchmark, MSCOCO, using a diverse set of models. Our analysis yields several interesting findings, including that models which score best on standard sentence metrics do not always have lower hallucination and that models which hallucinate more tend to make errors driven by language priors.

2016

pdf bib
Learning to Compose Neural Networks for Question Answering
Jacob Andreas | Marcus Rohrbach | Trevor Darrell | Dan Klein
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding
Akira Fukui | Dong Huk Park | Daylen Yang | Anna Rohrbach | Trevor Darrell | Marcus Rohrbach
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

2009

pdf bib
Who is “You”? Combining Linguistic and Gaze Features to Resolve Second-Person References in Dialogue
Matthew Frampton | Raquel Fernández | Patrick Ehlen | Mario Christoudias | Trevor Darrell | Stanley Peters
Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009)