Jesse Thomason


Improving Sign Recognition with Phonology
Lee Kezar | Jesse Thomason | Zed Sehyr
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

We use insights from research on American Sign Language (ASL) phonology to train models for isolated sign language recognition (ISLR), a step towards automatic sign language understanding. Our key insight is to explicitly recognize the role of phonology in sign production to achieve more accurate ISLR than existing work which does not consider sign language phonology. We train ISLR models that take in pose estimations of a signer producing a single sign to predict not only the sign but additionally its phonological characteristics, such as the handshape. These auxiliary predictions lead to a nearly 9% absolute gain in sign recognition accuracy on the WLASL benchmark, with consistent improvements in ISLR regardless of the underlying prediction model architecture. This work has the potential to accelerate linguistic research in the domain of signed languages and reduce communication barriers between deaf and hearing people.


Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions
Jing Gu | Eliana Stefani | Qi Wu | Jesse Thomason | Xin Wang
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

A long-term goal of AI research is to build intelligent agents that can communicate with humans in natural language, perceive the environment, and perform real-world tasks. Vision-and-Language Navigation (VLN) is a fundamental and interdisciplinary research topic towards this goal, and receives increasing attention from natural language processing, computer vision, robotics, and machine learning communities. In this paper, we review contemporary studies in the emerging field of VLN, covering tasks, evaluation metrics, methods, etc. Through structured analysis of current progress and challenges, we also highlight the limitations of current VLN and opportunities for future work. This paper serves as a thorough reference for the VLN research community.

Generalization Differences between End-to-End and Neuro-Symbolic Vision-Language Reasoning Systems
Wang Zhu | Jesse Thomason | Robin Jia
Findings of the Association for Computational Linguistics: EMNLP 2022

For vision-and-language reasoning tasks, both fully connectionist, end-to-end methods and hybrid, neuro-symbolic methods have achieved high in-distribution performance. In which out-of-distribution settings does each paradigm excel? We investigate this question on both single-image and multi-image visual question-answering through four types of generalization tests: a novel segment-combine test for multi-image queries, contrast set, compositional generalization, and cross-benchmark transfer.Vision-and-language end-to-end trained systems exhibit sizeable performance drops across all these tests. Neuro-symbolic methods suffer even more on cross-benchmark transfer from GQA to VQA, but they show smaller accuracy drops on the other generalization tests and their performance quickly improves by few-shot training. Overall, our results demonstrate the complementary benefits of these two paradigms, and emphasize the importance of using a diverse suite of generalization tests to fully characterize model robustness to distribution shift.

ALFRED-L: Investigating the Role of Language for Action Learning in Interactive Visual Environments
Arjun Akula | Spandana Gella | Aishwarya Padmakumar | Mahdi Namazifar | Mohit Bansal | Jesse Thomason | Dilek Hakkani-Tur
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Embodied Vision and Language Task Completion requires an embodied agent to interpret natural language instructions and egocentric visual observations to navigate through and interact with environments. In this work, we examine ALFRED, a challenging benchmark for embodied task completion, with the goal of gaining insight into how effectively models utilize language. We find evidence that sequence-to-sequence and transformer-based models trained on this benchmark are not sufficiently sensitive to changes in input language instructions. Next, we construct a new test split – ALFRED-L to test whether ALFRED models can generalize to task structures not seen during training that intuitively require the same types of language understanding required in ALFRED. Evaluation of existing models on ALFRED-L suggests that (a) models are overly reliant on the sequence in which objects are visited in typical ALFRED trajectories and fail to adapt to modifications of this sequence and (b) models trained with additional augmented trajectories are able to adapt relatively better to such changes in input language instructions.


pdf bib
Proceedings of the First Workshop on Advances in Language and Vision Research
Xin Wang | Jesse Thomason | Ronghang Hu | Xinlei Chen | Peter Anderson | Qi Wu | Asli Celikyilmaz | Jason Baldridge | William Yang Wang
Proceedings of the First Workshop on Advances in Language and Vision Research

Experience Grounds Language
Yonatan Bisk | Ari Holtzman | Jesse Thomason | Jacob Andreas | Yoshua Bengio | Joyce Chai | Mirella Lapata | Angeliki Lazaridou | Jonathan May | Aleksandr Nisnevich | Nicolas Pinto | Joseph Turian
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Language understanding research is held back by a failure to relate language to the physical world it describes and to the social interactions it facilitates. Despite the incredible effectiveness of language processing models to tackle tasks after being trained on text alone, successful linguistic communication relies on a shared experience of the world. It is this shared experience that makes utterances meaningful. Natural language processing is a diverse field, and progress throughout its development has come from new representational theories, modeling techniques, data collection paradigms, and tasks. We posit that the present success of representation learning approaches trained on large, text-only corpora requires the parallel tradition of research on the broader physical and social context of language to address the deeper questions of communication.

RMM: A Recursive Mental Model for Dialogue Navigation
Homero Roman Roman | Yonatan Bisk | Jesse Thomason | Asli Celikyilmaz | Jianfeng Gao
Findings of the Association for Computational Linguistics: EMNLP 2020

Language-guided robots must be able to both ask humans questions and understand answers. Much existing work focuses only on the latter. In this paper, we go beyond instruction following and introduce a two-agent task where one agent navigates and asks questions that a second, guiding agent answers. Inspired by theory of mind, we propose the Recursive Mental Model (RMM). The navigating agent models the guiding agent to simulate answers given candidate generated questions. The guiding agent in turn models the navigating agent to simulate navigation steps it would take to generate answers. We use the progress agents make towards the goal as a reinforcement learning reward signal to directly inform not only navigation actions, but also both question and answer generation. We demonstrate that RMM enables better generalization to novel environments. Interlocutor modelling may be a way forward for human-agent RMM where robots need to both ask and answer questions.


pdf bib
Proceedings of the Combined Workshop on Spatial Language Understanding (SpLU) and Grounded Communication for Robotics (RoboNLP)
Archna Bhatia | Yonatan Bisk | Parisa Kordjamshidi | Jesse Thomason
Proceedings of the Combined Workshop on Spatial Language Understanding (SpLU) and Grounded Communication for Robotics (RoboNLP)

Shifting the Baseline: Single Modality Performance on Visual Navigation & QA
Jesse Thomason | Daniel Gordon | Yonatan Bisk
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

We demonstrate the surprising strength of unimodal baselines in multimodal domains, and make concrete recommendations for best practices in future research. Where existing work often compares against random or majority class baselines, we argue that unimodal approaches better capture and reflect dataset biases and therefore provide an important comparison when assessing the performance of multimodal techniques. We present unimodal ablations on three recent datasets in visual navigation and QA, seeing an up to 29% absolute gain in performance over published baselines.


Improving Black-box Speech Recognition using Semantic Parsing
Rodolfo Corona | Jesse Thomason | Raymond Mooney
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Speech is a natural channel for human-computer interaction in robotics and consumer applications. Natural language understanding pipelines that start with speech can have trouble recovering from speech recognition errors. Black-box automatic speech recognition (ASR) systems, built for general purpose use, are unable to take advantage of in-domain language models that could otherwise ameliorate these errors. In this work, we present a method for re-ranking black-box ASR hypotheses using an in-domain language model and semantic parser trained for a particular task. Our re-ranking method significantly improves both transcription accuracy and semantic understanding over a state-of-the-art ASR’s vanilla output.

Guiding Interaction Behaviors for Multi-modal Grounded Language Learning
Jesse Thomason | Jivko Sinapov | Raymond Mooney
Proceedings of the First Workshop on Language Grounding for Robotics

Multi-modal grounded language learning connects language predicates to physical properties of objects in the world. Sensing with multiple modalities, such as audio, haptics, and visual colors and shapes while performing interaction behaviors like lifting, dropping, and looking on objects enables a robot to ground non-visual predicates like “empty” as well as visual predicates like “red”. Previous work has established that grounding in multi-modal space improves performance on object retrieval from human descriptions. In this work, we gather behavior annotations from humans and demonstrate that these improve language grounding performance by allowing a system to focus on relevant behaviors for words like “white” or “half-full” that can be understood by looking or lifting, respectively. We also explore adding modality annotations (whether to focus on audio or haptics when performing a behavior), which improves performance, and sharing information between linguistically related predicates (if “green” is a color, “white” is a color), which improves grounding recall but at the cost of precision.

Integrated Learning of Dialog Strategies and Semantic Parsing
Aishwarya Padmakumar | Jesse Thomason | Raymond J. Mooney
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

Natural language understanding and dialog management are two integral components of interactive dialog systems. Previous research has used machine learning techniques to individually optimize these components, with different forms of direct and indirect supervision. We present an approach to integrate the learning of both a dialog strategy using reinforcement learning, and a semantic parser for robust natural language understanding, using only natural dialog interaction for supervision. Experimental results on a simulated task of robot instruction demonstrate that joint learning of both components improves dialog performance over learning either of these components alone.


Integrating Language and Vision to Generate Natural Language Descriptions of Videos in the Wild
Jesse Thomason | Subhashini Venugopalan | Sergio Guadarrama | Kate Saenko | Raymond Mooney
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers


Differences in User Responses to a Wizard-of-Oz versus Automated System
Jesse Thomason | Diane Litman
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies