Andrew Owens
2025
Masked Diffusion Captioning for Visual Feature Learning
Chao Feng
|
Zihao Wei
|
Andrew Owens
Findings of the Association for Computational Linguistics: EMNLP 2025
We learn visual features by captioning images with an image-conditioned masked diffusion language model, a formulation we call masked diffusion captioning (MDC). During training, text tokens in each image–caption pair are masked at a randomly chosen ratio, and a decoder conditioned on visual features is trained to reconstruct the original text. After training, the learned visual features can be applied to downstream vision tasks. Unlike autoregressive captioning, the strength of the visual learning signal in MDC does not depend on each token’s position in the sequence, reducing the need for auxiliary objectives. Linear probing experiments across a variety of academic-scale models and datasets show that the learned visual features are competitive with those produced by autoregressive and contrastive approaches.
2022
Towards Understanding the Relation between Gestures and Language
Artem Abzaliev
|
Andrew Owens
|
Rada Mihalcea
Proceedings of the 29th International Conference on Computational Linguistics
In this paper, we explore the relation between gestures and language. Using a multimodal dataset, consisting of Ted talks where the language is aligned with the gestures made by the speakers, we adapt a semi-supervised multimodal model to learn gesture embeddings. We show that gestures are predictive of the native language of the speaker, and that gesture embeddings further improve language prediction result. In addition, gesture embeddings might contain some linguistic information, as we show by probing embeddings for psycholinguistic categories. Finally, we analyze the words that lead to the most expressive gestures and find that function words drive the expressiveness of gestures.