Francesc Moreno-Noguer


Belief Revision Based Caption Re-ranker with Visual Semantic Information
Ahmed Sabir | Francesc Moreno-Noguer | Pranava Madhyastha | Lluís Padró
Proceedings of the 29th International Conference on Computational Linguistics

In this work, we focus on improving the captions generated by image-caption generation systems. We propose a novel re-ranking approach that leverages visual-semantic measures to identify the ideal caption that maximally captures the visual information in the image. Our re-ranker utilizes the Belief Revision framework (Blok et al., 2003) to calibrate the original likelihood of the top-n captions by explicitly exploiting semantic relatedness between the depicted caption and the visual context. Our experiments demonstrate the utility of our approach, where we observe that our re-ranker can enhance the performance of a typical image-captioning system without necessity of any additional training or fine-tuning.


The BreakingNews Dataset
Arnau Ramisa | Fei Yan | Francesc Moreno-Noguer | Krystian Mikolajczyk
Proceedings of the Sixth Workshop on Vision and Language

We present BreakingNews, a novel dataset with approximately 100K news articles including images, text and captions, and enriched with heterogeneous meta-data (e.g. GPS coordinates and popularity metrics). The tenuous connection between the images and text in news data is appropriate to take work at the intersection of Computer Vision and Natural Language Processing to the next step, hence we hope this dataset will help spur progress in the field.

Multi-Modal Fashion Product Retrieval
Antonio Rubio Romano | LongLong Yu | Edgar Simo-Serra | Francesc Moreno-Noguer
Proceedings of the Sixth Workshop on Vision and Language

Finding a product in the fashion world can be a daunting task. Everyday, e-commerce sites are updating with thousands of images and their associated metadata (textual information), deepening the problem. In this paper, we leverage both the images and textual metadata and propose a joint multi-modal embedding that maps both the text and images into a common latent space. Distances in the latent space correspond to similarity between products, allowing us to effectively perform retrieval in this latent space. We compare against existing approaches and show significant improvements in retrieval tasks on a large-scale e-commerce dataset.


Structured Prediction with Output Embeddings for Semantic Image Annotation
Ariadna Quattoni | Arnau Ramisa | Pranava Swaroop Madhyastha | Edgar Simo-Serra | Francesc Moreno-Noguer
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies


Combining Geometric, Textual and Visual Features for Predicting Prepositions in Image Descriptions
Arnau Ramisa | Josiah Wang | Ying Lu | Emmanuel Dellandrea | Francesc Moreno-Noguer | Robert Gaizauskas
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

Semantic Tuples for Evaluation of Image to Sentence Generation
Lily D. Ellebracht | Arnau Ramisa | Pranava Swaroop Madhyastha | Jose Cordero-Rama | Francesc Moreno-Noguer | Ariadna Quattoni
Proceedings of the Fourth Workshop on Vision and Language