Miguel Carvalho

2025

pdf bib abs
Efficient Architectures for High Resolution Vision-Language Models
Miguel Carvalho | Bruno Martins
Proceedings of the 31st International Conference on Computational Linguistics

Vision-Language Models (VLMs) have recently experienced significant advancements. However, challenges persist in the accurate recognition of fine details within high resolution images, which limits performance in multiple tasks. This work introduces Pheye, a novel architecture that efficiently processes high-resolution images while training fewer parameters than similarly sized VLMs. Notably, Pheye achieves a high efficiency while maintaining strong performance, particularly in tasks that demand fine-grained image understanding and/or the handling of scene-text.

Co-authors

Bruno Martins 1

Venues

coling1

Fix data

Miguel Carvalho

Fixing paper assignments

2025

Co-authors

Venues