Abstract
Despite the success of data augmentation in improving CLIP model, existing methods that utilize LLM or SAM to enrich the information in captions still suffer from several limitations, including insufficient detail and excessive hallucinations, ultimately resulting in compromised alignment and masking the true potential of dense information. This can lead to erroneous conclusions about CLIP’s ability to handle rich data, impeding the development of more effective models. To address the limitations of existing methods, we introduce a novel pipeline that generates highly detailed, factually accurate captions for images, which facilitates in-depth analysis of the potential for dense information in multimodal alignment. Contrary to previous findings, our investigation revealed that lengthening captions boosts performance across diverse benchmarks, even surpassing the effectiveness of meticulously crafted hard negative samples. Building on these insights, DELIP is introduced, demonstrably enhancing both foundational multimodal alignment and compositional reasoning abilities. Finally, we explore strategies to expand the context window of the text encoder, unlocking the potential of richer data for CLIP and paving the way for advancements in leveraging dense information for multimodal alignment.- Anthology ID:
- 2024.findings-acl.797
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2024
- Month:
- August
- Year:
- 2024
- Address:
- Bangkok, Thailand
- Editors:
- Lun-Wei Ku, Andre Martins, Vivek Srikumar
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 13440–13451
- Language:
- URL:
- https://preview.aclanthology.org/build-pipeline-with-new-library/2024.findings-acl.797/
- DOI:
- 10.18653/v1/2024.findings-acl.797
- Cite (ACL):
- Zhiyuan Fan, Zhihong Chen, and Benyou Wang. 2024. Exploring the Potential of Dense Information in Multimodal Alignment. In Findings of the Association for Computational Linguistics: ACL 2024, pages 13440–13451, Bangkok, Thailand. Association for Computational Linguistics.
- Cite (Informal):
- Exploring the Potential of Dense Information in Multimodal Alignment (Fan et al., Findings 2024)
- PDF:
- https://preview.aclanthology.org/build-pipeline-with-new-library/2024.findings-acl.797.pdf