Abstract
Thanks to the recent progress in vision-language modeling and the evolving nature of news consumption, the tasks of automatic summarization and headline generation based on multimodal news articles have been gaining popularity. One of the limitations of the current approaches is caused by the commonly used sophisticated modular architectures built upon hierarchical cross-modal encoders and modality-specific decoders, which restrict the model’s applicability to specific data modalities – once trained on, e.g., text+video pairs there is no straightforward way to apply the model to text+image or text-only data. In this work, we propose a unified task formulation that utilizes a simple encoder-decoder model to generate headlines from uni- and multi-modal news articles. This model is trained jointly on data of several modalities and extends the textual decoder to handle the multimodal output.- Anthology ID:
- 2024.findings-eacl.30
- Volume:
- Findings of the Association for Computational Linguistics: EACL 2024
- Month:
- March
- Year:
- 2024
- Address:
- St. Julian’s, Malta
- Editors:
- Yvette Graham, Matthew Purver
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 437–450
- Language:
- URL:
- https://aclanthology.org/2024.findings-eacl.30
- DOI:
- Cite (ACL):
- Mateusz Krubiński and Pavel Pecina. 2024. Towards Unified Uni- and Multi-modal News Headline Generation. In Findings of the Association for Computational Linguistics: EACL 2024, pages 437–450, St. Julian’s, Malta. Association for Computational Linguistics.
- Cite (Informal):
- Towards Unified Uni- and Multi-modal News Headline Generation (Krubiński & Pecina, Findings 2024)
- PDF:
- https://preview.aclanthology.org/add_acl24_videos/2024.findings-eacl.30.pdf