This is an internal, incomplete preview of a proposed change to the ACL Anthology.
For efficiency reasons, we don't generate MODS or Endnote formats, and the preview may be incomplete in other ways, or contain mistakes.
Do not treat this content as an official publication.
Mong YuanSim
Fixing paper assignments
Please select all papers that belong to the same person.
Indicate below which author they should be assigned to.
Vision-language models (VLMs) integrate textual and visual information, enabling the model to process visual inputs and leverage visual information to generate predictions. Such models are demanding for tasks such as visual question answering, image captioning, and visual grounding. However, some recent work found that VLMs often rely heavily on textual information, ignoring visual information, but are still able to achieve competitive performance in vision-language (VL) tasks. This survey reviews modality collapse analysis work to provide insights into the reason for this unintended behavior. It also reviews probing studies for fine-grained vision-language understanding, presenting current findings on information encoded in VL representations and highlighting potential directions for future research.
The Impression section of a radiology report summarizes critical findings of a radiology report and thus plays a crucial role in communication between radiologists and physicians. Research on radiology report summarization mostly focuses on generating the Impression section by summarizing information from the Findings section, which typically details the radiologist’s observations in the radiology images. Recent work start to explore how to incorporate radiology images as input to multimodal summarization models, with the assumption that it can improve generated summary quality, as it contains richer information. However, the real effectiveness of radiology images remains unclear. To answer this, we conduct a thorough analysis to understand whether current multimodal models can utilize radiology images in summarizing Findings section. Our analysis reveals that current multimodal models often fail to effectively utilize radiology images. For example, masking the image input leads to minimal or no performance drop. Expert annotation study shows that radiology images are unnecessary when they write the Impression section.
Lay summarisation aims at generating a summary for non-expert audience which allows them to keep updated with latest research in a specific field. Despite the significant advancements made in the field of text summarisation, lay summarisation remains relatively under-explored. We present a comprehensive set of experiments and analysis to investigate the effectiveness of existing pre-trained language models in generating lay summaries. When evaluate our models using a BioNLP Shared Task, BioLaySumm, our submission ranked second for the relevance criteria and third overall among 21 competing teams.
Multi-document summarization (MDS) is a process of generating an informative and concise summary from multiple topic-related documents. Many studies have analyzed the quality of MDS dataset or models, however no work has been done from the perspective of topic preservation. In this work, we fill the gap by performing an empirical analysis on two MDS datasets and study topic preservation on generated summaries from 8 MDS models. Our key findings include i) Multi-News dataset has better gold summaries compared to Multi-XScience in terms of its topic distribution consistency and ii) Extractive approaches perform better than abstractive approaches in preserving topic information from source documents. We hope our findings could help develop a summarization model that can generate topic-focused summary and also give inspiration to researchers in creating dataset for such challenging task.