Shruti Palaskar


On Advances in Text Generation from Images Beyond Captioning: A Case Study in Self-Rationalization
Shruti Palaskar | Akshita Bhagia | Yonatan Bisk | Florian Metze | Alan W Black | Ana Marasovic
Findings of the Association for Computational Linguistics: EMNLP 2022

Combining the visual modality with pretrained language models has been surprisingly effective for simple descriptive tasks such as image captioning. More general text generation however remains elusive. We take a step back and ask: How do these models work for more complex generative tasks, i.e. conditioning on both text and images? Are multimodal models simply visually adapted language models, or do they combine they reason jointly over modalities?We investigate these questions in the context of self-rationalization (jointly generating task labels/answers and free-text explanations) of three tasks: (i) visual question answering in VQA-X, (ii) visual commonsense reasoning in VCR, and (iii) visual-textual entailment in E-SNLI-VE. We show that recent unimodal advances, CLIP image representations and scaling of language models, do not consistently improveself-rationalization in multimodal tasks. We find that no single model type works universally best across tasks, datasets, and finetuning data sizes. Our findings motivate the need for novel general backbones that move text generation from images and text beyond image captioning.


pdf bib
Towards Understanding ASR Error Correction for Medical Conversations
Anirudh Mani | Shruti Palaskar | Sandeep Konam
Proceedings of the First Workshop on Natural Language Processing for Medical Conversations

Domain Adaptation for Automatic Speech Recognition (ASR) error correction via machine translation is a useful technique for improving out-of-domain outputs of pre-trained ASR systems to obtain optimal results for specific in-domain tasks. We use this technique on our dataset of Doctor-Patient conversations using two off-the-shelf ASR systems: Google ASR (commercial) and the ASPIRE model (open-source). We train a Sequence-to-Sequence Machine Translation model and evaluate it on seven specific UMLS Semantic types, including Pharmacological Substance, Sign or Symptom, and Diagnostic Procedure to name a few. Lastly, we breakdown, analyze and discuss the 7% overall improvement in word error rate in view of each Semantic type.


Multimodal Abstractive Summarization for How2 Videos
Shruti Palaskar | Jindřich Libovický | Spandana Gella | Florian Metze
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

In this paper, we study abstractive summarization for open-domain videos. Unlike the traditional text news summarization, the goal is less to “compress” text information but rather to provide a fluent textual summary of information that has been collected and fused from different source modalities, in our case video and audio transcripts (or text). We show how a multi-source sequence-to-sequence model with hierarchical attention can integrate information from different modalities into a coherent output, compare various models trained with different modalities and present pilot experiments on the How2 corpus of instructional videos. We also propose a new evaluation metric (Content F1) for abstractive summarization task that measures semantic adequacy rather than fluency of the summaries, which is covered by metrics like ROUGE and BLEU.