Paolo Garza


2024

pdf
3MVRD: Multimodal Multi-task Multi-teacher Visually-Rich Form Document Understanding
Yihao Ding | Lorenzo Vaiani | Caren Han | Jean Lee | Paolo Garza | Josiah Poon | Luca Cagliero
Findings of the Association for Computational Linguistics: ACL 2024

This paper presents a groundbreaking multimodal, multi-task, multi-teacher joint-grained knowledge distillation model for visually-rich form document understanding. The model is designed to leverage insights from both fine-grained and coarse-grained levels by facilitating a nuanced correlation between token and entity representations, addressing the complexities inherent in form documents. Additionally, we introduce new inter-grained and cross-grained loss functions to further refine diverse multi-teacher knowledge distillation transfer process, presenting distribution gaps and a harmonised understanding of form documents. Through a comprehensive evaluation across publicly available form document understanding datasets, our proposed model consistently outperforms existing baselines, showcasing its efficacy in handling the intricate structures and content of visually complex form documents.

pdf
Beyond Accuracy Optimization: Computer Vision Losses for Large Language Model Fine-Tuning
Daniele Rege Cambrin | Giuseppe Gallipoli | Irene Benedetto | Luca Cagliero | Paolo Garza
Findings of the Association for Computational Linguistics: EMNLP 2024

Large Language Models (LLMs) have demonstrated impressive performance across various tasks. However, current training approaches combine standard cross-entropy loss with extensive data, human feedback, or ad hoc methods to enhance performance. These solutions are often not scalable or feasible due to their associated costs, complexity, or resource requirements. This study investigates the use of established semantic segmentation loss functions in natural language generation to create a versatile, practical, and scalable solution for fine-tuning different architectures. We evaluate their effectiveness in solving Math Word Problems and question answering across different models of varying sizes. For the analyzed tasks, we found that the traditional Cross-Entropy loss represents a sub-optimal choice, while models trained to minimize alternative (task-dependent) losses, such as Focal or Lovász, achieve a mean improvement of +36% on exact match without requiring additional data or human feedback. These findings suggest a promising pathway for more efficient and accessible training processes.

2023

pdf
PoliTo at SemEval-2023 Task 1: CLIP-based Visual-Word Sense Disambiguation Based on Back-Translation
Lorenzo Vaiani | Luca Cagliero | Paolo Garza
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

Visual-Word Sense Disambiguation (V-WSD) entails resolving the linguistic ambiguity in a text by selecting a clarifying image from a set of (potentially misleading) candidates. In this paper, we address V-WSD using a state-of-the-art Image-Text Retrieval system, namely CLIP. We propose to alleviate the linguistic ambiguity across multiple domains and languages via text and image augmentation. To augment the textual content we rely on back-translation with the aid of a variety of auxiliary languages. The approach based on finetuning CLIP on the full phrases is effective in accurately disambiguating words and incorporating back-translation enhance the system’s robustness and performance on the test samples written in Indo-European languages.