Alberto Lavelli

Also published as: A. Lavelli


2024

pdf
MedMT5: An Open-Source Multilingual Text-to-Text LLM for the Medical Domain
Iker García-Ferrero | Rodrigo Agerri | Aitziber Atutxa Salazar | Elena Cabrio | Iker de la Iglesia | Alberto Lavelli | Bernardo Magnini | Benjamin Molinet | Johana Ramirez-Romero | German Rigau | Jose Maria Villa-Gonzalez | Serena Villata | Andrea Zaninello
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Research on language technology for the development of medical applications is currently a hot topic in Natural Language Understanding and Generation. Thus, a number of large language models (LLMs) have recently been adapted to the medical domain, so that they can be used as a tool for mediating in human-AI interaction. While these LLMs display competitive performance on automated medical texts benchmarks, they have been pre-trained and evaluated with a focus on a single language (English mostly). This is particularly true of text-to-text models, which typically require large amounts of domain-specific pre-training data, often not easily accessible for many languages. In this paper, we address these shortcomings by compiling, to the best of our knowledge, the largest multilingual corpus for the medical domain in four languages, namely English, French, Italian and Spanish. This new corpus has been used to train Medical mT5, the first open-source text-to-text multilingual model for the medical domain. Additionally, we present two new evaluation benchmarks for all four languages with the aim of facilitating multilingual research in this domain. A comprehensive evaluation shows that Medical mT5 outperforms both encoders and similarly sized text-to-text models for the Spanish, French, and Italian benchmarks, while being competitive with current state-of-the-art LLMs in English.

pdf
Get the Best out of 1B LLMs: Insights from Information Extraction on Clinical Documents
Saeed Farzi | Soumitra Ghosh | Alberto Lavelli | Bernardo Magnini
Proceedings of the 23rd Workshop on Biomedical Natural Language Processing

While the popularity of large, versatile language models like ChatGPT continues to rise, the landscape shifts when considering open-source models tailored to specific domains. Moreover, many areas, such as clinical documents, suffer from a scarcity of training data, often amounting to only a few hundred instances. Additionally, in certain settings, such as hospitals, cloud-based solutions pose privacy concerns, necessitating the deployment of language models on traditional hardware, such as single GPUs or powerful CPUs. To address these complexities, we conduct extensive experiments on both clinical entity detection and relation extraction in clinical documents using 1B parameter models. Our study delves into traditional fine-tuning, continuous pre-training in the medical domain, and instruction-tuning methods, providing valuable insights into their effectiveness in a multilingual setting. Our results underscore the importance of domain-specific models and pre-training for clinical natural language processing tasks. Furthermore, data augmentation using cross-lingual information improves performance in most cases, highlighting the potential for multilingual enhancements.

2022

pdf
What’s in a (dataset’s) name? The case of BigPatent
Silvia Casola | Alberto Lavelli | Horacio Saggion
Proceedings of the 2nd Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)

Sharing datasets and benchmarks has been crucial for rapidly improving Natural Language Processing models and systems. Documenting datasets’ characteristics (and any modification introduced over time) is equally important to avoid confusion and make comparisons reliable. Here, we describe the case of BigPatent, a dataset for patent summarization that exists in at least two rather different versions under the same name. While previous literature has not clearly distinguished among versions, their differences do not only lay on a surface level but also modify the dataset’s core nature and, thus, the complexity of the summarization task. While this paper describes a specific case, we aim to shed light on new challenges that might emerge in resource sharing and advocate for comprehensive documentation of datasets and models.

pdf bib
Proceedings of the 13th International Workshop on Health Text Mining and Information Analysis (LOUHI)
Alberto Lavelli | Eben Holderness | Antonio Jimeno Yepes | Anne-Lyse Minard | James Pustejovsky | Fabio Rinaldi
Proceedings of the 13th International Workshop on Health Text Mining and Information Analysis (LOUHI)

2021

pdf bib
Proceedings of the 12th International Workshop on Health Text Mining and Information Analysis
Eben Holderness | Antonio Jimeno Yepes | Alberto Lavelli | Anne-Lyse Minard | James Pustejovsky | Fabio Rinaldi
Proceedings of the 12th International Workshop on Health Text Mining and Information Analysis

2020

pdf
Comparing Machine Learning and Deep Learning Approaches on NLP Tasks for the Italian Language
Bernardo Magnini | Alberto Lavelli | Simone Magnolini
Proceedings of the Twelfth Language Resources and Evaluation Conference

We present a comparison between deep learning and traditional machine learning methods for various NLP tasks in Italian. We carried on experiments using available datasets (e.g., from the Evalita shared tasks) on two sequence tagging tasks (i.e., named entities recognition and nominal entities recognition) and four classification tasks (i.e., lexical relations among words, semantic relations among sentences, sentiment analysis and text classification). We show that deep learning approaches outperform traditional machine learning algorithms in sequence tagging, while for classification tasks that heavily rely on semantics approaches based on feature engineering are still competitive. We think that a similar analysis could be carried out for other languages to provide an assessment of machine learning / deep learning models across different languages.

pdf bib
Proceedings of the 11th International Workshop on Health Text Mining and Information Analysis
Eben Holderness | Antonio Jimeno Yepes | Alberto Lavelli | Anne-Lyse Minard | James Pustejovsky | Fabio Rinaldi
Proceedings of the 11th International Workshop on Health Text Mining and Information Analysis

pdf
FBK@SMM4H2020: RoBERTa for Detecting Medications on Twitter
Silvia Casola | Alberto Lavelli
Proceedings of the Fifth Social Media Mining for Health Applications Workshop & Shared Task

This paper describes a classifier for tweets that mention medications or supplements, based on a pretrained transformer. We developed such a system for our participation in Subtask 1 of the Social Media Mining for Health Application workshop, which featured an extremely unbalanced dataset. The model showed promising results, with an F1 of 0.8 (task mean: 0.66).

2019

pdf bib
Proceedings of the Tenth International Workshop on Health Text Mining and Information Analysis (LOUHI 2019)
Eben Holderness | Antonio Jimeno Yepes | Alberto Lavelli | Anne-Lyse Minard | James Pustejovsky | Fabio Rinaldi
Proceedings of the Tenth International Workshop on Health Text Mining and Information Analysis (LOUHI 2019)

2018

pdf bib
Proceedings of the Ninth International Workshop on Health Text Mining and Information Analysis
Alberto Lavelli | Anne-Lyse Minard | Fabio Rinaldi
Proceedings of the Ninth International Workshop on Health Text Mining and Information Analysis

pdf
PoSTWITA-UD: an Italian Twitter Treebank in Universal Dependencies
Manuela Sanguinetti | Cristina Bosco | Alberto Lavelli | Alessandro Mazzei | Oronzo Antonelli | Fabio Tamburini
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf
Annotating Italian Social Media Texts in Universal Dependencies
Manuela Sanguinetti | Cristina Bosco | Alessandro Mazzei | Alberto Lavelli | Fabio Tamburini
Proceedings of the Fourth International Conference on Dependency Linguistics (Depling 2017)

2013

pdf
FBK-irst : A Multi-Phase Kernel Based Approach for Drug-Drug Interaction Detection and Classification that Exploits Linguistic Information
Md. Faisal Mahbub Chowdhury | Alberto Lavelli
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013)

pdf
FBK: Sentiment Analysis in Twitter with Tweetsted
Md. Faisal Mahbub Chowdhury | Marco Guerini | Sara Tonelli | Alberto Lavelli
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013)

pdf
Exploiting the Scope of Negations and Heterogeneous Features for Relation Extraction: A Case Study for Drug-Drug Interaction Extraction
Md. Faisal Mahbub Chowdhury | Alberto Lavelli
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Proceedings of the Joint Symposium on Semantic Processing. Textual Inference and Structures in Corpora
Octavian Popescu | Alberto Lavelli
Proceedings of the Joint Symposium on Semantic Processing. Textual Inference and Structures in Corpora

2012

pdf
Impact of Less Skewed Distributions on Efficiency and Effectiveness of Biomedical Relation Extraction
Md. Faisal Mahbub Chowdhury | Alberto Lavelli
Proceedings of COLING 2012: Posters

pdf
An Evaluation of the Effect of Automatic Preprocessing on Syntactic Parsing for Biomedical Relation Extraction
Md. Faisal Mahbub Chowdhury | Alberto Lavelli
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Relation extraction (RE) is an important text mining task which is the basis for further complex and advanced tasks. In state-of-the-art RE approaches, syntactic information obtained through parsing plays a crucial role. In the context of biomedical RE previous studies report usage of various automatic preprocessing techniques applied before parsing the input text. However, these studies do not specify to what extent such techniques improve RE results and to what extent they are corpus specific as well as parser specific. In this paper, we aim at addressing these issues by using various preprocessing techniques, two syntactic tree kernel based RE approaches and two different parsers on 5 widely used benchmark biomedical corpora of the protein-protein interaction (PPI) extraction task. We also provide analyses of various corpus characteristics to verify whether there are correlations between these characteristics and the RE results obtained. These analyses of corpus characteristics can be exploited to compare the 5 PPI corpora.

pdf
A treebank-based study on the influence of Italian word order on parsing performance
Anita Alicante | Cristina Bosco | Anna Corazza | Alberto Lavelli
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

The aim of this paper is to contribute to the debate on the issues raised by Morphologically Rich Languages, and more precisely to investigate, in a cross-paradigm perspective, the influence of the constituent order on the data-driven parsing of one of such languages(i.e. Italian). It shows therefore new evidence from experiments on Italian, a language characterized by a rich verbal inflection, which leads to a widespread diffusion of the pro―drop phenomenon and to a relatively free word order. The experiments are performed by using state-of-the-art data-driven parsers (i.e. MaltParser and Berkeley parser) and are based on an Italian treebank available in formats that vary according to two dimensions, i.e. the paradigm of representation (dependency vs. constituency) and the level of detail of linguistic information.

pdf
A Corpus of Scientific Biomedical Texts Spanning over 168 Years Annotated for Uncertainty
Ramona Bongelli | Carla Canestrari | Ilaria Riccioni | Andrzej Zuczkowski | Cinzia Buldorini | Ricardo Pietrobon | Alberto Lavelli | Bernardo Magnini
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Uncertainty language permeates biomedical research and is fundamental for the computer interpretation of unstructured text. And yet, a coherent, cognitive-based theory to interpret Uncertainty language and guide Natural Language Processing is, to our knowledge, non-existing. The aim of our project was therefore to detect and annotate Uncertainty markers ― which play a significant role in building knowledge or beliefs in readers' minds ― in a biomedical research corpus. Our corpus includes 80 manually annotated articles from the British Medical Journal randomly sampled from a 168-year period. Uncertainty markers have been classified according to a theoretical framework based on a combined linguistic and cognitive theory. The corpus was manually annotated according to such principles. We performed preliminary experiments to assess the manually annotated corpus and establish a baseline for the automatic detection of Uncertainty markers. The results of the experiments show that most of the Uncertainty markers can be recognized with good accuracy.

pdf
Combining Tree Structures, Flat Features and Patterns for Biomedical Relation Extraction
Md. Faisal Mahbub Chowdhury | Alberto Lavelli
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics

2011

pdf
A Study on Dependency Tree Kernels for Automatic Extraction of Protein-Protein Interaction
Faisal Md. Chowdhury | Alberto Lavelli | Alessandro Moschitti
Proceedings of BioNLP 2011 Workshop

pdf
Assessing the practical usability of an automatically annotated corpus
Md. Faisal Mahbub Chowdhury | Alberto Lavelli
Proceedings of the 5th Linguistic Annotation Workshop

2010

pdf
Disease Mention Recognition with Specific Features
Md. Faisal Mahbub Chowdhury | Alberto Lavelli
Proceedings of the 2010 Workshop on Biomedical Natural Language Processing

pdf
Comparing the Influence of Different Treebank Annotations on Dependency Parsing
Cristina Bosco | Simonetta Montemagni | Alessandro Mazzei | Vincenzo Lombardo | Felice Dell’Orletta | Alessandro Lenci | Leonardo Lesmo | Giuseppe Attardi | Maria Simi | Alberto Lavelli | Johan Hall | Jens Nilsson | Joakim Nivre
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

As the interest of the NLP community grows to develop several treebanks also for languages other than English, we observe efforts towards evaluating the impact of different annotation strategies used to represent particular languages or with reference to particular tasks. This paper contributes to the debate on the influence of resources used for the training and development on the performance of parsing systems. It presents a comparative analysis of the results achieved by three different dependency parsers developed and tested with respect to two treebanks for the Italian language, namely TUT and ISST--TANL, which differ significantly at the level of both corpus composition and adopted dependency representations.

2008

pdf
Comparing Italian parsers on a common Treebank: the EVALITA experience
Cristina Bosco | Alessandro Mazzei | Vincenzo Lombardo | Giuseppe Attardi | Anna Corazza | Alberto Lavelli | Leonardo Lesmo | Giorgio Satta | Maria Simi
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

The EVALITA 2007 Parsing Task has been the first contest among parsing systems for Italian. It is the first attempt to compare the approaches and the results of the existing parsing systems specific for this language using a common treebank annotated using both a dependency and a constituency-based format. The development data set for this parsing competition was taken from the Turin University Treebank, which is annotated both in dependency and constituency format. The evaluation metrics were those standardly applied in CoNLL and PARSEVAL. The results of the parsing results are very promising and higher than the state-of-the-art for dependency parsing of Italian. An analysis of such results is provided, which takes into account other experiences in treebank-driven parsing for Italian and for other Romance languages (in particular, the CoNLL X & 2007 shared tasks for dependency parsing). It focuses on the characteristics of data sets, i.e. type of annotation and size, parsing paradigms and approaches applied also to languages other than Italian.

2007

pdf
FBK-IRST: Kernel Methods for Semantic Relation Extraction
Claudio Giuliano | Alberto Lavelli | Daniele Pighin | Lorenza Romano
Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007)

2006

pdf bib
Simple Information Extraction (SIE): A Portable and Effective IE System
Claudio Giuliano | Alberto Lavelli | Lorenza Romano
Proceedings of the Workshop on Adaptive Text Extraction and Mining (ATEM 2006)

pdf
Exploiting Shallow Linguistic Information for Relation Extraction from Biomedical Literature
Claudio Giuliano | Alberto Lavelli | Lorenza Romano
11th Conference of the European Chapter of the Association for Computational Linguistics

pdf
Investigating a Generic Paraphrase-Based Approach for Relation Extraction
Lorenza Romano | Milen Kouylekov | Idan Szpektor | Ido Dagan | Alberto Lavelli
11th Conference of the European Chapter of the Association for Computational Linguistics

2004

pdf
A Critical Survey of the Methodology for IE Evaluation
A. Lavelli | M. E. Califf | F. Ciravegna | D. Freitag | C. Giuliano | N. Kushmerick | L. Romano
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

We survey the evaluation methodology adopted in Information Extraction (IE), as defined in the MUC conferences and in later independent efforts applying machine learning to IE. We point out a number of problematic issues that may hamper the comparison between results obtained by different researchers. Some of them are common to other NLP tasks: e.g., the difficulty of exactly identifying the effects on performance of the data (sample selection and sample size), of the domain theory (features selected), and of algorithm parameter settings. Issues specific to IE evaluation include: how leniently to assess inexact identification of filler boundaries, the possibility of multiple fillers for a slot, and how the counting is performed. We argue that, when specifying an information extraction task, a number of characteristics should be clearly defined. However, in the papers only a few of them are usually explicitly specified. Our aim is to elaborate a clear and detailed experimental methodology and propose it to the IE community. The goal is to reach a widespread agreement on such proposal so that future IE evaluations will adopt the proposed methodology, making comparisons between algorithms fair and reliable. In order to achieve this goal, we will develop and make available to the community a set of tools and resources that incorporate a standardized IE methodology.

2002

pdf
SiSSA: An Infrastructure for Developing NLP Applications
Alberto Lavelli | Fabio Pianesi | Ermanno Maci | Irina Prodanof | Luca Dini | Giampaolo Mazzini
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)

2001

pdf
SiSSA - An Infrastructure for NLP Application Development
Alberto Lavelli | F. Pianesi | E. Maci | I. Prodanof | L. Dini | G. Mazzini
Proceedings of the ACL 2001 Workshop on Sharing Tools and Resources

2000

pdf
Grammar Organization for Cascade-based Parsing in Information Extraction
Fabio Ciravegna | Alberto Lavelli
Proceedings of the Sixth International Workshop on Parsing Technologies

1999

pdf
Full Text Parsing using Cascades of Rules: an Information Extraction Perspective
Fabio Ciravegna | Alberto Lavelli
Ninth Conference of the European Chapter of the Association for Computational Linguistics

1997

pdf
Controlling Bottom-Up Chart Parsers through Text Chunking
Fabio Ciravegna | Alberto Lavelli
Proceedings of the Fifth International Workshop on Parsing Technologies

In this paper we propose to use text chunking for controlling a bottom-up parser. As it is well known, during analysis such parsers produce many constituents not contributing to the final solution(s). Most of these constituents are introduced due to t he parser inability of checking the input context around them. Preliminary text chunking allows to focus directly on the constituents that seem more likely and to prune the search space in the case some satisfactory solutions are found. Preliminary experiments show that a CYK-like parser controlled through chunking is definitely more efficient than a traditional parser without significantly losing in correctness. Moreover the quality of possible partial results produced by the controlled parser is high. The strategy is particularly suited for tasks like Information Extraction from text (IE) where sentences are often long and complex and it is very difficult to have a complete coverage. Hence, there is a strong necessity of focusing on the most likely solutions; furthermore, in IE the quality of partial results is important .

pdf
Participatory Design for Linguistic Engineering: the Case of the GEPPETTO Development Environment
Fabio Ciravegna | Alberto Lavelli | Daniela Petrelli | Fabio Pianesi
Computational Environments for Grammar Development and Linguistic Engineering

1995

pdf
On Parsing Control for Efficient Text Analysis
Fabio Ciravegna | Alberto Lavelli
Proceedings of the Fourth International Workshop on Parsing Technologies

1992

pdf
An Approach to Multilevel Semantics for Applied Systems
Alberto Lavelli | Bernardo Magnini | Carlo Strapparava
Third Conference on Applied Natural Language Processing

1991

pdf
Bidirectional Parsing of Lexicalized Tree Adjoining Grammars
Alberto Lavelli | Giorgio Satta
Fifth Conference of the European Chapter of the Association for Computational Linguistics

1990

pdf
When Something Is Missing: Ellipsis, Coordination and the Chart
Alberto Lavelli | Oliviero Stock
COLING 1990 Volume 3: Papers presented to the 13th International Conference on Computational Linguistics