Andrea Galassi


2024

pdf
A Corpus for Sentence-Level Subjectivity Detection on English News Articles
Francesco Antici | Federico Ruggeri | Andrea Galassi | Katerina Korre | Arianna Muti | Alessandra Bardi | Alice Fedotova | Alberto Barrón-Cedeño
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

We develop novel annotation guidelines for sentence-level subjectivity detection, which are not limited to language-specific cues. We use our guidelines to collect NewsSD-ENG, a corpus of 638 objective and 411 subjective sentences extracted from English news articles on controversial topics. Our corpus paves the way for subjectivity detection in English and across other languages without relying on language-specific tools, such as lexicons or machine translation. We evaluate state-of-the-art multilingual transformer-based models on the task in mono-, multi-, and cross-language settings. For this purpose, we re-annotate an existing Italian corpus. We observe that models trained in the multilingual setting achieve the best performance on the task.

2023

pdf
LEXTREME: A Multi-Lingual and Multi-Task Benchmark for the Legal Domain
Joel Niklaus | Veton Matoshi | Pooja Rani | Andrea Galassi | Matthias Stürmer | Ilias Chalkidis
Findings of the Association for Computational Linguistics: EMNLP 2023

Lately, propelled by phenomenal advances around the transformer architecture, the legal NLP field has enjoyed spectacular growth. To measure progress, well-curated and challenging benchmarks are crucial. Previous efforts have produced numerous benchmarks for general NLP models, typically based on news or Wikipedia. However, these may not fit specific domains such as law, with its unique lexicons and intricate sentence structures. Even though there is a rising need to build NLP systems for languages other than English, many benchmarks are available only in English and no multilingual benchmark exists in the legal NLP field. We survey the legal NLP literature and select 11 datasets covering 24 languages, creating LEXTREME. To fairly compare models, we propose two aggregate scores, i.e., dataset aggregate score and language aggregate score. Our results show that even the best baseline only achieves modest results, and also ChatGPT struggles with many tasks. This indicates that LEXTREME remains a challenging task with ample room for improvement. To facilitate easy use for researchers and practitioners, we release LEXTREME on huggingface along with a public leaderboard and the necessary code to evaluate models. We also provide a public Weights and Biases project containing all runs for transparency.

pdf
A First Attempt to Detect Misinformation in Russia-Ukraine War News through Text Similarity
Nina Khairova | Bogdan Ivasiuk | Fabrizio Lo Scudo | Carmela Comito | Andrea Galassi
Proceedings of the 4th Conference on Language, Data and Knowledge

2022

pdf
Detecting Arguments in CJEU Decisions on Fiscal State Aid
Giulia Grundler | Piera Santin | Andrea Galassi | Federico Galli | Francesco Godano | Francesca Lagioia | Elena Palmieri | Federico Ruggeri | Giovanni Sartor | Paolo Torroni
Proceedings of the 9th Workshop on Argument Mining

The successful application of argument mining in the legal domain can dramatically impact many disciplines related to law. For this purpose, we present Demosthenes, a novel corpus for argument mining in legal documents, composed of 40 decisions of the Court of Justice of the European Union on matters of fiscal state aid. The annotation specifies three hierarchical levels of information: the argumentative elements, their types, and their argument schemes. In our experimental evaluation, we address 4 different classification tasks, combining advanced language models and traditional classifiers.

pdf
Multimodal Argument Mining: A Case Study in Political Debates
Eleonora Mancini | Federico Ruggeri | Andrea Galassi | Paolo Torroni
Proceedings of the 9th Workshop on Argument Mining

We propose a study on multimodal argument mining in the domain of political debates. We collate and extend existing corpora and provide an initial empirical study on multimodal architectures, with a special emphasis on input encoding methods. Our results provide interesting indications about future directions in this important domain.

pdf
A Sentiment and Emotion Annotated Dataset for Bitcoin Price Forecasting Based on Reddit Posts
Pavlo Seroyizhko | Zhanel Zhexenova | Muhammad Zohaib Shafiq | Fabio Merizzi | Andrea Galassi | Federico Ruggeri
Proceedings of the Fourth Workshop on Financial Technology and Natural Language Processing (FinNLP)

Cryptocurrencies have gained enormous momentum in finance and are nowadays commonly adopted as a medium of exchange for online payments. After recent events during which GameStop’s stocks were believed to be influenced by WallStreetBets subReddit, Reddit has become a very hot topic on the cryptocurrency market. The influence of public opinions on cryptocurrency price trends has inspired researchers on exploring solutions that integrate such information in crypto price change forecasting. A popular integration technique regards representing social media opinions via sentiment features. However, this research direction is still in its infancy, where a limited number of publicly available datasets with sentiment annotations exists. We propose a novel Bitcoin Reddit Sentiment Dataset, a ready-to-use dataset annotated with state-of-the-art sentiment and emotion recognition. The dataset contains pre-processed Reddit posts and comments about Bitcoin from several domain-related subReddits along with Bitcoin’s financial data. We evaluate several widely adopted neural architectures for crypto price change forecasting. Our results show controversial benefits of sentiment and emotion features advocating for more sophisticated social media integration techniques. We make our dataset publicly available for research.

pdf
Combining WordNet and Word Embeddings in Data Augmentation for Legal Texts
Sezen Perçin | Andrea Galassi | Francesca Lagioia | Federico Ruggeri | Piera Santin | Giovanni Sartor | Paolo Torroni
Proceedings of the Natural Legal Language Processing Workshop 2022

Creating balanced labeled textual corpora for complex tasks, like legal analysis, is a challenging and expensive process that often requires the collaboration of domain experts. To address this problem, we propose a data augmentation method based on the combination of GloVe word embeddings and the WordNet ontology. We present an example of application in the legal domain, specifically on decisions of the Court of Justice of the European Union.Our evaluation with human experts confirms that our method is more robust than the alternatives.

2021

pdf bib
A Corpus for Multilingual Analysis of Online Terms of Service
Kasper Drawzeski | Andrea Galassi | Agnieszka Jablonowska | Francesca Lagioia | Marco Lippi | Hans Wolfgang Micklitz | Giovanni Sartor | Giacomo Tagiuri | Paolo Torroni
Proceedings of the Natural Legal Language Processing Workshop 2021

We present the first annotated corpus for multilingual analysis of potentially unfair clauses in online Terms of Service. The data set comprises a total of 100 contracts, obtained from 25 documents annotated in four different languages: English, German, Italian, and Polish. For each contract, potentially unfair clauses for the consumer are annotated, for nine different unfairness categories. We show how a simple yet efficient annotation projection technique based on sentence embeddings could be used to automatically transfer annotations across languages.

2020

pdf
Cross-lingual Annotation Projection in Legal Texts
Andrea Galassi | Kasper Drazewski | Marco Lippi | Paolo Torroni
Proceedings of the 28th International Conference on Computational Linguistics

We study annotation projection in text classification problems where source documents are published in multiple languages and may not be an exact translation of one another. In particular, we focus on the detection of unfair clauses in privacy policies and terms of service. We present the first English-German parallel asymmetric corpus for the task at hand. We study and compare several language-agnostic sentence-level projection methods. Our results indicate that a combination of word embeddings and dynamic time warping performs best.

2018

pdf bib
Argumentative Link Prediction using Residual Networks and Multi-Objective Learning
Andrea Galassi | Marco Lippi | Paolo Torroni
Proceedings of the 5th Workshop on Argument Mining

We explore the use of residual networks for argumentation mining, with an emphasis on link prediction. The method we propose makes no assumptions on document or argument structure. We evaluate it on a challenging dataset consisting of user-generated comments collected from an online platform. Results show that our model outperforms an equivalent deep network and offers results comparable with state-of-the-art methods that rely on domain knowledge.