2024
pdf
abs
PyRater: A Python Toolkit for Annotation Analysis
Angelo Basile
|
Marc Franco-Salvador
|
Paolo Rosso
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
We introduce PyRater, an open-source Python toolkit designed for analysing corpora annotations. When creating new annotated language resources, probabilistic models of annotation are the state-of-the-art solution for identifying the best annotators, retrieving the gold standard, and more generally separating annotation signal from noise. PyRater offers a unified interface for several such models and includes an API for the addition of new ones. Additionally, the toolkit has built-in functions to read datasets with multiple annotations and plot the analysis outcomes. In this work, we also demonstrate a novel application of PyRater to zero-shot classifiers, where it effectively selects the best-performing prompt. We make PyRater available to the research community.
2023
pdf
abs
Zero-Shot Data Maps. Efficient Dataset Cartography Without Model Training
Angelo Basile
|
Marc Franco-Salvador
|
Paolo Rosso
Findings of the Association for Computational Linguistics: EMNLP 2023
Data Maps (Swayamdipta, et al. 2020) have emerged as a powerful tool for diagnosing large annotated datasets. Given a model fitted on a dataset, these maps show each data instance from the dataset in a 2-dimensional space defined by a) the model’s confidence in the true class and b) the variability of this confidence. In previous work, confidence and variability are usually computed using training dynamics, which requires the fitting of a strong model to the dataset. In this paper, we introduce a novel approach: Zero-Shot Data Maps based on fast bi-encoder networks. For each data point, confidence on the true label and variability are computed over the members of an ensemble of zero-shot models constructed with different — but semantically equivalent — label descriptions, i.e., textual representations of each class in a given label space. We conduct a comparative analysis of maps compiled using traditional training dynamics and our proposed zero-shot models across various datasets. Our findings reveal that Zero-Shot Data Maps generally match those produced by the traditional method while delivering up to a 14x speedup. The code is available [here](https://github.com/symanto-research/zeroshot-cartography).
2021
pdf
abs
PROTEST-ER: Retraining BERT for Protest Event Extraction
Tommaso Caselli
|
Osman Mutlu
|
Angelo Basile
|
Ali Hürriyetoğlu
Proceedings of the 4th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE 2021)
We analyze the effect of further retraining BERT with different domain specific data as an unsupervised domain adaptation strategy for event extraction. Portability of event extraction models is particularly challenging, with large performance drops affecting data on the same text genres (e.g., news). We present PROTEST-ER, a retrained BERT model for protest event extraction. PROTEST-ER outperforms a corresponding generic BERT on out-of-domain data of 8.1 points. Our best performing models reach 51.91-46.39 F1 across both domains.
pdf
abs
Probabilistic Ensembles of Zero- and Few-Shot Learning Models for Emotion Classification
Angelo Basile
|
Guillermo Pérez-Torró
|
Marc Franco-Salvador
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)
Emotion Classification is the task of automatically associating a text with a human emotion. State-of-the-art models are usually learned using annotated corpora or rely on hand-crafted affective lexicons. We present an emotion classification model that does not require a large annotated corpus to be competitive. We experiment with pretrained language models in both a zero-shot and few-shot configuration. We build several of such models and consider them as biased, noisy annotators, whose individual performance is poor. We aggregate the predictions of these models using a Bayesian method originally developed for modelling crowdsourced annotations. Next, we show that the resulting system performs better than the strongest individual model. Finally, we show that when trained on few labelled data, our systems outperform fully-supervised models.
2019
pdf
abs
You Write like You Eat: Stylistic Variation as a Predictor of Social Stratification
Angelo Basile
|
Albert Gatt
|
Malvina Nissim
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
Inspired by Labov’s seminal work on stylisticvariation as a function of social stratification,we develop and compare neural models thatpredict a person’s presumed socio-economicstatus, obtained through distant supervision,from their writing style on social media. Thefocus of our work is on identifying the mostimportant stylistic parameters to predict socio-economic group. In particular, we show theeffectiveness of morpho-syntactic features aspredictors of style, in contrast to lexical fea-tures, which are good predictors of topic
pdf
abs
SymantoResearch at SemEval-2019 Task 3: Combined Neural Models for Emotion Classification in Human-Chatbot Conversations
Angelo Basile
|
Marc Franco-Salvador
|
Neha Pawar
|
Sanja Štajner
|
Mara Chinea Rios
|
Yassine Benajiba
Proceedings of the 13th International Workshop on Semantic Evaluation
In this paper, we present our participation to the EmoContext shared task on detecting emotions in English textual conversations between a human and a chatbot. We propose four neural systems and combine them to further improve the results. We show that our neural ensemble systems can successfully distinguish three emotions (SAD, HAPPY, and ANGRY) and separate them from the rest (OTHERS) in a highly-imbalanced scenario. Our best system achieved a 0.77 F1-score and was ranked fourth out of 165 submissions.
2018
pdf
abs
TAJJEB at SemEval-2018 Task 2: Traditional Approaches Just Do the Job with Emoji Prediction
Angelo Basile
|
Kenny W. Lino
Proceedings of the 12th International Workshop on Semantic Evaluation
Emojis are widely used on social media andunderstanding their meaning is important forboth practical purposes (e.g. opinion mining,sentiment detection) and theoretical purposes(e.g. how different L1 speakers use them, dothey have some syntax?); this paper presents aset of experiments that aim to predict a singleemoji from a tweet. We built different mod-els and we found that the test results are verydifferent from the validation results.
2016
pdf
abs
D(H)ante: A New Set of Tools for XIII Century Italian
Angelo Basile
|
Federico Sangati
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
In this paper we describe 1) the process of converting a corpus of Dante Alighieri from a TEI XML format in to a pseudo-CoNLL format; 2) how a pos-tagger trained on modern Italian performs on Dante’s Italian 3) the performances of two different pos-taggers trained on the given corpus. We are making our conversion scripts and models available to the community. The two other models trained on the corpus performs reasonably well. The tool used for the conversion process might turn useful for bridging the gap between traditional digital humanities and modern NLP applications since the TEI original format is not usually suitable for being processed with standard NLP tools. We believe our work will serve both communities: the DH community will be able to tag new documents and the NLP world will have an easier way in converting existing documents to a standardized machine-readable format.