2025
pdf
bib
abs
PolyNarrative: A Multilingual, Multilabel, Multi-domain Dataset for Narrative Extraction from News Articles
Nikolaos Nikolaidis
|
Nicolas Stefanovitch
|
Purificação Silvano
|
Dimitar Iliyanov Dimitrov
|
Roman Yangarber
|
Nuno Guimarães
|
Elisa Sartori
|
Ion Androutsopoulos
|
Preslav Nakov
|
Giovanni Da San Martino
|
Jakub Piskorski
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
We present polyNarrative, a new multilingual dataset of news articles, annotated for narratives. Narratives are overt or implicit claims, recurring across articles and languages, promoting a specific interpretation or viewpoint on an ongoing topic, often propagating mis/disinformation. We developed two-level taxonomies with coarse- and fine-grained narrative labels for two domains: (i) climate change and (ii) the military conflict between Ukraine and Russia. We collected news articles in four languages (Bulgarian, English, Portuguese, and Russian) related to the two domains and manually annotated them at the paragraph level. We make the dataset publicly available, along with experimental results of several strong baselines that assign narrative labels to news articles at the paragraph or the document level. We believe that this dataset will foster research in narrative detection and enable new research directions towards more multi-domain and highly granular narrative related tasks.
pdf
bib
abs
Entity Framing and Role Portrayal in the News
Tarek Mahmoud
|
Zhuohan Xie
|
Dimitar Iliyanov Dimitrov
|
Nikolaos Nikolaidis
|
Purificação Silvano
|
Roman Yangarber
|
Shivam Sharma
|
Elisa Sartori
|
Nicolas Stefanovitch
|
Giovanni Da San Martino
|
Jakub Piskorski
|
Preslav Nakov
Findings of the Association for Computational Linguistics: ACL 2025
We introduce a novel multilingual and hierarchical corpus annotated for entity framing and role portrayal in news articles. The dataset uses a unique taxonomy inspired by storytelling elements, comprising 22 fine-grained roles, or archetypes, nested within three main categories: protagonist, antagonist, and innocent. Each archetype is carefully defined, capturing nuanced portrayals of entities such as guardian, martyr, and underdog for protagonists; tyrant, deceiver, and bigot for antagonists; and victim, scapegoat, and exploited for innocents. The dataset includes 1,378 recent news articles in five languages (Bulgarian, English, Hindi, European Portuguese, and Russian) focusing on two critical domains of global significance: the Ukraine-Russia War and Climate Change. Over 5,800 entity mentions have been annotated with role labels. This dataset serves as a valuable resource for research into role portrayal and has broader implications for news analysis. We describe the characteristics of the dataset and the annotation process, and we report evaluation results on fine-tuned state-of-the-art multilingual transformers and hierarchical zero-shot learning using LLMs at the level of a document, a paragraph, and a sentence.
pdf
bib
abs
Enhancing an Annotation Scheme for Clinical Narratives in Portuguese through Human Variation Analysis
Ana Luisa Fernandes
|
Purificação Silvano
|
António Leal
|
Nuno Guimarães
|
Rita Rb-Silva
|
Luís Filipe Cunha
|
Alípio Jorge
Proceedings of the 19th Linguistic Annotation Workshop (LAW-XIX-2025)
The development of a robust annotation scheme and corresponding guidelines is crucial for producing annotated datasets that advance both linguistic and computational research. This paper presents a case study that outlines a methodology for designing an annotation scheme and its guidelines, specifically aimed at representing morphosyntactic and semantic information regarding temporal features, as well as medical information in medical reports written in Portuguese. We detail a multi-step process that includes reviewing existing frameworks, conducting an annotation experiment to determine the optimal approach, and designing a model based on these findings. We validated the approach through a pilot experiment where we assessed the reliability and applicability of the annotation scheme and guidelines. In this experiment, two annotators independently annotated a patient’s medical report consisting of six documents using the proposed model, while a curator established the ground truth. The analysis of inter-annotator agreement and the annotation results enabled the identification of sources of human variation and provided insights for further refinement of the annotation scheme and guidelines.
2024
pdf
bib
abs
ISO 24617-8 Applied: Insights from Multilingual Discourse Relations Annotation in English, Polish, and Portuguese
Aleksandra Tomaszewska
|
Purificação Silvano
|
António Leal
|
Evelin Amorim
Proceedings of the 20th Joint ACL - ISO Workshop on Interoperable Semantic Annotation @ LREC-COLING 2024
The main objective of this study is to contribute to multilingual discourse research by employing ISO-24617 Part 8 (Semantic Relations in Discourse, Core Annotation Schema – DR-core) for annotating discourse relations. Centering around a parallel discourse relations corpus that includes English, Polish, and European Portuguese, we initiate one of the few ISO-based comparative analyses through a multilingual corpus that aligns discourse relations across these languages. In this paper, we discuss the project’s contributions, including the annotated corpus, research findings, and statistics related to the use of discourse relations. The paper further discusses the challenges encountered in complying with the ISO standard, such as defining the scope of arguments and annotating specific relation types like Expansion. Our findings highlight the necessity for clearer definitions of certain discourse relations and more precise guidelines for argument spans, especially concerning the inclusion of connectives. Additionally, the study underscores the importance of ongoing collaborative efforts to broaden the inclusion of languages and more comprehensive datasets, with the objective of widening the reach of ISO-guided multilingual discourse research.
pdf
bib
abs
MultiLexBATS: Multilingual Dataset of Lexical Semantic Relations
Dagmar Gromann
|
Hugo Goncalo Oliveira
|
Lucia Pitarch
|
Elena-Simona Apostol
|
Jordi Bernad
|
Eliot Bytyçi
|
Chiara Cantone
|
Sara Carvalho
|
Francesca Frontini
|
Radovan Garabik
|
Jorge Gracia
|
Letizia Granata
|
Fahad Khan
|
Timotej Knez
|
Penny Labropoulou
|
Chaya Liebeskind
|
Maria Pia Di Buono
|
Ana Ostroški Anić
|
Sigita Rackevičienė
|
Ricardo Rodrigues
|
Gilles Sérasset
|
Linas Selmistraitis
|
Mahammadou Sidibé
|
Purificação Silvano
|
Blerina Spahiu
|
Enriketa Sogutlu
|
Ranka Stanković
|
Ciprian-Octavian Truică
|
Giedre Valunaite Oleskeviciene
|
Slavko Zitnik
|
Katerina Zdravkova
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Understanding the relation between the meanings of words is an important part of comprehending natural language. Prior work has either focused on analysing lexical semantic relations in word embeddings or probing pretrained language models (PLMs), with some exceptions. Given the rarity of highly multilingual benchmarks, it is unclear to what extent PLMs capture relational knowledge and are able to transfer it across languages. To start addressing this question, we propose MultiLexBATS, a multilingual parallel dataset of lexical semantic relations adapted from BATS in 15 languages including low-resource languages, such as Bambara, Lithuanian, and Albanian. As experiment on cross-lingual transfer of relational knowledge, we test the PLMs’ ability to (1) capture analogies across languages, and (2) predict translation targets. We find considerable differences across relation types and languages with a clear preference for hypernymy and antonymy as well as romance languages.
pdf
bib
BATS-PT: Assessing Portuguese Masked Language Models in Lexico-Semantic Analogy Solving and Relation Completion
Hugo Gonçalo Oliveira
|
Ricardo Rodrigues
|
Bruno Ferreira
|
Purificação Silvano
|
Sara Carvalho
Proceedings of the 16th International Conference on Computational Processing of Portuguese - Vol. 1
2023
pdf
bib
Validation of Language Agnostic Models for Discourse Marker Detection
Mariana Damova
|
Kostadin Mishev
|
Giedrė Valūnaitė-Oleškevičienė
|
Chaya Liebeskind
|
Purificação Silvano
|
Dimitar Trajanov
|
Ciprian-Octavian Truica
|
Elena-Simona Apostol
|
Christian Chiarcos
|
Anna Baczkowska
Proceedings of the 4th Conference on Language, Data and Knowledge
pdf
bib
ISO-DR-core Plugs into ISO-dialogue Acts for a Cross-linguistic Taxonomy of Discourse Markers
Purificação Silvano
|
Mariana Damova
Proceedings of the 4th Conference on Language, Data and Knowledge
pdf
bib
DRIPPS: a Corpus with Discourse Relations in Perfect Participial Sentences
Purificação Silvano
|
João Cordeiro
|
António Leal
|
Sebastião Pais
Proceedings of the 4th Conference on Language, Data and Knowledge
2022
pdf
bib
abs
The place of ISO-Space in Text2Story multilayer annotation scheme
António Leal
|
Purificação Silvano
|
Evelin Amorim
|
Inês Cantante
|
Fátima Silva
|
Alípio Mario Jorge
|
Ricardo Campos
Proceedings of the 18th Joint ACL - ISO Workshop on Interoperable Semantic Annotation within LREC2022
Reasoning about spatial information is fundamental in natural language to fully understand relationships between entities and/or between events. However, the complexity underlying such reasoning makes it hard to represent formally spatial information. Despite the growing interest on this topic, and the development of some frameworks, many problems persist regarding, for instance, the coverage of a wide variety of linguistic constructions and of languages. In this paper, we present a proposal of integrating ISO-Space into a ISO-based multilayer annotation scheme, designed to annotate news in European Portuguese. This scheme already enables annotation at three levels, temporal, referential and thematic, by combining postulates from ISO 24617-1, 4 and 9. Since the corpus comprises news articles, and spatial information is relevant within this kind of texts, a more detailed account of space was required. The main objective of this paper is to discuss the process of integrating ISO-Space with the existing layers of our annotation scheme, assessing the compatibility of the aforementioned parts of ISO 24617, and the problems posed by the harmonization of the four layers and by some specifications of ISO-Space.
pdf
bib
abs
ISO-based Annotated Multilingual Parallel Corpus for Discourse Markers
Purificação Silvano
|
Mariana Damova
|
Giedrė Valūnaitė Oleškevičienė
|
Chaya Liebeskind
|
Christian Chiarcos
|
Dimitar Trajanov
|
Ciprian-Octavian Truică
|
Elena-Simona Apostol
|
Anna Baczkowska
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Discourse markers carry information about the discourse structure and organization, and also signal local dependencies or epistemological stance of speaker. They provide instructions on how to interpret the discourse, and their study is paramount to understand the mechanism underlying discourse organization. This paper presents a new language resource, an ISO-based annotated multilingual parallel corpus for discourse markers. The corpus comprises nine languages, Bulgarian, Lithuanian, German, European Portuguese, Hebrew, Romanian, Polish, and Macedonian, with English as a pivot language. In order to represent the meaning of the discourse markers, we propose an annotation scheme of discourse relations from ISO 24617-8 with a plug-in to ISO 24617-2 for communicative functions. We describe an experiment in which we applied the annotation scheme to assess its validity. The results reveal that, although some extensions are required to cover all the multilingual data, it provides a proper representation of discourse markers value. Additionally, we report some relevant contrastive phenomena concerning discourse markers interpretation and role in discourse. This first step will allow us to develop deep learning methods to identify and extract discourse relations and communicative functions, and to represent that information as Linguistic Linked Open Data (LLOD).
2021
pdf
bib
abs
Developing a multilayer semantic annotation scheme based on ISO standards for the visualization of a newswire corpus
Purificação Silvano
|
António Leal
|
Fátima Silva
|
Inês Cantante
|
Fatima Oliveira
|
Alípio Mario Jorge
Proceedings of the 17th Joint ACL - ISO Workshop on Interoperable Semantic Annotation
In this paper, we describe the process of developing a multilayer semantic annotation scheme designed for extracting information from a European Portuguese corpus of news articles, at three levels, temporal, referential and semantic role labelling. The novelty of this scheme is the harmonization of parts 1, 4 and 9 of the ISO 24617 Language resource management - Semantic annotation framework. This annotation framework includes a set of entity structures (participants, events, times) and a set of links (temporal, aspectual, subordination, objectal and semantic roles) with several tags and attribute values that ensure adequate semantic and visual representations of news stories.