A European Portuguese corpus annotated for verbal idioms

David Antunes, Jorge Baptista, Nuno J. Mamede


Abstract
This paper presents the construction of VIDiom-PT, a corpus in European Portuguese annotated for verbal idioms (e.g. O Rui bateu a bota, lit.: Rui hit the boot ‘Rui died’). This linguistic resource aims to support the development of systems capable of processing such constructions in this language variety. To assist in the annotation effort, two tools were built. The first allows for the detection of possible instances of verbal idioms in texts, while the second provides a graphical interface for annotating them. This effort culminated in the annotation of a total of 5,178 instances of 747 different verbal idioms in more than 200,000 sentences in European Portuguese. A highly reliable inter-annotator agreement was achieved, using Krippendorff’s alpha for nominal data (0.869) with 5% of the data independently annotated by 3 experts. Part of the annotated corpus is also made publicly available.
Anthology ID:
2025.mwe-1.7
Volume:
Proceedings of the 21st Workshop on Multiword Expressions (MWE 2025)
Month:
May
Year:
2025
Address:
Albuquerque, New Mexico, U.S.A.
Editors:
Atul Kr. Ojha, Voula Giouli, Verginica Barbu Mititelu, Mathieu Constant, Gražina Korvel, A. Seza Doğruöz, Alexandre Rademaker
Venues:
MWE | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
58–66
Language:
URL:
https://preview.aclanthology.org/landing_page/2025.mwe-1.7/
DOI:
Bibkey:
Cite (ACL):
David Antunes, Jorge Baptista, and Nuno J. Mamede. 2025. A European Portuguese corpus annotated for verbal idioms. In Proceedings of the 21st Workshop on Multiword Expressions (MWE 2025), pages 58–66, Albuquerque, New Mexico, U.S.A.. Association for Computational Linguistics.
Cite (Informal):
A European Portuguese corpus annotated for verbal idioms (Antunes et al., MWE 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/landing_page/2025.mwe-1.7.pdf