Tomás Cerveira Da Cruz Pinto

Also published as: Tomás Pinto

2026

RelEx-PT: A Portuguese Sentence-Level Relation Extraction Dataset
Tomás Pinto | Catarina Silva | Hugo Goncalo Oliveira
Proceedings of the Fifteenth Language Resources and Evaluation Conference

We introduce RelEx-PT, a new sentence-level Relation Extraction dataset for Portuguese. Addressing the scarcity of high-quality, controlled resources for the language, RelEx-PT provides a balanced benchmark comprising 18 Wikidata-derived relation types across diverse domains. The dataset is built through a distant supervision pipeline that links Wikidata triples with Portuguese Wikipedia sentences and enhanced by a Natural Language Inference (NLI)-based filtering process, combining scalability with quality assurance. Additionally, we conduct baseline experiments to evaluate the dataset’s applicability across diverse extraction settings, including Relation Classification (RC), Relation Triple Extraction, and Open Information Extraction. These experiments leverage both prompting and fine-tuning strategies using Large Language Models. The results show that RelEx-PT effectively supports a range of extraction paradigms, yielding high performance in RC and competitive results in structured triple generation, while also highlighting key challenges in open-ended extraction.

2025

pdf bib abs

Exploring Medium-Sized LLMs for Knowledge Base Construction
Tomás Cerveira Da Cruz Pinto | Hugo Gonçalo Oliveira | Chris-Bennet Fleger
Proceedings of the 5th Conference on Language, Data and Knowledge

Knowledge base construction (KBC) is one of the great challenges in Natural Language Processing (NLP) and of fundamental importance to the growth of the Semantic Web. Large Language Models (LLMs) may be useful for extracting structured knowledge, including subject-predicate-object triples. We tackle the LM-KBC 2023 Challenge by leveraging LLMs for KBC, utilizing its dataset and benchmarking our results against challenge participants. Prompt engineering and ensemble strategies are tested for object prediction with pretrained LLMs in the 0.5-2B parameter range, which is between the limits of tracks 1 and 2 of the challenge.Selected models are assessed in zero-shot and few-shot learning approaches when predicting the objects of 21 relations. Results demonstrate that instruction-tuned LLMs outperform generative baselines by up to four times, with relation-adapted prompts playing a crucial role in performance. The ensemble approach further enhances triple extraction, with a relation-based selection strategy achieving the highest F1 score. These findings highlight the potential of medium-sized LLMs and prompt engineering methods for efficient KBC.

Co-authors

Venues

LDK1
LREC1

Fix author