Marlene Amorim
2026
Benchmarking Portuguese Open Information Extraction
Gabriel Silva | Mário Rodrigues | António Teixeira | Marlene Amorim
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Gabriel Silva | Mário Rodrigues | António Teixeira | Marlene Amorim
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Open Information Extraction (OIE) has seen significant advancements for English, but progress in Portuguese has been hindered by a lack of resources such as Datasets and standardized evaluation benchmarks. This work addresses this critical gap by establishing the a systematic and reproducible benchmark for Portuguese OIE systems. We conduct a comprehensive evaluation of eight systems, spanning a decade of research and encompassing both rule-based and neural architectures. The performance of these systems is measured against three distinct Portuguese corpora (WIKI200, CETEN200, and Gamalho) using the established CaRB methodology. Our results reveal that no single system excels across all three datasets. Rule-based models perform strongly on general text (WIKI200, CETEN200) but falter on specialized corpora (Gamalho), while neural systems demonstrate more consistent but not superior performance. With overall F1 scores averaging around 40%, our findings confirm that Portuguese OIE remains a largely unsolved task. This benchmark provides a baseline for future research and highlights the need for a high-quality, manually annotated gold-standard dataset to drive meaningful progress in the field. The evaluation benchmark/framework is made publicly available at https://github.com/gabrielrsilva11/PT-OIE-Benchmark.
2025
Inductive Learning on Heterogeneous Graphs Enhanced by LLMs for Software Mention Detection
Gabriel Silva | Mário Rodriges | António Teixeira | Marlene Amorim
Proceedings of the Fifth Workshop on Scholarly Document Processing (SDP 2025)
Gabriel Silva | Mário Rodriges | António Teixeira | Marlene Amorim
Proceedings of the Fifth Workshop on Scholarly Document Processing (SDP 2025)
This paper explores the synergy between Knowledge Graphs (KGs), Graph Machine Learning (Graph ML), and Large Language Models (LLMs) for multilingual Named Entity Recognition (NER) and Relation Extraction (RE), specifically targeting software mentions within the SOMD 2025 challenge. We propose a methodology where documents are first transformed into heterogeneous KGs enriched with linguistic features (Universal Dependencies) and external knowledge (entity linking). An inductive GraphSAGE model, operating on PyTorch Geometric’s ‘HeteroData‘ structure with dynamically generated multilingual embeddings, performs node classification tasks. For NER, Graph ML identifies candidate entities and types, with an LLM (DeepSeek v3) acting as a validation layer. For RE, Graph ML predicts dependency path convergence points indicative of relations, while the LLM classifies the relation type and direction based on entity context. Our results demonstrate the potential of this hybrid approach, showing significant performance gains post-competition (NER Phase 2 Macro F1 improved to 0.4364 from 0.2953, RE Phase 1 0.3355 Macro F1), which are already described in this paper, and highlighting the benefits of integrating structured graph learning with LLM reasoning for information extraction.