Rafael Torres Anchiêta

2026

Extending an Ensemble Baseline with Corpus-Based Graph Features for Portuguese Pun Detection
Avelar Rodrigues de Sousa | Camilla Soares Sousa | Carlos Henrique Santos Barros | Rafael Torres Anchiêta
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1

Automatic pun detection remains challenging because it depends on lexical ambiguity and contextual interaction, which are not explicitly captured by linear text representations. In Portuguese, TF-IDF-based ensemble methods provide competitive and interpretable baselines, but remain limited by surface-level features. This work investigates whether corpus-based graph information can complement such methods. Three graph representations are constructed from the Puntuguese corpus: a Co-occurrence graph, a PPMI-weighted graph, and a Pun-Context graph. In the current pipeline, each graph is converted into low-dimensional node embeddings with TruncatedSVD, which are then aggregated into document-level features and concatenated with TF-IDF representations in a soft-voting ensemble. Experimental results on the test set show that graph-based enrichment does not uniformly improve performance: Pun-Context and PPMI yield the strongest graph-augmented results, whereas combining all graphs degrades performance. These findings indicate that the usefulness of graph-based information depends strongly on how lexical relations are encoded and aggregated at the document level.

pdf bib abs

Token-Level Pun Location Using Multi-Layer BERT with Mixture of Experts
Rafael Torres Anchiêta | Roney Lira de Sales Santos | Raimundo Santos Moura
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1

Humor processing remains a complex challenge in Natural Language Processing, particularly the task of pun location, which involves identifying the specific ”pivot word” that creates linguistic ambiguity. This paper presents a novel two-stage approach for token-level pun location in Portuguese, addressing the scarcity of research in this language. The first stage uses an ensemble of traditional classifiers to filter out non-pun sentences, thereby reducing class imbalance. The second stage employs a pre-trained BERT encoder combined with a Mixture-of-Experts (MoE) layer to capture specialized linguistic features for token classification. We validate our approach on the Puntuguese corpus, achieving an F-score of 0.74 without requiring post-processing heuristics. Interpretability analyses demonstrate that the MoE experts learn to specialize in distinct mechanisms, such as punchline detection and morphological patterns, thereby confirming the model’s capacity to capture the nuances of humor.

Co-authors

Venues

PROPOR2

Fix author