International Conference on Computational Processing of Portuguese (2026)


up

pdf (full)
bib (full)
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1

This paper introduces a Graph Retrieval-Augmented Generation (GraphRAG) pipeline tailored for Question Answering (Q A) within Portuguese legal documents. Applied to a corpus of 203 normative resolutions from Companhia Energética de Minas Gerais (CEMIG), the proposed approach addresses the structural complexity of legal texts, such as hierarchical dependencies and temporal modifications. By explicitly modeling documents as knowledge graphs with nodes representing structural units (Articles, Paragraphs, Items) and edges denoting normative relationships, the system preserves context and traceability. The retrieval mechanism reconstructs evidence paths from root to leaf, performing semantic re-ranking before generation. Evaluation using the RAGAS framework yielded a mean answer accuracy of 0.81, with a median of 1.00. Results indicate that the system performs robustly on short, focused queries, while intermediate-length questions present challenges related to semantic dispersion. The findings suggest that structurally aware retrieval significantly enhances the interpretability and precision of legal Q A systems.
This study analyzes gender identification in Brazilian Portuguese using Amazon reviews drawn from ten product categories. Nine models were evaluated: three classical classifiers (Logistic Regression, Random Forest, and SVM), a multilingual BERT, and five LLMs (ChatGPT 4o, ChatGPT 3.5, DeepSeek, Sabia3, and Sabiazinho). Experiments show that BERT achieved the best performance (macro-F1 = 0.634), outperforming ChatGPT 4o and Logistic Regression by less than one percentage point. Reviews authored by women reach an average F1 of 0.654—four points higher than those by men. Performance also varies by domain: books and automotive are easier, whereas baby and pets are more challenging.
Stance detection is the task of determining whether an input text expresses a stance in favour of or against a given target topic. This, in a standard supervised fashion, will typically require a new set of labelled training examples for each test topic. As an alternative to full supervision (or costly LLM-based methods), this study leverages political alignment information by assuming that stances on related moral or political issues tend to co-occur (e.g., support for a right-wing politician correlating with support for the death penalty or opposition to abortion). This alignment, presently treated as a form of distance labelling, enables stance inference without constructing new corpora and is evaluated against standard cross-domain and prompt-based methods using a large corpus of stances in the Portuguese language.
We investigate the effect of dependency distance and its directionality on eye-tracking measures in Brazilian Portuguese. Using the RastrOS corpus enriched with surprisal and syntactic annotations, we find that absolute dependency distance significantly improves the prediction of first fixation durations, supporting memory-based accounts of sentence processing. In contrast, the direction of the dependency (whether the dependent precedes or follows the head) shows weaker and less consistent effects. These results indicate that early lexical retrieval is sensitive to distance magnitude, while later reading measures reflecting integration are less affected, highlighting the complementary role of syntactic distance alongside surprisal in modelling reading behaviour.
Context: The increasing availability of textual data has driven the application of Natural Language Processing (NLP) techniques in public administration to improve public services. Objective: This study aims to analyze topic modeling methods in the context of public health audits conducted by the National Department of SUS Auditing (AudSUS). Methods: A controlled in vitro experiment was conducted to assess the performance of the methods in topic modeling tasks using coherence metrics. Results: The LSA method stood out among models with the highest average C_V and C_NPMI coherence. LSA-based models achieved superior performance compared to 215 other models in configurations with lower top-n and top-k values. Overall, the statistical analysis confirms that the observed differences among the models are not due to random variation. Conclusion: The results underscore the potential of topic modeling methods for clustering news articles that exhibit indications of irregularities, thereby guiding information retrieval during the analytical phase of the audit process. This approach enhances the overall effectiveness of audits and facilitates faster preparation of teams for the operational stage.
Contact-center operations often face significant challenges in identifying candidates whose vocal performance aligns with high-quality customer interactions. Existing speech analytics tools typically assess only content, providing limited insight into how candidates speak. To address this gap, we introduce SR-Voice, a multilingual speech analytics module designed to support call-center hiring. SR-Voice extends a previous text-only auditor by integrating segment-level, audio-native analysis capable of generating judgments, concise evidence-based rationales, and 0–10 scores across three dimensions: Emotion, Communication, and Rhythm. Our two-stage architecture first applies an audio-native model to propose a label, which is then reassessed by a lightweight auditor that combines transcript cues with acoustic and timing indicators grounded in phonetic and prosodic theory. We evaluate SR-Voice on a production-like volunteer dataset, reporting strong agreement and calibration performance reaching Macro-F1 = 0.83; Expected Calibration Error (ECE) = 0.053. The hybrid system achieves state-of-the-art calibration without post-hoc adjustment, with the audio-only variant attaining the lowest Negative Log-Likelihood (NLL) = 0.472). Designed for operational practicality, SR-Voice emphasizes traceability, short rationales, and well-calibrated probabilities suitable for threshold-based decisions and human-in-the-loop triage. We also discuss privacy-preserving storage and the prospective masking of Personally Identifiable Information (PII) for archival data.
This paper reports on the development of a leaderboard of Open Large Language Models (LLM) for European Portuguese (PT-PT), and on its associated benchmarks.This leaderboard comes as a way to address a gap in the evaluation of LLM for European Portuguese, which so far had no leaderboard dedicated to this variant of the language.The paper also reports on novel benchmarks, including some that address aspects of performance that so far have not been available in benchmarks for European Portuguese, namely model safeguards and alignment to Portuguese culture.The leaderboard is available at https://huggingface.co/spaces/PORTULAN/portuguese-llm-leaderboard.
Portuguese serves as the official language of multiple countries across four continents. It is classified into two primary variants (European Portuguese and Brazilian Portuguese), but there is limited research on and resources for European Portuguese compared to the Brazilian variant.In this paper, we consider the task of Machine Translation (MT) into Portuguese. Given the resource imbalance, standard MT systems produce translations that are typically closer to the Brazilian standard. We compare four methods available to bias the translation toward the minority European Portuguese variant that target different places in the MT lifecycle: (1) reranking n-best MT outputs according to a variant classifier; (2) biasing hypothesis generation at inference time toward the target variant; (3) fine-tuning for the target variants; (4) moving completely to an LLM-based approach. We find that all methods can bias translation outputs to an extent. The LLM-based approach yields numerically the highest results, but the impact of memorisation remains unclear.
This work investigates Differential Object Marking (DOM) in Brazilian Portuguese (BP), specifically a-marked objects, or prepositional accusatives (PP-ACCs), across four variables: semantic requirements, constituent order, verb semantics, and textual genre.An optimized parsing model was trained to recognize instances of PP-ACCs and automatically annotate historical documents for these objects for the Tycho Brahe and Colonia corpora. Contrary to expectations based on the low frequency of these objects and prior diachronic studies on European Portuguese (EP), our results reveal that PP-ACCs remain present in BP from the 18th century onward. Our findings confirm previous patterns for EP and present textual genre (specifically, narrative texts and theater plays) as a possible relevant variable, but warrants further investigation. Constituent order was proved to be less significant than previously suggested. This work also reveals methodological challenges in using computational models and NLP tools for research in historical Portuguese.
This study analyzes texts from multiple sources, including social media and news portals, to observe how different sectors of Brazilian society discuss the antimicrobial resistance. The main goal is to support epidemiological surveillance and public policy decisions through computational tools. Three datasets were used: tweets collected between 2008 and 2025 (64,225 documents), news articles from G1 (4,363 documents), and official government publications (.gov.br, 1,515 documents). These sources enable comparative analysis between informal discourse (social media) and institutional or journalistic discourse (official and media outlets). The study applies and compares topic modeling techniques, particularly those designed for Short Text Topic Modeling (STTM), such as GSDMM and BERTopic, to identify discursive trends, semantic patterns, and emerging topics related to antimicrobial resistance. By exploring these distinct contexts, this work demonstrates the potential of Natural Language Processing (NLP) and AI methods as instruments for integrated analysis of public health data in both informal and formal environments.
Large generative language models have demonstrated impressive performance in various Natural Language Processing (NLP) tasks. However, the geological domain presents unique challenges for NLP due to its specialized language, which is full of technical terms. Therefore, pre-trained language models on generic corpora may not be suitable for performing geological domain-specific tasks. This article compares several models to identify those with the best performance in the Portuguese geological domain for a text summarization task. We applied the models to a Revista Geologia USP dataset. The dataset consists of abstracts of scientific texts and their respective titles, which we aim for the models to approximate with the summarization task. We tested the models in various scenarios, providing examples or not, and at two temperature levels. We then evaluated the models’ performance using quantitative metrics and a brief qualitative analysis comparing the titles proposed by the models with the original title. The results show that the Gemma3:27b model was better in some scenarios, while the Llama3:8b model performed best in others.
The spread of online misinformation has made fake news detection an essential tool for mitigating its negative impact, but many studies often disregard the temporal information, and existing datasets become outdated as news evolve. Some modern solutions using Retrieval-Augmented Generation (RAG) can solve the problem of unseen news events by providing context to the models. However, there are no studies evaluating the feasibility of web searches to attain context to decide whether a news article is true or not. This work aims to address this gap by conducting a comparative study between RAG-based solutions, traditional fake news classification methods, and deep learning-based methods. The results show that although RAG is a modern and promising technique, it cannot outperform techniques already adopted in the literature.
Linking citizen complaints to the public services they concern remains a major challenge in the Brazilian federal administration. In 2025, over 1.2 million manifestations were submitted across 328 agencies, yet only about 1.8% are currently associated with a specific service, limiting large-scale monitoring and evidence-based management. We cast this task as an extreme multi-class text classification problem marked by severe class imbalance and strong lexical–semantic gaps between citizen language and official service descriptions. Building on recent work that reframes the task as information retrieval, we combine sparse retrieval with BM25 over representative complaint corpora and dense retrieval enriched with RAG-labels: semantically expanded label descriptions generated via Retrieval-Augmented Generation and Small Language Models. This approach markedly reduces vocabulary mismatch and semantic ambiguity, yielding substantial gains over direct text or embedding matching. To our knowledge, this is the first Portuguese-language application of RAG-labels for service–complaint association. In real operational data from the Federal Ombudsman Office, our method can automatically assign plausible services to roughly 73% of previously unlabeled cases, improving coverage and supporting more effective public service evaluation.
Negation plays a fundamental role in human communication and logical reasoning, yet it remains underrepresented in natural language inference (NLI) datasets. This work investigates the impact of targeted data augmentation using negation cues on the main NLI datasets for Portuguese (InferBR, ASSIN and ASSIN2). By synthetically generating new instances with negated hypotheses, we create more diverse training and test sets. A BERT-based model was fine-tuned and tested on the combined datasets and augmented data. The results show that the model was heavily influenced by the bias in the use of negation, and increased data diversity improves the model’s handling of negation.
Text-to-SQL systems aim to translate natural language questions into Structured Query Language (SQL) queries, enabling database access without requiring SQL expertise. In real-world scenarios, these systems often need to manage multiple databases with heterogeneous schemas, making Schema Linking a crucial preliminary step for identifying relevant databases, tables, and columns. This study investigates Schema Linking for questions written in Brazilian Portuguese and compares two schema representation strategies: natural-language descriptions generated by Large Language Models (LLMs) and representations based on Data Definition Language (DDL) and Data Manipulation Language (DML) commands. Experiments conducted on a Brazilian Portuguese version of the Spider dataset, with over 200 databases, evaluated several LLMs and embedding models. The experimental results based on Hit@k show that natural language descriptions consistently outperform DDL/DML-based representations, demonstrating the effectiveness of LLM-generated schema descriptions for Schema Linking tasks.
Automated Essay Scoring systems can relieve teachers of this laborious task and allow students to practice more frequently due to faster feedback cycles. In Brazilian Portuguese, there is growing interest in automatic scoring systems for the standardized ENEM exam. However, the only available datasets consist of essays written as practice for the official exam. In the literature, to the best of our knowledge, there is no work that evaluates official ENEM essays using mock-exam datasets.This work fills that gap by presenting a new labeled dataset composed of 157 essays written for the official ENEM exam. The analysis shows that this dataset shares characteristics similar to existing datasets of mock exam essays. The results also indicate that, for small datasets such as this one, the use of LLMs pretrained on mock exams significantly improves the performance of automatic scorers for official ENEM essays, yielding an average gain of 0.27 points in the Quadratic Weighted Kappa metric compared to training solely on official data.
Reliable inflation forecasts play a critical role in economic stability and policy decisions. Traditional econometric models perform well but often overlook qualitative signals that could improve predictive accuracy. Recent advances in AI-based Natural Language Processing enable the extraction of latent sentiment, offering a promising avenue for inflation forecasting. This study proposes a framework that combines Large Language Models (LLMs) to extract sentiment variables from the Brazilian Monetary Policy Committee (COPOM) minutes, optimize bias to match human-collected sentiment, and integrate them into ARIMA and LSTM models for one-step-ahead monthly IPCA prediction. Results show that LLM-generated sentiment trends are temporally coherent with historical inflation patterns and highly statistically significant (p < 0.001). Models whose sentiment evaluations aligned more closely with human assessments (e.g., grok-4-fast and llama-4-maverick) achieved superior forecasting performance. ARIMA models consistently benefited from sentiment inclusion, while LSTM results were more variable.
High-quality corpora are essential for advancing Natural Language Processing (NLP) in Portuguese. Building on previous encoder-only models such as BERTimbau and Albertina PT-BR, we introduce NorBERTo, a modern encoder based on the ModernBERT architecture, featuring long-context support and efficient attention mechanisms. NorBERTo is trained on Aurora-PT, a newly curated Brazilian Portuguese corpus comprising 331 billion GPT-2 tokens collected from diverse web sources and existing multilingual datasets. We systematically benchmark NorBERTo against strong baselines on semantic similarity, textual entailment and classification tasks using standardized datasets such as ASSIN 2 and PLUE. On PLUE, NorBERTo-large achieves the best results among the encoder models we evaluated, notably reaching 0.9191 F1 on MRPC and 0.7689 accuracy on RTE. On ASSIN 2, NorBERTo-large attains the highest entailment F1 (0.904) among all encoders considered, while remaining competitive, but Albertina-900M and BERTimbau-large still hold an advantage. To the best of our knowledge, Aurora-PT is currently the largest openly available monolingual Portuguese corpus, surpassing previous resources. NorBERTo provides a modern, mid-sized encoder designed for realistic deployment scenarios: it is straightforward to fine-tune, efficient to serve, and well suited as a backbone for retrieval-augmented generation and other downstream Portuguese NLP systems.
Propomos o AspectRAG, uma arquitetura de Recuperação e Geração para ASTE em português que opera sem treinamento supervisionado. O método extrai aspectos com um LLM, codifica-os como vetores densos e usa apenas esses vetores para recuperar evidências altamente específicas por meio de busca aproximada e fusão de rankings. As evidências recuperadas compõem o contexto do modelo gerador, que produz as triplas finais. Nos datasets ReLi e ReHol, o AspectRAG obtém até 93,47% em ATE, 80,68% em OTE e 69,83% em ASTE, superando modelos supervisionados como OTE-MTL, CMLA-MTL e BOTE, o estado da arte em Português. O estudo de ablação evidencia que a recuperação semântica guiada por aspectos é o principal fator responsável pelos ganhos observados, enquanto o tamanho do LLM tem impacto secundário. Os resultados mostram que a arquitetura AspectRAG é uma solução eficiente, e competitiva mesmo sem fine-tuning, apoiando-se apenas em recuperação vetorial e inferência contextualizada.
This paper analyzes the semantic parsing of relative clauses in Portuguese in two meaning representation frameworks: Abstract Meaning Representation (AMR) and Lexicalized Meaning Representation (LMR). While both treat relatives as noun modifiers, AMR fails to distinguish restrictive from appositive clauses–an important traditional grammatical distinction. We argue for explicitly encoding this difference. The study draws on annotated translations of *The Little Prince* (Saint-Exupéry, 1943) in Brazilian and European Portuguese, highlighting issues in the Brazilian AMR annotations.
Robust sentiment analysis in Portuguese is central to applications across Lusophone contexts, yet systematic evaluations still focus predominantly on English and proprietary systems. This paper presents a comparative study of 29 open-source Large Language Models (LLMs) and two proprietary models on Portuguese sentiment classification under four prompting strategies: Zero-Shot, Few-Shot, Chain-of-Thought (CoT), and CoT with Few-Shot (CoT+FS). Experiments on a unified three-class benchmark built from three public review corpora (about 3,000 instances) comprise roughly 372,000 inferences, totaling approximately 150M input tokens and 65M output tokens. Results show that CoT+FS generally yields the best performance for larger models, while several compact open-source models obtain competitive F1-scores with substantially lower computational cost, making them suitable for real-world deployments. We identify concrete teacher–student configurations tailored for knowledge distillation in Portuguese sentiment analysis.
Automatic metrics are widely used to evaluate text quality across various natural language processing tasks. Despite their convenience and scalability, the extent to which these metrics reliably reflect textual quality remains an open challenge. The LLM-as-a-judge paradigm has recently emerged, aligning more closely with human judgments by using LLMs themselves as evaluators. However, there is still a gap in such evaluations across specific domains and languages, as most prior work focuses on generic task benchmarks in English. In this paper, we examine the robustness of both traditional automatic metrics and the LLM-as-a-judge approach for assessing the quality of financial commentaries in Portuguese, an underexplored task and language that has been neglected in previous work. We introduce fine-grained perturbations into the texts generated by specialists to analyze which types of noise most significantly affect evaluation outcomes, using noise-free counterparts as references. The results highlight the weaknesses of classical metrics in this specific task and the limitations of even recent evaluation paradigms, underscoring the need to develop context- and domain-sensitive.
The rapid dissemination of digital information has exposed financial markets to the risks of disinformation. Although numerous methods exist to detect fake news, they predominantly focus on textual features, often neglecting the significant role of image-based content. This paper introduces a novel framework for detecting financial fake news in Brazilian Portuguese by bridging this gap. The proposed system integrates Natural Language Processing (NLP) with an image-to-text classification strategy: using a Tesseract-based OCR, the system extracts text from images and processes it using the unified pipeline used for text classification. Experiments on Fake.BR, FakeRecogna corpus and BBC News Brasil show that our approach achieves 98% accuracy using BERTimbau Fine Tuned on financial news. These findings underscore the critical importance of analyzing visual text and demonstrate the multimodal strategy is effective for disinformation detection.
This paper reports on an effort to recover the classical morphosyntactically annotated corpus MacMorpho and realign it with the current version of the Universal Dependencies framework. We introduce a knowledge-rich approach grounded in a syntactic parser and on a specially designed tagset compatibility strategy in order to generate a "silver-standard" resource: the MacMorpho-UD-2.17. We evaluate this resource through multiple complementary methods, providing evidence for the quality of both our approach and the resulting annotation.
Named Entity Recognition (NER) is an important task of Natural Language Processing. Achieving good results in this task usually requires a large amount of labeled data to train models. This is especially difficult in domain-specific datasets and low-resourced languages. To mitigate the high cost of human-annotated data, data augmentation can be used. In this work, we evaluate Data Augmentation techniques for NER, focusing on domain-specific datasets in Portuguese.We employed augmentation techniques based on rules, back-translation, and large language models on four datasets of varying sizes to train Transformer-based NER models.The results showed that most techniques improved over the baseline, with the best results achieved using PP-LLM, SR, and MR.
Humor processing remains a complex challenge in Natural Language Processing, particularly the task of pun location, which involves identifying the specific ”pivot word” that creates linguistic ambiguity. This paper presents a novel two-stage approach for token-level pun location in Portuguese, addressing the scarcity of research in this language. The first stage uses an ensemble of traditional classifiers to filter out non-pun sentences, thereby reducing class imbalance. The second stage employs a pre-trained BERT encoder combined with a Mixture-of-Experts (MoE) layer to capture specialized linguistic features for token classification. We validate our approach on the Puntuguese corpus, achieving an F-score of 0.74 without requiring post-processing heuristics. Interpretability analyses demonstrate that the MoE experts learn to specialize in distinct mechanisms, such as punchline detection and morphological patterns, thereby confirming the model’s capacity to capture the nuances of humor.
Text-to-SQL systems allow users to query relational databases using natural language, but accuracy remains sensitive to the choice of language, model architecture, and prompting strategy. Although recent Large Language Models (LLMs) incorporate reasoning mechanisms that improve multi-step problem solving in other domains, their effects on multilingual Text-to-SQL are not yet well understood. This work evaluates a diverse set of LLMs on the BIRD benchmark and BIRD_PT, a Portuguese version produced by translating the questions and external knowledge while keeping the original English database schema and values unchanged. We compare four controlled scenarios that vary internal reasoning and guided reasoning for SQL generation. The results show a consistent decrease in accuracy when switching from English to Portuguese, with large variations in robustness across models. Reasoning alone does not reliably improve execution accuracy and can reduce performance in Portuguese, while combining reasoning with a guided plan provides the most stable improvements, although still weaker than in English. These findings highlight ongoing challenges in multilingual Text-to-SQL and emphasize the need to jointly consider language understanding, reasoning activation, and task-aligned planning when designing future systems.
Fine-tuned small language models (SLMs) have emerged as effective alternatives for closed-domain tasks, where large language models (LLMs) often lack sufficient parametric knowledge. This study presents a methodology for adapting a small language model to a closed-domain question answering (Q A) task. For each question, the model is trained to output an answer based on the most relevant context passage, among ten provided candidates, thus reproducing the logic of a Retrieval-Augmented Generation (RAG) framework. The fine-tuning data were derived from PetroKGraph, an existing knowledge graph built from Portuguese-language resources in the oil and gas (O G) domain. Experimental results show that the fine-tuned model achieves a 20 percentage points accuracy improvement over the base model on closed-domain questions. It also surpasses GPT-4o and GPT-4o Mini by 12 and 25 points, respectively. Moreover, its performance on general-domain tasks remains comparable to that of the base model, indicating that the specialized model effectively learned domain specific knowledge while maintaining general reasoning capabilities.
The conceptual ambiguity among terms like ’hate speech’, ’toxic speech’, and ’dangerous speech’ creates a significant bottleneck for both research and automated moderation. Traditional NLP models, often focused on lexical cues, struggle to differentiate these nuanced forms of linguistic violence, especially when the harm is implicit. This paper addresses this gap with a twofold objective. First, we conduct a conceptual review and propose a unified ontology that differentiates these concepts—including verbal aggression and cyberbullying—based on their core attributes, such as their target, intent, and associated rhetorical hallmarks. Second, we propose a computational methodology designed to operationalize this ontology. Our framework uses a multi-stage NLP pipeline that leverages semantic analysis, specifically Semantic Role Labeling and Named Entity Recognition, to deconstruct speech acts into their core components (e.g., target and action). This component-based approach allows for a granular classification that can robustly distinguish between seemingly similar phenomena, such as a general insult and a targeted identity-based attack. This methodology is particularly promising for low-resource languages, such as Portuguese, as it relies on core semantic tasks for which multilingual models are available, rather than requiring massive, task-specific labeled datasets.
This paper presents Ethos AT, a desktop software for automatic transcription that uses OpenAI Whisper models, enabling local processing and ensuring data privacy and accessibility for users who are not necessarily programming experts, such as oral history researchers. A comparative analysis of six Whisper models (small, medium, large, large-v2, large-v3, and turbo) was conducted to analyze performance in terms of transcription accuracy, error types, and processing time. Results indicate that larger models achieve higher lexical accuracy, while smaller ones provide faster execution with acceptable quality for general use; the turbo model showed an effective balance between accuracy and speed. Overall, Ethos AT offers a secure, efficient, and user-friendly solution for academic and institutional contexts.
A crescente quantidade de textos disponíveis na Web torna as ferramentas de mineração de texto essenciais para a extração de informações valiosas para diversas aplicações. No entanto, além dos próprios textos, conhecer as características de seus autores é crucial para algumas organizações. Como os textos podem ser publicados anonimamente, é crescente o interesse em pesquisas voltadas para a criação de técnicas computacionais para inferir as características demográficas de seus autores. Apesar disso, para o problema da predição da faixa etária de autores de textos escritos na língua portuguesa, a quantidade limitada de recursos e o baixo desempenho preditivo evidenciam a necessidade de mais pesquisas focadas nessa tarefa. Assim, este trabalho propõe e avalia uma abordagem que, além de um classificador tradicional, utiliza dicionários de palavras para capturar as especificidades do domínio textual e aprimorar o desempenho preditivo da tarefa de predição da faixa etária. Os resultados experimentais obtidos com a abordagem proposta mostram que explorar as características do domínio dos textos pode contribuir positivamente para o desempenho dessa tarefa.
Enhanced Universal Dependencies (EUD) provide a more informative syntactic representation than Basic Universal Dependencies by relaxing tree constraints to allow for graph structures. While conversion rules from basic to enhanced relations have been established for Portuguese, they were previously evaluated only on journalistic text using gold-standard basic syntactic trees. This paper evaluates the robustness of these rules in diverse scenarios ("in the wild"), encompassing other text genres and domains, as well as realistic parsing pipelines that rely on automatically generated basic syntax. Our results demonstrate that Portuguese-specific rules consistently outperform universal rules. However, the reliance on automatic basic syntax significantly impacts performance. This degradation is particularly severe when the domain of the input text differs from the training data of the basic parser. We also provide a detailed error analysis, identifying specific difficult linguistic phenomena and scenarios.
The legal domain presents several challenges for Natural Language Processing (NLP), particularly due to its linguistic complexity and lack of public datasets. Named Entity Recognition (NER), a subarea of NLP, has been successfully used to extract useful knowledge from legal texts. Its widespread use is limited by the lack of legal text corpora. This paper introduces UlyssesLegalNER-Br, a comprehensive corpus of Brazilian legal documents for NER, covering bills, case laws and laws, including the first NER corpus based exclusively on Brazilian laws. This research expand the UlyssesNER-Br corpus, previously focused only on the Brazilian legislative domain. The proposed corpus has 560 public documents annotated using a hybrid approach, organized in 9 categories and 23 fine-grained types, experimentally evaluated with the CRF, BiLSTM, and BERTimbau architectures. The corpus was experimentally evaluated regarding predictive performance, computational cost and label-level results. The best micro F1 96.18% was achieved by BERTimbau on the unified corpus, providing a strong baseline for Brazilian legal NER. At the label level, six categories and seven types presented a F1-score above 95%, while the lowest were distributed in the interval 71-82%.
The PROPOR conference has been the main venue for Portuguese language Natural Language Processing (NLP) research for over two decades. This paper presents a longitudinal bibliometric analysis of PROPOR from 2003 to 2024, examining thematic evolution, community structure, and scientific impact. We identify a shift from speech-oriented research toward text-based tasks, alongside the sustained importance of resources and linguistic theory. The community exhibits a stable structure, with complementary leadership models centered on institutional hubs and brokerage roles. Scientific impact is highly concentrated, following a long tail distribution, and distinguishes between cumulative productivity-driven impact and rapidly accelerating citation uptake in recent editions. These findings characterize PROPOR as a resilient regional linguistic ecosystem evolving in dialogue with broader NLP paradigms.
For two decades, the HAREM corpus has served as the foundational benchmark for Portuguese Named Entity Recognition (NER), establishing its evaluation paradigm. Virtually all major progress has been measured against its fixed train/test split. This paper presents the first systematic audit of this split, revealing 153 overlapping (contaminated) sentences. We re-evaluate 13 NER models (ranging from CRFs to Transformers) on both the original and a new, decontaminated version of the corpus. Our statistical analysis reveals that decontamination has a significant (p < 0.05) and positive impact on the majority of models. We find that performance gains are most pronounced in the F1_textmacro score (up to +4 points), demonstrating that the contamination primarily harmed generalization on rare entity types. Furthermore, our audit reveals clear evidence of overfitting in some models that benefited from data leakage. We conclude that even minor contamination can distort performance metrics and mask true model generalization. We release our decontaminated benchmark to ensure more reliable future evaluations.
Recent works in the fields of computer vision and natural language processing have enabled the recognition and identification of objects in images, generating automatic descriptions. Despite these advancements, the main research in this field is primarily related to the English language, requiring some adaptation when dealing with other languages, such as Portuguese. One of these methods is the translate-train approach, which involves translating the training dataset into the desired language. However, there are various translators with different levels of effectiveness available. The primary objective of this work is to evaluate the behavior of image captioning models when trained on datasets translated into Portuguese by different automatic translators, both quantitatively (cost, training time, metrics on the test set) and qualitatively (comparative evaluation form, error analysis). The results indicate that it is possible to obtain valid automatic descriptions in Portuguese from image captioning models trained on translated datasets, and that more robust translators produce more meaningful descriptions.
Analyzing how large-scale multi-party dialogues shape collective behavior is a central challenge in computational linguistics. However, traditional text-based methods often overlook the complex, non-linear turn-taking dynamics defining these interactions. To address this gap, we propose a framework based on Dialogue Action Flows (DAFs) that integrates verbal utterances and non-verbal actions into a unified probabilistic representation of interactional behavior. Interactions are encoded as speaker-action states, forming a probabilistic DAF that reveals dominant behavioral trajectories and recurrent patterns. We validate this framework on five years of Portuguese Parliament debates. Analysis reveals systematic behavioral asymmetries driven by party roles: while government parties exhibit increasing alignment, opposition forces, particularly the radical wing, maintain persistently high conflict. Additionally, the rising volume of interactions across legislative years indicates a progressively heated environment. Overall, our framework provides a quantitative and interpretable approach for modeling polarization, alignment, and interactional dynamics in multi-party political discourse.
Despite rapid progress in open large language models (LLMs), European Portuguese (pt-PT) remains underrepresented in both training data and native evaluation, with machine-translated benchmarks likely missing the variant’s linguistic and cultural nuances. We introduce AMALIA, a fully open LLM that prioritizes pt-PT by using more high-quality pt-PT data during both the mid- and post-training stages. To evaluate pt-PT more faithfully, we release a suite of pt-PT benchmarks that includes translated standard tasks and four new datasets targeting pt-PT generation, linguistic competence, and pt-PT/pt-BR bias. Experiments show that AMALIA matches strong baselines on translated benchmarks while substantially improving performance on pt-PT-specific evaluations, supporting the case for targeted training and native benchmarking for European Portuguese.
Document simplification has recently attracted increasing attention due to its broader practical applicability compared to sentence-level simplification. Beyond simplifying individual sentences, this task involves preserving fluency, conciseness, and coherence across the entire text, often incorporating summarization techniques. Despite its importance, research in this area remains largely concentrated on a few languages, particularly English.In this work, we introduce LegalSim-PT, the first large-scale Portuguese dataset for document simplification based on legal texts. To mitigate reliance on manual evaluation, we combined data augmentation strategies with readability, semantic similarity, and diversity metrics to select the most suitable document pairs. We conducted a comprehensive analysis of the resulting dataset, first characterizing its surface features and comparing them with those of existing simplification corpora. Next, we assessed its quality using automatic metrics, linguistic indicators, and human evaluations. Finally, we select representative models as baselines and fine-tune two models on LegalSim-PT, achieving improved performance in document-level simplification.
Orthographic neighbors (ONs) play a central role in models of visual word recognition and have been shown to influence reading speed, lexical access, and literacy development. Despite their importance, resources providing detailed and flexible ON information remain scarce for European Portuguese. This paper introduces Portho, a corpus-based lexical resource that provides multiple ON metrics for over 43,000 word forms, using several ON definitions. In addition to classical neighborhood size measures, Portho provides frequency-based statistics and graded orthographic distance (OD) features. We analyze the statistical properties of the resource and evaluate its empirical utility in automatic text complexity assessment using the iRead4Skills corpus. Results show that while ON features alone are insufficient to predict readability, they contribute complementary information and compare favorably with existing resources for Portuguese. Portho is made publicly available in different formats to support research in psycholinguistics, readability modeling, and Natural Language Processing (NLP) for Portuguese.
The analysis of unstructured civil petitions is often hindered by procedural noise and verbose argumentation. To address this, we propose a pipeline composed of LLM-based fact extraction followed by legal-domain embeddings of texts for unsupervised density clustering. We employ Large Language Models to isolate factual narratives from raw texts, which are then encoded using domain-specific representations (Legal-BERT) and grouped via UMAP dimensionality reduction and the HDBSCAN algorithm. Comparative experiments on a Brazilian judicial corpus reveal that clustering based solely on extracted yields significantly more cohesive and semantically well-defined groups than, which suffer from fragmentation due to content variability. Results indicate that the proposed method is a promising approach for thematic organization, procedural triage support, and large-scale discovery of legal patterns.
This work presents a study of automated reformulation of argumentative essays written by college-bound native speakers of Brazilian Portuguese as a form of pedagogical feedback. We first evaluate the feasibility of using large language models (LLMs) to score argument quality with respect to three criteria: the defense of a point of view, organization, and development. We then employ an LLM to provide a reformulated version of the essay as feedback. As we discuss, the main challenge is to constrain the automated feedback to address only argument quality, rather than improving other aspects such as spelling or cohesion, and to modify the essay as little as possible. We achieve levels of agreement in automatic essay scoring comparable to human inter-rater agreement metrics, while increasing explainability. Instructing the LLM to add argument support (facts, examples, etc.) was the best way to get non-superficial changes to the arguments, and it was able to add true examples and facts to the essays even without being provided with background information on the topic.
This paper presents a comparative evaluation of automatic classification strategies for Brazilian university entrance exam questions by subject and fine-grained topic. A central contribution of this study is the creation and curation of a large-scale Portuguese-language dataset comprising approximately 17,000 questions collected from the Agatha.edu platform, carefully cleaned and normalized. We investigated two alternative classification strategies: a single-step approach that directly predicts fine-grained topics and a two-stage approach in which an initial model predicts the subject, followed by specialized topic classifiers. These strategies were evaluated using both classical machine learning methods, such as Support Vector Machines, Naive Bayes, and Random Forest, and transformer-based language models pre-trained for Portuguese. Experimental results show the feasibility of large-scale automatic question classification and highlight the potential of NLP-based classification strategies to support the curation, analysis, and organization of educational question banks.
The proliferation of fake news in digital environments poses serious challenges to democratic processes, particularly in morphologically rich languages such as Portuguese. While most existing approaches focus on stylistic cues or propagation patterns in social networks, this paper proposes an automated fake news verification methodology grounded in Knowledge Graphs (KGs). Instead of treating news as raw text, we represent each article as a set of factual events encoded as semantic triples of subject, predicate, and object. A proprietary knowledge graph is built from Brazilian data sources, and a verification algorithm is introduced to estimate the veracity of news articles based on graph connectivity evidence. Experimental results confirm the feasibility of the proposed approach and highlight its inherent explainability as a key advantage over deep learning black-box models. Error analysis further indicates that the main limitation stems from the syntactic complexity of Open Information Extraction in Portuguese, suggesting that improvements at this extraction stage are essential to increase system robustness.
The proliferation of online hate speech requires a rigorous examination of the datasets used to train detection models. In this work, we analyze six Brazilian Portuguese datasets annotated for hate speech or toxicity to investigate how their lexical "anatomy" and domain characteristics affect cross-domain generalization. We combine HurtLex-based lexical profiling with cross-dataset evaluation in a feature-based transfer-learning setup, using BERTimbau embeddings and an XGBoost classifier. Our analysis shows that, although the datasets share a broadly similar macro-level focus, they diverge substantially in how specific terms are used and labeled across platforms and topics. Results indicate that lexical breadth and annotation practices strongly predict transferability: datasets with broader and more heterogeneous toxic vocabulary yield better cross-domain performance, whereas resources with narrow, profanity-centered labeling lead to severe generalization gaps, even when lexical overlap is high. These findings underscore the impact of collection and labeling strategies on the curation and evaluation of Portuguese hate speech datasets. Warning! This work and the referenced datasets contain examples of offensive and hateful language.
Modelos de Língua de Grande Porte (LLMs) têm demonstrado desempenho expressivo em tarefas de raciocínio médico. No entanto, sua robustez diante de variações linguísticas ainda é pouco explorada, especialmente em idiomas além do inglês, como o português. Neste trabalho, investigamos como o idioma de entrada afeta o desempenho e o comportamento de raciocínio de LLMs médicos, bem como se a Geração Aumentada por Recuperação (RAG) é capaz de mitigar eventuais limitações decorrentes dessas variações. Para isso, realizamos experimentos em português e em inglês, utilizando duas variantes do modelo MedGemma, com 4B e 27B parâmetros, e avaliando-as em três conjuntos de dados médicos. A avaliação combina métricas quantitativas de acurácia com análises qualitativas e estruturais das cadeias de raciocínio e das respostas geradas pelos modelos. Os resultados indicam que a variação linguística impacta de forma mais acentuada os modelos de menor porte. Em particular, a variante de 4B parâmetros apresenta desempenho consistentemente inferior quando as entradas são fornecidas em português. Em contraste, a variante de 27B parâmetros demonstra maior robustez entre idiomas, mantendo níveis semelhantes de acurácia e de estrutura de raciocínio tanto em português quanto em inglês. Embora o sistema de RAG implementado apresente recuperação de documentos de boa qualidade, sua integração não resulta em ganhos consistentes para o modelo menor, o que sugere limitações na exploração efetiva do contexto adicional. De forma geral, este trabalho contribui para o entendimento dos limites atuais dos LLMs médicos em contextos multilíngues, destacando os desafios associados ao desempenho em idiomas com recursos limitados.
This work presents BIPA, a phonetic transcription corpus for Brazilian Portuguese that covers regional dialectal variations. The corpus was constructed through automated extraction from Wiktionary, resulting in 53,353 unique words and 350,021 transcriptions in IPA format, distributed across six dialects: general Brazilian, Rio de Janeiro, São Paulo, South Region, Northeast Region, and Center-West Region. The average density of 6.56 transcriptions per word reflects multiple regionally conditioned phonetic variations. To validate the utility of the corpus, the ByT5-small model was fine-tuned for grapheme-to-phoneme conversion, achieving a Minimum Phoneme Error Rate of 2.66% on the validation set. BIPA addresses the scarcity of computational linguistic resources for Brazilian Portuguese, enabling applications in regional speech synthesis, automatic accent recognition, and computational sociolinguistic analysis.
A Avaliação Automática de Redações (AAR) para o português brasileiro ainda é uma tarefa desafiadora, particularmente no contexto do exame Enem, no qual a qualidade textual é avaliada por meio de múltiplas competências e as notas apresentam natureza ordinal. Neste artigo, investigamos estratégias de modelagem híbrida para AAR em nível de competência, combinando características linguísticas explícitas com representações contextuais. Utilizando o córpus Enem-AES, a avaliação de cada competência foi modelada como um problema de predição ordinal por meio do framework CORAL. Foi realizada uma comparação empírica controlada entre representações lexicais tradicionais, um amplo conjunto de métricas linguísticas extraídas com o sistema NILC-Metrix, características manuais orientadas à tarefa, embeddings contextuais e combinações dessas representações. Os resultados mostram que modelos híbridos alcançam o maior nível médio de concordância com as notas humanas, embora o desempenho varie entre competências e dependa do tipo de representação utilizada. Além disso, foi analisado o comportamento dos modelos em cenários de discordância entre avaliadores, o que evidenciou o impacto da variabilidade de anotação no desempenho dos modelos. De modo geral, os resultados fornecem evidências de que a combinação de indicadores linguísticos com embeddings contextuais constitui uma estratégia promissora para a tarefa de AAR no contexto do Enem.
This paper presents an evaluation of large language models (LLMs) applied to the task of normalizing eighteenth-century written texts. Several LLMs were employed to process texts in pre-contemporary spellings and update them according to contemporary Portuguese orthography. Their outputs were rigorously compared against a curated reference corpus. The findings indicate marked disparities in model performance, with the Portuguese-specialized model Sabiá demonstrating a statistically significant advantage over multilingual alternatives.
Speech-language assessment of stuttering is traditionally manual, subjective, and time-consuming. This paper presents the development of software for automatic detection and classification of stuttering-related disfluencies in Brazilian Portuguese, aiming to support clinical assessment. The system follows a two-stage hybrid approach. In the first stage, it applies deterministic algorithms based on automatic speech recognition (ASR) and temporal information to identify simple disfluencies, such as repetitions and pauses. In the second stage, it employs a hierarchical architecture combining a Kohonen network (Self-Organizing Map, SOM) and a Multilayer Perceptron (MLP) to classify complex disfluencies, specifically blocks and prolongations, using acoustic features. Because no publicly available annotated resources exist for this task in Brazilian Portuguese, we built a initial dataset annotated by specialists. The system achieved 89.5% accuracy in classifying complex disfluencies, with a Matthews Correlation Coeficient (MCC) of 0.812. These results indicate the feasibility of the tool as decision support for clinical assessment and establish a baseline for future research.
Este trabalho propõe a identificação de vieses sociais em português nos modelos GPT-4o, GPT-4o-mini, Sabiá-3 e Sabiázinho-3, utilizando a métrica de estima a fim de avaliar o nível de respeito e deferência dos modelos sobre diferentes grupos demográficos. A avaliação abrange sujeitos com marcadores sociais explícitos de género, raça e região brasileira, em condições com e sem o uso de uma técnica de contorno das restrições de moderação (jailbreaking). Os achados mostram que os modelos de linguagem avaliados reproduzem padrões sistemáticos de valoração diferenciada entre grupos sociais, revelando vieses de estima associados a marcadores de gênero, raça e região no português brasileiro. Sujeitos com marcadores sociais enfatizados, especialmente os de raça, tendem a receber estimas mais baixas. A utilização da técnica de jailbreaking não apresentou um impacto uniforme, podendo tanto ampliar quanto reduzir as diferenças de estima.
The choice between large-scale, multilingual, foundation models and specialized monolingual models for languages like Brazilian Portuguese (PT-BR) presents a complex trade-off between generalization and specialization. This paper investigates this trade-off through an empirical study across a diverse suite of tasks. We evaluate multiple families of language models under both linear probing and fine-tuning regimes. We find that monolingual encoders exhibit greater "adaptation plasticity" during fine-tuning, improving on both classification and semantic similarity, where global (multilingual) models degrade. However, this plasticity comes at a cost: our tokenization analysis suggests that monolingual models struggle with foreign terms, whereas modern multilingual tokenizers show surprising morphological competence, challenging a long-standing assumption in the field. We conclude that the optimal model choice is a task-dependent trade-off between vocabulary coverage and adaptation flexibility.
This work presents and evaluates two specialized sentence embedding models for the Portuguese legal domain, LexIris-pt and LexBert-pt, obtained through supervised fine-tuning of BERT-based models using pairs of initial petitions. We propose a comparative evaluation protocol along three fronts: (i) zero-shot inference with pretrained embeddings, (ii) supervised fine-tuning on these pairs, and (iii) vector retrieval with incremental clustering over a corpus of 20,000 initial petitions. The results show that fine-tuning consistently increases correlations with reference scores and improves performance in vector retrieval; additionally, the vector retrieval stage indicates that the metric configured in the index (cosine similarity or inner product) can change the granularity of the partitioning under a fixed threshold, reinforcing the need for joint calibration among the encoder, metric and threshold. After auditing by specialists from the partner institution, LexIris-pt and LexBert-pt were operationally adopted to support the screening and organization of repetitive claims and predatory litigation.
Questions and answers are among the most fundamental forms of human communication. Question Answering (QA) is the task of correctly generating answers based on a context. To assess the success of the task, the answers are typically evaluated using traditional metrics such as BLEU, ROUGE, and METEOR. However, these metrics often fail to reflect the actual quality of the outputs. More recently, new evaluation metrics and the LLM-as-a-judge paradigm have also been applied to the evaluation of QA. To gain a deeper understanding of the capabilities and limitations of QA metrics, this work performs a comparative analysis of both traditional and more recent approaches for QA evaluation. Experiments were conducted on the Pirá dataset (in Portuguese) using four LLMs to generate answers. Additionally, human evaluation was performed to assess aspects such as correctness, completeness, clarity, and relevance of the generated content. We demonstrate that lexical metrics are limited in evaluating QA. We also observed that human evaluators favor models that provide higher information density, even when this contradicts prompt constraints, whereas lexical metrics penalize this verbosity. This divergence confirms that traditional metrics are insufficient for capturing the trade-off between instruction adherence and the semantic richness valued by native speakers.
Semantic re-ranking architectures based on cross-encoders are essential for high-precision Information Retrieval (IR) in the legal domain, but they face a dilemma: their high computational latency renders large-scale applications challenging, particularly in resource-constrained environments. Traditional single-stage approaches force a choice between computational efficiency and ranking quality. This work presents an empirical evaluation of established cascade re-ranking architectures to optimize this balance through the adaptive application of off-the-shelf models of increasing complexity over progressively smaller sets of candidates. We validated the architecture on a corpus of 300,000 legal documents in Portuguese from the Court of Accounts of the State of Goiás (TCE-GO). Experiments demonstrate a 60.3% reduction in latency (from 11.75s to 4.66s per query) compared to the most precise single-stage baseline, with a marginal degradation of only 2 p.p. in R@avg and 0.0224 in MRR@avg. The results validate the semantic funnel as a computationally viable solution for semantic document-to-document search within the specific context of the TCE-GO repository, establishing a baseline for future transferability studies in broader Portuguese legal contexts.
Neste artigo descrevemos a adição de correferência a corpos literários públicos, para a tarefa de caracterização de personagens na leitura distante. Começamos por motivar essa tarefa na área dos estudos literários computacionais, explicamos a forma como tornamos essa tarefa legível e revisável a pessoas da área dos Estudos Literários, transferindo para o BRAT, descrevemos os primeiros resultados e um pequeno corpo anotado público, assim como discutimos a criação de dois módulos de correferência.
In this paper we analyze the structural and linguistic dynamics of online toxicity in Reddit discussion trees, focusing on how trigger comments escalate conflicts in Brazilian Portuguese. Using a fine-tuned BERTAbaporu model, we show that toxic discussions are deeper, more engaging, and initially semantically cohesive, but degrade over time, while non-toxic interactions emphasize social bonding. Our findings contribute to a better understanding of toxicity escalation and support early detection of discursive conflicts.
In this work, we study disentanglement between speaker and environment by combining an adversarial framework with contrastive learning objectives. We investigate supervised contrastive learning (SupCon), which exploits environment labels to structure the environment subspace, and self-supervised SimCLR, which learns invariance from augmented views. Experiments on a controlled synthetic dataset (ST1) and a more realistic corpus (CML-TTS) show that SupCon yields the most discriminative and stable speaker embeddings on ST1, achieving the best verification performance (EER=4.70%, MinDCF=0.24). Overall, our findings emphasize (i) the importance of synthetic benchmarks for diagnosing disentanglement under controlled factor variation and (ii) the effectiveness of combining contrastive and adversarial objectives to encourage speaker representations that are both discriminative and less sensitive to environmental factors.
This paper investigates morphosyntactic covariation in Brazilian Portuguese (BP) to assess whether dialectal origin can be inferred from the combined behavior of linguistic variables. Focusing on four grammatical phenomena related to second-person pronoun, correlation and clustering methods are applied to model covariation and dialectal distribution. The results indicate that correlation captures only limited pairwise associations, whereas clustering reveals speaker groupings that reflect regional dialectal patterns. Despite the methodological constraints imposed by differences in sample size requirements between sociolinguistics and computational approaches, the study highlights the importance of interdisciplinary research. Developing fair and inclusive language technologies that respect dialectal diversity outweighs the challenges of integrating these fields.
This paper presents a syntactic lexicon of Brazilian Portuguese predicative adjectives that are not regularly derived from verbs. From the 7,000 most frequent adjectives in a large web corpus, 3,161 lexical items were selected and annotated with 36 syntactic properties. These properties were established through introspection and corpus evidence, covering argument structure, copular verbs, prepositions, transformations (e.g., raising, nominalization), semantic roles, and others. The resulting resource constitutes a machine-readable lexicon of predicative adjectives for Brazilian Portuguese.
Extracting structured information from lengthy documents using Large Language Models (LLMs) is computationally expensive and prone to accuracy degradation as input size increases. We present a two-stage pipeline for extracting products from Brazilian tender documents (editais de licitação), combining NLP-based page classification with LLM extraction. We construct a novel dataset of 11,190 annotated pages from 350 documents across five product domains. Our experiments compare transformer-based classifiers (BERTimbau, DistilBERT) with classical machine learning approaches using engineered features. Results show that XGBoost with domain-specific features achieves 97.75% F1-score, outperforming fine-tuned BERT models by over 4 percentage points. The complete pipeline reduces LLM input tokens by 64-88% while maintaining extraction completeness, enabling cost-effective document processing at scale.
Twitter/X remains a key source of user-generated content, requiring Natural Language Processing tools capable of handling non-canonical language. This study presents a manual annotation of lexical and orthographic phenomena in DANTEStocks, a corpus of financial tweets in Brazilian Portuguese, using a hierarchical typology to capture both creative uses and deviations from the standard norm. Results show that orthographic variation is strongly influenced by creative forms, mainly driven by platform- and domain-specific innovations. Standard norm variation is systematic, mostly involving predictable omissions of diacritics and the cedilla, and most tokens exhibit only one phenomenon, reflecting stable and largely independent patterns of variation in this Twitter subgenre. The identified variant forms enabled the construction of a lexicon for evaluating embedding models. We assessed how BERTimbau, Word2Vec, and FastText handle lexical variation in raw, unnormalized data, showing that the lexicon reduces out-of-vocabulary rates and improves coverage. These results highlight model robustness and the value of curated lexical resources in complementing both fixed and data-driven vocabularies.
A síntese de voz emocional multi-idioma para português brasileiro é pouco explorada. Este trabalho investiga diferentes abordagens para incorporar controle emocional em síntese multi-idioma português-inglês, comparando cinco variantes: modelo base YourTTS, ajuste fino com dados emocionais, condicionamento via tokens textuais, e arquitetura VECL-TTS com embeddings emocionais sob diferentes configurações. Utilizamos datasets emocionais em inglês (RAVDESS, Emotional Speech Dataset) e português brasileiro (VERBO), totalizando 14,4 horas, para ajuste fino a partir do modelo YourTTS pré-treinado. A avaliação combinou métricas objetivas (similaridade de embeddings emocionais e de falante) com avaliação subjetiva por dez participantes. Os resultados revelam que abordagens arquiteturalmente simples podem alcançar desempenho perceptual comparável ou superior a métodos mais complexos: o YourTTS com ajuste fino obteve a melhor qualidade geral, o condicionamento por tokens alcançou a maior similaridade emocional percebida, enquanto o VECL-TTS maximizou o controle emocional objetivo com degradação na qualidade e na similaridade de falante. Observou-se ainda uma competição entre controle emocional e preservação de identidade vocal, bem como discrepâncias entre métricas objetivas e percepção humana. Este trabalho demonstra a viabilidade de transferência emocional multi-idioma para português brasileiro via ajuste fino com recursos limitados.
Electoral debates are influential moments in public discourse, providing candidates with a high-visibility platform to present their proposals, contrast their positions, and engage in exchanges that shape voter decisions. In Brazil, these debates reach a broad and diverse audience, reflecting regional, social, and ideological variations that affect linguistic choices and thematic content. This paper presents CoDEl-BR (Corpus de Debates Eleitorais, in Portuguese), a corpus of transcripts from 22 second-round mayoral debates held in 13 Brazilian state capitals during the 2024 municipal elections. It comprises 2,943 transcript segments totaling approximately 32 hours. Exploratory analyses reveal differences in thematic priorities between candidates and voters’ questions, as well as variations by race and party affiliation. The corpus aims to enable research in discourse and argumentation analysis, stance and sentiment detection, polarization modeling, and other related NLP tasks. We demonstrate that this initial release provides a curated, high-quality subset of debates with significant potential for expansion.
We present Causal_QA.PT, a human–LLM co-curated benchmark for causal question answering in Portuguese, addressing the lack of high-quality evaluation resources for causal reasoning in non-English languages. The dataset is developed through a hybrid human–LLM process with targeted generation, validation, and evaluation procedures, and is organized according to the PEARL causal typology. Using this resource, we evaluate the ability of Large Language Models to answer causal questions in Portuguese and examine the role of explicitly providing causal class information in prompt design. Our findings show that current LLMs are capable of producing high-quality causal responses in Portuguese, with GPT-5 Mini in particular demonstrating strong performance in judgment-based evaluation. Explicit causal class information yields model- and question-dependent benefits, particularly for interventional and counterfactual questions. Finally, we observe that human reference answers are not always superior, underscoring the importance of careful benchmark curation and robust evaluation for underrepresented languages.
We present ConsumerBR, a large-scale corpus of consumer complaints and company responses in Brazilian Portuguese, compiled from publicly available data on the Consumidor.gov.br platform. The corpus comprises over 3.1 million consumer–company interactions collected between 2021 and 2025 and combines anonymized textual content with rich structured metadata, including temporal information, complaint outcomes, and consumer satisfaction indicators. We describe a data collection strategy tailored to the platform’s dynamic interface, a preprocessing pipeline that includes response clustering to identify template-based replies, and a hybrid anonymization approach designed to mitigate privacy risks. We also provide a detailed statistical characterization of the corpus, highlighting its scale, coverage, and distributional properties. ConsumerBR is publicly available for research purposes and supports a wide range of applications, including complaint analysis, sentiment modeling, dialogue and response generation, and preference-based evaluation.
The Semantic Web aims to make web data understandable not only to humans but also to machines, enabling more efficient data integration, sharing, and reuse. Linked Open Data (LOD) initiatives have supported this vision by promoting the publication of semantically annotated and interconnected data. However, querying LOD repositories typically requires knowledge of SPARQL, a complex query language that limits access for non-expert users. Although several approaches have been proposed to automatically generate SPARQL queries from natural-language questions, most are designed for English and are tightly coupled to specific domains, which hinders reuse. This article presents a generic, domain-independent approach for generating SPARQL queries from questions written in Portuguese. The proposed method uses reference questions, parameterized query templates, and a synonym dictionary enriched by lexical resources and similarity metrics. The implementation is supported by the Natural2SPARQL tool, and the approach is validated through a case study in the financial domain using real data from the Brazilian stock exchange (B3). The results indicate that the method enables flexible and semantically accurate SPARQL query generation from natural-language input. Unlike learning-based approaches, our method avoids retraining and achieves up to 93.3% end-to-end success in controlled settings, demonstrating robustness and low adaptation cost.
This paper presents the development of Retrato_Cantado, a dataset of sentences extracted from Brazilian song lyrics and manually annotated to identify and categorize predicative constructions that describe individuals. The corpus findings validate the effectiveness of lexical-syntactic patterns for identifying predicative sentences, confirming their suitability for large-scale linguistic annotation tasks. The dataset also serves as a valuable resource for the analysis of textual discourse and the representation of social groups in Brazilian culture. We additionally trained a person-characterization classifier to illustrate the applicability of the dataset to the automatic detection of predicative descriptions, which achieved high accuracy and highlights the potential for creating more specialized models capable of detecting physical and sociocognitive categories, as well as performing sentiment polarity analysis.
As Large Language Models (LLMs) expand across multilingual domains, evaluating their performance in under-represented languages becomes increasingly important. European Portuguese (pt-PT) is particularly affected, as existing training data and benchmarks are mainly in Brazilian Portuguese (pt-BR). To address this, we introduce ALBA, a linguistically grounded benchmark designed from the ground up to assess LLM proficiency in linguistic-related tasks in pt-PT across eight linguistic dimensions, including Language Variety, Culture-bound Semantics, Discourse Analysis, Word Plays, Syntax, Morphology, Lexicology, and Phonetics and Phonology. ALBA is manually constructed by language experts and paired with an LLM-as-a-judge framework for scalable evaluation of pt-PT generated language. Experiments on a diverse set of models reveal performance variability across linguistic dimensions, highlighting the need for comprehensive, variety-sensitive benchmarks that support further development of tools in pt-PT.
Este artigo investiga a classificação automática multi-rótulo de cartas indígenas ao Brasil em categorias temáticas. A partir do acervo digital "Cartas Indígenas ao Brasil", que constitui um corpus de 871 cartas anotadas em 18 categorias, comparamos três abordagens de classificação: um modelo lexical (TF-IDF + regressão logística), um modelo contextual (BERTimbau-base) e um classificador baseado em grandes modelos de linguagem (LLM). Para lidar com o desbalanceamento do corpus, empregamos estratégias de balanceamento de classes no modelo neural. Os resultados revelam um trade-off entre precisão e recall: o baseline lexical apresenta maior precisão (0,65), enquanto o BERTimbau demonstra maior recall (0,67), especialmente em categorias minoritárias. Ambos alcançam macro-F1 de 0,42, evidenciando que a classificação multi-rótulo neste domínio é uma tarefa desafiadora, em especial devido ao desbalanceamento do corpus e à sobreposição semântica entre categorias. O classificador baseado em LLM atinge alto recall, especialmente em categorias minoritárias, mas tende a superestimar o número de rótulos por documento, reforçando o trade-off entre precisão e cobertura observado nas outras duas abordagens. A análise detalhada por classe revela comportamentos complementares entre os modelos, sugerindo que abordagens híbridas podem superar as limitações individuais de cada método. O corpus e os scripts dos experimentos serão disponibilizados publicamente.
A disseminação de desinformação em meios digitais requer mecanismos robustos de detecção, tarefa na qual modelos de linguagem apresentam desempenho satisfatório. Entretanto, são percebidas na literatura análises que desconsideram a característica da degradação da capacidade de generalização dos modelos em dados reais, diferentes daqueles nos quais o treino ou ajuste fino foi realizado. Este trabalho investiga o comportamento dos modelos BERTimbau e mBERT em cenários de generalização cruzada (dados de teste diferentes dos dados de treinamento e validação). Para isso, foi realizado um ajuste fino utilizando quatro corpora brasileiros (Fake.br, Fakepedia, FakeRecogna e FakeTrueBR). Os resultados confirmam a hipótese de que avaliações intra-base têm altas taxas de desempenho, enquanto avaliações entre-bases têm baixas taxas e alta degradação na generalização cruzada, ainda que o objetivo de identificação de notícias falsas seja mantido. Quanto à capacidade preditiva dos modelos, o BERTimbau se mostrou ligeiramente melhor na média com 71% de acurácia e 67% de f1-score contra 69% e 64%, respectivamente, para o mBERT.
This work introduces and evaluates JAMEX (Judicial Multi-Agent Metadata Extraction), a multi-agent pipeline for extracting structured metadata from Brazilian court decisions (Espelho do Acórdão), and compares it against a strong single-prompt baseline under an Information Retrieval-only (IR-only) setting.We first ran a pilot on 300 decisions and then reran the experiment on a stratified dataset of n=1,225; completion rates varied across executions, yielding between 779–1,216 successfully completed instances, with non-completion concentrated in agentic configurations.Across re-executions, the accuracy impact of agents was strategy-dependent: GPT-5 improves over the baseline in multiple agentic strategies but not across all orchestration variants, while smaller models (Gemma3-12B/Gemma3-27B) show no robust gains.Orchestration refinements motivated by agent design literature (memory, planning and directed review) improved traceability, but performance remained sensitive to task decomposition and context splitting.Overall, JAMEX increases token usage and operational complexity, so deployment must balance accuracy, completion reliability, and cost for Portuguese legal metadata extraction.
A Avaliação Automática de Redações (AES) é um desafio central em avaliações educacionais de larga escala, como o Exame Nacional do Ensino Médio (Enem), no qual redações são avaliadas em múltiplas competências. Este trabalho apresenta uma análise comparativa de representações textuais para a AES em nível de competência no português brasileiro. Foram avaliados modelos baseados em características utilizando TF-IDF, métricas linguísticas extraídas com o NILC-Metrix e uma combinação híbrida de ambos, além de modelos baseados em transformers. Os experimentos foram conduzidos sobre o corpus Enem-AES, considerando formulações de classificação e de regressão. Os resultados indicam que formulações de regressão são, em geral, mais adequadas do que as de classificação multiclasse, pois acomodam melhor a estrutura ordinal das notas. Modelos baseados em transformers alcançaram uma concordância maior em competências relacionadas ao uso da linguagem e à coesão textual, enquanto representações baseadas em características demonstraram um desempenho comparável em competências associadas à pertinência temática. Apesar de alcançarem alta acurácia sob o critério de tolerância do Enem, todas as abordagens demonstraram dificuldade em prever notas extremas, principalmente devido ao desbalanceamento do corpus. Dessa forma, conclui-se que as metodologias são complementares e que sistemas híbridos são promissores para a AES.
Este artigo avalia um sistema end-to-end de Geração Aumentada por Recuperação (RAG) para consulta a documentos hospitalares regulatórios em português. O estudo analisa o impacto da otimização de cada componente (recuperação, reclassificação e geração) em um cenário de recursos limitados. A metodologia combinou a criação de um dataset híbrido (sintético e validado por especialistas) com avaliações quantitativas utilizando métricas como MRR, NDCG@10 e BERTScore. Os resultados demonstram que o modelo de embedding intfloat/multilingual-e5-small apresentou a maior robustez, com taxa de falha de apenas 1,4% na recuperação. Na etapa de reclassificação, o método RRF destacou-se pelo equilíbrio entre custo computacional e desempenho. Conclui-se que a arquitetura otimizada, integrando esses componentes ao gerador Gemini 2.5 Flash, oferece uma solução eficiente e precisa para suporte à decisão em ambientes hospitalares.
Automatic pun detection remains challenging because it depends on lexical ambiguity and contextual interaction, which are not explicitly captured by linear text representations. In Portuguese, TF-IDF-based ensemble methods provide competitive and interpretable baselines, but remain limited by surface-level features. This work investigates whether corpus-based graph information can complement such methods. Three graph representations are constructed from the Puntuguese corpus: a Co-occurrence graph, a PPMI-weighted graph, and a Pun-Context graph. In the current pipeline, each graph is converted into low-dimensional node embeddings with TruncatedSVD, which are then aggregated into document-level features and concatenated with TF-IDF representations in a soft-voting ensemble. Experimental results on the test set show that graph-based enrichment does not uniformly improve performance: Pun-Context and PPMI yield the strongest graph-augmented results, whereas combining all graphs degrades performance. These findings indicate that the usefulness of graph-based information depends strongly on how lexical relations are encoded and aggregated at the document level.
Racist discourse on social media appears both through explicit attacks and subtle, context-dependent forms, remaining a challenge for Natural Language Processing. We introduce RacismoBR, a culturally grounded dataset for detecting racist discourse in Brazilian Portuguese, manually annotated exclusively by Black researchers to ensure sociolinguistic validity and epistemic representativeness. We conduct a controlled evaluation of binary racism classification in our dataset considering several classification modeling paradigms: classical machine learning, supervised Transformer-based (Small) Language Models, and Large Language models under in-context, few-shot learning. Results show that GPT-4.1 and BERTimbau yield the highest Macro-F1 scores; however, Wilcoxon signed-rank tests reveal no statistically significant differences across models, mostly due to high variability. Across paradigms, classifiers consistently display higher precision for non-racist content and higher recall for racist content. A qualitative analysis highlights persistent difficulties with implicit, euphemized, and context-dependent racism. These findings indicate that culturally grounded annotation plays a more decisive role than architectural sophistication alone in advancing racism detection.
Top-performing Artificial Intelligence models often operate as black boxes. Explainable AI (XAI) can increase transparency, but its evaluation is currently hindered by a lack of annotated explanation data and agreed-upon validation standards. We propose a framework for evaluating the faithfulness of explanations in Portuguese hate speech detection. Our approach is based on the premise that a faithful explanation should identify features whose removal degrades a model’s performance. We follow a three-step process: (i) prediction on the original input; (ii) identification and removal of explanatory keywords; and (iii), prediction on the modified input, with performance differences used as an evaluation signal. We conduct experiments using ensemble classifiers, multiple keyword selection strategies, and SHAP and LIME as XAI methods. In addition, Large Language Models (LLMs) are explored both as classifiers and as explainers. Results demonstrate that removing explanatory keywords degrades model performance more than random word removal, indicating explanation faithfulness. Notably, SHAP and LIME consistently provided more faithful explanations than LLM-generated or manual alternatives, although impact depends on the keyword selection strategy. These findings highlight the importance of standardised, unsupervised evaluation protocols for XAI and the faithfulness limitations of current generative LLM explanations.
Trait-specific automated scoring of essays written for the standardized Brazilian National Entrance Exam (ENEM) has received significant attention in recent years. The task is both important in a classroom setting, to provide timely and personalized learning feedback, and in the official exam, to make the scoring process more scalable and consistent. The state-of-the-art systems approach the task as a purely statistical predictive task, ignoring the knowledge provided to human graders and test takers in the form of rubrics and guidelines.Aiming to produce more interpretable and informative formative feedback in this work, we leverage the official ENEM Grader’s handbook and develop two neuro-symbolic approaches to trait-specific essay scoring.The first approach uses a Large Language Model (GPT4o) to write an evaluative explanation of the essay score according to the subcriteria described in the guidelines; the explanation is then fed into a statistical model to effectively predict the score; the good performance of the scoring validates the quality of the explanations.The second approach formalizes the Guideline grading rubrics as logical rules that derive the essay score as a function of subcriteria, mimicking the recommended human grader’s scoring approach.In order to provide weak supervision in training and to evaluate the quality of the model, we build a dataset of 63 essays annotated with their subcriteria by two expert human graders.Our empirical results suggest that both approaches perform on par with purely statistical methods while providing more helpful and fine-grained feedback.
A modelagem da variação dialetal enfrenta desafios quando dependente de modelos de linguagem baseados em sub-palavras, que frequentemente falham ao processar a complexidade de transcrições fonéticas devido a restrições de vocabulário e vieses semânticos. Este trabalho introduz o dialect2vec, um método para capturar a diversidade dialetal do Português Brasileiro. Nossa proposta adota o modelo token-free ByT5 para codificar sequências do Alfabeto Fonético Internacional (IPA) ao nível de byte, mitigando a perda de informação causada por tokens desconhecidos. Os experimentos foram realizados com dados do Atlas Linguístico do Brasil (ALiB), em que a dimensão fonética isolada demonstrou viabilidade em tarefas de agrupamento não supervisionado, com desempenho próximo do estado da arte léxico (BERTimbau), comprovando que arquiteturas baseadas em bytes podem recuperar estruturas dialetais complexas exclusivamente através de pistas fonológicas, oferecendo um mapeamento mais granular das fronteiras linguísticas.
Compression-based language complexity metrics show promise as holistic parameters for measuring linguistic complexity across intra- and cross-linguistic scenarios. Yet, their sensitivity to specific forms of linguistic variation requires further experimental validation. We examine the sensitivity of this metric family to register variation in Portuguese, a phenomenon already established for English. We refine the validation process found in previous literature by introducing a more granular statistical analysis to evaluate both the individual and joint sensitivity of these metrics to register variation at the sentence level. Our results confirm they are highly sensitive to functional variation in Portuguese, exhibiting the same structural morphosyntactic trade-off consistent with that observed in English and in cross-linguistic studies.
Robust text-to-speech (TTS) systems must be trained on speech that mirrors the variability and imperfections of spontaneous dialogues. However, TTS systems trained on existing Brazilian Portuguese datasets are typically limited to clean, scripted, or studio-recorded speech. Certas Palavras (CP) bridges this gap with 70 hours of spontaneous, multi-speaker dialogues from a Brazilian radio program broadcast in the 1980s–1990s. The extensive manual annotation process captures conversational dynamics, including orality markers, filled pauses, and hesitations. For the analog medium, we annotated non-verbal phenomena as musical interference, noise and segmental corrections, describing a challenging acoustic environment for synthesis. Baseline YourTTS and F5-TTS models were trained in a 9-hour subset featuring one of the two main hosts of Certas Palavras. Baseline YourTTS and F5-TTS models were trained on a 9-hour single-speaker subset corresponding to one of the main program hosts. Objective evaluation shows that the synthesized speech remains intelligible, with moderate WER and CER. In contrast, subjective evaluation reveals a clear gap in perceived naturalness, with lower MOS scores and higher inter-rater variability compared to ground-truth audio. Together, these properties make the dataset a strong benchmark for TTS robustness.
Este trabalho investiga a aplicação do modelo monolíngue BERTimbau para a Análise de Sentimentos Baseada em Aspectos (ABSA) em português, visando estabelecer um baseline robusto para o domínio hoteleiro. São comparadas duas estratégias via fine-tuning: uma abordagem pipeline (extração seguida de classificação) e uma abordagem end-to-end (multitarefa com esquema de tags colapsadas). Avaliadas no conjunto de dados da competição ABSAPT 2024, os resultados evidenciam um trade-off arquitetural: o pipeline favorece a revocação na extração de aspectos (F1: 0,840), enquanto o end-to-end prioriza a precisão, mas sofre com a dispersão de classes. A análise composta demonstra desempenho competitivo (Medida-F 0,72 para ambos), oferecendo um ponto de partida para futuras investigações em arquiteturas híbridas e generativas para o português.
Automatic Speech Recognition (ASR) systems require large amounts of annotated speech, which are difficult to obtain in specialized domains. This paper introduces GARAGEM: General Automotive Real and Artificial speech corpus for Garage Environments and Maintenance in brazilian portuguese, a domain specific ASR dataset for Brazilian Portuguese focused on automotive repair, combining real speech collected from online sources with synthetic speech generated from curated technical terminology. A reproducible methodology is proposed, encompassing real data acquisition, domain guided synthetic data generation, dataset consolidation, and ASR model fine-tuning. Experiments conducted with the Whisper, Wav2vec 2.0, and Conformer models show that synthetic data provides improvements when used to complement real recordings. Quantitative and qualitative analyses show reductions in Word Error Rate (WER) and Character Error Rate (CER) and improved recognition of domain specific terms absent from the real training set. The results indicate that domain guided synthetic speech is an effective data augmentation strategy for ASR adaptation in specialized and low resource scenarios.
Dense retrieval is a critical component of Retrieval-Augmented Generation (RAG) systems and is highly sensitive to document representations. In consumer complaint settings, raw interaction texts are often lengthy and noisy, which limits retrieval effectiveness. This paper investigates whether schema-guided structured summaries can improve dense retrieval in RAG. We compare embeddings derived from raw interaction texts and from LLM-generated structured summaries in a controlled evaluation on Portuguese-language consumer complaints. Summary-based retrieval achieves a Recall@1 of 0.527, compared to 0.001 when indexing raw interactions, and reaches Recall@10 of 0.610, demonstrating gains of more than two orders of magnitude. These results show that structured summaries enable more effective and reliable retrieval at low cutoffs, making them particularly suitable for RAG pipelines.
O Celpe-Bras é o exame oficial brasileiro de proficiência em Português como Língua Adicional (Inep, 2020). A parte escrita do exame exige que os participantes produzam quatro textos em resposta a tarefas baseadas em vídeo, áudio e textos de insumo, o que exige que a preparação para o exame seja realizada a partir de práticas de (re)escrita de textos. Por um lado, professores que trabalham na preparação de estudantes para o exame têm um alto volume de textos para corrigir, e os estudantes têm poucas opções de recursos didáticos acessíveis alinhados ao construto teórico do Celpe-Bras. Nesse contexto, e impulsionado pelos recentes avanços no Processamento de Linguagem Natural (PLN), modelos de língua de grande escala (LLMs) e Inteligência Artificial, este estudo visa mapear e comparar métodos para a avaliação automática dos textos produzidos no exame Celpe-Bras. São apresentados e testados diversos modelos, abrangendo tanto algoritmos tradicionais de aprendizado de máquina quanto modelos de linguagem pré-treinados, como BERT, BART e T5. Ao final, foi possível perceber que os melhores resultados foram obtidos pelas adaptações do modelo BERT, levemente superiores aos dos modelos restantes, mas com considerável maior custo computacional.
The development of Small Language Models (SLMs) for Portuguese faces significant challenges in balancing parameter efficiency with specialized capabilities, particularly in mathematical reasoning domains where existing models demonstrate limited native competence. This work introduces the first model in the Biatron series, a 345-million-parameter language model specifically optimized for Brazilian Portuguese through strategic data curation rather than brute-force parameter scaling. Using a carefully designed 60-30-10 data mixture combining high-quality Portuguese text from GigaVerbo, chain-of-thought reasoning examples, and mathematical datasets, Biatron was trained on 300 billion tokens using the Megatron-LM framework, achieving 32% Model FLOP Utilization.The model attains an overall score of 0.245 (aggregate performance) on Portuguese-specific benchmarks, approaching within 1.6% of Tucano-630M’s performance while utilizing 45% fewer parameters. Most significantly, Biatron achieves 7.5% Pass@1 accuracy on mathematical reasoning tasks—more than doubling the performance of Tucano-2.4B (3.5%) despite being nearly seven times smaller. These results validate that strategic data mixing can rival parameter scaling for language model development, establishing a reproducible methodology for efficient AI development in resource constrained language contexts. To support reproducibility and further research, the final model weights, training logs, and intermediate checkpoints are publicly available.
As part of the institution’s 2024–2027 strategic plan, which includes the objective of understanding how the media portrays the organization to strengthen its public image, this paper investigates the application of deep learning algorithms in sentiment analysis of headline news about a public security institution. Four deep learning methods were applied in combination with three textual representations, resulting in twelve trained models. For each combination, a class-based analysis of the results was conducted. Models using BERT as the textual representation achieved strong performance, with an F1-score of approximately 90%.
Este artigo apresenta uma avaliação do viés de gênero na tradução automática (TA) do inglês ao português, analisando o desempenho de três tradutores comerciais (Google Translate, Microsoft Translator, Amazon Translate) e três modelos de linguagem de propósito geral (GPT-3.5 Turbo, GPT-4o-mini e Llama-3 8B-Instruct). Utilizando o corpus de teste WinoMT (Stanovsky et al., 2019), a análise quantitativa mediu a acurácia e o viés (ΔG e ΔS) no corpus traduzido. Os resultados mostram que todos os sistemas apresentam viés, com melhor desempenho na tradução de entidades-alvo masculinas (ΔG positivo) e daquelas que corroboram estereótipos ocupacionais (ΔS positivo). A análise qualitativa, fundamentada na Teoria Sistêmico-Funcional, enfocando nas profissões ‘nurse’ e ‘physician’, revela como o viés de gênero constrói significados distintos das sentenças-fontes em relação às entidades-alvo e compromete a coesão referencial. O estudo valida um algoritmo de avaliação adaptado para o português e reitera a persistência do viés como um problema sociotécnico (Savoldi et al., 2025b.). Conclui-se observando a necessidade de avaliações contínuas e de desenvolvimento de métodos de avaliação que considerem diferentes contextos de uso da TA, principalmente em domínios críticos, a fim de ponderar e mitigar danos.
Automatic summarization of financial news in Portuguese lacks reliable reference-free evaluation metrics. While LLM-as-a-Judge approaches are gaining traction, their correlation with human perception in specialized domains remains under-explored. This work evaluates the efficacy of Question Answering (QA) based metrics against a direct LLM-as-a-Judge baseline for Portuguese financial news. We propose a pipeline comparing Lexical, Binary, and Semantic (LLM-based) QA scoring methods, validated against a human ground truth of 50 news items annotated for Faithfulness and Completeness. Our results show that granular QA metrics significantly outperform the monolithic LLM-Judge in evaluating Completeness, with QA-Binary achieving the highest rank correlation (ρ ≈ 0.49 with pessimistic human aggregation). For Faithfulness, we observe a strong ceiling effect in human evaluation, yet the Semantic QA metric demonstrated a "super-human" ability to detect subtle hallucinations (e.g., temporal shifts) missed by annotators. We conclude that decomposing evaluation into atomic QA pairs is superior to holistic judging for the Portuguese financial domain.
The Brazilian DataSUS platform provides vast health databases in relational formats that, while operationally efficient, lack the robust representation needed for advanced scientific data management, restricting interoperability. In this paper, we develop a knowledge engineering pipeline using Scenario 2 of the NeOn methodology to extract, process, and transform knowledge from the DataSUS Health Terminology Repository into a formal knowledge graph that adheres to World Wide Web Consortium standards.We illustrate the potential of this formalization by showing how the graph captures the domain’s complex relationships.The resulting graph comprises over 1.4 million triples, with approximately 700,000 associations generated solely through logical inference. Our pipeline provides a foundational resource that enables advanced structural and semantic querying in Portuguese.
Although large language models have transformed natural language processing, their computational costs create accessibility barriers for low-resource languages such as Brazilian Portuguese. This work presents a systematic evaluation of Parameter-Efficient Fine-Tuning (PEFT) and quantization techniques applied to BERTimbau for Question Answering on SQuAD-BR, the Brazilian Portuguese translation of SQuAD v1.We evaluate 40 configurations combining four PEFT methods (LoRA, DoRA, QLoRA, QDoRA) across two model sizes (Base: 110M and Large: 335M parameters). Our findings reveal three critical insights: (1) LoRA achieves 95.8% of baseline performance on BERTimbau-Large while reducing training time by 73.5% (F1 = 81.32 vs. 84.86); (2) higher learning rates (2e-4) substantially improve PEFT performance, with F1 gains of up to +19.71 points compared to standard rates; and (3) larger models show twice the quantization resilience (loss of 4.83 vs. 9.56 F1 points).These results demonstrate that encoder-based models can be efficiently fine-tuned for extractive Brazilian Portuguese question answering with substantially lower computational cost than large generative LLMs, promoting more sustainable approaches aligned with Green AI principles. An exploratory evaluation of Tucano and Sabiá on the same benchmark shows that although generative models can achieve competitive F1 scores with LoRA fine-tuning, they require up to 4.2 times more GPU memory and three times more training time than BERTimbau-Base, reinforcing the efficiency advantage of smaller encoder-based architectures for this task.
Structured Sentiment Analysis (SSA) aims to extract fine-grained opinion structures as tuples (holder, target, expression, polarity). While recent advances have improved SSA for English, Brazilian Portuguese lacks dedicated resources. This paper presents an exploratory study introducing a manually annotated dataset of hotel reviews for SSA in Brazilian Portuguese. We propose a baseline approach fine-tuning the BERTimbau model under a BIO tagging scheme to extract sentiment spans. Unlike traditional approaches that model relations explicitly, we assess the viability of span-level extraction as a first step for SSA in this language. Experimental results using a strict train/validation/test split show that our approach achieves a span-level F1-score of 48.41 for holder extraction and a macro F1-score of 61.52. We also discuss the linguistic challenges of holder extraction in Portuguese, specifically regarding implicit subjects (pro-drop), and provide a detailed error analysis. These results establish a preliminary baseline for future relation-aware models in Portuguese.
Encoder-based language models remain essential for natural language understanding tasks such as classification, semantic similarity, and retrieval-augmented generation. However, the lack of high-quality monolingual encoders for Brazilian Portuguese poses a significant challenge to performance. In this work, we systematically explore the training of Portuguese-specific encoder models from scratch using two modern architectures: DeBERTa, trained with Replaced Token Detection (RTD), and ModernBERT, trained with Masked Language Modeling (MLM). All models are pre-trained on the large-scale Jabuticaba corpus. Our DeBERTa-Large model achieves results comparable to the state-of-the-art, with F1 scores of 0.920 on ASSIN2 RTE and 0.915 on LeNER. Crucially, it matches the performance of the 900M-parameter Albertina model while utilizing significantly fewer parameters. We also release custom tokenizers that reduce token fertility rates compared to multilingual baselines. These findings provide evidence that careful architectural choices and monolingual tokenization can yield competitive performance without massive model scaling.
Small language models (SLMs) are increasingly adopted for machine translation due to their lower computational and deployment costs, yet a focused and systematic evaluation for English-to-Portuguese remains limited. We benchmarked dozens of SLMs (135M–20B parameters) across multiple architectures and quantization schemes (FP16, Q8_0, Q4_K_M) on two datasets: FLORES-101 (Portuguese subset, 1,012 sentences) and the multidomain OPUS-100 dataset (~10k sentences). We computed lexical and semantic metrics (BLEU, chrF, and BERTScore) and assessed statistical differences using non-parametric Friedman tests over paired sentence-level scores, followed by Wilcoxon signed-rank post-hoc comparisons with Holm correction. Normality assumptions are evaluated using the Shapiro–Wilk test. Our results strongly suggest that 8-bit quantization (Q8_0) preserves semantic quality with negligible average loss, while 4-bit quantization (Q4_K_M) reaches statistical significance in roughly half of model configurations, paired effect sizes (Cliff’s δ) remain negligible to small in magnitude, with measurable degradation concentrated in lower-capacity models. Model scale exhibits only a weak correlation with translation quality: medium-sized models can match or outperform larger ones depending on model family and pretraining. These findings highlight trade-offs between efficiency and quality and inform the design of practical English–to-Portuguese translation pipelines based on SLMs.
Large Language Models (LLMs) have introduced reasoning capabilities through multi-step problem-solving processes. These models predominantly perform reasoning in English, limiting their effectiveness in other languages. This paper introduces Bode Reasoning, a Portuguese-language reasoning approach built upon fine-tuned Qwen3-4B and Qwen3-4B-Thinking models, and the Bode Reasoning Portuguese Dataset, comprising 13,961 instances from Brazilian examinations and translated datasets. Through supervised fine-tuning, the proposed approach successfully shifts the reasoning process to Brazilian Portuguese while reducing output verbosity. Experimental evaluation demonstrates that fine-tuned models generate Portuguese reasoning in 86-98.7% of outputs and achieve superior lexical alignment with reference answers. However, this specialization results in moderate mean G-Eval and accuracy degradation across diverse multiple-choice question types, highlighting inherent trade-offs in adapting multilingual reasoning models.
This study evaluates the ability of large language models (LLMs) to detect incoherence between the text of product reviews and their assigned rating (1 or 5 stars). Using popular LLMs such as GPT-5, Llama-4 and DeepSeek-3.2, and models optimized for Brazilian Portuguese, Sabiá-3.1 and Bode-3.1, we show that some are capable of detecting incoherence among texts and ratings (F1 > 90%) in a zero-shot protocol. Models also present a high agreement in the predictions, where several prediction rounds led to low variability (Fleiss’ κ> 0.95). With the demonstrated incoherence present in all product categories (aprox. 10% of comments), the results suggest that LLMs are very promising to perform this high semantic interpretation task, and they can be used as valuable tools for online monitoring and recommendation systems.
Masked Diffusion Language Models (MDLM) have recently demonstrated that discrete diffusion can achieve competitive performance in text generation. However, training these models remains computationally expensive, particularly for lower-resourced languages like Portuguese. In this work, we adapt REPresentation Alignment (REPA), a technique originally proposed for vision, to the textual domain. We systematically evaluate the impact of aligning the internal representations of a Portuguese MDLM with those of pretrained teacher encoders (e.g., Qwen, BERTimbau). Our experiments show that REPA significantly accelerates training and improves final perplexity by 28.6% compared to a baseline without alignment. We also identify optimal hyperparameters, finding that mid-level alignment with modern teacher encoders yields the best results.
Automatic assessment of reading in children who are learning to read is challenging due to the lack of data and the high variability of children’s speech. This work investigates the improvement of Automatic Speech Recognition (ASR) models for the analysis of reading decoding of isolated words in Brazilian Portuguese. We propose a methodology based on fine-tuning Wav2Vec2.0 models, with a paradigm transformation from orthographic to phonemic transcription. Using a novel corpus of 5,400 audio word samples from children in the 2nd and 3rd grades of Elementary School, we compare pre-trained models in Portuguese and multilingual. Results reveal that the phonemic approach, combined with fine-tuning strategies, data augmentation, and adapted tokenization, significantly reduces the Phoneme Error Rate (PER). This overcomes the limitations of commercial tools and validates the use of ASR for the detailed diagnosis of decoding errors and phonological acquisition.
Idiomatic expressions are a well-known challenge for neural machine translation, including both traditional sequence-to-sequence models and large language models (LLMs). This paper presents a systematic approach to improve idiom translation between Spanish and Galician. First, we build a high-quality parallel dataset of idioms manually aligned across both languages. Then, we automatically extend this dataset into a large synthetic parallel corpus using LLMs, following a strategy that prioritizes the most frequent idioms observed in authentic corpora. This augmented dataset is used to retrain a seq2seq translation model. We evaluate the resulting system and compare it both to the baseline model without idiom data and to state-of-the-art LLM-based translators such as SalamandraTA. Results show that the translation of idioms improves significantly after the training, alongside a slight boost in the model’s overall performance.
Natural language interfaces supported by LLMs have been used to translate user questions into SQL queries, but sending the complete database schema in each prompt entails high token consumption and computational cost, especially in corporate databases with hundreds of tables. This work presents a multi-agent Text-to-SQL architecture with dynamic context windows, which combines RAG and metadata dictionaries to select, at query time, only the relevant tables and columns. In a case study with Firebird enterprise databases, the approach reduces by an average of 84.4% the number of processed tokens, resulting in more efficient queries without loss of quality, thereby contributing to the democratization of access to corporate databases.
We present the first public, user-friendly system for Galician poetry scansion, a symbolic system derived from a well-performing mixed-meter Spanish scansion library. We adapted its resources to Galician and added a preprocessing module. The system achieves 88% per-line accuracy in exact stress-pattern match on data unseen during development, and has practical value: First, it helps create a large annotated corpus to train scansion systems. Second, its web interface can help engage a non-specialist public. Third, its current accuracy is helpful for annotating large volumes of poetry and studying metrical trends in Computational Literary Studies use cases.
The use of large language models (LLMs) for complex mathematical reasoning is an emergent area of research, with fast progress in methods, models, and benchmark datasets. However, most mathematical reasoning evaluations exhibit a significant linguistic bias, with the vast majority of benchmark datasets being exclusively in English or (at best) translated from English. We address this limitation by introducing MATH-PT, a novel dataset comprising 1,729 mathematical problems written in European and Brazilian Portuguese. MATH-PT is curated from a variety of high-quality native sources, including mathematical Olympiads, competitions, and exams from Portugal and Brazil. We present a comprehensive benchmark of current state-of-the-art LLMs on MATHPT, revealing that frontier reasoning models achieve strong performance in multiple choice questions compared to open weight models, but that their performance decreases for questions with figures or open-ended questions. To facilitate future research, we release the benchmark dataset and model outputs
Legal systems produce large volumes of high-stakes decisions in unstructured natural language, making large-scale empirical analysis costly, difficult to reproduce, and unevenly accessible. This bottleneck is especially acute for legal analytics and policy evaluation in low-resource languages such as Portuguese. To address it, we present a resource-efficient pipeline for information extraction from Brazilian criminal case law that reuses a legacy dataset to fine-tune open-weight LLMs with Q-LoRA. Operating in a small-data setting and using schema-constrained JSON generation, the pipeline extracts 47 legal variables spanning charges, evidence, and sentencing outcome. In held-out evaluation, a fine-tuned Phi-4 (14B) model achieves 92.8% accuracy and 0.826 macro-F1, approaching proprietary baselines while retaining the cost and privacy benefits of local deployment. We then use the extracted data in a case study of the short-term effects of a recent Brazilian Supreme Court ruling on drug decriminalization, finding no statistically significant change in trafficking-conviction rates (p≥0.05), a pattern consistent with short-run institutional inertia. More broadly, the paper contributes a reproducible framework for legal NLP and shows how legacy empirical datasets can support scalable legal analytics under severe resource constraints.
Towards improving metadata in academic repositories, this study evaluates the efficacy of different transformer-based models in the automatic classification of the Field of Science and Technology (FOS) of academic theses written in Portuguese. We compare the performance of four different encoder models, two multilingual and two Portuguese-specific, against five larger decoder-based LLMs, on a dataset of 9,696 theses characterized by their title, keywords, and abstract. Fine-tuned encoder-based models achieved the best scores (F1 = 88%), outperforming general-purpose decoder models prompted for the task. These results suggest that, for localized academic domains, task-specific fine-tuning remains more effective than general-purpose LLM prompting.
A antropomorfização de sistemas de Inteligência Artificial tornou-se particularmente relevante em contextos de Processamento de Linguagem Natural em português, onde expressões como ”o modelo compreende” ou ”o sistema alucina” podem gerar equívocos conceptuais, contribuindo para uma percepção errada das capacidades dos modelos. Este artigo propõe um enquadramento terminológico para descrever sistemas de Processamento de Linguagem Natural em português sem recurso a metáforas antropomórficas, apresentando um conjunto de reformulações linguísticas destinadas a melhorar a precisão conceptual e a literacia em Inteligência Artificial.
This paper investigates whether injecting semantic structural knowledge of low-resource or unfamiliar languages into Large Language Models (LLMs) enhances performance on downstream Text-to-SQL tasks. We evaluate our approach on Galician, a Romance low-resource language, and, to demonstrate its generality, also on Guarani, a (very) low-resource language of an entirely distinct linguistic profile. Our empirical results show that semantically-aware models consistently outperform baselines across all benchmark metrics.
The safe deployment of Large Language Models remains challenging in multilingual settings, particularly when models are exposed to adversarial or malicious prompts in underrepresented languages. In this work, we present Curupira, a Brazilian Portuguese-language guard model designed to mitigate harmful prompt exploitation. To do this, we establish a three steps methodology that involves adaptation, data generation, and fine-tuning. We also evaluate our model with two state-of-the-art open guardrail architectures. The results show that targeted fine-tuning leads to consistent improvements in safety classification for Portuguese prompts, with favorable efficiency–performance trade-offs for compact models and limited degradation in cross-lingual evaluation.
Quantization is key for efficient LLM inference, but its language-specific effects are understudied. We compare INT8 and FP8 (E4M3) quantization for Meta-Llama-3-8B on English and Brazilian Portuguese (PT-BR). INT8 with outlier handling preserves perplexity in both languages, while naive FP8 casting degrades English far more than PT-BR (+18% vs. +3.9%). Activation analysis shows rarer, larger English spikes (>35) that are more prone to saturation under unscaled E4M3, whereas PT-BR activations are more concentrated. Our FP8 results reflect a naive casting stress test (no calibration/scaling), not an optimized FP8 recipe.
Hate speech detection is often treated as a binary task, ignoring the hierarchical nature of toxicity, such as severity levels and specific target groups. This work presents a Multitask Learning (MTL) approach for the HateBR dataset, utilizing a shared BERTimbau encoder to simultaneously predict binary offensiveness, ordinal severity, and hate speech targets. Our experiments demonstrate that the MTL architecture outperforms Single-Task baselines on the primary offensive detection task, increasing the Matthews Correlation Coefficient from 0.80 to 0.82. Beyond predictive performance, we show that joint training implicitly enforces hierarchical sanity: the unified model yields a 0% target-inconsistency rate (i.e., no cases where a comment is predicted Non-offensive while still assigned a hate target). However, we observe negative transfer in the fine-grained multilabel target task (Micro-F1 drops from 0.59 to 0.42), highlighting a trade-off between logical consistency and target attribution under extreme imbalance.
Recent advances in the field have revolutionized Question and Answering (QA). However, for languages like Portuguese, progress is often hindered by the lack of native training resources. To address this gap, this paper introduces LARI, a new dataset designed to benchmark and enhance QA in Portuguese. Our methodology combines the capabilities of the Sabiá-7B model, fine-tuned via QLoRA on a domain-specific corpus, with human validation. We utilized the book Natural Language Processing – Concepts, Techniques, and Applications in Portuguese (2nd Edition), as a case study for content extraction. The generated instances underwent expert human evaluation, achieving an average quality score of 4.47 out of 5. The final dataset, comprising 464 context-question-answer triples, is made publicly available to the community, offering a valuable resource for future research in low-resource settings.
Este trabalho analisa a evolução de padrões linguísticos em resumos de artigos em português da Sociedade Brasileira de Computação entre 2020 e 2025, com base em métricas linguísticas do NILC-Metrix. Foram aplicadas 72 métricas a um conjunto de mais de 10 mil resumos, e comparações estatísticas (t-test) foram realizadas entre o período de referência (2020–2022) e os anos subsequentes. Os resultados indicam transformações a partir de 2023, incluindo simplificação estrutural, aumento da densidade lexical, reconfiguração de estratégias discursivas e mudanças no uso de conectivos. Em 2024 e 2025, mais de 95% dos artigos apresentam múltiplas métricas significativamente distintas em relação ao período de referência.
The Libras-UFPel Corpus is a multimodal, multilayer parallel resource designed for the documentation and computational analysis of Brazilian Sign Language (Libras) in systematic alignment with written Portuguese. By integrating controlled recordings with naturalistic data from the Inventário Nacional de Libras-Pelotas, the corpus ensures interoperability through shared methodological standards. The dataset currently comprises 4,800 controlledaudiovisual records (2,400 sentences and 2,400 isolated signs) fully paired with Portuguese translations, supplemented by approximately 10 hours of spontaneous interaction from threenew naturalistic interviews, currently in the editing phase. To date, 1,200 controlled sentences have been lemmatized, gloss-annotatedand translated, providing a structured parallel subset for Libras-to-Portuguese Sign Language Processing tasks such as recognition and machine translation. The annotation model follows a hierarchical structure covering lexical, partially lexical, and non-lexical signs, including independent tiers for non-manual markers. By bridging descriptive linguistics and Natural Language Processing, Libras-UFPel Corpus serves as a reference source for bilingual data-driven modeling, advancing digital inclusion and linguistic accessibility.
Supervised models trained on community-labeled data have shown promise in Health Question Answering (HQA), but relying on “likes” as a proxy for clinical usefulness remains controversial. This work investigates the alignment between automated predictions and human perception in Portuguese HQA. Using a subset of the SaudeBR-QA corpus, we compare a Random Forest classifier against a controlled evaluation conducted by laypeople and healthcare professionals. Our results reveal a recurring divergence that we term Superficiality Bias: human evaluators frequently validate very brief answers, whereas the classifier often labels these cases as non-useful under its learned criteria. Rather than indicating that the model is inherently more clinically accurate, this pattern suggests a misalignment between community feedback and feature-driven utility judgments. We argue that crowd-based labels in medical domains should be treated cautiously and complemented with more rigorous annotation protocols.
While scaling laws suggest increasing model and dataset sizes for better results, efficient pre-training techniques for low-resource scenarios present unique challenges that require further investigation. This work introduces FlexQwen, a model based on the Qwen 3 architecture adapted for a hybrid causal-masked objective, and the Carolina Originality dataset, a subset of the Corpus Carolina tailored for efficient pre-training in Portuguese. We investigate two primary research questions: the influence of hybrid masked-causal modelling and the impact of text originality on model performance. Our experiments compare a high-originality Gold split against a length-matched control group. Results indicate that hybrid objectives may be viable for efficient training. Furthermore, we provide open access to our code, datasets, and training logs to foster further research in efficient Portuguese LLMs.
Clinical narratives written in free text contain valuable information for patient care. However, their unstructured nature and linguistic variability pose significant challenges for automatic processing and interoperability. In particular, mapping clinical terms to standardized terminologies such as SNOMED Clinical Terms (SNOMED CT) remains difficult for languages other than English, including Brazilian Portuguese. This paper presents NormaTex-MapSNOMED, a proposed component of the NormaTex framework that focuses on mapping clinical terms to predefined categories aligned with SNOMED CT. Given previously extracted terms, the method leverages large language models (LLMs) guided by a structured prompt to assign terms to target categories. Experiments were conducted on Portuguese-language clinical narratives and evaluated using three complementary strategies: lexical similarity based on Levenshtein distance, contextual similarity using a BERT-based model, and semantic validation using LLMs. The results show that LLM-based evaluation consistently outperforms lexical and contextual baselines across different models, with higher precision observed for disease-related terms compared to symptom-related expressions. These findings indicate that LLMs are a promising approach for semantic mapping of clinical terms in Brazilian Portuguese and can support clinical term normalization and interoperability with standardized terminologies.
The growing volume and complexity of legal texts highlight the need for automatic methods capable of extracting structured information from unstructured documents. Motivated by the limited availability and high cost of annotated legal data, this challenge is even more severe for the Portuguese language. This work investigates whether prompt engineering over Large Language Models (LLMs) can effectively support legal Named Entity Recognition (NER) in low-supervision and low-resource settings through In-Context Learning (ICL). Using the LeNER-Br corpus, we evaluate category-specific prompts, different chunking sizes, and prompt engineering strategies. Entity-level evaluation using Exact Match Micro F1 shows that prompt engineering has a stronger impact on performance than other strategies. The best results were obtained with larger models, the 4-bit quantised Qwen-2.5:32B and GPT-5.2, achieving scores of 57.9% and 71.9%, respectively, highlighting the potential of this approach as an alternative to traditional supervised NER pipelines.

up

pdf (full)
bib (full)
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 2

A revisão de trabalhos acadêmicos é uma etapa crucial, porém onerosa, na formação de pesquisadores. Trabalhos anteriores obtiveram bons resultados com abordagens automatizadasde revisão em inglês. Nesse contexto, apresentamos o ARAMIS, uma ferramenta multiagente baseada em Large Language Models (LLM) open-source, projetada para revisar Trabalhos de Conclusão de Curso (TCC) em português. A solução foca em três pilares: correção gramatical, encadeamento lógico e rigor metodológico, permitindo ao usuário receber revisões estruturadas para cada pilar escolhido. Mesmo em estágio experimental, os testes atingiram ótimos resultados de usabilidade ao aplicar o System Usability Scale (SUS),obtendo uma pontuação de 90,5/100.
Adult Learning (AL) programmes need short, trustworthy texts that match learners’ reading abilities, but educators rarely have time, tools, or evidence-based guidelines to select and adapt materials consistently.We present a live demo of iRead4Skills for European Portuguese: a web-based system that (i) estimates readability/complexity for AL-oriented levels aligned with CEFR, (ii) highlights where complexity concentrates (lexical, grammatical, semantic), and (iii) supports rewriting by offering actionable, level-aware suggestions and curated lexical resources.The demo emphasises transparency and “trainer-first” workflows: users see *why* a text is complex and *how* to revise it without losing meaning.
This demo showcases a web-based interface that provides open, interactive access to a large-scale grammatical database of European Portuguese verbal constructions. Through a unified search and exploration environment, users can query, inspect, and compare more than 7,000 distributionally free verbal constructions and over 2,700 verbal idioms (frozen constructions), grounded in long-standing Lexicon–Grammar descriptions. For each construction, the interface exposes core linguistic properties such as argument structure, distributional constraints, semantic roles, major syntactic transformations, and curated usage examples with English translations. The demo illustrates how detailed, manually validated grammatical knowledge can be explored dynamically via the web, supporting linguistic research, language teaching, and NLP development. To the best of our knowledge, this is the largest publicly accessible, web-based grammatical resource dedicated to European Portuguese verbal constructions.
This paper describes Bruna, a data-centric smart voice assistant powered by multiple Large Language Models designed to support Stilingue and Blip products. Our architecture provides an enriched conversational experience, delivering strategic insights in real-time.
Analyzing large conversational datasets is often inefficient due to the linear nature of text, which hinders the tracking of interaction evolution over time. To address this, we present FlowDisco, an interactive platform for the automatic discovery and exploration of dialogue flows. The framework uses semantic embeddings and modular clustering to transform raw text into probabilistic dialogue flows. By providing a web interface with dynamic filtering and a suite of analytical metrics, FlowDisco simplifies the visual identification and validation of conversational behaviors at scale. The platform’s utility is demonstrated through real-world application scenarios, including customer support interactions and multi-party political debates, where it successfully uncovers complex patterns and sentiment shifts that traditional sequential analysis often overlooks.
This paper presents AttentionApp, an interactive demonstration system designed to support the inspection and linguistic analysis of attention mechanisms in Transformer-based language models for Portuguese. The tool allows users to input sentences in Portuguese and visualize attention distributions across layers and heads, enabling fine-grained qualitative analysis of syntactic and semantic patterns captured by the model. AttentionApp is intended as a research-oriented tool, facilitating exploratory analysis, hypothesis generation, and interpretability studies for Portuguese Natural Language Processing.
Este artigo tem como objetivo apresentar o sistema multimodal computacional, denominado NOAH, para apoiar o gerenciamento de riscos de desastres (GRD) nas cidades brasileiras, considerando a necessidade de troca de informações e de comunicação estabelecida entre os agentes públicos de GRD e os membros da população em situações de risco e desastre. Este sistema está sendo desenvolvido através da aplicação da inteligência artificial (IA), integrando o chatbot ao processamento de linguagem natural (PLN), reconhecimento de fala, classificação de imagens e recuperação de informações por geração aumentada de recuperação (RAG). O sistema tem como foco a comunicação direta com a população via WhatsApp, permitindo a coleta de relatos em língua portuguesa nos formatos de texto, áudio e imagem. A contribuição prática do NOAH consiste na combinação de uma técnica de modelagem de tópicos (BERTopic) para classificação textual, Whisper Small para transcrição de áudio e redes neurais convolucionais Resnet50 para análise visual do tipo de incidente. Essa abordagem viabiliza o desenvolvimento de ferramenta prática e escalável para o apoio à tomada de decisão dos órgãos municipais de Proteção e Defesa Civil, que são responsáveis pelo GRD, contribuindo para uma resposta mais eficiente a situações de emergência em localidades de língua portuguesa.
Este trabalho apresenta o Lispector, uma família de modelos de linguagem especializados para revisão gramatical e ortográfica em português brasileiro. Comparamos duas estratégias de inferência para a tarefa de correção gramatical de texto com grandes modelos de linguagem (LLMs): (1) fine-tuning supervisionado e (2) prompting few-shot em modelos de maior escala. Utilizando um conjunto de dados de 4.500 pares de textos reais de usuários (2.500 registros para treino, 1.000 para avaliação e 1.000 para teste), com referências corrigidas por linguistas, analisamos duas variantes do Lispector baseadas em diferentes tamanhos de parâmetros. A avaliação empregou as métricas BLEU, GLEU, METEOR e ROUGE. Os resultados demonstram que modelos menores submetidos a fine-tuning supervisionado superam consistentemente em todas as métricas modelos maiores que operam apenas com prompting, com o Lispector small alcançando ganhos expressivos em métricas de similaridade textual como GLEU (+12%) e BLEU (+13%). Assim, além do aumento de desempenho, os modelos fine-tuned apresentam comportamento mais previsível e conservador, características desejáveis em aplicações industriais de escrita assistida. No quesito latência, o Lispector small obteve a menor mediana de tempo de resposta entre todos os modelos e o menor P95 entre os fine-tuned; o Lispector large também se mostrou competitivo. Esses achados indicam que, para tarefas específicas de revisão textual em português brasileiro, o fine-tuning pode oferecer vantagens significativas em desempenho e eficiência computacional.
Large Language Models (LLMs) are effective text generators but create legal citations at non-trivial rates, a failure mode with serious consequences in legal practice. In Brazilian Portuguese the risk is amplified by citation variability (juridiquês), fragment-level references (article → paragraph → item), and the need to distinguish jurisdictions and court instances.We describe a production Retrieval-Augmented Generation (RAG) system deployed at a Brazilian legal-technology platform. The system combines (1) domain-tuned hybrid retrieval (lexical, dense, and cross-encoder reranking) over a large-scale legal corpus; (2) grounded generation with explicit citation constraints; and (3) a post-generation Reference Audit layer that extracts legislation and jurisprudence mentions via specialized taggers, normalizes them to a canonical schema, checks existence against authoritative databases at fragment granularity, verifies fidelity against official texts, and triggers targeted rewrites when inconsistencies are detected.We report production telemetry from 184,895 audited answers containing 43,175 extracted legal references. Legislation references resolve at 81.7%, while jurisprudence references resolve at only 47.1%, identifying case-law normalization as the primary bottleneck for practitioners. Fidelity verification corrected 6.5% of checked answers before delivery, preventing misrepresented legal claims from reaching end users. By converting silent hallucinations into explicit warnings with per-reference status, the system enables legal professionals to trust verified citations and efficiently review flagged ones, rather than manually checking every authority.
This Ph.D. dissertation advances the state-of-the-art in Natural Language Processing (NLP) for Portuguese by proposing new and innovative data resources and explainable methods for hate speech detection and automated fact-checking. The thesis introduces several benchmark datasets for Brazilian Portuguese, HateBR, HateBRXplain, HateBRMoralXplain, MFTCXplain, MOL, and FactNews, which have been widely adopted by the research community and address critical gaps in the availability of high-quality annotated resources for Portuguese. In addition, this dissertation proposes novel post-hoc and self-explaining NLP methods: Sentence-Level Factual Reasoning (SELFAR), Social Stereotype Analysis (SSA), Contextual Bag-of-Words with Interpretable Input and Feature Optimization (B+M), Supervised Rational Attention (SRA), and Supervised Moral Rational Attention (SMRA). Across multiple tasks and datasets in Portuguese, these methods outperform baselines while improving interpretability and robustness, demonstrating that explainability and performance can be jointly optimized. Finally, this thesis has achieved significant national and international impact, being cited by leading universities and research institutes worldwide and fostering new M.Sc. and Ph.D. research projects in Brazil. Its scientific and social contributions have also been recognized with multiple prestigious national and international awards, including the Google LARA, the Maria Carolina Monard Best Thesis Award in Artificial Intelligence, the Trevisan Prize for Students “AI for Good” from Bocconi University for rigorous computer science research in AI with social impact, and the Diversity and Inclusion Award from the Association for Computational Linguistics (ACL). Lastly, this thesis has received two nominations for the Brazilian Computer Society Thesis Awards in Computer Science, and in Multimedia, Hypermedia, and Web.
Brazil’s ENEM, a high-stakes assessment determining university admission for millions of students annually, creates an immense evaluation burden where human raters process hundreds of essays daily. Automated Essay Scoring (AES) offers a potential solution, yet Portuguese-language systems remain understudied due to fragmented datasets and the complexity of ENEM’s multi-trait rubric. This work investigated cross-prompt, trait-specific essay scoring using a corpus of 385 essays across 38 prompts, where models evaluated essays on unseen prompts across five traits scored on a six-point ordinal scale. We compared three model classes: feature-based methods (72 features), encoder-only transformers (109M–1.5B parameters), and decoder architectures (2.4B–671B parameters) with fine-tuned and zero-shot configurations. Experiments under varying information access and rubric conditioning revealed that no single approach serves all evaluation needs: encoder models excel at mechanical traits (fluency, cohesion) despite context limitations; decoder models achieve superior performance on argumentation (QWK 0.73) and writing style (QWK 0.60) when provided full context; and language-specific pretraining benefits only surface-level features without improving complex reasoning. Best-performing models achieved QWK scores of 0.60–0.73. Gaps to oracle bounds ranged from 0.15 (argumentation) to 0.29 (writing style), with the largest disparities in writing style and persuasiveness.
Gender-based violence (GBV) is a major public health issue, with the World Health Organization estimating that one in three women experiences physical or sexual violence by an intimate partner during her lifetime. In Brazil, although healthcare professionals are legally required to report such cases, underreporting remains significant due to difficulties in identifying abuse and limited integration between public information systems. This study investigates whether FrameNet-based semantic annotation of open-text fields in electronic medical records can support the identification of patterns of GBV. We compare the performance of an SVM classifier for GBV cases trained on (1) frame-annotated text, (2) annotated text combined with parameterized data, and (3) parameterized data alone. Quantitative and qualitative analyses show that models incorporating semantic annotation outperform categorical models, achieving over 0.3 improvement in F1 score and demonstrating that domain-specific semantic representations provide meaningful signals beyond structured demographic data. The findings support the hypothesis that semantic analysis of clinical narratives can enhance early identification strategies and support more informed public health interventions.
Asthma is a chronic respiratory disease that affects breathing and may also influence speech and voice production. In this paper, we examine whether short mobile-recorded Brazilian Portuguese voice and speech audio contain cues that can be used to distinguish individuals with asthma from those without asthma. We approach this problem using transfer learning with pretrained neural audio models based on convolutional architectures trained on large-scale audio datasets (PANNs). We evaluate two recording types: sustained vowel phonation and read speech. Models are trained for a binary classification task and evaluated at both the segment level and the patient level. Read speech performs better than sustained vowels. The best configuration (CNN14 on speech) achieves 0.85 patient-level balanced accuracy (accuracy 0.85) with ROC-AUC 0.93 and PR-AUC 0.98, performing comparably to CNN10. Training from scratch performs worse than fine-tuning a pretrained model, showing that pretraining helps when data is limited. Performance also varies across age groups, suggesting demographic sensitivity. These findings support the feasibility of audio-based asthma classification from voice and speech and motivate further investigation of pretrained audio models in biomedical applications.
The adoption of LLMs in hospital environments demands solutions that ensure information security, computational efficiency, and rigorous control over sensitive institutional data. This work presents the development and evaluation of a chatbot based on RAG, using exclusively local LLMs, applied to internal documents of a university hospital in Portuguese, composed of Standard Operating Procedures and technical manuals. The methodology initially evaluates the quality of information retrieval through dense embedding models, measured by the Mean Reciprocal Rank (MRR) metric. Then, the generation stage is analyzed in two distinct scenarios: (i) RAG with fixed context, in which multiple chunks are provided simultaneously to the model, and (ii) Incremental page retrieval, in which chunks are sent sequentially according to the retrieval ranking. The generation assessment was conducted with four local LLMs — MedGemma3:27B, Gemma3:27B, Gpt-oss:20B, and Mistral Small 3.1 — using BERTScore as a quality metric. The results indicate that indiscriminate context increase in the fixed-context scenario degrades generation quality, even while increasing the probability of recovering the relevant chunk. In contrast, the incremental page retrieval technique showed improvements in BERTScore values, with the MedGemma3:27B model standing out with the best overall results. These findings demonstrate that adaptive context control is a critical factor in increasing the reliability and efficiency of RAG systems based on local LLMs in the healthcare domain.
Anaphylaxis is an acute, potentially life-threatening allergic reaction that requires rapid recognition in clinical settings. Natural language processing (NLP) approaches for automatic detection of anaphylaxis in clinical narratives can support large-scale analysis of health records and retrospective clinical research. However, such approaches depend on high-quality labeled corpora, and resources for Portuguese remain scarce. This paper introduces a corpus of Brazilian Portuguese clinical notes annotated by domain specialists for the presence or absence of anaphylaxis. The dataset comprises 969 clinical narratives drawn from three sources: clinician-authored synthetic clinical notes designed to represent realistic scenarios, case reports from the medical literature rewritten into note-like format by specialists, and a subset of de-identified notes from the publicly available SemClinBr corpus. All texts were reviewed and labeled by allergists using established clinical diagnostic criteria, and the corpus reflects realistic prevalence conditions, with approximately 5% positive cases. We describe the corpus design, data sources, annotation methodology, and composition, discuss potential research applications, and address ethical considerations. The corpus is intended as a reusable resource for Portuguese clinical NLP, supporting future work on document classification, information extraction, and language modeling in the medical domain.
Ensuring safety in clinical applications of large language models (LLMs) remains an unresolved challenge, particularly for high-risk and underrepresented conditions such as Sickle Cell Disease (SCD). Consequently, these models may exhibit limited reliability for SCD, including hallucinations and clinically unsafe outputs. This paper proposes an LLM-based Multi-Agent System (MAS) enhanced by Retrieval-Augmented Generation (RAG) to support the generation of medical care plans for SCD. The MAS decomposes clinical reasoning into specialized agents responsible for diagnosis, investigation, and treatment planning. Retrieval is framed not as a performance optimization, but as a safety control mechanism. Three RAG strategies, namely LLM-Guided Tree Retrieval, Metadata-Filtered Retrieval, and Semantic Similarity Retrieval, are evaluated alongside a baseline. Our experiments considered LLM-as-a-Judge evaluations and independent assessments by physicians. The results demonstrate high clinical quality, with safety scores exceeding 4 on a 5-point scale. While average performance was similar between RAG and baseline conditions, the Tree Retrieval strategy reduced the frequency of clinically unsafe outputs compared to conventional Semantic Retrieval, indicating fewer clinically unsafe outputs. These findings show evidence that average performance is insufficient to evaluate clinical AI systems, particularly in high-risk scenarios where retrieval serves as a safety control layer.
The evaluation of Large Language Models (LLMs) in medicine has predominantly relied on English-language benchmarks aligned with North American clinical guidelines, limiting their applicability to other healthcare systems. In this paper, we evaluate twenty-two proprietary and open-weight LLMs on the 2025 National Examination for the Evaluation of Medical Training (ENAMED), a high-stakes, government-standardized assessment used to evaluate medical graduates in Brazil. The benchmark comprises 90 multiple-choice questions grounded in Brazilian public health policy, clinical practice, and Portuguese medical terminology, and is released as an open dataset. Model performance is measured using both standard accuracy and the official Item Response Theory (IRT) framework employed by ENAMED, enabling direct comparison with human proficiency thresholds. Results reveal a clear stratification of model capabilities: proprietary frontier models achieve the highest performance, whereas many open-weight and smaller-domain-adapted models fail to meet the minimum proficiency criterion. Across comparable scales, large generalist models consistently outperform specialized medical fine-tunes, suggesting that general reasoning capacity is a stronger predictor of success than narrow domain adaptation in this setting. These findings establish ENAMED as a rigorous benchmark for evaluating medical LLMs in Portuguese and highlight both the potential and current limitations of such models for educational assessment.
Retrieval-Augmented Generation (RAG) is proposed to reduce hallucination and improve grounding in clinical language models, yet its effectiveness across different levels of clinical reasoning remains unclear. We conducted a controlled evaluation of medication-related question answering in Portuguese using over 7,000 Brazilian regulatory drug leaflets and a complementary clinical benchmark derived from national medical licensing examinations (Revalida and Fuvest). Retrieval substantially improved factual recall and clinical coherence in medication-specific queries, increasing F1 from 0.276 to 0.412. However, naive retrieval did not consistently improve complex clinical reasoning and sometimes reduced accuracy compared to a parametric-only baseline. We identify retrieval-induced anchoring bias, where partially relevant evidence shifts model decisions toward clinically incorrect conclusions. Critique-based and adaptive retrieval mitigated this effect and achieved the highest clinical benchmark accuracy (54.25%). Clinically grounded evaluation dimensions revealed safety-relevant differences beyond traditional NLP metrics. These results show that retrieval augmentation is effective in regulatory settings but requires adaptive control for higher-level clinical reasoning.
While most essential medicines have become widely accessible across all social strata in Brazil due to government initiatives and market shifts, a significant barrier remains: the technical complexity of medication leaflets. This pragmatic and linguistic gap hinders patient comprehension of critical risks and benefits. Thus, adapting these texts into plain language patterns is crucial for patient safety and treatment adherence. Large language models have been increasingly effective as practical solutions for text simplification, an important Natural Language Processing (NLP) task that serves as a basis for several other linguistic and computational tasks. However, the scarcity of annotated datasets remains a bottleneck for rigorous evaluation. To bridge this gap, we propose a streamlined pipeline for generating simplified medical leaflets and introduce an initial benchmark dataset of 30 expertly annotated samples. Our results, supported by semantic and morphosyntactic evaluations, demonstrate that the proposed method produces high-quality, simplified content suitable for health applications.
Clinical NLP for Brazilian Portuguese remains limited by the lack of semantically structured resources that support interoperability and downstream health applications. Although existing corpora provide annotated clinical narratives, their flat annotation schemes restrict semantic expressiveness and alignment with standardized terminologies. In this work, we present a lightweight domain ontology that models clinical entities, contextual qualifiers, and semantic relations in Brazilian Portuguese texts. The ontology is derived from the original corpus annotations and conceptually aligned with standards to enhance interoperability while preserving corpus-specific semantics. This work establishes foundational infrastructure for Portuguese clinical NLP, supporting tasks such as entity normalization, semantic search, and ontology-guided annotation.
Depressive symptomatology may be reflected in the language used by possible depressive profiles (PDP). This paper investigates to what extent symptoms of depression are manifested in Brazilian Portuguese narrative texts, and whether these can be used to identify relevant linguistic clues related to PDP. Moreover, the relation between these symptoms and PDP is explored, characterising the lexical, syntactic, and psycholinguistic aspects of texts produced by PDP. We found that texts associated with PDPs differed in some of these characteristics from non-PDP texts. The interactions between symptoms and PDP can also shed light on patterns of communication differentiation and the relationship between them. The results of this paper can help to characterise and understand the indicators that can be used to train more bespoke and accurate large language models.
Notícias falsas são um grande problema para a sociedade. Com a Inteligência Artificial generativa, notícias falsas produzidas pela máquina têm se proliferado, tornando o cenário mais desafiador. Apesar da relevância desse problema, em línguas sub-representadas como o Português, as pesquisas que buscam diferenciar notícias falsas de humanos e de máquinas são incipientes. Buscando preencher essa lacuna, este artigo explora os corpora Fake.br e FakeTrueBR expandidos com notícias falsas geradas automaticamente, caracterizando lexical e sintaticamente as notícias falsas produzidas por humanos e por máquina. Os resultados mostram que textos gerados por máquina apresentam palavras significativamente mais longas, maior uso de modificadores adjetivais e menor diversidade sintática, apesar de utilizarem mais regras sintáticas por sentença. Em contrapartida, textos humanos exibem maior variabilidade estilística em todas as dimensões analisadas.
Este trabalho investiga métodos simbólicos para a detecção de emoções em textos em português, considerando múltiplos córpus, domínios e diferentes configurações de pré-processamento. Os resultados mostram grande variação no desempenho absoluto entre domínios, mas estabilidade no desempenho relativo entre os métodos, evidenciando a influência das propriedades do córpus e o gradiente entre complexidade e interpretabilidade. A inclusão da classe neutra tende a degradar o desempenho ao aumentar a ambiguidade e, frequentemente, o desbalanceamento entre classes, enquanto um pré-processamento mais extensivo beneficia especialmente abordagens simbólicas. A análise qualitativa indica que parte dos erros decorre de ambiguidades linguísticas, do grande espaço para subjetividade no processo de anotação e das próprias nuances emocionais, reforçando a importância de avaliações comparativas multi-domínio.
Prosodic segmentation is the task of dividing a sound unit into smaller units, which can be distinguished between units with a completed idea, marked by TBs, and non-autonomous units, marked by NTBs. It is a useful task to enhance the performance of ASR and TTs systems, and it remains relevant for Brazilian Portuguese due to the diversity of conditions and speaker-related factors that influence its performance. Here, we explore a low-impact, open-source approach based on a Random Forest classifier and a set of features that include fundamental frequency, speech rate, pauses, and energy (Craveiro et al., 2025). We perform a robustness evaluation of the referred ML model, modifying a few conditions on its training, comparing its performance when tested in other datasets, and comparing its results with those of other studies using the same data samples. We experiment with augmenting the training dataset and evaluating how the bias of speaker profile aspects is affected when the size and diversity of the training set are changed. Although we don’t achieve statistically significant values in the bias evaluation, we observe that inequalities grow as the training dataset is expanded with a much larger, but less diverse sample of data.
Approaches based solely on textual representations have limitations in capturing structural relations between legal entities, particularly in documents with high lexical similarity. This paper presents ongoing work on a dynamic clustering system for judicial decisions that integrates hybrid representations, combining semantic embeddings from legal-domain Portuguese models with knowledge graphs automatically constructed from documents. The architecture supports incremental clustering and generates cluster justifications using Large Language Models grounded on knowledge graph relations. Preliminary evaluation combines the quantitative metrics Silhouette Score, Davies-Bouldin Index, and Calinski-Harabasz Index.
The need for tools that assist in process management, automating tasks and reducing the slowness of the judicial system, justifies the improvement of traditional Information Retrieval systems, often limited by vocabulary incompatibility and the length of legal texts. Although models based on Transformers capture semantic particularities, they face input size constraints that make it difficult to process long texts without losing information. In this work, we propose a hybrid system applied to the legal domain, combining the BM25L algorithm and the BumbaLM language model.
O uso crescente de Grandes Modelos de Linguagem (LLM) tem ampliado preocupações relacionadas a viés social e justiça algorítmica. Este trabalho apresenta uma Revisão Sistemática da Literatura de 60 estudos publicados entre 2020 e 2025, analisando estratégias de mitigação, métricas de avaliação, tipos de discriminação e idiomas considerados. Os resultados indicam forte predominância de avaliações em língua inglesa, foco desproporcional no viés de gênero tratado de forma binária e maior ênfase em diagnóstico do que em mitigação. Observa-se ainda escassez de análises interseccionais, multilíngues e orientadas a cenários reais de uso, evidenciando lacunas metodológicas e socioculturais na literatura atual.
Large language models (LLMs) are increasingly used for Natural Language Inference (NLI), yet their ability to perform logic-sensitive semantic reasoning, especially outside English, remains underexplored. This paper presents a preliminary investigation into the feasibility and usefulness of developing FraCaS-BR, a Portuguese adaptation of the FraCaS benchmark for semantic inference. Using a small diagnostic subset of seven FraCaS problems focusing on generalized quantifiers, plurals, and nominal anaphora, we evaluate the behavior of three LLMs (ChatGPT, Maritalk, and Evaristo) on Brazilian Portuguese translations. Each problem is submitted multiple times to assess correctness, variance, and consistency relative to the original FraCaS gold labels. The results reveal systematic differences across models.While ChatGPT shows higher overall correctness and stability, all models exhibit limitations that undermine their reliability on logic-controlled inference tasks. The extent of manual correction required during translation further underscores the necessity of human-in-the-loop evaluation. Taken together, these findings support and motivate the development of FraCaS-BR as a controlled evaluation resource for assessing semantic reasoning in Portuguese.
This paper evaluates the impact of expanding the UD_Nheengatu-CompLin treebank on parsing performance for Nheengatu, a Brazilian endangered Indigenous language. We hypothesized that the inclusion of annotated data would result in a 10% improvement in the Labeled Attachment Score (LAS). To test this hypothesis, we conducted a 10-fold cross-validation experiment using UDPipe 1.4 under two conditions: parsing with gold tokenization and gold tags, and automatic parsing from raw text. Statistical significance was determined using the Mann-Whitney U test. Although the expected gain was not achieved, the results show improvements in parsing accuracy and reduced variance across folds. The findings highlight the importance of corpus expansion and standardized annotation workflows for improving parsing performance in low-resource language scenarios and for supporting reproducible evaluation methods in the computational modeling of minority languages.
Automated translation systems exhibit a tendency toward cultural drift when processing non-literal language, often favoring standardized outputs that diverge from the original pragmatic intent. Although Large Language Models (LLMs) have introduced more sophisticated context-handling capabilities, the transition from literal decoding to effective cultural adaptation remains inconsistent.This study investigates these linguistic detours by evaluating ChatGPT-4o, Gemini 1.5 Pro, and Google Translate using a corpus of 100 Brazilian Portuguese expressions. To ensure contemporary relevance, the expressions were validated through the Corpus Carolina and categorized into four groups: classical idioms, regionalisms, metaphors, and intensifiers. Translation quality was assessed using the Multidimensional Quality Metrics (MQM) framework, focusing on adequacy, fluency, and cultural adaptation.The analysis reveals that, even when grammatical accuracy is achieved, automated systems frequently overlook the socio-cultural weight embedded in the source language. Such semantic shifts pose significant challenges in high-stakes professional communication, where nuanced mediation is essential. The findings underscore the limitations of current AI systems in cultural competence and reinforce the ongoing necessity of human intervention to bridge the gap between algorithmic processing and regional identity.
We present an ongoing research project focused on the construction of a Universal Dependencies (UD) corpus of Portuguese epidemiological reports derived from documents published within the Brazilian public health system. We describe findings and challenges to build such a corpus from PDF reports processed through a controlled document extraction pipeline that contrasts layout-aware extraction with raw PDF text extraction, explicitly addressing the impact of tabular content on downstream syntactic analysis. Narrative text is annotated using multiple UD parsers for Portuguese, including widely used and state-of-the-art tools, and their outputs are systematically compared using descriptive structural indicators and targeted qualitative inspection. Our analysis highlights domain-specific challenges in epidemiological texts and shows that document extraction and representation choices have a stronger effect on parsing behavior than parser selection alone. Based on these findings, we identify robust preprocessing configurations and discuss design choices for a UD-epidemiological corpus to support future research on syntactic parsing, domain adaptation, and downstream natural language processing tasks in epidemiology and public health.
We study gender-associated stylistic variation in Brazilian Portuguese Google Play reviews. Using IBGE name frequencies, we infer binary gender from first names in 76.7M reviews (96 apps, 2011–2025), obtaining 22.25M high-confidence labels. Women-associated reviews show markedly higher paralinguistic expressivity (about 60% higher emoji density and more lengthening/punctuation), while lexical diversity (MTLD) is nearly identical across groups. Ratings are mostly positive, with men contributing relatively more 1-star reviews and women more 5-star reviews. These findings contribute to a deeper understanding of digital sociolinguistic behavior within the Brazilian context. We discuss limitations of name-based gender inference and future demographic extensions.
Coreference resolution is a crucial task in natural language processing (NLP) that aims to identify and link expressions in a text that refer to the same entity. However, the lack of annotated data for coreference resolution in Portuguese has hindered the development of robust and accurate systems for this language. In this paper, we present an assessment of coreference annotation utilizing large language models (LLMs) for Portuguese: LLM-PREF is proposed to annotate coreference in Portuguese texts. It was evaluated and compared to a system previously proposed in the literature. The results show that although the model’s world knowledge and inference capacity are quite rich - allowing it to recognize complex coreference patterns, including the pronominal anaphora phenomenon - it does not excel the previously developed rule based system.
Digital trace data have expanded empirical opportunities in the social sciences while intensifying the methodological challenge of scale: researchers increasingly face corpora too large and fast-moving to read exhaustively without sacrificing interpretive rigor. This article presents Social-RAG, a modular Retrieval-Augmented Generation (RAG) architecture designed to support scalable qualitative inquiry over large text corpora while preserving evidence traceability, auditability, and researcher control. Our empirical basis consists of messages from public Telegram groups and channels, organized into two thematic subsets: vaccine-related discourse and debates surrounding Brazil’s Lei Rouanet cultural funding policy. We detail key design decisions, including a “one post = one chunk” indexing strategy, semantic retrieval over vector embeddings with efficient ANN search, an Adaptive-K dynamic cutoff for context selection, MMR re-ranking for diversity, and structured analytical instructions that constrain generation to retrieved evidence. We evaluate system behavior using two complementary question blocks, hermeneutic (narrative) and factual, and compare outputs across three language models with distinct deployment profiles (a local open-weight model, a cloud open-weight model, and a commercial closed model), using an LLM-as-judge protocol with explicit qualitative criteria. Results show consistent behaviour across both thematic corpora and highlight a key trade-off: the two larger/closed models perform similarly and robustly in both narrative and factual tasks when evidential discipline is maintained, whereas the smaller local model remains useful for exploratory narrative synthesis but is less reliable for strict factual extraction and attribution. We conclude by discussing methodological implications, limitations, and future directions, with a focus on scalability and extensibility to new data types and analytical problems.
Neste artigo descrevemos brevemente o projeto ReadingFood sobre o campo semântico da comida e bebida na literatura, que pretende comparar as obras de quatro países no período 1840-1920, mas cingindo-nos a Portugal e ao Brasil. Após apresentar as infraestruturas já desenvolvidas, tornando pública a pesquisa neste domínio, apresentamos o trabalho já feito e alguns estudos preliminares: a criação de uma taxonomia do domínio na literatura, a desambiguação em contexto, e o estudo de refeições(ou eventos relacionados com comida e bebida).
Este artigo aborda tarefas de etapas anteriores ao processamento computacional de fontes históricas do século XVIII, em língua portuguesa. O trabalho desenvolvido incidiu em domínios muito especializados: fauna e flora. Por esta última característica, esperava-se um fraco nível de ambiguidade vocabular, mas assim não aconteceu. Por isso, apresenta-se um roteiro do processo de normalização ortográfica; descreve-se a constituição do corpus anotado de Entidades Nomeadas e, sobretudo, discutem-se problemas ligados à variação lexical nestes thesauri de especialidade e os constrangimentos do processo. Desta forma, pretende-se contribuir para a reflexão sobre o que é o processo de normalização de fontes históricas e chamar a atenção para a importância das boas práticas neste quadro.
This paper analyzes the performance of several terminology extraction methods when confronted with historical specialized texts that do not conform with modern orthographical norms. We tested two extraction methods based on linguistic patterns, four prompt-based generative artificial intelligence (GenAI) models, and one BERT-like model. Some of these models went through fine-tuning for terminology extraction, and one of these is specialized in the extraction of medical terms from documents written in Portuguese. For the GenAI models, we tested four different prompting strategies. As test set, we used chapter fifteen of the second part of the book Aviso ’a Gente do Mar sobre a sua Saude [Advice to Sea People about their Health], originally written in French by G. Mauran at the end of the 18th century, and translated and adapted to Portuguese in 1794. The chapter was annotated with terminology, and the evaluation was conducted independently both in terms of f-measure, as well as in terms of pure precision, to observe if the automatic extraction methods could complement the manual token-based annotation. Results show that using automatic extraction methods to complement the manual annotation can improve coverage, and that individual models do not achieve high extraction quality, but, by combining two or more models, a recall of more than 90% could be achieved in the test data.
Este artigo apresenta a modelagem semântica de entidades nomeadas em Os Lusíadas, de Luís de Camões, com base no padrão TEI P5. Propõe-se um fluxo híbrido de anotação quecombina NER (spaCy), dicionário de autoridade (gazetteer) e pós-edição filológica manual. São tipificados antropônimos, mitônimos e topônimos por meio dos elementos <persName> (nome de pessoa), <placeName> (nome de lugar) e <rs> (referencing string, para cadeias de referências), com especial atenção à marcação de epítetos. O estudo evidencia os limites de modelos treinados em corpora jornalísticos diante da sintaxe épica e da ortografia da edição de 1572, demonstrando a necessidade de uma abordagem híbrida. Conclui-se que o XML/TEI atua como ferramenta de modelagem do conhecimento literário.
This article presents the Lusíadas Digital project, which proposes the development of a virtual philological edition of Os Lusíadas by Luís de Camões, integrating principles of textual criticism, Digital Humanities, and Natural Language Processing (NLP).The project aims to develop a digital platform bringing together facsimiles of the 1572 editions, diplomatic and modernized transcriptions, a dynamic critical apparatus, a lexical glossary with etymological information, historical and literary commentary, and translations aligned with the original text.The methodology combines traditional philological practices with XML-TEI text encoding, OCR techniques, automatic lemmatization, version alignment, and lexical mining.Initially focused on Canto I, the project seeks to establish a scalable and replicable model for the remaining cantos of the work. By proposing an open, interoperable, and data-oriented digital infrastructure, the initiative contributes to the advancement of e-Philology in Brazil and to the development of technologies applied to the digital critical editing of manuscripts and early printed editions.
Recorded interviews can capture their subjects’ memories, perceptions, and emotions. When conducted with notable figures, they also have the potential to serve as a resource for interdisciplinary research, impacting various branches of science. In this work, we mark the beginning of a significant project analyzing interviews from the Roda Viva program, the longest-running interview show on Brazilian television. In this initial study, we examined six memorable interviews with six Brazilian Formula One drivers to compare the performance of two named entity recognition methods: a statistical-neural method and large language models, both evaluated against manual annotations. Still, it highlighted relevant qualitative distinctions: the statistical method showed a rigid dependence on capitalisation and lexical familiarity, leading to mechanical false positives and missing non-capitalised entities, while the LLM exhibited greater linguistic sensitivity, retrieving contextual entities and being robust to transcription errors, though it still produces false positives. The LLM-based model appears more promising due to its flexibility and the potential for refinement via instructions to filter for ambiguities, favouring the automation of social network extraction in the corpus.
Uniform Meaning Representation (UMR) is a cross-linguistic semantic representation framework designed to encode sentence meaning in a structured and interpretable way. Building on the foundations of Abstract Meaning Representation (AMR), UMR extends semantic coverage to events, participants, semantic roles, temporal/aspectual information, modality, and discourse links. It is language-agnostic and therefore suitable for multilingual exploration.This tutorial provides a beginner’s introduction to UMR aimed at an audience with no prior experience with AMR, UMR, or meaning representations. The tutorial begins with a simple introduction to the essentials of Universal Dependencies (UD) needed to understand how UMR graphs can be constructed from syntactic information. Using simple Portuguese examples, the tutorial illustrates how basic UD structures guide the creation of UMR graphs. Participants will leave with a foundational understanding of what UMR is; how it relates to syntax and semantic roles; how to create minimal UMR graphs, and how Portuguese UD treebanks can support UMR annotation.