Leandro Balby Marinho
Also published as: Leandro Balby Marinho
2026
Benchmark Data Contamination in Underrepresented Languages: A Comprehensive Analysis Using Brazilian Data
Iriedson Souto Maior de Moraes Vilar | David Candeia Maia | João Brunet | Fabio Morais | Leandro Balby Marinho
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Iriedson Souto Maior de Moraes Vilar | David Candeia Maia | João Brunet | Fabio Morais | Leandro Balby Marinho
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Large Language Models (LLMs) are typically evaluated using standardized benchmarks to enable consistent performance measurement and model comparison. However, the reliability of these benchmarks can be undermined by data contamination, which occurs when evaluation items are inadvertently included in training corpora. While this issue has been investigated primarily in high-resource languages such as English and Chinese, its impact on underrepresented languages — such as Brazilian Portuguese — remains understudied. In this paper, we present one of the first systematic investigations of benchmark data contamination (BDC) in an underrepresented language setting, using Brazilian Portuguese as a case study. Using validated methodologies from the literature, we evaluate specialized and multilingual models across four benchmarks: BLUEX, ENEM Challenge, OAB Exams, and HealthQA-BR. Our approach applyes TS-Guessing to detect contamination via memorized knowledge, alongside a 50-character n-gram similarity strategy to identify benchmark items leaked into training data. Our results provide consistent evidence of contamination, revealing that models with stronger memorization and retrieval abilities tend to achieve artificially inflated benchmark scores. Our contributions include: (i) classifying models according to their contamination risk, (ii) identifying the benchmarks most affected by data leakage, and (iii) reporting contaminated training corpora.
2020
Computing with Subjectivity Lexicons
Caio L. M. Jeronimo | Claudio E. C. Campelo | Leandro Balby Marinho | Allan Sales | Adriano Veloso | Roberta Viola
Proceedings of the Twelfth Language Resources and Evaluation Conference
Caio L. M. Jeronimo | Claudio E. C. Campelo | Leandro Balby Marinho | Allan Sales | Adriano Veloso | Roberta Viola
Proceedings of the Twelfth Language Resources and Evaluation Conference
In this paper, we introduce a new set of lexicons for expressing subjectivity in text documents written in Brazilian Portuguese. Besides the non-English idiom, in contrast to other subjectivity lexicons available, these lexicons represent different subjectivity dimensions (other than sentiment) and are more compact in number of terms. This last feature was designed intentionally to leverage the power of word embedding techniques, i.e., with the words mapped to an embedding space and the appropriate distance measures, we can easily capture semantically related words to the ones in the lexicons. Thus, we do not need to build comprehensive vocabularies and can focus on the most representative words for each lexicon dimension. We showcase the use of these lexicons in three highly non-trivial tasks: (1) Automated Essay Scoring in the Presence of Biased Ratings, (2) Subjectivity Bias in Brazilian Presidential Elections and (3) Fake News Classification Based on Text Subjectivity. All these tasks involve text documents written in Portuguese.