Inês Calvo


2026

As Large Language Models (LLMs) expand across multilingual domains, evaluating their performance in under-represented languages becomes increasingly important. European Portuguese (pt-PT) is particularly affected, as existing training data and benchmarks are mainly in Brazilian Portuguese (pt-BR). To address this, we introduce ALBA, a linguistically grounded benchmark designed from the ground up to assess LLM proficiency in linguistic-related tasks in pt-PT across eight linguistic dimensions, including Language Variety, Culture-bound Semantics, Discourse Analysis, Word Plays, Syntax, Morphology, Lexicology, and Phonetics and Phonology. ALBA is manually constructed by language experts and paired with an LLM-as-a-judge framework for scalable evaluation of pt-PT generated language. Experiments on a diverse set of models reveal performance variability across linguistic dimensions, highlighting the need for comprehensive, variety-sensitive benchmarks that support further development of tools in pt-PT.
We present PHEB, a comprehensive benchmark designed to evaluate Large Language Models (LLMs) on real high school level national exams in European Portuguese. The goal is to promote the development of NLP tools and provide a reliable resource for benchmarking multilingual and educational capabilities of LLMs. Covering over 3,500 questions spanning 18 years (2006–2023) across six core subjects, the benchmark compiles high-quality questions from Portuguese National Exams, written and thoroughly curated by professors to ensure topic diversity, linguistic accuracy, and alignment with national curricula. PHEB spans a wide range of subjects, including Mathematics, Portuguese Language and Literature, History, Geography, Biology/Geology, and Philosophy. Questions incorporate both multiple-choice and long-form answers to assess factual knowledge, reasoning capabilities, and language understanding. We comprehensively benchmark state-of-the-art LLMs, shedding light on key challenges such as models’ knowledge, language coverage, answer format biases and robustness to machine translation.
Despite rapid progress in open large language models (LLMs), European Portuguese (pt-PT) remains underrepresented in both training data and native evaluation, with machine-translated benchmarks likely missing the variant’s linguistic and cultural nuances. We introduce AMALIA, a fully open LLM that prioritizes pt-PT by using more high-quality pt-PT data during both the mid- and post-training stages. To evaluate pt-PT more faithfully, we release a suite of pt-PT benchmarks that includes translated standard tasks and four new datasets targeting pt-PT generation, linguistic competence, and pt-PT/pt-BR bias. Experiments show that AMALIA matches strong baselines on translated benchmarks while substantially improving performance on pt-PT-specific evaluations, supporting the case for targeted training and native benchmarking for European Portuguese.