José Matos
2026
CorEGe-PT: Compiling a Large Corpus of Academic Texts in Portuguese
Tanara Zingano Kuhn | José Matos | Bruno Neves | Daniela Pereira | Elisabete Cação | Ivo Simões | Jacinto Estima | Delfim Leão | Hugo Goncalo Oliveira
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Tanara Zingano Kuhn | José Matos | Bruno Neves | Daniela Pereira | Elisabete Cação | Ivo Simões | Jacinto Estima | Delfim Leão | Hugo Goncalo Oliveira
Proceedings of the Fifteenth Language Resources and Evaluation Conference
This paper describes the creation of a large-scale corpus of academic texts in Portuguese, dubbed CorEGe-PT, extracted from the institutional repository of a Portuguese university. Its compilation methodology, which combined automatic and manual procedures, is detailed, together with challenges faced and proposed solutions. The process included a thorough analysis of the metadata, which will be publicly released together with the documents, extracted in a markdown format. CorEGe-PT covers five areas of knowledge and, with over 34,000 documents and 1B tokens, is the largest of corpus of its kind in Portuguese, which will enable in-depth linguistic studies while providing data for adapting Large Language Models to academic Portuguese and related tasks.
2025
Cognitive Flow: An LLM-Automated Framework for Quantifying Reasoning Distillation
José Matos | Catarina Silva | Hugo Goncalo Oliveira
Proceedings of the 18th International Natural Language Generation Conference
José Matos | Catarina Silva | Hugo Goncalo Oliveira
Proceedings of the 18th International Natural Language Generation Conference
The ability of large language models (LLMs) to reason effectively is crucial for a wide range of applications, from complex decision-making to scientific research. However, it remains unclear how well reasoning capabilities are transferred or preserved when LLMs undergo Knowledge Distillation (KD), a process that typically reduces model size while attempting to retain performance. In this study, we explore the effects of model distillation on the reasoning abilities of various reasoning language models (RLMs). We introduce Cognitive Flow, a novel framework that systematically extracts meaning and map states in Chain-of-Thought (CoT) processes, offering new insights on model reasoning and enabling quantitative comparisons across RLMs. Using this framework, we investigate the impact of KD on CoTs produced by RLMs. We target DeepSeek-R1-671B and its distilled 70B, 32B and 14B versions, as well as QwenQwQ-32B from the Qwen series. We evaluate the models on three subsets of mathematical reasoning tasks with varying complexity from the MMLU benchmark. Our findings demonstrate that while distillation can effectively replicate a similar reasoning style under specific conditions, it struggles with simpler problems, revealing a significant divergence in the observable thought process and a potential limitation in the transfer of a robust and adaptable problem-solving capability.