Tanara Zingano Kuhn
2026
CorEGe-PT: Compiling a Large Corpus of Academic Texts in Portuguese
Tanara Zingano Kuhn | José Matos | Bruno Neves | Daniela Pereira | Elisabete Cação | Ivo Simões | Jacinto Estima | Delfim Leão | Hugo Goncalo Oliveira
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Tanara Zingano Kuhn | José Matos | Bruno Neves | Daniela Pereira | Elisabete Cação | Ivo Simões | Jacinto Estima | Delfim Leão | Hugo Goncalo Oliveira
Proceedings of the Fifteenth Language Resources and Evaluation Conference
This paper describes the creation of a large-scale corpus of academic texts in Portuguese, dubbed CorEGe-PT, extracted from the institutional repository of a Portuguese university. Its compilation methodology, which combined automatic and manual procedures, is detailed, together with challenges faced and proposed solutions. The process included a thorough analysis of the metadata, which will be publicly released together with the documents, extracted in a markdown format. CorEGe-PT covers five areas of knowledge and, with over 34,000 documents and 1B tokens, is the largest of corpus of its kind in Portuguese, which will enable in-depth linguistic studies while providing data for adapting Large Language Models to academic Portuguese and related tasks.
CorSpell: Introducing a Semiautomatic Tool for Spelling Normalization in Brazilian Portuguese
Juliana Schoffen | Dennis Giovani Balreira | Elisa Marchioro Stumpf | Larissa Goulart | Tanara Zingano Kuhn | Rafael Oleques Nunes | Gabriel Ricci Pazzinato | Isadora Dahmer Hanauer | José Henrique de Souza Silva | Luiza Sarmento Divino | Marine Matte
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Juliana Schoffen | Dennis Giovani Balreira | Elisa Marchioro Stumpf | Larissa Goulart | Tanara Zingano Kuhn | Rafael Oleques Nunes | Gabriel Ricci Pazzinato | Isadora Dahmer Hanauer | José Henrique de Souza Silva | Luiza Sarmento Divino | Marine Matte
Proceedings of the Fifteenth Language Resources and Evaluation Conference
With the growing availability of large text collections, efficient tools for corpus annotation and normalization have become increasingly important in linguistic and computational research. This paper presents CorSpell, a semiautomatic tool developed to support the spelling normalization of Brazilian Portuguese texts within the CorCel project—a corpus comprising over 15,000 handwritten exam responses from the Celpe-Bras proficiency test. Given the corpus scale, manual normalization is impractical; CorSpell streamlines this process by enabling users to visualize, select, and replace tokens directly through an intuitive web interface. The tool integrates automatic suggestions from PT-BR dictionaries with human validation, providing an interface for users to access and manipulate the texts. CorSpell significantly reduces annotation time, minimizes errors, and facilitates collaborative work, providing a practical and scalable solution for corpus normalization and a foundation for LLM-based modeling of Portuguese proficiency.