Larissa Goulart
2026
CorSpell: Introducing a Semiautomatic Tool for Spelling Normalization in Brazilian Portuguese
Juliana Schoffen | Dennis Giovani Balreira | Elisa Marchioro Stumpf | Larissa Goulart | Tanara Zingano Kuhn | Rafael Oleques Nunes | Gabriel Ricci Pazzinato | Isadora Dahmer Hanauer | José Henrique de Souza Silva | Luiza Sarmento Divino | Marine Matte
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Juliana Schoffen | Dennis Giovani Balreira | Elisa Marchioro Stumpf | Larissa Goulart | Tanara Zingano Kuhn | Rafael Oleques Nunes | Gabriel Ricci Pazzinato | Isadora Dahmer Hanauer | José Henrique de Souza Silva | Luiza Sarmento Divino | Marine Matte
Proceedings of the Fifteenth Language Resources and Evaluation Conference
With the growing availability of large text collections, efficient tools for corpus annotation and normalization have become increasingly important in linguistic and computational research. This paper presents CorSpell, a semiautomatic tool developed to support the spelling normalization of Brazilian Portuguese texts within the CorCel project—a corpus comprising over 15,000 handwritten exam responses from the Celpe-Bras proficiency test. Given the corpus scale, manual normalization is impractical; CorSpell streamlines this process by enabling users to visualize, select, and replace tokens directly through an intuitive web interface. The tool integrates automatic suggestions from PT-BR dictionaries with human validation, providing an interface for users to access and manipulate the texts. CorSpell significantly reduces annotation time, minimizes errors, and facilitates collaborative work, providing a practical and scalable solution for corpus normalization and a foundation for LLM-based modeling of Portuguese proficiency.
2022
Towards better structured and less noisy Web data: Oscar with Register annotations
Veronika Laippala | Anna Salmela | Samuel Rönnqvist | Alham Fikri Aji | Li-Hsin Chang | Asma Dhifallah | Larissa Goulart | Henna Kortelainen | Marc Pàmies | Deise Prina Dutra | Valtteri Skantsi | Lintang Sutawika | Sampo Pyysalo
Proceedings of the Eighth Workshop on Noisy User-generated Text (W-NUT 2022)
Veronika Laippala | Anna Salmela | Samuel Rönnqvist | Alham Fikri Aji | Li-Hsin Chang | Asma Dhifallah | Larissa Goulart | Henna Kortelainen | Marc Pàmies | Deise Prina Dutra | Valtteri Skantsi | Lintang Sutawika | Sampo Pyysalo
Proceedings of the Eighth Workshop on Noisy User-generated Text (W-NUT 2022)
Web-crawled datasets are known to be noisy, as they feature a wide range of language use covering both user-generated and professionally edited content as well as noise originating from the crawling process. This article presents one solution to reduce this noise by using automatic register (genre) identification -whether the texts are, e.g., forum discussions, lyrical or how-to pages. We apply the multilingual register identification model by Rönnqvist et al. (2021) and label the widely used Oscar dataset. Additionally, we evaluate the model against eight new languages, showing that the performance is comparable to previous findings on a restricted set of languages. Finally, we present and apply a machine learning method for further cleaning text files originating from Web crawls from remains of boilerplate and other elements not belonging to the main text of the Web page. The register labeled and cleaned dataset covers 351 million documents in 14 languages and is available at https://huggingface.co/datasets/TurkuNLP/register_oscar.
Search
Fix author
Co-authors
- Alham Fikri Aji 1
- Dennis Giovani Balreira 1
- Li-Hsin Chang 1
- Isadora Dahmer Hanauer 1
- Asma Dhifallah 1
- Henna Kortelainen 1
- Veronika Laippala 1
- Elisa Marchioro Stumpf 1
- Marine Matte 1
- Rafael Oleques Nunes 1
- Deise Prina Dutra 1
- Sampo Pyysalo 1
- Marc Pàmies 1
- Gabriel Ricci Pazzinato 1
- Samuel Rönnqvist 1
- Anna Salmela 1
- Luiza Sarmento Divino 1
- Juliana Schoffen 1
- Valtteri Skantsi 1
- Lintang Sutawika 1
- Tanara Zingano Kuhn 1
- José Henrique de Souza Silva 1