Abstract
Lexical simplification (LS) is the task of automatically replacing complex words for easier ones making texts more accessible to various target populations (e.g. individuals with low literacy, individuals with learning disabilities, second language learners). To train and test models, LS systems usually require corpora that feature complex words in context along with their potential substitutions. To continue improving the performance of LS systems we introduce ALEXSIS-PT, a novel multi-candidate dataset for Brazilian Portuguese LS containing 9,605 candidate substitutions for 387 complex words. ALEXSIS-PT has been compiled following the ALEXSIS-ES protocol for Spanish opening exciting new avenues for cross-lingual models. ALEXSIS-PT is the first LS multi-candidate dataset that contains Brazilian newspaper articles. We evaluated three models for substitute generation on this dataset, namely mBERT, XLM-R, and BERTimbau. The latter achieved the highest performance across all evaluation metrics.- Anthology ID:
- 2022.coling-1.529
- Volume:
- Proceedings of the 29th International Conference on Computational Linguistics
- Month:
- October
- Year:
- 2022
- Address:
- Gyeongju, Republic of Korea
- Venue:
- COLING
- SIG:
- Publisher:
- International Committee on Computational Linguistics
- Note:
- Pages:
- 6057–6062
- Language:
- URL:
- https://aclanthology.org/2022.coling-1.529
- DOI:
- Cite (ACL):
- Kai North, Marcos Zampieri, and Tharindu Ranasinghe. 2022. ALEXSIS-PT: A New Resource for Portuguese Lexical Simplification. In Proceedings of the 29th International Conference on Computational Linguistics, pages 6057–6062, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
- Cite (Informal):
- ALEXSIS-PT: A New Resource for Portuguese Lexical Simplification (North et al., COLING 2022)
- PDF:
- https://preview.aclanthology.org/ingestion-script-update/2022.coling-1.529.pdf