Abstract
Reading is a complex process not only because of the words or sections that are difficult for the reader to understand. Complex word identification (CWI) is the task of detecting in the content of documents the words that are difficult or complex to understand by the people of a certain group. Annotated corpora for English learners are widely available, while they are less common for the Spanish language. In this article, we present CLexIS2, a new corpus in Spanish to contribute to the advancement of research in the area of Lexical Simplification, specifically in the identification and prediction of complex words in computing studies. Several metrics used to evaluate the complexity of texts in Spanish were applied, such as LC, LDI, ILFW, SSR, SCI, ASL, CS. Furthermore, as a baseline of the primer, two experiments have been performed to predict the complexity of words: one using a supervised learning approach and the other using an unsupervised solution based on the frequency of words on a general corpus.- Anthology ID:
- 2021.ranlp-1.121
- Volume:
- Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)
- Month:
- September
- Year:
- 2021
- Address:
- Held Online
- Editors:
- Ruslan Mitkov, Galia Angelova
- Venue:
- RANLP
- SIG:
- Publisher:
- INCOMA Ltd.
- Note:
- Pages:
- 1075–1083
- Language:
- URL:
- https://aclanthology.org/2021.ranlp-1.121
- DOI:
- Cite (ACL):
- Jenny A. Ortiz Zambrano and Arturo Montejo-Ráez. 2021. CLexIS2: A New Corpus for Complex Word Identification Research in Computing Studies. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), pages 1075–1083, Held Online. INCOMA Ltd..
- Cite (Informal):
- CLexIS2: A New Corpus for Complex Word Identification Research in Computing Studies (Ortiz Zambrano & Montejo-Ráez, RANLP 2021)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-3/2021.ranlp-1.121.pdf