CLexIS2: A New Corpus for Complex Word Identification Research in Computing Studies

Jenny A. Ortiz Zambrano, Arturo Montejo-Ráez


Abstract
Reading is a complex process not only because of the words or sections that are difficult for the reader to understand. Complex word identification (CWI) is the task of detecting in the content of documents the words that are difficult or complex to understand by the people of a certain group. Annotated corpora for English learners are widely available, while they are less common for the Spanish language. In this article, we present CLexIS2, a new corpus in Spanish to contribute to the advancement of research in the area of Lexical Simplification, specifically in the identification and prediction of complex words in computing studies. Several metrics used to evaluate the complexity of texts in Spanish were applied, such as LC, LDI, ILFW, SSR, SCI, ASL, CS. Furthermore, as a baseline of the primer, two experiments have been performed to predict the complexity of words: one using a supervised learning approach and the other using an unsupervised solution based on the frequency of words on a general corpus.
Anthology ID:
2021.ranlp-1.121
Volume:
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)
Month:
September
Year:
2021
Address:
Held Online
Venue:
RANLP
SIG:
Publisher:
INCOMA Ltd.
Note:
Pages:
1075–1083
Language:
URL:
https://aclanthology.org/2021.ranlp-1.121
DOI:
Bibkey:
Cite (ACL):
Jenny A. Ortiz Zambrano and Arturo Montejo-Ráez. 2021. CLexIS2: A New Corpus for Complex Word Identification Research in Computing Studies. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), pages 1075–1083, Held Online. INCOMA Ltd..
Cite (Informal):
CLexIS2: A New Corpus for Complex Word Identification Research in Computing Studies (Ortiz Zambrano & Montejo-Ráez, RANLP 2021)
Copy Citation:
PDF:
https://preview.aclanthology.org/update-css-js/2021.ranlp-1.121.pdf