Investigating the importance of linguistic complexity features across different datasets related to language learning

Ildikó Pilán, Elena Volodina


Abstract
We present the results of our investigations aiming at identifying the most informative linguistic complexity features for classifying language learning levels in three different datasets. The datasets vary across two dimensions: the size of the instances (texts vs. sentences) and the language learning skill they involve (reading comprehension texts vs. texts written by learners themselves). We present a subset of the most predictive features for each dataset, taking into consideration significant differences in their per-class mean values and show that these subsets lead not only to simpler models, but also to an improved classification performance. Furthermore, we pinpoint fourteen central features that are good predictors regardless of the size of the linguistic unit analyzed or the skills involved, which include both morpho-syntactic and lexical dimensions.
Anthology ID:
W18-4606
Volume:
Proceedings of the Workshop on Linguistic Complexity and Natural Language Processing
Month:
August
Year:
2018
Address:
Santa Fe, New-Mexico
Venue:
WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
49–58
Language:
URL:
https://aclanthology.org/W18-4606
DOI:
Bibkey:
Cite (ACL):
Ildikó Pilán and Elena Volodina. 2018. Investigating the importance of linguistic complexity features across different datasets related to language learning. In Proceedings of the Workshop on Linguistic Complexity and Natural Language Processing, pages 49–58, Santa Fe, New-Mexico. Association for Computational Linguistics.
Cite (Informal):
Investigating the importance of linguistic complexity features across different datasets related to language learning (Pilán & Volodina, 2018)
Copy Citation:
PDF:
https://preview.aclanthology.org/author-url/W18-4606.pdf