Cross-corpus Native Language Identification via Statistical Embedding
Francisco Rangel, Paolo Rosso, Julian Brooke, Alexandra Uitdenbogerd
Abstract
In this paper, we approach the task of native language identification in a realistic cross-corpus scenario where a model is trained with available data and has to predict the native language from data of a different corpus. The motivation behind this study is to investigate native language identification in the Australian academic scenario where a majority of students come from China, Indonesia, and Arabic-speaking nations. We have proposed a statistical embedding representation reporting a significant improvement over common single-layer approaches of the state of the art, identifying Chinese, Arabic, and Indonesian in a cross-corpus scenario. The proposed approach was shown to be competitive even when the data is scarce and imbalanced.- Anthology ID:
- W18-1605
- Volume:
- Proceedings of the Second Workshop on Stylistic Variation
- Month:
- June
- Year:
- 2018
- Address:
- New Orleans
- Editors:
- Julian Brooke, Lucie Flekova, Moshe Koppel, Thamar Solorio
- Venue:
- Style-Var
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 39–43
- Language:
- URL:
- https://aclanthology.org/W18-1605
- DOI:
- 10.18653/v1/W18-1605
- Cite (ACL):
- Francisco Rangel, Paolo Rosso, Julian Brooke, and Alexandra Uitdenbogerd. 2018. Cross-corpus Native Language Identification via Statistical Embedding. In Proceedings of the Second Workshop on Stylistic Variation, pages 39–43, New Orleans. Association for Computational Linguistics.
- Cite (Informal):
- Cross-corpus Native Language Identification via Statistical Embedding (Rangel et al., Style-Var 2018)
- PDF:
- https://preview.aclanthology.org/emnlp22-frontmatter/W18-1605.pdf