Native Language Identification Using Large, Longitudinal Data
Xiao Jiang, Yufan Guo, Jeroen Geertzen, Dora Alexopoulou, Lin Sun, Anna Korhonen
Abstract
Native Language Identification (NLI) is a task aimed at determining the native language (L1) of learners of second language (L2) on the basis of their written texts. To date, research on NLI has focused on relatively small corpora. We apply NLI to the recently released EFCamDat corpus which is not only multiple times larger than previous L2 corpora but also provides longitudinal data at several proficiency levels. Our investigation using accurate machine learning with a wide range of linguistic features reveals interesting patterns in the longitudinal data which are useful for both further development of NLI and its application to research on L2 acquisition.- Anthology ID:
- L14-1051
- Volume:
- Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
- Month:
- May
- Year:
- 2014
- Address:
- Reykjavik, Iceland
- Editors:
- Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
- Venue:
- LREC
- SIG:
- Publisher:
- European Language Resources Association (ELRA)
- Note:
- Pages:
- 3309–3312
- Language:
- URL:
- http://www.lrec-conf.org/proceedings/lrec2014/pdf/1068_Paper.pdf
- DOI:
- Cite (ACL):
- Xiao Jiang, Yufan Guo, Jeroen Geertzen, Dora Alexopoulou, Lin Sun, and Anna Korhonen. 2014. Native Language Identification Using Large, Longitudinal Data. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 3309–3312, Reykjavik, Iceland. European Language Resources Association (ELRA).
- Cite (Informal):
- Native Language Identification Using Large, Longitudinal Data (Jiang et al., LREC 2014)
- PDF:
- http://www.lrec-conf.org/proceedings/lrec2014/pdf/1068_Paper.pdf