Cross-Genre Native Language Identification with Open-Source Large Language Models

Robin Nicholls; Kenneth Alperin

Cross-Genre Native Language Identification with Open-Source Large Language Models

Abstract

Native Language Identification (NLI) is a crucial area within computational linguistics, aimed at determining an author’s first language (L1) based on their proficiency in a second language (L2). Recent studies have shown remarkable improvements in NLI accuracy due to advancements in large language models (LLMs). This paper investigates the performance of open-source LLMs on short-form comments from the Reddit-L2 corpus compared to their performance on the TOEFL11 corpus of non-native English essays. Our experiments revealed that fine-tuning on TOEFL11 significantly improved accuracy on Reddit-L2, demonstrating the transferability of linguistic features across different text genres. Conversely, models fine-tuned on Reddit-L2 also generalised well to TOEFL11, achieving over 90% accuracy and F1 scores for the native languages that appear in both corpora. This shows the strong transfer performance from long-form to short-form text and vice versa. Additionally, we explored the task of classifying authors as native or non-native English speakers, where fine-tuned models achieve near-perfect accu- racy on the Reddit-L2 dataset. Our findings emphasize the impact of document length on model performance, with optimal results observed up to approximately 1200 tokens. This study highlights the effectiveness of open-source LLMs in NLI tasks across diverse linguistic contexts, suggesting their potential for broader applications in real-world scenarios.

Anthology ID:: 2025.luhme-1.10
Volume:: Proceedings of the 2nd LUHME Workshop
Month:: October
Year:: 2025
Address:: Bologna, Italy
Editors:: Henrique Lopes Cardoso, Rui Sousa-Silva, Maarit Koponen, Antonio Pareja-Lora
Venue:: LUHME
SIG:
Publisher:: LUHME
Note:
Pages:: 103–108
Language:
URL:: https://preview.aclanthology.org/ingest-luhme/2025.luhme-1.10/
DOI:
Bibkey:
Cite (ACL):: Robin Nicholls and Kenneth Alperin. 2025. Cross-Genre Native Language Identification with Open-Source Large Language Models. In Proceedings of the 2nd LUHME Workshop, pages 103–108, Bologna, Italy. LUHME.
Cite (Informal):: Cross-Genre Native Language Identification with Open-Source Large Language Models (Nicholls & Alperin, LUHME 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-luhme/2025.luhme-1.10.pdf

PDF Cite Search Fix data