Cross-Genre Native Language Identification with Open-Source Large Language Models

Robin Nicholls, Kenneth Alperin


Abstract
Native Language Identification (NLI) is a crucial area within computational linguistics, aimed at determining an author’s first language (L1) based on their proficiency in a second language (L2). Recent studies have shown remarkable improvements in NLI accuracy due to advancements in large language models (LLMs). This paper investigates the performance of open-source LLMs on short-form comments from the Reddit-L2 corpus compared to their performance on the TOEFL11 corpus of non-native English essays. Our experiments revealed that fine-tuning on TOEFL11 significantly improved accuracy on Reddit-L2, demonstrating the transferability of linguistic features across different text genres. Conversely, models fine-tuned on Reddit-L2 also generalised well to TOEFL11, achieving over 90% accuracy and F1 scores for the native languages that appear in both corpora. This shows the strong transfer performance from long-form to short-form text and vice versa. Additionally, we explored the task of classifying authors as native or non-native English speakers, where fine-tuned models achieve near-perfect accu- racy on the Reddit-L2 dataset. Our findings emphasize the impact of document length on model performance, with optimal results observed up to approximately 1200 tokens. This study highlights the effectiveness of open-source LLMs in NLI tasks across diverse linguistic contexts, suggesting their potential for broader applications in real-world scenarios.
Anthology ID:
2025.luhme-1.10
Volume:
Proceedings of the 2nd LUHME Workshop
Month:
October
Year:
2025
Address:
Bologna, Italy
Editors:
Henrique Lopes Cardoso, Rui Sousa-Silva, Maarit Koponen, Antonio Pareja-Lora
Venue:
LUHME
SIG:
Publisher:
LUHME
Note:
Pages:
103–108
Language:
URL:
https://preview.aclanthology.org/ingest-luhme/2025.luhme-1.10/
DOI:
Bibkey:
Cite (ACL):
Robin Nicholls and Kenneth Alperin. 2025. Cross-Genre Native Language Identification with Open-Source Large Language Models. In Proceedings of the 2nd LUHME Workshop, pages 103–108, Bologna, Italy. LUHME.
Cite (Informal):
Cross-Genre Native Language Identification with Open-Source Large Language Models (Nicholls & Alperin, LUHME 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-luhme/2025.luhme-1.10.pdf