@inproceedings{ng-markov-2025-leveraging,
    title = "Leveraging Open-Source Large Language Models for Native Language Identification",
    author = "Ng, Yee Man  and
      Markov, Ilia",
    editor = "Scherrer, Yves  and
      Jauhiainen, Tommi  and
      Ljube{\v{s}}i{\'c}, Nikola  and
      Nakov, Preslav  and
      Tiedemann, Jorg  and
      Zampieri, Marcos",
    booktitle = "Proceedings of the 12th Workshop on NLP for Similar Languages, Varieties and Dialects",
    month = jan,
    year = "2025",
    address = "Abu Dhabi, UAE",
    publisher = "Association for Computational Linguistics",
    url = "https://preview.aclanthology.org/ingest-emnlp/2025.vardial-1.3/",
    pages = "20--28",
    abstract = "Native Language Identification (NLI) {--} the task of identifying the native language (L1) of a person based on their writing in the second language (L2) {--} has applications in forensics, marketing, and second language acquisition. Historically, conventional machine learning approaches that heavily rely on extensive feature engineering have outperformed transformer-based language models on this task. Recently, closed-source generative large language models (LLMs), e.g., GPT-4, have demonstrated remarkable performance on NLI in a zero-shot setting, including promising results in open-set classification. However, closed-source LLMs have many disadvantages, such as high costs and undisclosed nature of training data. This study explores the potential of using open-source LLMs for NLI. Our results indicate that open-source LLMs do not reach the accuracy levels of closed-source LLMs when used out-of-the-box. However, when fine-tuned on labeled training data, open-source LLMs can achieve performance comparable to that of commercial LLMs."
}Markdown (Informal)
[Leveraging Open-Source Large Language Models for Native Language Identification](https://preview.aclanthology.org/ingest-emnlp/2025.vardial-1.3/) (Ng & Markov, VarDial 2025)
ACL