Andrei Kucharavy

2025

pdf bib abs
Low-Perplexity LLM-Generated Sequences and Where To Find Them
Arthur Wuhrmann | Andrei Kucharavy | Anastasiia Kucherenko
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)

As Large Language Models (LLMs) become increasingly widespread, understanding how specific training data shapes their outputs is crucial for transparency, accountability, privacy, and fairness. To explore how LLMs leverage and replicate their training data, we introduce a systematic approach centered on analyzing low-perplexity sequences—high-probability text spans generated by the model. Our pipeline reliably extracts such long sequences across diverse topics while avoiding degeneration, then traces them back to their sources in the training data. Surprisingly, we find that a substantial portion of these low-perplexity spans cannot be mapped to the corpus. For those that do match, we quantify the distribution of occurrences across source documents, highlighting the scope and nature of verbatim recall and paving a way toward better understanding of how LLMs training data impacts their behavior.

pdf bib abs
LLMs Protégés: Tutoring LLMs with Knowledge Gaps Improves Student Learning Outcome
Andrei Kucharavy | Cyril Vallez | Dimitri Percia David
Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025)

Since the release of ChatGPT, Large Langauge Models (LLMs) have been proposed as potential tutors to students in the education outcomes. Such an LLM-as-tutors metaphor is problematic, notably due to the counterfactual generation, perception of learned skills as mastered by an automated system and hence non-valuable, and learning LLM over-reliance.We propose instead the LLM-as-mentee tutoring schema, leveraging the Learning-by-Teaching protégé effect in peer tutoring - LLM Protégés. In this configuration, counterfactual generation is desirable, allowing students to operationalize the learning material and better understand the limitations of LLM-based systems, both a skill in itself and an additional learning motivation. Our preliminary results suggest that LLM Protégés are effective. Students in an introductory algorithms class who successfully diagnosed an LLM teachable agent system prompted to err on a course material gained an average of 0.72 points on a 1-6 scale. Remarkably, if fully adopted, this approach would reduce the failure rate in the second midterm from 28% to 8%, mitigating 72% of midterm failure.We publish code for on-premises deployment of LLM Protégés on https://github.com/Reliable-Information-Lab-HEVS/LLM_Proteges.

pdf bib abs
Low-Resource Languages LLM Disinformation is Within Reach: The Case of Walliserdeutsch
Andrei Kucharavy | Sherine Seppey | Cyril Vallez | Dimitri Percia David | Ljiljana Dolamic
Findings of the Association for Computational Linguistics: EMNLP 2025

LLM-augmented online disinformation is of particular concern for low-resource languages, given their prior limited exposure to it. While current LLMs lack fluidity in such languages, their multilingual and emerging capabilities can potentially still be leveraged.In this paper, we investigate whether a moderately sophisticated attacker can leverage such capabilities and perform an impersonation attack in the Walliserdeutsch dialect, a low-resource (100k speakers) Swiss German Highest Allemanic dialect that is generally non-intelligible to both Standard German and other Swiss German dialects speakers and presents considerable within-dialect variability.We show that while a standard few-shot learning prompting of SotA LLMs, even by native Walliserdeutsch speakers, yields easily human-detectable texts, an expert attacker performing a PEFT on a small SotA LLM is partially able to perform such an impersonation with minimal resources, even if the fine-tuned LLM does not advertise any capabilities in Germanic languages. With Walliserdeutsch presenting many features of low-resource languages and dialects, our results suggest that LLM-augmented disinformation is within reach for low-resource languages, highlighting the urgency of LLM detectability research in low-resource languages.

2021

pdf bib abs
Can the Transformer Be Used as a Drop-in Replacement for RNNs in Text-Generating GANs?
Kevin Blin | Andrei Kucharavy
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

In this paper we address the problem of fine-tuned text generation with a limited computational budget. For that, we use a well-performing text generative adversarial network (GAN) architecture - Diversity-Promoting GAN (DPGAN), and attempted a drop-in replacement of the LSTM layer with a self-attention-based Transformer layer in order to leverage their efficiency. The resulting Self-Attention DPGAN (SADPGAN) was evaluated for performance, quality and diversity of generated text and stability. Computational experiments suggested that a transformer architecture is unable to drop-in replace the LSTM layer, under-performing during the pre-training phase and undergoing a complete mode collapse during the GAN tuning phase. Our results suggest that the transformer architecture need to be adapted before it can be used as a replacement for RNNs in text-generating GANs.

Co-authors

Sherine Seppey 1

Arthur Wuhrmann 1

Venues

Fix data