Aldan Creo


2026

The widespread use of AI-generated code raises questions about software maintenance and academic integrity. However, tools to detect it are still in their infancy. In this article, we explore the issue of out-of-distribution (OOD) detection; while embedder models like CodeBERT can easily achieve high accuracies in the context of their training data, they are unable to properly generalize to unseen contexts or programming languages. We argue that this is caused by an overfitting of such models to the training distribution, e.g. memorizing a language’s "AI syntax" instead of the true generative artifacts, and develop a approach that is able to naturally generalize to completely unseen languages and domains. Our system is also considerably more interpretable than the deep neural alternatives. In particular, we propose three orthogonal views (lexical, structural, and symbolic) to capture the AI-generated code’s indicators. To deal with OOD shift, we normalize the scores per language with Z-scoring and a Gaussian Mixture Model to remove the language bias automatically. We test our approach on the SemEval-2026 Task 13 dataset, where our experiments reached a macro F1 of 0.602 compared to the task baseline of 0.305, demonstrating the generalization capabilities of our system. We make our source code and data available at https://github.com/ACMCMC/COODetect.

2025

In this paper, we propose an approach to detecting hallucinations based on a Named Entity Recognition (NER) task.We focus on efficiency, aiming to develop a model that can detect hallucinations without relying on external data sources or expensive computations that involve state-of-the-art large language models with upwards of tens of billions of parameters. We utilize the SQuAD question answering dataset to generate a synthetic version that contains both correct and hallucinated responses and train encoder language models of a moderate size (RoBERTa and FLAN-T5) to predict spans of text that are highly likely to contain a hallucination. We test our models on a separate dataset of expert-annotated question-answer pairs and find that our approach achieves a Jaccard similarity of up to 0.358 and 0.227 Spearman correlation, which suggests that our models can serve as moderately accurate hallucination detectors, ideally as part of a detection pipeline involving human supervision. We also observe that larger models seem to develop an emergent ability to leverage their background knowledge to make more informed decisions, while smaller models seem to take shortcuts that can lead to a higher number of false positives.We make our data and code publicly accessible, along with an online visualizer. We also release our trained models under an open license.
The advent of Large Language Models (LLMs) has enabled the generation of text that increasingly exhibits human-like characteristics. As the detection of such content is of significant importance, substantial research has been conducted with the objective of developing reliable AI-generated text detectors. These detectors have demonstrated promising results on test data, but recent research has revealed that they can be circumvented by employing different techniques. In this paper, we present homoglyph-based attacks (‘A’ → Cyrillic ‘А’) as a means of circumventing existing detectors. We conduct a comprehensive evaluation to assess the effectiveness of these attacks on seven detectors, including ArguGPT, Binoculars, DetectGPT, Fast-DetectGPT, Ghostbuster, OpenAI’s detector, and watermarking techniques, on five different datasets. Our findings demonstrate that homoglyph-based attacks can effectively circumvent state-of-the-art detectors, leading them to classify all texts as either AI-generated or human-written (decreasing the average Matthews Correlation Coefficient from 0.64 to -0.01). Through further examination, we extract the technical justification underlying the success of the attacks, which varies across detectors. Finally, we discuss the implications of these findings and potential defenses against such attacks.