Ryuto Koike
2026
ExaGPT: Example-Based Machine-Generated Text Detection for Human Interpretability
Ryuto Koike | Masahiro Kaneko | Ayana Niwa | Preslav Nakov | Naoaki Okazaki
Findings of the Association for Computational Linguistics: ACL 2026
Ryuto Koike | Masahiro Kaneko | Ayana Niwa | Preslav Nakov | Naoaki Okazaki
Findings of the Association for Computational Linguistics: ACL 2026
Detecting texts generated by Large Language Models (LLMs) could cause grave mistakes due to incorrect decisions, such as undermining student’s academic dignity. LLM text detection thus needs to ensure the interpretability of the decision, which can help users judge how reliably correct its prediction is. When humans verify whether a text is human-written or LLM-generated, they intuitively investigate with which of them it shares more similar spans. However, existing interpretable detectors are not aligned with the human decision-making process and fail to offer evidence that users easily understand. To bridge this gap, we introduce ExaGPT, an interpretable detection approach grounded in the human decision-making process for verifying the origin of a text. ExaGPT identifies a text by checking whether it shares more similar spans with human-written vs. with LLM-generated texts from a datastore. This approach can provide similar span examples that contribute to the decision for each span in the text as evidence. Our human evaluation demonstrates that providing similar span examples contributes more effectively to judging the correctness of the decision than existing interpretable methods. Moreover, extensive experiments in four domains and three generators show that ExaGPT massively outperforms prior interpretable detectors by up to +37.0 points of accuracy at a false positive rate of 1%.
Is Human-Like Text Liked by Humans? Multilingual Human Detection and Preference Against AI
Yuxia Wang | Rui Xing | Jonibek Mansurov | Giovanni Puccetti | Zhuohan Xie | Minh Ngoc Ta | Jiahui Geng | Jinyan Su | Mervat Abassy | Saadeldine Eletter | Kareem Elozeiri | Nurkhan Laiyk | Maiya Goloburda | Tarek Mahmoud | Raj Vardhan Tomar | Alexander Aziz | Ryuto Koike | Masahiro Kaneko | Artem Shelmanov | Ekaterina Artemova | Vladislav Mikhailov | Akim Tsvigun | Alham Fikri Aji | Nizar Habash | Iryna Gurevych | Preslav Nakov
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yuxia Wang | Rui Xing | Jonibek Mansurov | Giovanni Puccetti | Zhuohan Xie | Minh Ngoc Ta | Jiahui Geng | Jinyan Su | Mervat Abassy | Saadeldine Eletter | Kareem Elozeiri | Nurkhan Laiyk | Maiya Goloburda | Tarek Mahmoud | Raj Vardhan Tomar | Alexander Aziz | Ryuto Koike | Masahiro Kaneko | Artem Shelmanov | Ekaterina Artemova | Vladislav Mikhailov | Akim Tsvigun | Alham Fikri Aji | Nizar Habash | Iryna Gurevych | Preslav Nakov
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Prior studies have shown that distinguishing text generated by Large Language Models (LLMs) from human-written one is highly challenging for humans, and often no better than random guessing. To verify the generalizability of this finding across languages and domains, we perform an extensive case study to identify the upper bound of human detection accuracy. Across 16 datasets covering 9 languages and 9 domains, 19 annotators achieved an average detection accuracy of 87.6%, thus challenging previous conclusions. We find that major gaps between human and machine text lie in concreteness, cultural nuances, and diversity. Prompting by explicitly explaining the distinctions in the prompts can partially bridge the gaps in over 50% of the cases. However, we also find that humans do not always prefer human-written text, particularly when they cannot clearly identify its source. We release our dataset, the human labels, and the annotator metadata at https://github.com/xnlp-lab/HumanEval-MGT.
2025
GenAI Content Detection Task 1: English and Multilingual Machine-Generated Text Detection: AI vs. Human
Yuxia Wang | Artem Shelmanov | Jonibek Mansurov | Akim Tsvigun | Vladislav Mikhailov | Rui Xing | Zhuohan Xie | Jiahui Geng | Giovanni Puccetti | Ekaterina Artemova | Jinyan Su | Minh Ngoc Ta | Mervat Abassy | Kareem Ashraf Elozeiri | Saad El Dine Ahmed El Etter | Maiya Goloburda | Tarek Mahmoud | Raj Vardhan Tomar | Nurkhan Laiyk | Osama Mohammed Afzal | Ryuto Koike | Masahiro Kaneko | Alham Fikri Aji | Nizar Habash | Iryna Gurevych | Preslav Nakov
Proceedings of the 1stWorkshop on GenAI Content Detection (GenAIDetect)
Yuxia Wang | Artem Shelmanov | Jonibek Mansurov | Akim Tsvigun | Vladislav Mikhailov | Rui Xing | Zhuohan Xie | Jiahui Geng | Giovanni Puccetti | Ekaterina Artemova | Jinyan Su | Minh Ngoc Ta | Mervat Abassy | Kareem Ashraf Elozeiri | Saad El Dine Ahmed El Etter | Maiya Goloburda | Tarek Mahmoud | Raj Vardhan Tomar | Nurkhan Laiyk | Osama Mohammed Afzal | Ryuto Koike | Masahiro Kaneko | Alham Fikri Aji | Nizar Habash | Iryna Gurevych | Preslav Nakov
Proceedings of the 1stWorkshop on GenAI Content Detection (GenAIDetect)
We present the GenAI Content Detection Task 1 – a shared task on binary machine generated text detection, conducted as a part of the GenAI workshop at COLING 2025. The task consists of two subtasks: Monolingual (English) and Multilingual. The shared task attracted many participants: 36 teams made official submissions to the Monolingual subtask during the test phase and 27 teams – to the Multilingual. We provide a comprehensive overview of the data, a summary of the results – including system rankings and performance scores – detailed descriptions of the participating systems, and an in-depth analysis of submissions.
2024
Likelihood-based Mitigation of Evaluation Bias in Large Language Models
Masanari Oi | Masahiro Kaneko | Ryuto Koike | Mengsay Loem | Naoaki Okazaki
Findings of the Association for Computational Linguistics: ACL 2024
Masanari Oi | Masahiro Kaneko | Ryuto Koike | Mengsay Loem | Naoaki Okazaki
Findings of the Association for Computational Linguistics: ACL 2024
Large Language Models (LLMs) are widely used to evaluate natural language generation tasks as automated metrics.However, the likelihood, a measure of LLM’s plausibility for a sentence, can vary due to superficial differences in sentences, such as word order and sentence structure.It is therefore possible that there might be a likelihood bias if LLMs are used for evaluation: they might overrate sentences with higher likelihoods while underrating those with lower likelihoods.In this paper, we investigate the presence and impact of likelihood bias in LLM-based evaluators.We also propose a method to mitigate the likelihood bias.Our method utilizes highly biased instances as few-shot examples for in-context learning.Our experiments in evaluating the data-to-text and grammatical error correction tasks reveal that several LLMs we test display a likelihood bias.Furthermore, our proposed method successfully mitigates this bias, also improving evaluation performance (in terms of correlation of models with human scores) significantly.
How You Prompt Matters! Even Task-Oriented Constraints in Instructions Affect LLM-Generated Text Detection
Ryuto Koike | Masahiro Kaneko | Naoaki Okazaki
Findings of the Association for Computational Linguistics: EMNLP 2024
Ryuto Koike | Masahiro Kaneko | Naoaki Okazaki
Findings of the Association for Computational Linguistics: EMNLP 2024
To combat the misuse of Large Language Models (LLMs), many recent studies have presented LLM-generated-text detectors with promising performance. When users instruct LLMs to generate texts, the instruction can include different constraints depending on the user’s need. However, most recent studies do not cover such diverse instruction patterns when creating datasets for LLM detection. In this paper, we reveal that even task-oriented constraints — constraints that would naturally be included in an instruction and are not related to detection-evasion — cause existing powerful detectors to have a large variance in detection performance. We focus on student essay writing as a realistic domain and manually create task-oriented constraints based on several factors for essay quality. Our experiments show that the standard deviation (SD) of current detector performance on texts generated by an instruction with such a constraint is significantly larger (up to an SD of 14.4 F1-score) than that by generating texts multiple times or paraphrasing the instruction. We also observe an overall trend where the constraints can make LLM detection more challenging than without them. Finally, our analysis indicates that the high instruction-following ability of LLMs fosters the large impact of such constraints on detection performance.
Search
Fix author
Co-authors
- Masahiro Kaneko 5
- Preslav Nakov 3
- Naoaki Okazaki 3
- Mervat Abassy 2
- Alham Fikri Aji 2
- Ekaterina Artemova 2
- Jiahui Geng 2
- Maiya Goloburda 2
- Iryna Gurevych 2
- Nizar Habash 2
- Nurkhan Laiyk 2
- Tarek Mahmoud 2
- Jonibek Mansurov 2
- Vladislav Mikhailov 2
- Giovanni Puccetti 2
- Artem Shelmanov 2
- Jinyan Su 2
- Minh Ngoc Ta 2
- Raj Vardhan Tomar 2
- Akim Tsvigun 2
- Yuxia Wang 2
- Zhuohan Xie 2
- Rui Xing 2
- Osama Mohammed Afzal 1
- Alexander Aziz 1
- Saad El Dine Ahmed El Etter 1
- Saadeldine Eletter 1
- Kareem Elozeiri 1
- Kareem Ashraf Elozeiri 1
- Mengsay Loem 1
- Ayana Niwa 1
- Masanari Oi 1