Denny Vrandečić
Also published as: Denny Vrandecic
2025
WETBench: A Benchmark for Detecting Task-Specific Machine-Generated Text on Wikipedia
Gerrit Quaremba
|
Elizabeth Black
|
Denny Vrandecic
|
Elena Simperl
Proceedings of the 2nd Workshop on Advancing Natural Language Processing for Wikipedia (WikiNLP 2025)
Given Wikipedia’s role as a trusted source of high-quality, reliable content, there are growing concerns about the proliferation of low-quality machine-generated text (MGT) produced by large language models (LLMs) on its platform. Reliable detection of MGT is therefore essential, yet existing work primarily evaluates MGT detectors on generic generation tasks, rather than on tasks more commonly performed by Wikipedia editors. This misalignment can lead to poor generalisability when applied to real-world Wikipedia contexts.We introduce WETBench, a multilingual, multi-generator, and task-specific benchmark for MGT detection. We define three editing tasks empirically grounded in Wikipedia editors’ perceived use cases for LLM-assisted editing: Paragraph Writing, Summarisation, and Text Style Transfer, which we implement using two new datasets across three languages. For each writing task, we evaluate three prompts, produce MGT across multiple generators using the best-performing prompt, and benchmark diverse detectors.We find that, across settings, training-based detectors achieve an average accuracy of 78%, while zero-shot detectors average 58%. These results demonstrate that detectors struggle with MGT in realistic generation scenarios and underscore the importance of evaluating such models on diverse, task-specific data to assess their reliability in editor-driven contexts.
2020
Wiki-40B: Multilingual Language Model Dataset
Mandy Guo
|
Zihang Dai
|
Denny Vrandečić
|
Rami Al-Rfou
Proceedings of the Twelfth Language Resources and Evaluation Conference
We propose a new multilingual language model benchmark that is composed of 40+ languages spanning several scripts and linguistic families. With around 40 billion characters, we hope this new resource will accelerate the research of multilingual modeling. We train monolingual causal language models using a state-of-the-art model (Transformer-XL) establishing baselines for many languages. We also introduce the task of multilingual causal language modeling where we train our model on the combined text of 40+ languages from Wikipedia with different vocabulary sizes and evaluate on the languages individually. We released the cleaned-up text of 40+ Wikipedia language editions, the corresponding trained monolingual language models, and several multilingual language models with different fixed vocabulary sizes.
Introducing Lexical Masks: a New Representation of Lexical Entries for Better Evaluation and Exchange of Lexicons
Bruno Cartoni
|
Daniel Calvelo Aros
|
Denny Vrandecic
|
Saran Lertpradit
Proceedings of the Twelfth Language Resources and Evaluation Conference
The evaluation and exchange of large lexicon databases remains a challenge in many NLP applications. Despite the existence of commonly accepted standards for the format and the features used in a lexicon, there is still a lack of precise and interoperable specification requirements about how lexical entries of a particular language should look like, both in terms of the numbers of forms and in terms of features associated with these forms. This paper presents the notion of “lexical masks”, a powerful tool used to evaluate and exchange lexicon databases in many languages.
Search
Fix author
Co-authors
- Rami Al-Rfou’ 1
- Elizabeth Black 1
- Daniel Calvelo Aros 1
- Bruno Cartoni 1
- Zihang Dai 1
- show all...