Introducing OmniGEC: A Silver Multilingual Dataset for Grammatical Error Correction

Roman Kovalchuk, Mariana Romanyshyn, Petro Ivaniuk


Abstract
In this paper, we introduce OmniGEC, a collection of multilingual silver-standard datasets for the task of Grammatical Error Correction (GEC), covering eleven languages: Czech, English, Estonian, German, Greek, Icelandic, Italian, Latvian, Slovene, Swedish, and Ukrainian. These datasets facilitate the development of multilingual GEC solutions and help bridge the data gap in adapting English GEC solutions to multilingual GEC. The texts in the datasets originate from three sources: Wikipedia edits for the eleven target languages, subreddits from Reddit in the eleven target languages, and the Ukrainian-only UberText 2.0 social media corpus. While Wikipedia edits were derived from human-made corrections, the Reddit and UberText 2.0 data were automatically corrected with the GPT-4o-mini model. The quality of the corrections in the datasets was evaluated both automatically and manually. Finally, we fine-tune two open-source large language models — Aya-Expanse (8B) and Gemma-3 (12B) — on the multilingual OmniGEC corpora and achieve state-of-the-art (SOTA) results for paragraph-level multilingual GEC. The dataset collection and the best-performing models are available on Hugging Face.
Anthology ID:
2025.unlp-1.17
Volume:
Proceedings of the Fourth Ukrainian Natural Language Processing Workshop (UNLP 2025)
Month:
July
Year:
2025
Address:
Vienna, Austria (online)
Editor:
Mariana Romanyshyn
Venues:
UNLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
162–178
Language:
URL:
https://preview.aclanthology.org/acl25-workshop-ingestion/2025.unlp-1.17/
DOI:
Bibkey:
Cite (ACL):
Roman Kovalchuk, Mariana Romanyshyn, and Petro Ivaniuk. 2025. Introducing OmniGEC: A Silver Multilingual Dataset for Grammatical Error Correction. In Proceedings of the Fourth Ukrainian Natural Language Processing Workshop (UNLP 2025), pages 162–178, Vienna, Austria (online). Association for Computational Linguistics.
Cite (Informal):
Introducing OmniGEC: A Silver Multilingual Dataset for Grammatical Error Correction (Kovalchuk et al., UNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/acl25-workshop-ingestion/2025.unlp-1.17.pdf