QSpell 250K: A Large-Scale, Practical Dataset for Chinese Search Query Spell Correction

Dezhi Ye; Haomei Jia; Junwei Hu; Tian Bowen; Jie Liu; Haijin Liang; Jin Ma; Wenmin Wang

QSpell 250K: A Large-Scale, Practical Dataset for Chinese Search Query Spell Correction

Dezhi Ye, Haomei Jia, Junwei Hu, Tian Bowen, Jie Liu, Haijin Liang, Jin Ma, Wenmin Wang

Abstract

Chinese Search Query Spell Correction is a task designed to autonomously identify and correct typographical errors within queries in the search engine. Despite the availability of comprehensive datasets like Microsoft Speller and Webis, their monolingual nature and limited scope pose significant challenges in evaluating modern pre-trained language models such as BERT and GPT. To address this, we introduce QSpell 250K, a large-scale benchmark specifically developed for Chinese Query Spelling Correction. QSpell 250K offers several advantages: 1) It contains over 250K samples, which is ten times more than previous datasets. 2) It covers a broad range of topics, from formal entities to everyday colloquialisms and idiomatic expressions. 3) It includes both Chinese and English, addressing the complexities of code-switching. Each query undergoes three rounds of high-fidelity annotation to ensure accuracy. Our extensive testing across three popular models demonstrates that QSpell 250K effectively evaluates the efficacy of representative spelling correctors. We believe that QSpell 250K will significantly advance spelling correction methodologies. The accompanying data and code will be made publicly available.

Anthology ID:: 2025.naacl-industry.13
Volume:: Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track)
Month:: April
Year:: 2025
Address:: Albuquerque, New Mexico
Editors:: Weizhu Chen, Yi Yang, Mohammad Kachuee, Xue-Yong Fu
Venue:: NAACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 148–155
Language:
URL:: https://preview.aclanthology.org/fix-sig-urls/2025.naacl-industry.13/
DOI:
Bibkey:
Cite (ACL):: Dezhi Ye, Haomei Jia, Junwei Hu, Tian Bowen, Jie Liu, Haijin Liang, Jin Ma, and Wenmin Wang. 2025. QSpell 250K: A Large-Scale, Practical Dataset for Chinese Search Query Spell Correction. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track), pages 148–155, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):: QSpell 250K: A Large-Scale, Practical Dataset for Chinese Search Query Spell Correction (Ye et al., NAACL 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/fix-sig-urls/2025.naacl-industry.13.pdf

PDF Cite Search Fix data