Yike Zhao


2025

pdf bib
UnifiedGEC: Integrating Grammatical Error Correction Approaches for Multi-languages with a Unified Framework
Yike Zhao | Xiaoman Wang | Yunshi Lan | Weining Qian
Proceedings of the 31st International Conference on Computational Linguistics: System Demonstrations

Grammatical Error Correction is an important research direction in NLP field. Although many models of different architectures and datasets across different languages have been developed to support the research, there is a lack of a comprehensive evaluation on these models, and different architectures make it hard for developers to implement these models on their own. To address this limitation, we present UnifiedGEC, the first open-source GEC-oriented toolkit, which consists of several core components and reusable modules. In UnifiedGEC, we integrate 5 widely-used GEC models and compare their performance on 7 datasets in different languages. Additionally, GEC-related modules such as data augmentation, prompt engineering are also deployed in it. Developers are allowed to implement new models, run and evaluate on existing benchmarks through our framework in a simple way. Code, documents and detailed results of UnifiedGEC are available at https://github.com/AnKate/UnifiedGEC.

pdf bib
More Data or Better Data? A Critical Analysis of Data Selection and Synthesis for Mathematical Reasoning
Yike Zhao | Simin Guo | Ziqing Yang | Shifan Han | Dahua Lin | Fei Tan
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track

The reasoning capabilities of Large Language Models (LLMs) play a critical role in many downstream tasks, yet depend strongly on the quality of training data. Despite various proposed data construction methods, their practical utility in real-world pipelines remains underexplored. In this work, we conduct a comprehensive analysis of open-source datasets and data synthesis techniques for mathematical reasoning, evaluating them under a unified pipeline designed to mirror training and deployment scenarios. We further distill effective data selection strategies and identify practical methods suitable for industrial applications. Our findings highlight that structuring data in more interpretable formats, or distilling from stronger models often outweighs simply scaling up data volume. This study provides actionable guidance for integrating training data to enhance LLM capabilities, supporting both cost-effective data curation and scalable model enhancement. We hope this work will inspire further research on how to balance “more data” versus “better data” for real-world reasoning tasks.

pdf bib
VisCGEC: Benchmarking the Visual Chinese Grammatical Error Correction
Xiaoman Wang | Dan Yuan | Xin Liu | Yike Zhao | Xiaoxiao Zhang | Xizhi Chen | Yunshi Lan
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)