Tao Bai

2026

OmniOData: Unleashing Small Language Models for OData Query Generation with Synthetic Data and Reinforcement Learning
Tao Bai | Zhaochen Li | Hongxin Shao | Daniel Dahlmeier
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track)

Despite the success of Large Language Models (LLMs) in structured query generation, OData—a critical RESTful protocol for enterprise APIs—remains under-researched due to a lack of high-fidelity, execution-validated datasets. To bridge this gap, we introduce OmniOData, a framework that generates SynOData, the first large-scale OData corpus featuring execution-grounded queries and reasoning traces. Using this corpus, we develop OmniOData-R1 (1.5B–3B parameters), a family of models that match or surpass frontier proprietary systems, such as GPT-4o and Gemini 3, on realistic industrial benchmarks. Our results demonstrate that the synergy of execution-verified synthetic data and Reinforcement Learning (RL) effectively unlocks the latent reasoning of Small Language Models (SLMs), providing a high-performance, low-latency solution for specialized enterprise query generation.The code and data will be released under an open-source license.