Grammar Pruning: Enabling Low-Latency Zero-Shot Task-Oriented Language Models for Edge AI

Octavian Alexandru Trifan, Jason Lee Weber, Marc Titus Trifan, Alexandru Nicolau, Alexander Veidenbaum


Abstract
Edge deployment of task-oriented semantic parsers demands high accuracy under tight latency and memory budgets. We present Grammar Pruning, a lightweight zero-shot framework that begins with a user-defined schema of API calls and couples a rule-based entity extractor with an iterative grammar-constrained decoder: extracted items dynamically prune the context-free grammar, limiting generation to only those intents, slots, and values that remain plausible at each step. This aggressive search-space reduction both reduces hallucinations and slashes decoding time. On the adapted FoodOrdering, APIMixSNIPS, and APIMixATIS benchmarks, Grammar Pruning with small language models achieves an average execution accuracy of over 90%—rivaling State-of-the-Art, cloud-based solutions—while sustaining at least 2x lower end-to-end latency than existing methods. By requiring nothing beyond the domain’s full API schema values yet delivering precise, real-time natural-language understanding, Grammar Pruning positions itself as a practical building block for future edge-AI applications that cannot rely on large models or cloud offloading.
Anthology ID:
2025.emnlp-main.858
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
16955–16968
Language:
URL:
https://preview.aclanthology.org/ingest-luhme/2025.emnlp-main.858/
DOI:
10.18653/v1/2025.emnlp-main.858
Bibkey:
Cite (ACL):
Octavian Alexandru Trifan, Jason Lee Weber, Marc Titus Trifan, Alexandru Nicolau, and Alexander Veidenbaum. 2025. Grammar Pruning: Enabling Low-Latency Zero-Shot Task-Oriented Language Models for Edge AI. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 16955–16968, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Grammar Pruning: Enabling Low-Latency Zero-Shot Task-Oriented Language Models for Edge AI (Trifan et al., EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-luhme/2025.emnlp-main.858.pdf
Checklist:
 2025.emnlp-main.858.checklist.pdf