Utkarsh Garg


2026

Large language models increasingly power AI agents for tasks requiring iterative refinement: document editing demands targeted revisions while preserving cross-references, code refactoring requires tracking function dependencies, and knowledge base updates cascade through related entities. Iterative editing with AI agents faces a fundamental efficiency-consistency tradeoff: maintaining consistency requires full-context awareness of dependencies, but processing entire documents for each edit incurs prohibitive token costs and latency. Isolated edits improve efficiency but risk breaking cross-references and violating semantic constraints. We introduce LEDGER (scaLing Agentic document editing with Dependency-aware Graph rEtRieval), a framework that constructs lightweight dependency graphs capturing semantic relationships and structural hierarchies across document elements. For each edit, graph traversal identifies affected elements and retrieves only necessary context. Experiments across 1,900 test cases spanning six state-of-the-art models show LEDGER achieves 76 consistency versus 56 baseline while reducing token usage by 85 . Critically, LEDGER with low reasoning effort matches baseline performance at high reasoning effort using 70 fewer tokens, suggesting explicit dependency representations can substitute for expensive internal reasoning with implications for agentic systems operating on structured data.

2023

Dialogue systems need to produce responses that realize multiple types of dialogue acts (DAs) with high semantic fidelity. In the past, natural language generators (NLGs) for dialogue were trained on large parallel corpora that map from a domain-specific DA and its semantic attributes to an output utterance. Recent work shows that pretrained language models (LLMs) offer new possibilities for controllable NLG using prompt-based learning. Here we develop a novel few-shot overgenerate-and-rank approach that achieves the controlled generation of DAs. We compare eight few-shot prompt styles that include a novel method of generating from textual pseudo-references using a textual style transfer approach. We develop six automatic ranking functions that identify outputs with both the correct DA and high semantic accuracy at generation time. We test our approach on three domains and four LLMs. To our knowledge, this is the first work on NLG for dialogue that automatically ranks outputs using both DA and attribute accuracy. For completeness, we compare our results to fine-tuned few-shot models trained with 5 to 100 instances per DA. Our results show that several prompt settings achieve perfect DA accuracy, and near perfect semantic accuracy (99.81%) and perform better than few-shot fine-tuning.