Adithya Sagar
2026
CoSy: Conversational Synthesis for Grounded Question Answering
Patrick Huber | Arash Einolghozati | Rylan Conway | Kanika Narang | Matt Smith | Waqar Nayyar | Adithya Sagar | Ahmed A Aly | Akshat Shrivastava
Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM)
Patrick Huber | Arash Einolghozati | Rylan Conway | Kanika Narang | Matt Smith | Waqar Nayyar | Adithya Sagar | Ahmed A Aly | Akshat Shrivastava
Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM)
High-quality, large-scale conversational datasets are scarce, making it difficult to train on-device language models (~1B parameters) as effective assistants. We introduce CoSy (Conversational Synthesis), a novel framework for generating diverse, steerable, multi-turn conversations at scale. CoSY combines three key mechanisms: (1) conversational graphs that ensure natural dialogue flow, (2) turn-based prompt augmentations for diversity, and (3) explicit linguistic phenomena for coherence. We evaluate CoSy on conversational grounded reasoning tasks (i.e. answering questions based on contextual information), a core on-device use case.Our on-device sized models trained on CoSy-synthesized data achieve competitive performance with human-annotated baselines and outperform instruction-tuned models of up to 70B parameters in zero-shot settings.
CoSMoEs: Compact Sparse Mixture of Experts
Patrick Huber | Akshat Shrivastava | Ernie Chang | Chinnadhurai Sankar | Ahmed A Aly | Adithya Sagar
Proceedings of the 4th Workshop on Advances in Language and Vision Research (ALVR)
Patrick Huber | Akshat Shrivastava | Ernie Chang | Chinnadhurai Sankar | Ahmed A Aly | Adithya Sagar
Proceedings of the 4th Workshop on Advances in Language and Vision Research (ALVR)
Sparse Mixture of Expert (MoE) models are widely used foundation architectures at large scale, yet remain under-explored at smaller sizes. In this work, we introduce Compact Sparse Mixture of Experts (CoSMoEs) for on-device inference, addressing three key challenges: Quality, Memory, and Latency. On the quality front, we conduct a fair evaluation (removing confounding factors) and show that MoE architectures outperform dense models at on-device scale. We further propose weight-decomposed experts, which improve MoE performance beyond the standard formulation. On the memory and latency front, we address the prohibitively large parameter count of MoE models by improving expert offloading efficiency through a novel training-time loss, reducing inference latency for on-device deployment
MobileLLM-Flash: Latency-Guided On-Device LLM Design for Industry Scale Deployment
Hanxian Huang | Igor Fedorov | Andrey Gromov | Bernard Beckerman | Naveen Suda | David Eriksson | Maximilian Balandat | Rylan Conway | Patrick Huber | Chinnadhurai Sankar | Ayushi Dalmia | Zechun Liu | Lemeng Wu | Tarek Elgamal | Adithya Sagar | Vikas Chandra | Raghuraman Krishnamoorthi
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track)
Hanxian Huang | Igor Fedorov | Andrey Gromov | Bernard Beckerman | Naveen Suda | David Eriksson | Maximilian Balandat | Rylan Conway | Patrick Huber | Chinnadhurai Sankar | Ayushi Dalmia | Zechun Liu | Lemeng Wu | Tarek Elgamal | Adithya Sagar | Vikas Chandra | Raghuraman Krishnamoorthi
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track)
Real-time AI experiences call for on-device large language models (OD-LLMs) optimized for efficient deployment on resource-constrained hardware. The most useful OD-LLMs produce near-real-time responses and exhibit broad hardware compatibility, maximizing user reach. We present a methodology for designing such models using hardware-in-the-loop architecture search under mobile latency constraints. This system is amenable to industry-scale deployment: it generates models deployable without custom kernels and compatible with standard mobile runtimes like Executorch. Our methodology avoids specialized attention mechanisms and instead uses attention skipping for long-context acceleration. Our approach jointly optimizes model architecture (layers, dimensions) and attention pattern. To efficiently evaluate candidates, we treat each as a pruned version of a pretrained backbone with inherited weights, thereby achieving high accuracy with minimal continued pretraining. We leverage the low cost of latency evaluation in a staged process: learning an accurate latency model first, then searching for the Pareto-frontier across latency and quality.This yields MobileLLM-Flash, a family of foundation models (350M, 650M, 1.4B) for efficient on-device use with strong capabilities, supporting up to 8k context length. MobileLLM-Flash delivers up to 1.8x and 1.6x faster prefill and decode on mobile CPUs with comparable or superior quality. Our analysis of Pareto-frontier design choices offers actionable principles for OD-LLM design.
2024
PRoDeliberation: Parallel Robust Deliberation for End-to-End Spoken Language Understanding
Trang Le | Daniel Lazar | Suyoun Kim | Shan Jiang | Duc Le | Adithya Sagar | Aleksandr Livshits | Ahmed A Aly | Akshat Shrivastava
Findings of the Association for Computational Linguistics: EMNLP 2024
Trang Le | Daniel Lazar | Suyoun Kim | Shan Jiang | Duc Le | Adithya Sagar | Aleksandr Livshits | Ahmed A Aly | Akshat Shrivastava
Findings of the Association for Computational Linguistics: EMNLP 2024
Spoken Language Understanding (SLU) is a critical component of voice assistants; it consists of converting speech to semantic parses for task execution. Previous works have explored end-to-end models to improve the quality and robustness of SLU models with Deliberation, however these models have remained autoregressive, resulting in higher latencies. In this work we introduce PRoDeliberation, a novel method leveraging a Connectionist Temporal Classification-based decoding strategy as well as a denoising objective to train robust non-autoregressive deliberation models. We show that PRoDeliberation achieves the latency reduction of parallel decoding (2-10x improvement over autoregressive models) while retaining the ability to correct Automatic Speech Recognition (ASR) mistranscriptions of autoregressive deliberation systems. We further show that the design of the denoising training allows PRoDeliberation to overcome the limitations of small ASR devices, and we provide analysis on the necessity of each component of the system.
Large Language Models as Zero-shot Dialogue State Tracker through Function Calling
Zekun Li | Zhiyu Zoey Chen | Mike Ross | Patrick Huber | Seungwhan Moon | Zhaojiang Lin | Luna Dong | Adithya Sagar | Xifeng Yan | Paul A. Crook
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zekun Li | Zhiyu Zoey Chen | Mike Ross | Patrick Huber | Seungwhan Moon | Zhaojiang Lin | Luna Dong | Adithya Sagar | Xifeng Yan | Paul A. Crook
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language models (LLMs) are increasingly prevalent in conversational systems due to their advanced understanding and generative capabilities in general contexts. However, their effectiveness in task-oriented dialogues (TOD), which requires not only response generation but also effective dialogue state tracking (DST) within specific tasks and domains, remains less satisfying. In this work, we propose a novel approach FnCTOD for solving DST with LLMs through function calling. This method improves zero-shot DST, allowing adaptation to diverse domains without extensive data collection or model tuning. Our experimental results demonstrate that our approach achieves exceptional performance with both modestly sized open-source and also proprietary LLMs: with in-context prompting it enables various 7B or 13B parameter models to surpass the previous state-of-the-art (SOTA) achieved by ChatGPT, and improves ChatGPT’s performance beating the SOTA by 5.6% average joint goal accuracy (JGA). Individual model results for GPT-3.5 and GPT-4 are boosted by 4.8% and 14%, respectively. We also show that by fine-tuning on a small collection of diverse task-oriented dialogues, we can equip modestly sized models, specifically a 13B parameter LLaMA2-Chat model, with function-calling capabilities and DST performance comparable to ChatGPT while maintaining their chat capabilities. We have made the code publicly available at https://github.com/facebookresearch/FnCTOD.
2022
RetroNLU: Retrieval Augmented Task-Oriented Semantic Parsing
Vivek Gupta | Akshat Shrivastava | Adithya Sagar | Armen Aghajanyan | Denis Savenkov
Proceedings of the 4th Workshop on NLP for Conversational AI
Vivek Gupta | Akshat Shrivastava | Adithya Sagar | Armen Aghajanyan | Denis Savenkov
Proceedings of the 4th Workshop on NLP for Conversational AI
While large pre-trained language models accumulate a lot of knowledge in their parameters, it has been demonstrated that augmenting it with non-parametric retrieval-based memory has a number of benefits ranging from improved accuracy to data efficiency for knowledge-focused tasks such as question answering. In this work, we apply retrieval-based modeling ideas to the challenging complex task of multi-domain task-oriented semantic parsing for conversational assistants. Our technique, RetroNLU, extends a sequence-to-sequence model architecture with a retrieval component, which is used to retrieve existing similar samples and present them as an additional context to the model. In particular, we analyze two settings, where we augment an input with (a) retrieved nearest neighbor utterances (utterance-nn), and (b) ground-truth semantic parses of nearest neighbor utterances (semparse-nn). Our technique outperforms the baseline method by 1.5% absolute macro-F1, especially at the low resource setting, matching the baseline model accuracy with only 40% of the complete data. Furthermore, we analyse the quality, model sensitivity, and performance of the nearest neighbor retrieval component’s for semantic parses of varied utterance complexity.
Search
Fix author
Co-authors
- Patrick Huber 4
- Akshat Shrivastava 4
- Ahmed A Aly 3
- Rylan Conway 2
- Chinnadhurai Sankar 2
- Armen Aghajanyan 1
- Maximilian Balandat 1
- Bernard Beckerman 1
- Vikas Chandra 1
- Ernie Chang 1
- Zhiyu Zoey Chen 1
- Paul A. Crook 1
- Ayushi Dalmia 1
- Xin Luna Dong 1
- Arash Einolghozati 1
- Tarek Elgamal 1
- David Eriksson 1
- Igor Fedorov 1
- Andrey Gromov 1
- Vivek Gupta 1
- Hanxian Huang 1
- Shan Jiang 1
- Suyoun Kim 1
- Raghuraman Krishnamoorthi 1
- Daniel Lazar 1
- Duc Le 1
- Trang Le 1
- Zekun Li 1
- Zhaojiang Lin 1
- Zechun Liu 1
- Aleksandr Livshits 1
- Seungwhan Moon 1
- Kanika Narang 1
- Waqar Nayyar 1
- Mike Ross 1
- Denis Savenkov 1
- Matt Smith 1
- Naveen Suda 1
- Lemeng Wu 1
- Xifeng Yan 1