Matt Smith
2026
CoSy: Conversational Synthesis for Grounded Question Answering
Patrick Huber | Arash Einolghozati | Rylan Conway | Kanika Narang | Matt Smith | Waqar Nayyar | Adithya Sagar | Ahmed A Aly | Akshat Shrivastava
Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM)
Patrick Huber | Arash Einolghozati | Rylan Conway | Kanika Narang | Matt Smith | Waqar Nayyar | Adithya Sagar | Ahmed A Aly | Akshat Shrivastava
Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM)
High-quality, large-scale conversational datasets are scarce, making it difficult to train on-device language models (~1B parameters) as effective assistants. We introduce CoSy (Conversational Synthesis), a novel framework for generating diverse, steerable, multi-turn conversations at scale. CoSY combines three key mechanisms: (1) conversational graphs that ensure natural dialogue flow, (2) turn-based prompt augmentations for diversity, and (3) explicit linguistic phenomena for coherence. We evaluate CoSy on conversational grounded reasoning tasks (i.e. answering questions based on contextual information), a core on-device use case.Our on-device sized models trained on CoSy-synthesized data achieve competitive performance with human-annotated baselines and outperform instruction-tuned models of up to 70B parameters in zero-shot settings.
2024
AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
Seungwhan Moon | Andrea Madotto | Zhaojiang Lin | Tushar Nagarajan | Matt Smith | Shashank Jain | Chun-Fu Yeh | Prakash Murugesan | Peyman Heidari | Yue Liu | Kavya Srinet | Babak Damavandi | Anuj Kumar
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track
Seungwhan Moon | Andrea Madotto | Zhaojiang Lin | Tushar Nagarajan | Matt Smith | Shashank Jain | Chun-Fu Yeh | Prakash Murugesan | Peyman Heidari | Yue Liu | Kavya Srinet | Babak Damavandi | Anuj Kumar
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track
We present Any-Modality Augmented Language Model (AnyMAL), a unified model that reasons over diverse input modality signals (i.e. text, image, video, audio, IMU motion sensor), and generates textual responses. AnyMAL inherits the powerful text-based reasoning abilities of the state-of-the-art LLMs including Llama-3 (70B), and converts modality-specific signals to the joint textual space through a pre-trained aligner module.In this paper, we provide details on the optimizations implemented to efficiently scale the training pipeline, and present a comprehensive recipe for model and training configurations. We conduct comprehensive empirical analysis comprising both human and automatic evaluations, and demonstrate state-of-the-art performance on various multimodal tasks compared to industry-leading models – albeit with a relatively small number of trainable parameters.