Jean Lahoud
2025
LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM
Sambal Shikhar
|
Mohammed Irfan Kurpath
|
Sahal Shaji Mullappilly
|
Jean Lahoud
|
Fahad Shahbaz Khan
|
Rao Muhammad Anwer
|
Salman Khan
|
Hisham Cholakkal
Findings of the Association for Computational Linguistics: ACL 2025
Recent advancements in speech-to-speech dialogue systems leverage LLMs for multimodal interactions, yet they remain hindered by fine-tuning requirements, high computational overhead, and text-speech misalignment. Existing speech-enabled LLMs often degrade conversational quality by modifying the LLM, thereby compromising its linguistic capabilities. In contrast, we propose LLMVoX, a lightweight 30M-parameter, LLM-agnostic, autoregressive streaming TTS system that generates high-quality speech with low latency, while fully preserving the capabilities of the base LLM. Our approach achieves a significantly lower Word Error Rate compared to speech-enabled LLMs, while operating at comparable latency. By decoupling speech synthesis from LLM processing via a multi-queue token streaming system, LLMVoX enables seamless, infinite-length dialogues. Its plug-and-play design also facilitates extension to various tasks with different backbones. Furthermore, LLMVoX generalizes to new languages with minimal dataset adaptation, attaining a low Character Error Rate on an Arabic speech task. Evaluations demonstrate that LLMVoX matches or surpasses existing speech-enabled LLMs in both speech quality and latency, while maintaining the original linguistic strengths of the LLM. Additionally, we have integrated LLMVoX with a Vision-Language Model to create an omni-model with speech, text, and vision capabilities, without requiring additional multimodal training.
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs
Omkar Thawakar
|
Dinura Dissanayake
|
Ketan Pravin More
|
Ritesh Thawkar
|
Ahmed Heakl
|
Noor Ahsan
|
Yuhao Li
|
Ilmuz Zaman Mohammed Zumri
|
Jean Lahoud
|
Rao Muhammad Anwer
|
Hisham Cholakkal
|
Ivan Laptev
|
Mubarak Shah
|
Fahad Shahbaz Khan
|
Salman Khan
Findings of the Association for Computational Linguistics: ACL 2025
Step-by-step reasoning is crucial for solving complex visual tasks, yet existing approaches lack a comprehensive framework for evaluating this capability and do not emphasize step-wise problem-solving. To this end, we propose a comprehensive framework for advancing multi-step visual reasoning in large multimodal models (LMMs) through three key contributions. First, we introduce a Visual Reasoning Chain Benchmark, the most comprehensive benchmark for multi-step visual reasoning, covering eight diverse categories and over 4k reasoning steps. This enables rigorous evaluation of LMMs’ ability to reason accurately and interpretably across multiple steps. Second, we propose a fine-grained reasoning metric that evaluates correctness and logical coherence at each step, providing deeper insights beyond traditional accuracy metrics. Third, we introduce LlamaV-o1, a state-of-the-art multimodal reasoning model trained using a multi-step curriculum learning approach. LlamaV-o1 is optimized for structured, step-by-step reasoning and significantly outperforms existing open-source models. It surpasses Llava-CoT with a 3.8% absolute gain across six benchmarks, achieving an average score of 67.3 while being 5x faster during inference scaling. Our benchmark, model, and code is available at https://github.com/mbzuai-oryx/LlamaV-o1.
Search
Fix author
Co-authors
- Rao Muhammad Anwer 2
- Hisham Cholakkal 2
- Fahad Shahbaz Khan 2
- Salman Khan 2
- Noor Ahsan 1
- show all...