Vasista Sai Lodagala
2025
Octopus: Towards Building the Arabic Speech LLM Suite
Sara Althubaiti
|
Vasista Sai Lodagala
|
Tjad Clark
|
Yousseif Ahmed Elshahawy
|
Daniel Izham
|
Abdullah Alrajeh
|
Aljawahrah Bin Tamran
|
Ahmed Ali
Proceedings of The Third Arabic Natural Language Processing Conference
We present Octopus, a first family of modular speech-language models designed for Arabic-English ASR, dialect identification, and speech translation. Built on Whisper-V3 and enhanced with large language models like ALLaM, LLaMA, and DeepSeek, Octopus bridges speech and text through a lightweight projection layer and Q-Former. To broaden its scope beyond speech, Octopus integrates BEATs, a general-purpose audio encoder allowing it to understand both linguistic and acoustic events. Despite its simplicity, this dual-encoder design supports robust performance across multilingual and code-switched scenarios. We also introduce TinyOctopus, a distilled variant using smaller models (Distil-Whisper + LLaMA3-1B / DeepSeek-1.5B), achieving competitive results with just a fraction of the parameters. Fine-tuning on synthetic code-switched data further boosts its performance. Octopus demonstrates the power of compact, extensible architectures in Arabic-centric speech modeling and sets the stage for unified multilingual audio-language understanding.
Search
Fix author
Co-authors
- Ahmed Ali 1
- Abdullah Alrajeh 1
- Sara Althubaiti 1
- Aljawahrah Bin Tamran 1
- Tjad Clark 1
- show all...