Aljawahrah Bin Tamran


2025

pdf bib
Octopus: Towards Building the Arabic Speech LLM Suite
Sara Althubaiti | Vasista Sai Lodagala | Tjad Clark | Yousseif Ahmed Elshahawy | Daniel Izham | Abdullah Alrajeh | Aljawahrah Bin Tamran | Ahmed Ali
Proceedings of The Third Arabic Natural Language Processing Conference

We present Octopus, a first family of modular speech-language models designed for Arabic-English ASR, dialect identification, and speech translation. Built on Whisper-V3 and enhanced with large language models like ALLaM, LLaMA, and DeepSeek, Octopus bridges speech and text through a lightweight projection layer and Q-Former. To broaden its scope beyond speech, Octopus integrates BEATs, a general-purpose audio encoder allowing it to understand both linguistic and acoustic events. Despite its simplicity, this dual-encoder design supports robust performance across multilingual and code-switched scenarios. We also introduce TinyOctopus, a distilled variant using smaller models (Distil-Whisper + LLaMA3-1B / DeepSeek-1.5B), achieving competitive results with just a fraction of the parameters. Fine-tuning on synthetic code-switched data further boosts its performance. Octopus demonstrates the power of compact, extensible architectures in Arabic-centric speech modeling and sets the stage for unified multilingual audio-language understanding.