Muhammad Huzaifah


2025

pdf bib
Benchmarking Contextual and Paralinguistic Reasoning in Speech-LLMs: A Case Study with In-the-Wild Data
Qiongqiong Wang | Hardik Bhupendra Sailor | Tianchi Liu | Wenyu Zhang | Muhammad Huzaifah | Nattadaporn Lertcheva | Shuo Sun | Nancy F. Chen | Jinyang Wu | AiTi Aw
Findings of the Association for Computational Linguistics: EMNLP 2025

Recent speech-LLMs have shown impressive performance in tasks like transcription and translation, yet they remain limited in understanding the paralinguistic aspects of speech crucial for social and emotional intelligence. We propose CP-Bench, a benchmark for evaluating speech-LLMs on contextual paralinguistic reasoning the integration of verbal content with non-verbal cues like emotion and prosody. The benchmark includes two curated question answering (QA) datasets requiring both linguistic and empathetic understanding. We evaluate state-of-the-art speech-LLMs from both open and closed-source models and perform a comprehensive analysis across different question types. The top two models were further analyzed under temperature tuning to understand its effect on this task. Our benchmark reveals a key gap in existing evaluations and offers insights into building more context-aware and emotionally intelligent speech-capable LLMs.

2024

pdf bib
Evaluating Code-Switching Translation with Large Language Models
Muhammad Huzaifah | Weihua Zheng | Nattapol Chanpaisit | Kui Wu
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Recent advances in large language models (LLMs) have shown they can match or surpass finetuned models on many natural language processing tasks. Currently, more studies are being carried out to assess whether this performance carries over across different languages. In this paper, we present a thorough evaluation of LLMs for the less well-researched code-switching translation setting, where inputs include a mixture of different languages. We benchmark the performance of six state-of-the-art LLMs across seven datasets, with GPT-4 and GPT-3.5 displaying strong ability relative to supervised translation models and commercial engines. GPT-4 was also found to be particularly robust against different code-switching conditions. Several methods to further improve code-switching translation are proposed including leveraging in-context learning and pivot translation. Through our code-switching experiments, we argue that LLMs show promising ability for cross-lingual understanding.

2023

pdf bib
I2R’s End-to-End Speech Translation System for IWSLT 2023 Offline Shared Task
Muhammad Huzaifah | Kye Min Tan | Richeng Duan
Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023)

This paper describes I2R’s submission to the offline speech translation track for IWSLT 2023. We focus on an end-to-end approach for translation from English audio to German text, one of the three available language directions in this year’s edition. The I2R system leverages on pretrained models that have been exposed to large-scale audio and text data for our base model. We introduce several stages of additional pretraining followed by fine-tuning to adapt the system for the downstream speech translation task. The strategy is supplemented by other techniques such as data augmentation, domain tagging, knowledge distillation, and model ensemble, among others. We evaluate the system on several publicly available test sets for comparison.