Xiong Wang

2026

LLM-ForcedAligner: A Non-Autoregressive and Accurate LLM-Based Forced Aligner for Multilingual and Long-Form Speech
Bingshen Mu | Xian Shi | Xiong Wang | Hexin Liu | Jin Xu | Lei Xie
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Forced alignment (FA) predicts start and end timestamps for words or characters in speech, but existing methods are language-specific and prone to cumulative temporal shifts. The multilingual speech understanding and long-sequence processing abilities of speech large language models (SLLMs) make them promising for FA in multilingual, crosslingual, and long-form speech settings. However, directly applying the next-token prediction paradigm of SLLMs to FA results in hallucinations and slow inference. To bridge the gap, we propose LLM-ForcedAligner, reformulating FA as a slot-filling paradigm: timestamps are treated as discrete indices, and special timestamp tokens are inserted as slots into the transcript. Conditioned on the speech embeddings and the transcript with slots, the SLLM directly predicts the time indices at slots. During training, causal attention masking with non-shifted input and label sequences allows each slot to predict its own timestamp index based on itself and preceding context, with loss computed only at slot positions. Dynamic slot insertion enables FA at arbitrary positions. Moreover, non-autoregressive inference is supported, avoiding hallucinations and improving speed. Experiments across multilingual, crosslingual, and long-form speech scenarios show that LLM-ForcedAligner achieves a 69% 78% relative reduction in accumulated averaging shift compared with prior methods. The checkpoint and inference code will be released later.

pdf bib abs

Federated low-rank adaptation (LoRA) enables multiple clients to collaboratively fine-tune large language models (LLMs) without disclosing their raw data. However, existing works often experience performance degradation due to biased model aggregation and are hindered by significant communication and computation burden, both limiting training efficiency. In this paper, we propose iFLoRA, an improved Federated LoRA fine-tuning system for LLMs featuring pipelined error-mitigated model aggregation and adaptive matrix-wise parameter freezing. Specifically, iFLoRA mitigates aggregation error by first reconstructing local update matrices from clients’ low-rank matrices. These are then aggregated into a global update, which is decomposed via singular value decomposition (SVD) to form low-rank matrices for the next round. To mitigate the overhead from SVD, iFLoRA employs a pipeline to overlap global aggregation, local computation, and communication. Additionally, iFLoRA implements an adaptive matrix-wise freezing scheme that assesses their stability and selectively freezes them for adaptively adjusted periods, alleviating client training overheads without compromising model performance. Extensive experiments on real-world datasets show that iFLoRA can improve time-to-target by 2.17-8.48× than state-of-the-art methods. Our code is available at: https://github.com/whr819987540/iflora.

2025

pdf bib abs

Recent advancements in speech large language models (SpeechLLMs) have attracted considerable attention. Nonetheless, current methods exhibit suboptimal performance in adhering to speech instructions. Notably, the intelligence of models significantly diminishes when processing speech-form input as compared to direct text-form input. Prior work has attempted to mitigate this semantic inconsistency between speech and text representations through techniques such as representation and behavior alignment, which involve the meticulous design of data pairs during the post-training phase. In this paper, we introduce a simple and scalable training method called InSerter, which stands for Interleaved Speech-Text Representation Pre-training. InSerter is designed to pre-train large-scale unsupervised speech-text sequences, where the speech is synthesized from randomly selected segments of an extensive text corpus using text-to-speech conversion. Consequently, the model acquires the ability to generate textual continuations corresponding to the provided speech segments, obviating the need for intensive data design endeavors. To systematically evaluate speech instruction-following capabilities, we introduce SpeechInstructBench, the first comprehensive benchmark specifically designed for speech-oriented instruction-following tasks. Our proposed model InSerter achieves SOTA performance in SpeechInstructBench and demonstrates superior or competitive results across diverse speech processing tasks.