DNCASR: End-to-End Training for Speaker-Attributed ASR

Xianrui Zheng; Chao Zhang; Phil Woodland

DNCASR: End-to-End Training for Speaker-Attributed ASR

Xianrui Zheng, Chao Zhang, Phil Woodland

Abstract

This paper introduces DNCASR, a novel end-to-end trainable system designed for joint neural speaker clustering and automatic speech recognition (ASR), enabling speaker-attributed transcription of long multi-party meetings. DNCASR uses two separate encoders to independently encode global speaker characteristics and local waveform information, along with two linked decoders to generate speaker-attributed transcriptions. The use of linked decoders allows the entire system to be jointly trained under a unified loss function. By employing a serialised training approach, DNCASR effectively addresses overlapping speech in real-world meetings, where the link improves the prediction of speaker indices in overlapping segments. Experiments on the AMI-MDM meeting corpus demonstrate that the jointly trained DNCASR outperforms a parallel system that does not have links between the speaker and ASR decoders. Using cpWER to measure the speaker-attributed word error rate, DNCASR achieves a 9.0% relative reduction on the AMI-MDM Eval set.

Anthology ID:: 2025.acl-long.899
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 18369–18383
Language:
URL:: https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.899/
DOI:
Bibkey:
Cite (ACL):: Xianrui Zheng, Chao Zhang, and Phil Woodland. 2025. DNCASR: End-to-End Training for Speaker-Attributed ASR. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 18369–18383, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: DNCASR: End-to-End Training for Speaker-Attributed ASR (Zheng et al., ACL 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.899.pdf

PDF Cite Search Fix data