Daniel Povey
Also published as: D. Povey
2026
ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation with Flow Matching
Han Zhu | Wei Kang | Liyong Guo | Zengwei Yao | Fangjun Kuang | Weiji Zhuang | Zhaoqing Li | Zhifeng Han | Dong Zhang | Xin Zhang | Xingchen Song | Lingxuan Ye | Long Lin | Daniel Povey
Findings of the Association for Computational Linguistics: ACL 2026
Han Zhu | Wei Kang | Liyong Guo | Zengwei Yao | Fangjun Kuang | Weiji Zhuang | Zhaoqing Li | Zhifeng Han | Dong Zhang | Xin Zhang | Xingchen Song | Lingxuan Ye | Long Lin | Daniel Povey
Findings of the Association for Computational Linguistics: ACL 2026
Generating spoken dialogue is inherently more complex than monologue text-to-speech (TTS), as it demands both realistic turn-taking and the maintenance of distinct speaker timbres. While existing autoregressive (AR) models have made progress, they often suffer from high inference latency and stability issues. To overcome these limitations, we propose ZipVoice-Dialog, a non-autoregressive (NAR) zero-shot spoken dialogue generation model based on flow-matching. Observing that applying vanilla flow-matching to dialogue generation leads to poor speech intelligibility and turn-taking precision, we introduce two simple yet effective methods to adapt flow-matching architectures for dialogue generation: (1) a curriculum learning strategy to ensure robust speech-text alignment, and (2) speaker-turn embeddings to govern precise speaker turn-taking. Additionally, we introduce dedicated strategies to support stereo dialogue generation.Recognizing the lack of training datasets in this field, we curate and release OpenDialog, the first large-scale (6.8k hours) open-source spoken dialogue dataset derived from in-the-wild speech data. Moreover, for fair and rigorous evaluations, we established a benchmark to comprehensively evaluate dialogue generation models. Experiments demonstrate the effectiveness of the proposed methods and dataset, showing that ZipVoice-Dialog achieves superior performance in inference speed, intelligibility, speaker turn-taking accuracy, and speaker similarity. Our code, model checkpoints, and the OpenDialog dataset are publicly available.
2019
Robust Document Representations for Cross-Lingual Information Retrieval in Low-Resource Settings
Mahsa Yarmohammadi | Xutai Ma | Sorami Hisamoto | Muhammad Rahman | Yiming Wang | Hainan Xu | Daniel Povey | Philipp Koehn | Kevin Duh
Proceedings of Machine Translation Summit XVII: Research Track
Mahsa Yarmohammadi | Xutai Ma | Sorami Hisamoto | Muhammad Rahman | Yiming Wang | Hainan Xu | Daniel Povey | Philipp Koehn | Kevin Duh
Proceedings of Machine Translation Summit XVII: Research Track
2015
A Coarse-Grained Model for Optimal Coupling of ASR and SMT Systems for Speech Translation
Gaurav Kumar | Graeme Blackwood | Jan Trmal | Daniel Povey | Sanjeev Khudanpur
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing
Gaurav Kumar | Graeme Blackwood | Jan Trmal | Daniel Povey | Sanjeev Khudanpur
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing
2014
Translations of the Callhome Egyptian Arabic corpus for conversational speech translation
Gaurav Kumar | Yuan Cao | Ryan Cotterell | Chris Callison-Burch | Daniel Povey | Sanjeev Khudanpur
Proceedings of the 11th International Workshop on Spoken Language Translation: Papers
Gaurav Kumar | Yuan Cao | Ryan Cotterell | Chris Callison-Burch | Daniel Povey | Sanjeev Khudanpur
Proceedings of the 11th International Workshop on Spoken Language Translation: Papers
Translation of the output of automatic speech recognition (ASR) systems, also known as speech translation, has received a lot of research interest recently. This is especially true for programs such as DARPA BOLT which focus on improving spontaneous human-human conversation across languages. However, this research is hindered by the dearth of datasets developed for this explicit purpose. For Egyptian Arabic-English, in particular, no parallel speechtranscription-translation dataset exists in the same domain. In order to support research in speech translation, we introduce the Callhome Egyptian Arabic-English Speech Translation Corpus. This supplements the existing LDC corpus with four reference translations for each utterance in the transcripts. The result is a three-way parallel dataset of Egyptian Arabic Speech, transcriptions and English translations.
2006
Search
Fix author
Co-authors
- Sanjeev Khudanpur 2
- Gaurav Kumar 2
- Graeme Blackwood 1
- Chris Callison-Burch 1
- Yuan Cao 1
- Ryan Cotterell 1
- Kevin Duh 1
- Liyong Guo 1
- Zhifeng Han 1
- Sorami Hisamoto 1
- Wei Kang 1
- Brian Kingsbury 1
- Philipp Koehn 1
- Fangjun Kuang 1
- Zhaoqing Li 1
- Long Lin 1
- Xutai Ma 1
- Lidia Mangu 1
- Muhammad Rahman 1
- Bhuvana Ramabhadran 1
- G. Saon 1
- O. Siohan 1
- Xingchen Song 1
- Jan Trmal 1
- Yiming Wang 1
- Hainan Xu 1
- Zengwei Yao 1
- Mahsa Yarmohammadi 1
- Lingxuan Ye 1
- Dong Zhang 1
- Xin Zhang 1
- Han Zhu 1
- Weiji Zhuang 1
- Geoffrey Zweig 1