Ling Dong

Also published as:


2024

pdf bib
基于预训练模型与序列建模的音素分割方法(Sequence Modeling)
Shanglong Yang (杨尚龙) | Zhengtao Yu (余正涛) | Wenjun Wang (王文君) | Ling Dong (董凌) | Shengxiang Gao (高盛祥)
Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 1: Main Conference)

“音素分割作为语音处理领域内的一个重要任务,对于关键词识别、自动语音识别等应用具有至关重要的意义。传统方法往往独立预测每一帧音频是否为音素边界,忽视了音素边界与整个音频序列以及相邻帧之间的内在联系,从而影响了分割的准确性和连贯性。本文提出一种基于预训练模型与序列建模的音素分割方法,在HuBERT模型提取声学特征的基础上,结合BiLSTM捕捉长期依赖,再用CRF优化序列,提升了音素边界检测的性能。在TIMIT和Buckeye数据集上的实验表明,本文方法优于现有技术,证明了序列建模在音素分割任务中的有效性。”

pdf bib
DialectMoE: An End-to-End Multi-Dialect Speech Recognition Model with Mixture-of-Experts
Jie Zhou | Shengxiang Gao | Zhengtao Yu | Ling Dong | Wenjun Wang
Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 1: Main Conference)

“Dialect speech recognition has always been one of the challenges in Automatic Speech Recog-nition (ASR) systems. While lots of ASR systems perform well in Mandarin, their performancesignificantly drops when handling dialect speech. This is mainly due to the obvious differencesbetween dialects and Mandarin in pronunciation and the data scarcity of dialect speech. In thispaper, we propose DialectMoE, a Chinese multi-dialects speech recognition model based onMixture-of-Experts (MoE) in a low-resource conditions. Specifically, DialectMoE assigns inputsequences to a set of experts using a dynamic routing algorithm, with each expert potentiallytrained for a specific dialect. Subsequently, the outputs of these experts are combined to derivethe final output. Due to the similarities among dialects, distinct experts may offer assistance inrecognizing other dialects as well. Experimental results on the Datatang dialect public datasetshow that, compared with the baseline model, DialectMoE reduces Character Error Rate (CER)for Sichuan, Yunnan, Hubei and Henan dialects by 23.6%, 32.6%, 39.2% and 35.09% respec-tively. The proposed DialectMoE model demonstrates outstanding performance in multi-dialectsspeech recognition.”

2023

pdf bib
基于语音文本跨模态表征对齐的端到端语音翻译(End-to-end Speech Translation Based on Cross-modal Representation Alignment of Speech and Text)
Ling Zhou, Guojiang ang Dong | Zhengtao Yu | Shengxiang Gao | Wenjun Wang | Houli Ma | 国江 周 | 凌 董 | 正涛 余 | 盛祥 高 | 文君 王 | 候丽 马
Proceedings of the 22nd Chinese National Conference on Computational Linguistics

“端到端语音翻译需要解决源语言语音到目标语言文本的跨语言和跨模态映射,有限标注数据条件下,建立语音文本表征间的统一映射,缓解跨模态差异是提升语音翻译性能的关键。本文提出语音文本跨模态表征对齐方法,对语音文本表征进行多粒度对齐并进行混合作为并行输入,基于多模态表征的一致性约束进行多任务融合训练。在MuST-C数据集上的实验表明,本文所提方法优于现有端到端语音翻译跨模态表征相关方法,有效提升了语音翻译模型跨模态映射能力和翻译性能。”

pdf bib
基于离散化自监督表征增强的老挝语非自回归语音合成方法(A Discretized Self-Supervised Representation Enhancement based Non-Autoregressive Speech Synthesis Method for Lao Language)
Zijian Feng (冯子健) | Linqin Wang (王琳钦) | Shengxaing Gao (高盛祥) | Zhengtao Yu (余正涛) | Ling Dong (董凌)
Proceedings of the 22nd Chinese National Conference on Computational Linguistics

“老挝语的语音合成对中老两国合作与交流意义重大,但老挝语语音发音复杂,存在声调、音节及音素等发音特性,现有语音合成方法在老挝语上效果不尽人意。基于注意力机制建模的自回归模型难以拟合复杂的老挝语语音,模型泛化能力差,容易出现漏字、跳字等灾难性错误,合成音频缺乏自然性和流畅性。本文提出基于离散化自监督表征增强的老挝语非自回归语音合成方法,结合老挝语的语言语音特点,使用老挝语音素粒度的标注时长信息构建非自回归架构声学模型,通过自监督学习的预训练语音模型来提取语音内容和声调信息的离散化表征,融入到声学模型中增强模型的语音生成能力,增强合成音频的流畅性和自然性。实验证明,本文方法合成音频达到了4.03的MOS评分,基于离散化自监督表征增强的非自回归建模方法,能更好的在声调、音素时长、音高等细粒度层面刻画老挝语的语音特性。”

2022

pdf bib
多特征融合的越英端到端语音翻译方法(A Vietnamese-English end-to-end speech translation method based on multi-feature fusion)
Houli Ma (马候丽) | Ling Dong (董凌) | Wenjun Wang (王文君) | Jian Wang (王剑) | Shengxiang Gao (高盛祥) | Zhengtao Yu (余正涛)
Proceedings of the 21st Chinese National Conference on Computational Linguistics

“语音翻译的编码器需要同时编码语音中的声学和语义信息,单一的Fbank或Wav2vec2语音特征表征能力存在不足。本文通过分析人工的Fbank特征与自监督的Wav2vec2特征间的差异性,提出基于交叉注意力机制的声学特征融合方法,并探究了不同的自监督特征和融合方式,加强模型对语音中声学和语义信息的学习。结合越南语语音特点,以Fbank特征为主、Pitch特征为辅混合编码Fbank表征,构建多特征融合的越-英语音翻译模型。实验表明,使用多特征的语音翻译模型相比单特征翻译效果更优,与简单的特征拼接方法相比更有效,所提的多特征融合方法在越-英语音翻译任务上提升了1.97个BLEU值。”

pdf bib
融合外部语言知识的流式越南语语音识别(Streaming Vietnamese Speech Recognition Based on Fusing External Vietnamese Language Knowledge)
Junqiang Wang (王俊强) | Zhengtao Yu (余正涛) | Ling Dong (董凌) | Shengxiang Gao (高盛祥) | Wenjun Wang (王文君)
Proceedings of the 21st Chinese National Conference on Computational Linguistics

“越南语为低资源语言,训练语料难以获取;流式端到端模型在训练过程中难以学习到外部大量文本中的语言知识,这些问题在一定程度上都限制了流式越南语语音识别模型的性能。因此,本文以越南语音节作为语言模型和流式越南语语音识别模型的建模单元,提出了一种将预训练越南语语言模型在训练阶段融合到流式语音识别模型的方法。在训练阶段,通过最小化预训练越南语语言模型和解码器的输出计算一个新的损失函数LAE D−LM ,帮助流式越南语语音识别模型学习一些越南语语言知识从而优化其模型参数;在解码阶段,使用孓孨孡孬孬孯孷 孆孵孳孩孯孮或者字孆孓孔技术再次融合预训练语言模型进一步提升模型识别率。实验结果表明,在孖孉孖孏孓数据集上,相比基线模型,在训练阶段融合语言模型可以将流式越南语语音识别模型的词错率提升嬲嬮嬴嬵嬥;在解码阶段使用孓孨孡孬孬孯孷 孆孵孳孩孯孮或字孆孓孔再次融合语言模型,还可以将模型词错率分别提升嬱嬮嬳嬵嬥和嬴嬮嬷嬵嬥。”