2024
pdf
bib
abs
A Tone-based Hierarchical Structure of Chinese Prosody
Ya Li
Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 1: Main Conference)
“In Chinese speech engineering, many projects use a conventional, syllable-based prosodic hierarchyas an underlying framework to process natural or synthesized speech. However, Chinese as a tonelanguage has its own way of expressing prosody, that is, through tonal interaction, especially tonesandhi. By utilizing the capacity of tone as a dual unit of pitch and timing, the present study proposesa tone-based, three-layer-four-level structure for Chinese prosody. The three layers are tone, toneprosody, and intonation, respectively composed of one level of pitch units, two levels of toneprosody units (basic and derived), and one level of intonation units. These four levels of units areused to replace syllable, prosodic word, phonological phrase, and intonational phrase in aconventional hierarchy. Tone prosody units are established based on sizes or types of tone sandhidomains, so when applied to the same clause uttered in Mandarin and Shanghai Wu Chinese, theyare timed differently and branched toward different directions at different levels, hence capable ofcapturing rhythmic and melodic patterns of the two distinctive types of Chinese. Overall, given itstheory-friendly design, the proposed structure may be used as a unifying framework in Chinesespeech engineering.”
2023
pdf
bib
abs
Exploring Prompt-based Multi-task Learning for Multimodal Dialog State Tracking and Immersive Multimodal Conversation
Yirong Chen
|
Ya Li
|
Tao Wang
|
Xiaofen Xing
|
Xiangmin Xu
|
Quan Liu
|
Cong Liu
|
Guoping Hu
Proceedings of the Eleventh Dialog System Technology Challenge
With the rise of the metaverse, immersive multimodal conversation has attracted more and more researchers’ attention. Multimodal contexts will become more important for human-computer interaction in the metaverse, especially in shopping domain. Unlike traditional conversation tasks, immersive multimodal conversation has challenges such as multimodal ambiguous candidate identification and multimodal coreference resolution, which makes it more difficult to dialog state tracking and response generation, as described in SIMMC 2.1 challenge, a part of DSTC11. In particular, as the number of objects in the scene increases, the difficulty will increase dramatically. We proposed a prompt-based multi-task learning Encoder-Decoder, in which different subtasks use different prompts to make the model tend to focus on the current subtask. We achieve the winner in ambiguous candidates indentification and runner-up in multimodal coreference resolution (MM-Coref), multimodal dialog state tracking (MM-DST) and assistant response generation. Our code and model are made publicly available at https://github.com/scutcyr/dstc11-simmc2.1-scut-bds-lab.
pdf
bib
abs
Multi-Stage Coarse-to-Fine Contrastive Learning for Conversation Intent Induction
Caiyuan Chu
|
Ya Li
|
Yifan Liu
|
Jia-Chen Gu
|
Quan Liu
|
Yongxin Ge
|
Guoping Hu
Proceedings of the Eleventh Dialog System Technology Challenge
Intent recognition is critical for task-oriented dialogue systems. However, for emerging domains and new services, it is difficult to accurately identify the key intent of a conversation due to time-consuming data annotation and comparatively poor model transferability. Therefore, the automatic induction of dialogue intention is very important for intelligent dialogue systems. This paper presents our solution to Track 2 of Intent Induction from Conversations for Task-Oriented Dialogue at the Eleventh Dialogue System Technology Challenge (DSTC11). The essence of intention clustering lies in distinguishing the representation of different dialogue utterances. The key to automatic intention induction is that, for any given set of new data, the sentence representation obtained by the model can be well distinguished from different labels. Therefore, we propose a multi-stage coarse-to-fine contrastive learning model training scheme including unsupervised contrastive learning pre-training, supervised contrastive learning pre-training, and fine-tuning with joint contrastive learning and clustering to obtain a better dialogue utterance representation model for the clustering task. In the released DSTC11 Track 2 evaluation results, our proposed system ranked first on both of the two subtasks of this Track.
2021
pdf
bib
abs
Cross Attention Augmented Transducer Networks for Simultaneous Translation
Dan Liu
|
Mengge Du
|
Xiaoxi Li
|
Ya Li
|
Enhong Chen
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
This paper proposes a novel architecture, Cross Attention Augmented Transducer (CAAT), for simultaneous translation. The framework aims to jointly optimize the policy and translation models. To effectively consider all possible READ-WRITE simultaneous translation action paths, we adapt the online automatic speech recognition (ASR) model, RNN-T, but remove the strong monotonic constraint, which is critical for the translation task to consider reordering. To make CAAT work, we introduce a novel latency loss whose expectation can be optimized by a forward-backward algorithm. We implement CAAT with Transformer while the general CAAT architecture can also be implemented with other attention-based encoder-decoder frameworks. Experiments on both speech-to-text (S2T) and text-to-text (T2T) simultaneous translation tasks show that CAAT achieves significantly better latency-quality trade-offs compared to the state-of-the-art simultaneous translation approaches.