Nayu Liu
2022
ChipSong: A Controllable Lyric Generation System for Chinese Popular Song
Nayu Liu
|
Wenjing Han
|
Guangcan Liu
|
Da Peng
|
Ran Zhang
|
Xiaorui Wang
|
Huabin Ruan
Proceedings of the First Workshop on Intelligent and Interactive Writing Assistants (In2Writing 2022)
In this work, we take a further step towards satisfying practical demands in Chinese lyric generation from musical short-video creators, in respect of the challenges on songs’ format constraints, creating specific lyrics from open-ended inspiration inputs, and language rhyme grace. One representative detail in these demands is to control lyric format at word level, that is, for Chinese songs, creators even expect fix-length words on certain positions in a lyric to match a special melody, while previous methods lack such ability. Although recent lyric generation community has made gratifying progress, most methods are not comprehensive enough to simultaneously meet these demands. As a result, we propose ChipSong, which is an assisted lyric generation system built based on a Transformer-based autoregressive language model architecture, and generates controlled lyric paragraphs fit for musical short-video display purpose, by designing 1) a novel Begin-Internal-End (BIE) word-granularity embedding sequence with its guided attention mechanism for word-level length format control, and an explicit symbol set for sentence-level length format control; 2) an open-ended trigger word mechanism to guide specific lyric contents generation; 3) a paradigm of reverse order training and shielding decoding for rhyme control. Extensive experiments show that our ChipSong generates fluent lyrics, with assuring the high consistency to pre-determined control conditions.
2020
Multistage Fusion with Forget Gate for Multimodal Summarization in Open-Domain Videos
Nayu Liu
|
Xian Sun
|
Hongfeng Yu
|
Wenkai Zhang
|
Guangluan Xu
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
Multimodal summarization for open-domain videos is an emerging task, aiming to generate a summary from multisource information (video, audio, transcript). Despite the success of recent multiencoder-decoder frameworks on this task, existing methods lack fine-grained multimodality interactions of multisource inputs. Besides, unlike other multimodal tasks, this task has longer multimodal sequences with more redundancy and noise. To address these two issues, we propose a multistage fusion network with the fusion forget gate module, which builds upon this approach by modeling fine-grained interactions between the modalities through a multistep fusion schema and controlling the flow of redundant information between multimodal long sequences via a forgetting module. Experimental results on the How2 dataset show that our proposed model achieves a new state-of-the-art performance. Comprehensive analysis empirically verifies the effectiveness of our fusion schema and forgetting module on multiple encoder-decoder architectures. Specially, when using high noise ASR transcripts (WER>30%), our model still achieves performance close to the ground-truth transcript model, which reduces manual annotation cost.
Search
Co-authors
- Xian Sun 1
- Hongfeng Yu 1
- Wenkai Zhang 1
- Guangluan Xu 1
- Wenjing Han 1
- show all...