TED: A Pretrained Unsupervised Summarization Model with Theme Modeling and Denoising

Ziyi Yang; Chenguang Zhu; Robert Gmyr; Michael Zeng; Xuedong Huang; Eric Darve

doi:10.18653/v1/2020.findings-emnlp.168

TED: A Pretrained Unsupervised Summarization Model with Theme Modeling and Denoising

Ziyi Yang, Chenguang Zhu, Robert Gmyr, Michael Zeng, Xuedong Huang, Eric Darve

Abstract

Text summarization aims to extract essential information from a piece of text and transform the text into a concise version. Existing unsupervised abstractive summarization models leverage recurrent neural networks framework while the recently proposed transformer exhibits much more capability. Moreover, most of previous summarization models ignore abundant unlabeled corpora resources available for pretraining. In order to address these issues, we propose TED, a transformer-based unsupervised abstractive summarization system with pretraining on large-scale data. We first leverage the lead bias in news articles to pretrain the model on millions of unlabeled corpora. Next, we finetune TED on target domains through theme modeling and a denoising autoencoder to enhance the quality of generated summaries. Notably, TED outperforms all unsupervised abstractive baselines on NYT, CNN/DM and English Gigaword datasets with various document styles. Further analysis shows that the summaries generated by TED are highly abstractive, and each component in the objective function of TED is highly effective.

Anthology ID:: 2020.findings-emnlp.168
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2020
Month:: November
Year:: 2020
Address:: Online
Editors:: Trevor Cohn, Yulan He, Yang Liu
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1865–1874
Language:
URL:: https://aclanthology.org/2020.findings-emnlp.168
DOI:: 10.18653/v1/2020.findings-emnlp.168
Bibkey:
Cite (ACL):: Ziyi Yang, Chenguang Zhu, Robert Gmyr, Michael Zeng, Xuedong Huang, and Eric Darve. 2020. TED: A Pretrained Unsupervised Summarization Model with Theme Modeling and Denoising. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1865–1874, Online. Association for Computational Linguistics.
Cite (Informal):: TED: A Pretrained Unsupervised Summarization Model with Theme Modeling and Denoising (Yang et al., Findings 2020)
Copy Citation:
PDF:: https://preview.aclanthology.org/nschneid-patch-1/2020.findings-emnlp.168.pdf
Optional supplementary material:: 2020.findings-emnlp.168.OptionalSupplementaryMaterial.pdf

PDF Search Optional supplementary material