SEHY: A Simple yet Effective Hybrid Model for Summarization of Long Scientific Documents

Zhihua Jiang, Junzhan Yang, Dongning Rao


Abstract
Long document (e.g., scientific papers) summarization is obtaining more and more attention in recent years. Extractive approaches attempt to choose salient sentences via understanding the whole document, but long documents cover numerous subjects with varying details and will not ease content understanding. Instead, abstractive approaches elaborate to generate related tokens while suffering from truncating the source document due to their input sizes. To this end, we propose a Simple yet Effective HYbrid approach, which we call SEHY, that exploits the discourse information of a document to select salient sections instead sentences for summary generation. On the one hand, SEHY avoids the full-text understanding; on the other hand, it retains salient information given the length limit. In particular, we design two simple strategies for training the extractor: extracting sections incrementally and based on salience-analysis. Then, we use strong abstractive models to generate the final summary. We evaluate our approach on a large-scale scientific paper dataset: arXiv. Further, we discuss how the disciplinary class (e.g., computer science, math or physics) of a scientific paper affects the performance of SEHY as its writing style indicates, which is unexplored yet in existing works. Experimental results show the effectiveness of our approach and interesting findings on arXiv and its subsets generated in this paper.
Anthology ID:
2022.findings-aacl.9
Volume:
Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022
Month:
November
Year:
2022
Address:
Online only
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
96–106
Language:
URL:
https://aclanthology.org/2022.findings-aacl.9
DOI:
Bibkey:
Cite (ACL):
Zhihua Jiang, Junzhan Yang, and Dongning Rao. 2022. SEHY: A Simple yet Effective Hybrid Model for Summarization of Long Scientific Documents. In Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022, pages 96–106, Online only. Association for Computational Linguistics.
Cite (Informal):
SEHY: A Simple yet Effective Hybrid Model for Summarization of Long Scientific Documents (Jiang et al., Findings 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/2022.findings-aacl.9.pdf
Software:
 2022.findings-aacl.9.Software.rar