What Are You Token About? Differentiable Perturbed Top-k Token Selection for Scientific Document Summarization

Luca Ragazzi; Paolo Italiani; Gianluca Moro; Mattia Panni

doi:10.18653/v1/2024.findings-acl.561

What Are You Token About? Differentiable Perturbed Top-k Token Selection for Scientific Document Summarization

Luca Ragazzi, Paolo Italiani, Gianluca Moro, Mattia Panni

Abstract

Scientific document summarization aims to condense complex and long articles in both technical and plain-language terms to facilitate the accessibility and dissemination of scientific findings. Existing datasets suffer from a deficiency in source heterogeneity, as their data predominantly stem from a single common resource, hindering effective model training and generalizability. First, we introduce SciLay, a novel dataset that includes documents from multiple natural science journals with expert-authored technical and lay summaries. Second, we propose PrunePert, a new transformer-based model that incorporates a differentiable perturbed top-k encoder layer to prune irrelevant tokens in end-to-end learning. Experimental results show that our model achieves a nearly 2x speed-up compared to a state-of-the-art linear transformer, remaining comparable in effectiveness. Additional examinations underscore the importance of employing a training dataset that includes different sources to enhance the generalizability of the models. Code is available at https://github.com/disi-unibo-nlp/sci-lay.

Anthology ID:: 2024.findings-acl.561
Volume:: Findings of the Association for Computational Linguistics: ACL 2024
Month:: August
Year:: 2024
Address:: Bangkok, Thailand
Editors:: Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 9427–9440
Language:
URL:: https://preview.aclanthology.org/jlcl-multiple-ingestion/2024.findings-acl.561/
DOI:: 10.18653/v1/2024.findings-acl.561
Bibkey:
Cite (ACL):: Luca Ragazzi, Paolo Italiani, Gianluca Moro, and Mattia Panni. 2024. What Are You Token About? Differentiable Perturbed Top-k Token Selection for Scientific Document Summarization. In Findings of the Association for Computational Linguistics: ACL 2024, pages 9427–9440, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):: What Are You Token About? Differentiable Perturbed Top-k Token Selection for Scientific Document Summarization (Ragazzi et al., Findings 2024)
Copy Citation:
PDF:: https://preview.aclanthology.org/jlcl-multiple-ingestion/2024.findings-acl.561.pdf

PDF Cite Search Fix data