Revisiting Transformer-based Models for Long Document Classification

Xiang Dai; Ilias Chalkidis; Sune Darkner; Desmond Elliott

doi:10.18653/v1/2022.findings-emnlp.534

Revisiting Transformer-based Models for Long Document Classification

Xiang Dai, Ilias Chalkidis, Sune Darkner, Desmond Elliott

Abstract

The recent literature in text classification is biased towards short text sequences (e.g., sentences or paragraphs). In real-world applications, multi-page multi-paragraph documents are common and they cannot be efficiently encoded by vanilla Transformer-based models. We compare different Transformer-based Long Document Classification (TrLDC) approaches that aim to mitigate the computational overhead of vanilla transformers to encode much longer text, namely sparse attention and hierarchical encoding methods.We examine several aspects of sparse attention (e.g., size of local attention window, use of global attention) and hierarchical (e.g., document splitting strategy) transformers on four document classification datasets covering different domains. We observe a clear benefit from being able to process longer text, and, based on our results, we derive practical advice of applying Transformer-based models on long document classification tasks.

Anthology ID:: 2022.findings-emnlp.534
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2022
Month:: December
Year:: 2022
Address:: Abu Dhabi, United Arab Emirates
Editors:: Yoav Goldberg, Zornitsa Kozareva, Yue Zhang
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 7212–7230
Language:
URL:: https://preview.aclanthology.org/fix-sig-urls/2022.findings-emnlp.534/
DOI:: 10.18653/v1/2022.findings-emnlp.534
Bibkey:
Cite (ACL):: Xiang Dai, Ilias Chalkidis, Sune Darkner, and Desmond Elliott. 2022. Revisiting Transformer-based Models for Long Document Classification. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 7212–7230, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Cite (Informal):: Revisiting Transformer-based Models for Long Document Classification (Dai et al., Findings 2022)
Copy Citation:
PDF:: https://preview.aclanthology.org/fix-sig-urls/2022.findings-emnlp.534.pdf
Video:: https://preview.aclanthology.org/fix-sig-urls/2022.findings-emnlp.534.mp4

PDF Cite Search Video Fix data