Abstract
In the legal domain, we often perform classification tasks on very long documents, for example court judgements. These documents often contain thousands of words, so the length of these documents poses a challenge for this modelling task. In this research paper, we present a comprehensive evaluation of various strategies to perform long text classification using Transformers in conjunction with strategies to select document chunks using traditional NLP models. We conduct our experiments on 6 benchmark datasets comprising lengthy documents, 4 of which are publicly available. Each dataset has a median word count exceeding 1,000. Our evaluation encompasses state-of-the-art Transformer models, such as RoBERTa, Longformer, HAT, MEGA and LegalBERT and compares them with a traditional baseline TF-IDF + Neural Network (NN) model. We investigate the effectiveness of pre-training on large corpora, fine tuning strategies, and transfer learning techniques in the context of long text classification.- Anthology ID:
- 2023.nllp-1.3
- Volume:
- Proceedings of the Natural Legal Language Processing Workshop 2023
- Month:
- December
- Year:
- 2023
- Address:
- Singapore
- Editors:
- Daniel Preoțiuc-Pietro, Catalina Goanta, Ilias Chalkidis, Leslie Barrett, Gerasimos (Jerry) Spanakis, Nikolaos Aletras
- Venues:
- NLLP | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 17–24
- Language:
- URL:
- https://aclanthology.org/2023.nllp-1.3
- DOI:
- 10.18653/v1/2023.nllp-1.3
- Cite (ACL):
- Mohit Tuteja and Daniel González Juclà. 2023. Long Text Classification using Transformers with Paragraph Selection Strategies. In Proceedings of the Natural Legal Language Processing Workshop 2023, pages 17–24, Singapore. Association for Computational Linguistics.
- Cite (Informal):
- Long Text Classification using Transformers with Paragraph Selection Strategies (Tuteja & González Juclà, NLLP-WS 2023)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-2/2023.nllp-1.3.pdf