Abstract
Transformers, being the forefront of Natural Language Processing and a pioneer in the recent developments, we tweak the very fundamentals of the giant Deep Learning model in this paper. For long documents, the conventional Full SelfAttention exceeds the compute power and the memory requirement as it scales quadratically. Instead, if we use a Local Self-Attention using a sliding window, we lose the global context present in the input document which can impact the performance of the task in hand. For long documents (ranging from 500 to 16K tokens), the proposed Dispersed Hierarchical Attention component captures the local context using a sliding window and the global context using a linearlyscaled dispersion approach. This achieves O(N) linear complexity, where N is the length of the input sequence or document.- Anthology ID:
- 2023.icon-1.10
- Volume:
- Proceedings of the 20th International Conference on Natural Language Processing (ICON)
- Month:
- December
- Year:
- 2023
- Address:
- Goa University, Goa, India
- Editors:
- Jyoti D. Pawar, Sobha Lalitha Devi
- Venue:
- ICON
- SIG:
- SIGLEX
- Publisher:
- NLP Association of India (NLPAI)
- Note:
- Pages:
- 90–98
- Language:
- URL:
- https://aclanthology.org/2023.icon-1.10
- DOI:
- Cite (ACL):
- Ajay Mukund S. and Easwarakumar K.s.. 2023. Dispersed Hierarchical Attention Network for Machine Translation and Language Understanding on Long Documents with Linear Complexity. In Proceedings of the 20th International Conference on Natural Language Processing (ICON), pages 90–98, Goa University, Goa, India. NLP Association of India (NLPAI).
- Cite (Informal):
- Dispersed Hierarchical Attention Network for Machine Translation and Language Understanding on Long Documents with Linear Complexity (S. & K.s., ICON 2023)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-4/2023.icon-1.10.pdf