Dispersed Hierarchical Attention Network for Machine Translation and Language Understanding on Long Documents with Linear Complexity

Ajay Mukund S., Easwarakumar K.s.


Abstract
Transformers, being the forefront of Natural Language Processing and a pioneer in the recent developments, we tweak the very fundamentals of the giant Deep Learning model in this paper. For long documents, the conventional Full SelfAttention exceeds the compute power and the memory requirement as it scales quadratically. Instead, if we use a Local Self-Attention using a sliding window, we lose the global context present in the input document which can impact the performance of the task in hand. For long documents (ranging from 500 to 16K tokens), the proposed Dispersed Hierarchical Attention component captures the local context using a sliding window and the global context using a linearlyscaled dispersion approach. This achieves O(N) linear complexity, where N is the length of the input sequence or document.
Anthology ID:
2023.icon-1.10
Volume:
Proceedings of the 20th International Conference on Natural Language Processing (ICON)
Month:
December
Year:
2023
Address:
Goa University, Goa, India
Editors:
Jyoti D. Pawar, Sobha Lalitha Devi
Venue:
ICON
SIG:
SIGLEX
Publisher:
NLP Association of India (NLPAI)
Note:
Pages:
90–98
Language:
URL:
https://aclanthology.org/2023.icon-1.10
DOI:
Bibkey:
Cite (ACL):
Ajay Mukund S. and Easwarakumar K.s.. 2023. Dispersed Hierarchical Attention Network for Machine Translation and Language Understanding on Long Documents with Linear Complexity. In Proceedings of the 20th International Conference on Natural Language Processing (ICON), pages 90–98, Goa University, Goa, India. NLP Association of India (NLPAI).
Cite (Informal):
Dispersed Hierarchical Attention Network for Machine Translation and Language Understanding on Long Documents with Linear Complexity (S. & K.s., ICON 2023)
Copy Citation:
PDF:
https://preview.aclanthology.org/naacl-24-ws-corrections/2023.icon-1.10.pdf