Ajay Mukund S.


2023

pdf
Dispersed Hierarchical Attention Network for Machine Translation and Language Understanding on Long Documents with Linear Complexity
Ajay Mukund S. | Easwarakumar K.s.
Proceedings of the 20th International Conference on Natural Language Processing (ICON)

Transformers, being the forefront of Natural Language Processing and a pioneer in the recent developments, we tweak the very fundamentals of the giant Deep Learning model in this paper. For long documents, the conventional Full SelfAttention exceeds the compute power and the memory requirement as it scales quadratically. Instead, if we use a Local Self-Attention using a sliding window, we lose the global context present in the input document which can impact the performance of the task in hand. For long documents (ranging from 500 to 16K tokens), the proposed Dispersed Hierarchical Attention component captures the local context using a sliding window and the global context using a linearlyscaled dispersion approach. This achieves O(N) linear complexity, where N is the length of the input sequence or document.
Search
Co-authors
Venues