Abstract
How to reduce compute and memory requirements of neural networks (NNs) without sacrificing performance? Many recent works use sparse Mixtures of Experts (MoEs) to build resource-efficient large language models (LMs). Here we introduce several novel perspectives on MoEs, presenting a general framework that *unifies* various methods to *approximate two-layer NNs* (e.g., feedforward blocks of Transformers), including product-key memories (PKMs). Leveraging insights from this framework, we propose methods to improve both MoEs and PKMs. Unlike prior work that compares MoEs with dense baselines under the *compute-equal* condition, our evaluation condition is *parameter-equal*, which is crucial to properly evaluate LMs. We show that our MoEs are competitive with the *dense* Transformer-XL on both the WikiText-103 and enwiki8 datasets at two different scales, while being much more resource efficient. This demonstrates that MoEs are relevant not only to extremely large LMs but also to any-scale resource-efficient LMs. Our code is public.- Anthology ID:
- 2023.findings-emnlp.49
- Volume:
- Findings of the Association for Computational Linguistics: EMNLP 2023
- Month:
- December
- Year:
- 2023
- Address:
- Singapore
- Editors:
- Houda Bouamor, Juan Pino, Kalika Bali
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 674–692
- Language:
- URL:
- https://aclanthology.org/2023.findings-emnlp.49
- DOI:
- 10.18653/v1/2023.findings-emnlp.49
- Cite (ACL):
- Róbert Csordás, Kazuki Irie, and Jürgen Schmidhuber. 2023. Approximating Two-Layer Feedforward Networks for Efficient Transformers. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 674–692, Singapore. Association for Computational Linguistics.
- Cite (Informal):
- Approximating Two-Layer Feedforward Networks for Efficient Transformers (Csordás et al., Findings 2023)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-2/2023.findings-emnlp.49.pdf