Felicity Wang
2022
Numerical Optimizations for Weighted Low-rank Estimation on Language Models
Ting Hua
|
Yen-Chang Hsu
|
Felicity Wang
|
Qian Lou
|
Yilin Shen
|
Hongxia Jin
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Singular value decomposition (SVD) is one of the most popular compression methods that approximate a target matrix with smaller matrices. However, standard SVD treats the parameters within the matrix with equal importance, which is a simple but unrealistic assumption. The parameters of a trained neural network model may affect the task performance unevenly, which suggests non-equal importance among the parameters. Compared to SVD, the decomposition method aware of parameter importance is the more practical choice in real cases. Unlike standard SVD, weighed value decomposition is a non-convex optimization problem that lacks a closed-form solution. We systematically investigated multiple optimization strategies to tackle the problem and examined our method by compressing Transformer-based language models.Further, we designed a metric to predict when the SVD may introduce a significant performance drop, for which our method can be a rescue strategy.The extensive evaluations demonstrate that our method can perform better than current SOTA methods in compressing Transformer-based language models.
2019
Bag-of-Words Transfer: Non-Contextual Techniques for Multi-Task Learning
Seth Ebner
|
Felicity Wang
|
Benjamin Van Durme
Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo 2019)
Many architectures for multi-task learning (MTL) have been proposed to take advantage of transfer among tasks, often involving complex models and training procedures. In this paper, we ask if the sentence-level representations learned in previous approaches provide significant benefit beyond that provided by simply improving word-based representations. To investigate this question, we consider three techniques that ignore sequence information: a syntactically-oblivious pooling encoder, pre-trained non-contextual word embeddings, and unigram generative regularization. Compared to a state-of-the-art MTL approach to textual inference, the simple techniques we use yield similar performance on a universe of task combinations while reducing training time and model size.
Search
Co-authors
- Seth Ebner 1
- Benjamin Van Durme 1
- Ting Hua 1
- Yen-Chang Hsu 1
- Qian Lou 1
- show all...