Guoxuan Chen


Fixing paper assignments

  1. Please select all papers that belong to the same person.
  2. Indicate below which author they should be assigned to.
Provide a valid ORCID iD here. This will be used to match future papers to this author.
Provide the name of the school or the university where the author has received or will receive their highest degree (e.g., Ph.D. institution for researchers, or current affiliation for students). This will be used to form the new author page ID, if needed.

TODO: "submit" and "cancel" buttons here


2025

pdf bib
Self-Adjust Softmax
Chuanyang Zheng | Yihang Gao | Guoxuan Chen | Han Shi | Jing Xiong | Xiaozhe Ren | Chao Huang | Zhenguo Li | Yu Li
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

The softmax function is crucial in Transformer attention, which normalizes each row of the attention scores with summation to one. **Usually, tokens with larger attention scores are important for the final prediction.However, the softmax function can face a gradient vanishing issue for such important tokens (e.g., probabilities close to one), leading to optimization difficulties for the important tokens so that the performance may not be better.**In this paper, we propose Self-Adjust Softmax (SA-Softmax) to address this issue by modifying softmax(z) to z ⋅ softmax(z) and its normalized variant (z - min(z\min,0))max(0,zmax)-min(zmin,0) ⋅ softmax(z).We theoretically show that SA-Softmax provides enhanced gradient properties compared to the vanilla softmax function.Moreover, Attention can be seamlessly integrated into existing Transformer models to their attention mechanisms with minor adjustments.We conducted experiments to evaluate the empirical performance of Transformer models using compared to the vanilla softmax function. These experiments, involving models with up to 2.7 billion parameters, are conducted across diverse datasets, language tasks, and positional encoding methods.