Chun-Nam Yu


2025

pdf bib
RoCoFT: Efficient Finetuning of Large Language Models with Row-Column Updates
Md Kowsher | Tara Esmaeilbeig | Chun-Nam Yu | Chen Chen | Mojtaba Soltanalian | Niloofar Yousefi
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We propose Row-Column Fine-Tuning(RoCoFT), a parameter-efficient fine-tuning method for large language models based on updating only a few rows and columns of the weight matrices in transformers. Through extensive experiments with medium-sized LMs like RoBERTa and DeBERTa, and larger LMs like Bloom-7B, Llama2-7B, and Llama2-13B, we show that our method gives comparable or better accuracies than state-of-the-art Parameter-Efficient Finetuning methods while also being more memory and computation-efficient. We also study the reason behind the effectiveness of our method with tools from neural tangent kernel theory. We empirically demonstrate that our kernel, constructed using a restricted set of row and column parameters, is numerically close to the full-parameter kernel and gives comparable classification performance. Ablation studies are conducted to investigate the impact of different algorithmic choices, including the robustness of RoCoFT to any selection of rows and columns, as well as the optimal rank for the effective implementation of our method.

pdf bib
Predicting Through Generation: Why Generation Is Better for Prediction
Md Kowsher | Nusrat Jahan Prottasha | Prakash Bhat | Chun-Nam Yu | Mojtaba Soltanalian | Ivan Garibay | Ozlem Garibay | Chen Chen | Niloofar Yousefi
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

This paper argues that generating output tokens is more effective than using pooled representations for prediction tasks because token-level generation retains more mutual information. Since LLMs are trained on massive text corpora using next-token prediction, generation aligns naturally with their learned behavior. Using the Data Processing Inequality (DPI), we provide both theoretical and empirical evidence supporting this claim. However, autoregressive models face two key challenges when used for prediction: (1) exposure bias, where the model sees ground-truth tokens during training but relies on its own predictions during inference, leading to errors, and (2) format mismatch, where discrete tokens do not always align with the task’s required output structure. To address these challenges, we introduce PredGen (Predicting Through Generating), an end-to-end framework that (i) uses scheduled sampling to reduce exposure bias, and (ii) introduces a task adapter to convert the generated tokens into structured outputs. Additionally, we introduce Writer-Director Alignment Loss (WDAL), which ensures consistency between token generation and final task predictions, improving both text coherence and numerical accuracy. We evaluate PredGen on multiple classification and regression benchmarks. Our results show that PredGen consistently outperforms standard baselines, demonstrating its effectiveness in structured prediction tasks.

pdf bib
Does Self-Attention Need Separate Weights in Transformers?
Md Kowsher | Nusrat Jahan Prottasha | Chun-Nam Yu | Ozlem Garibay | Niloofar Yousefi
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track)

Self-attention has revolutionized natural language processing by capturing long-range dependencies and improving context understanding. However, it comes with high computational costs and struggles with sequential data’s inherent directionality. This paper investigates and presents a simplified approach called “shared weight self-attention,” where a single weight matrix is used for Keys, Queries, and Values instead of separate matrices for each. This approach cuts training parameters by more than half and significantly reduces training time. Our method not only improves efficiency but also achieves strong performance on tasks from the GLUE benchmark, even outperforming the standard BERT baseline in handling noisy and out-of-domain data. Experimental results show a 66.53% reduction in parameter size within the attention block and competitive accuracy improvements of 3.55% and 0.89% over symmetric and pairwise attention-based BERT models, respectively.