Peng Lu
2026
MultiFinBen: Benchmarking Large Language Models for Multilingual and Multimodal Financial Application
Xueqing Peng | Lingfei Qian | Yan Wang | Ruoyu Xiang | Yueru He | Yang Ren | Mingyang Jiang | Vincent Jim Zhang | Yuqing Guo | Jeff Zhao | Huan He | Yi Han | Yun Feng | Yuechen Jiang | Yupeng Cao | Haohang Li | Yangyang Yu | Xiaoyu Wang | Penglei Gao | Shengyuan Lin | Keyi Wang | Shanshan Yang | Yilun Zhao | Zhiwei Liu | Peng Lu | Jerry Huang | Suyuchen Wang | Triantafillos Papadopoulos | Polydoros Giannouris | Efstathia Soufleri | Nuo Chen | Zhiyang Deng | Heming Fu | Yijia Zhao | Mingquan Lin | Meikang Qiu | Kaleb E Smith | Arman Cohan | Xiao-Yang Liu | Jimin Huang | Guojun Xiong | Alejandro Lopez-Lira | Xi Chen | Junichi Tsujii | Jian-Yun Nie | Sophia Ananiadou | Qianqian Xie
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Xueqing Peng | Lingfei Qian | Yan Wang | Ruoyu Xiang | Yueru He | Yang Ren | Mingyang Jiang | Vincent Jim Zhang | Yuqing Guo | Jeff Zhao | Huan He | Yi Han | Yun Feng | Yuechen Jiang | Yupeng Cao | Haohang Li | Yangyang Yu | Xiaoyu Wang | Penglei Gao | Shengyuan Lin | Keyi Wang | Shanshan Yang | Yilun Zhao | Zhiwei Liu | Peng Lu | Jerry Huang | Suyuchen Wang | Triantafillos Papadopoulos | Polydoros Giannouris | Efstathia Soufleri | Nuo Chen | Zhiyang Deng | Heming Fu | Yijia Zhao | Mingquan Lin | Meikang Qiu | Kaleb E Smith | Arman Cohan | Xiao-Yang Liu | Jimin Huang | Guojun Xiong | Alejandro Lopez-Lira | Xi Chen | Junichi Tsujii | Jian-Yun Nie | Sophia Ananiadou | Qianqian Xie
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Real-world financial analysis involves information across multiple languages and modalities, from reports and news to scanned filings and meeting recordings. Yet most existing evaluations of LLMs in finance remain text-only, monolingual, and largely saturated by current models. To bridge these gaps, we present MultiFinBen, the first expert-annotated multilingual (five languages) and multimodal (text, vision, audio) benchmark for evaluating LLMs in realistic financial contexts. MultiFinBen introduces two new task families: multilingual financial reasoning, which tests cross-lingual evidence integration from filings and news, and financial OCR, which extracts structured text from scanned documents containing tables and charts. Rather than aggregating all available datasets, we apply a structured, difficulty-aware selection based on advanced model performance, ensuring balanced challenge and removing redundant tasks. Evaluating 21 leading LLMs shows that even frontier multimodal models like GPT-4o achieve only 46.01% overall, stronger on vision and audio but dropping sharply in multilingual settings. These findings expose persistent limitations in multilingual, multimodal, and expert-level financial reasoning. All datasets, evaluation scripts, and leaderboards are publicly released.
EvoEdit: Evolving Null-space Alignment for Robust and Efficient Knowledge Editing
Sicheng Lyu | Yu Gu | Xinyu Wang | Jerry Huang | Sitao Luan | Yufei Cui | Xiao-Wen Chang | Peng Lu
Findings of the Association for Computational Linguistics: ACL 2026
Sicheng Lyu | Yu Gu | Xinyu Wang | Jerry Huang | Sitao Luan | Yufei Cui | Xiao-Wen Chang | Peng Lu
Findings of the Association for Computational Linguistics: ACL 2026
Large language models (LLMs) require continual updates to rectify outdated or erroneous knowledge. Model editing has emerged as a compelling paradigm for introducing targeted modifications without the computational burden of full retraining. Existing approaches are mainly based on a locate-then-edit framework. However, in sequential editing contexts, where multiple updates are applied over time, they exhibit significant limitations and suffer from catastrophic interference, i.e., new edits compromise previously integrated updates and degrade preserved knowledge. To address these challenges, we introduce EvoEdit, a novel editing strategy that mitigates catastrophic interference through sequential null-space alignment, enabling stable and efficient model editing. By performing sequential null-space alignment for each incoming edit, EvoEdit preserves both original and previously modified knowledge representations and maintains output invariance on preserved knowledge even across long edit sequences, effectively mitigating interference. Evaluations on real-world sequential knowledge-editing benchmarks show that EvoEdit achieves better or comparable performance than prior state-of-the-art locate-then-edit techniques, with up to 3.53× speedup. Overall, these results underscore the necessity of developing more principled approaches for designing LLMs in dynamically evolving information settings, while providing a simple yet effective solution with strong theoretical guarantees.
Investigating the Multilingual Calibration Effects of Language Model Instruction Tuning
Jerry Huang | Peng Lu | Qiuhao Zeng | Yusuke Iwasawa | Yutaka Matsuo | Sarath Chandar | Edison Marrese-Taylor | Irene Li
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers)
Jerry Huang | Peng Lu | Qiuhao Zeng | Yusuke Iwasawa | Yutaka Matsuo | Sarath Chandar | Edison Marrese-Taylor | Irene Li
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers)
Ensuring that deep learning models are well-calibrated in terms of their predictive uncertainty is essential in maintaining their trustworthiness and reliability, yet despite increasing advances in foundation model research, the relationship between such large language models (LLMs) and their calibration remains an open area of research. In this work, we look at a critical gap in the calibration of LLMs within multilingual settings, in an attempt to better understand how the data scarcity can potentially lead to different calibration effects and how commonly used techniques can apply in these settings. Our analysis on two multilingual benchmarks, over 29 and 42 languages respectively, reveals that even in low-resource languages, model confidence can increase significantly after instruction-tuning on high-resource language SFT datasets. However, improvements in accuracy are marginal or non-existent, resulting in mis-calibration, highlighting a critical shortcoming of standard SFT for multilingual languages. Furthermore, we observe that the use of label smoothing to be a reasonable method alleviate this concern, again without any need for low-resource SFT data, maintaining better calibration across all languages. Overall, this highlights the importance of multilingual considerations for both training and tuning LLMs in order to improve their reliability and fairness in downstream use.
MATCH: Modulating Attention via In-Context Retrieval for Long-Context Transformers
Linrui Ma | Chun Hei Lo | Xinyu Wang | Peng Lu | Xihao Yuan | Hanting Chen | Kai Han | Xinghao Chen | Chengjun Zhan | Hanlin xu | Yichun Yin | Lifeng Shang | Feng Wen | Boxing Chen | Yufei Cui
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Linrui Ma | Chun Hei Lo | Xinyu Wang | Peng Lu | Xihao Yuan | Hanting Chen | Kai Han | Xinghao Chen | Chengjun Zhan | Hanlin xu | Yichun Yin | Lifeng Shang | Feng Wen | Boxing Chen | Yufei Cui
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The quadratic computational cost of traditional attention mechanisms poses a major bottleneck to the scalability and practical deployment of large language models (LLMs), particularly in long-context scenarios. To improve efficiency, existing approaches often enforce rigid structural constraints such as local attention windows. However, these strategies typically lead to substantial performance degradation on tasks requiring precise long-range recall. In this work, we propose MATCH, a scalable and efficient framework that augments sparsified attention mechanisms with dynamically integrated in-context information through an efficient retrieval system. Empirical results show that MATCH significantly improves the performance of sparse-attention models on both synthetic and real-world natural-language tasks. These findings highlight the versatility of MATCH as a general approach for enhancing in-context retrieval capabilities while maintaining the efficiency benefits of sparse attention architectures.
2025
ReGLA: Refining Gated Linear Attention
Peng Lu | Ivan Kobyzev | Mehdi Rezagholizadeh | Boxing Chen | Philippe Langlais
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Peng Lu | Ivan Kobyzev | Mehdi Rezagholizadeh | Boxing Chen | Philippe Langlais
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Recent advancements in Large Language Models (LLMs) have set themselves apart with their exceptional performance in complex language modelling tasks. However, these models are also known for their significant computational and storage requirements, primarily due to the quadratic computation complexity of softmax attention. To mitigate this issue, linear attention has been designed to reduce the quadratic space-time complexity that is inherent in standard transformers. In this work, we embarked on a comprehensive exploration of three key components that substantially impact the performance of the Gated Linear Attention module: feature maps, normalization, and the gating mechanism. We developed a feature mapping function to address some crucial issues that previous suggestions overlooked. Then we offered further rationale for the integration of normalization layers to stabilize the training process. Moreover, we explored the saturation phenomenon of the gating mechanism and augmented it with a refining module. We conducted extensive experiments and showed our architecture outperforms previous Gated Linear Attention mechanisms in extensive tasks including training from scratch and post-linearization with continual pre-training.
2024
Draft on the Fly: Adaptive Self-Speculative Decoding using Cosine Similarity
Michael R. Metel | Peng Lu | Boxing Chen | Mehdi Rezagholizadeh | Ivan Kobyzev
Findings of the Association for Computational Linguistics: EMNLP 2024
Michael R. Metel | Peng Lu | Boxing Chen | Mehdi Rezagholizadeh | Ivan Kobyzev
Findings of the Association for Computational Linguistics: EMNLP 2024
We present a simple on the fly method for faster inference of large language models. Unlike other (self-)speculative decoding techniques, our method does not require fine-tuning or black-box optimization to generate a fixed draft model, relying instead on simple rules to generate varying draft models adapted to the input context. We show empirically that our light-weight algorithm is competitive with the current SOTA for self-speculative decoding, while being a truly plug-and-play method.
Resonance RoPE: Improving Context Length Generalization of Large Language Models
Suyuchen Wang | Ivan Kobyzev | Peng Lu | Mehdi Rezagholizadeh | Bang Liu
Findings of the Association for Computational Linguistics: ACL 2024
Suyuchen Wang | Ivan Kobyzev | Peng Lu | Mehdi Rezagholizadeh | Bang Liu
Findings of the Association for Computational Linguistics: ACL 2024
This paper addresses the challenge of train-short-test-long (TSTL) scenarios in Large Language Models (LLMs) equipped with Rotary Position Embedding (RoPE), where models pre-trained on shorter sequences face difficulty with out-of-distribution (OOD) token positions in longer sequences. We introduce Resonance RoPE, a novel approach designed to narrow the generalization gap in TSTL scenarios by refining the interpolation of RoPE features for OOD positions, significantly improving the model performance without additional online computational costs. Furthermore, we present PosGen, a new synthetic benchmark specifically designed for fine-grained behavior analysis in TSTL scenarios, aiming to isolate the constantly increasing difficulty of token generation on long contexts from the challenges of recognizing new token positions. Our experiments on synthetic tasks show that after applying Resonance RoPE, Transformers recognize OOD position better and more robustly. Our extensive LLM experiments also show superior performance after applying Resonance RoPE to the current state-of-the-art RoPE scaling method, YaRN, on both upstream language modeling tasks and a variety of downstream long-text applications.
2023
LABO: Towards Learning Optimal Label Regularization via Bi-level Optimization
Peng Lu | Ahmad Rashid | Ivan Kobyzev | Mehdi Rezagholizadeh | Phillippe Langlais
Findings of the Association for Computational Linguistics: ACL 2023
Peng Lu | Ahmad Rashid | Ivan Kobyzev | Mehdi Rezagholizadeh | Phillippe Langlais
Findings of the Association for Computational Linguistics: ACL 2023
Regularization techniques are crucial to improving the generalization performance and training efficiency of deep neural networks. Many deep learning algorithms rely on weight decay, dropout, batch/layer normalization to converge faster and generalize. Label Smoothing (LS) is another simple, versatile and efficient regularization which can be applied to various supervised classification tasks. Conventional LS, however, regardless of the training instance assumes that each non-target class is equally likely. In this work, we present a general framework for training with label regularization, which includes conventional LS but can also model instance-specific variants. Based on this formulation, we propose an efficient way of learning LAbel regularization by devising a Bi-level Optimization (LABO) problem. We derive a deterministic and interpretable solution of the inner loop as the optimal label smoothing without the need to store the parameters or the output of a trained model. Finally, we conduct extensive experiments and demonstrate our LABO consistently yields improvement over conventional label regularization on various fields, including seven machine translation and three image classification tasks across various neural network architectures while maintaining training efficiency.
Efficient Classification of Long Documents via State-Space Models
Peng Lu | Suyuchen Wang | Mehdi Rezagholizadeh | Bang Liu | Ivan Kobyzev
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Peng Lu | Suyuchen Wang | Mehdi Rezagholizadeh | Bang Liu | Ivan Kobyzev
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Transformer-based models have achieved state-of-the-art performance on numerous NLP applications. However, long documents which are prevalent in real-world scenarios cannot be efficiently processed by transformers with the vanilla self-attention module due to their quadratic computation complexity and limited length extrapolation ability. Instead of tackling the computation difficulty for self-attention with sparse or hierarchical structures, in this paper, we investigate the use of State-Space Models (SSMs) for long document classification tasks. We conducted extensive experiments on six long document classification datasets, including binary, multi-class, and multi-label classification, comparing SSMs (with and without pre-training) to self-attention-based models. We also introduce the SSM-pooler model and demonstrate that it achieves comparable performance while being on average 36% more efficient. Additionally our method exhibits higher robustness to the input noise even in the extreme scenario of 40%.
Do we need Label Regularization to Fine-tune Pre-trained Language Models?
Ivan Kobyzev | Aref Jafari | Mehdi Rezagholizadeh | Tianda Li | Alan Do-Omri | Peng Lu | Pascal Poupart | Ali Ghodsi
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics
Ivan Kobyzev | Aref Jafari | Mehdi Rezagholizadeh | Tianda Li | Alan Do-Omri | Peng Lu | Pascal Poupart | Ali Ghodsi
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics
Knowledge Distillation (KD) is a prominent neural model compression technique that heavily relies on teacher network predictions to guide the training of a student model. Considering the ever-growing size of pre-trained language models (PLMs), KD is often adopted in many NLP tasks involving PLMs. However, it is evident that in KD, deploying the teacher network during training adds to the memory and computational requirements of training. In the computer vision literature, the necessity of the teacher network is put under scrutiny by showing that KD is a label regularization technique that can be replaced with lighter teacher-free variants such as the label-smoothing technique. However, to the best of our knowledge, this issue is not investigated in NLP. Therefore, this work concerns studying different label regularization techniques and whether we actually need them to improve the fine-tuning of smaller PLM networks on downstream tasks. In this regard, we did a comprehensive set of experiments on different PLMs such as BERT, RoBERTa, and GPT with more than 600 distinct trials and ran each configuration five times. This investigation led to a surprising observation that KD and other label regularization techniques do not play any meaningful role over regular fine-tuning when the student model is pre-trained. We further explore this phenomenon in different settings of NLP and computer vision tasks and demonstrate that pre-training itself acts as a kind of regularization, and additional label regularization is unnecessary.
2022
Improving Generalization of Pre-trained Language Models via Stochastic Weight Averaging
Peng Lu | Ivan Kobyzev | Mehdi Rezagholizadeh | Ahmad Rashid | Ali Ghodsi | Phillippe Langlais
Findings of the Association for Computational Linguistics: EMNLP 2022
Peng Lu | Ivan Kobyzev | Mehdi Rezagholizadeh | Ahmad Rashid | Ali Ghodsi | Phillippe Langlais
Findings of the Association for Computational Linguistics: EMNLP 2022
Knowledge Distillation (KD) is a commonly used technique for improving the generalization of compact Pre-trained Language Models (PLMs) on downstream tasks. However, such methods impose the additional burden of training a separate teacher model for every new dataset.Alternatively, one may directly work on the improvement of the optimization procedure of the compact model towards better generalization. Recent works observe that the flatness of the local minimum correlates well with better generalization.In this work, we adapt Stochastic Weight Averaging (SWA), a method encouraging convergence to a flatter minimum, to fine-tuning PLMs. We conduct extensive experiments on various NLP tasks (text classification, question answering, and generation) and different model architectures and demonstrate that our adaptation improves the generalization without extra computation cost. Moreover, we observe that this simple optimization technique is able to outperform the state-of-the-art KD methods for compact models.
2021
RW-KD: Sample-wise Loss Terms Re-Weighting for Knowledge Distillation
Peng Lu | Abbas Ghaddar | Ahmad Rashid | Mehdi Rezagholizadeh | Ali Ghodsi | Philippe Langlais
Findings of the Association for Computational Linguistics: EMNLP 2021
Peng Lu | Abbas Ghaddar | Ahmad Rashid | Mehdi Rezagholizadeh | Ali Ghodsi | Philippe Langlais
Findings of the Association for Computational Linguistics: EMNLP 2021
Knowledge Distillation (KD) is extensively used in Natural Language Processing to compress the pre-training and task-specific fine-tuning phases of large neural language models. A student model is trained to minimize a convex combination of the prediction loss over the labels and another over the teacher output. However, most existing works either fix the interpolating weight between the two losses apriori or vary the weight using heuristics. In this work, we propose a novel sample-wise loss weighting method, RW-KD. A meta-learner, simultaneously trained with the student, adaptively re-weights the two losses for each sample. We demonstrate, on 7 datasets of the GLUE benchmark, that RW-KD outperforms other loss re-weighting methods for KD.
2019
SC-LSTM: Learning Task-Specific Representations in Multi-Task Learning for Sequence Labeling
Peng Lu | Ting Bai | Philippe Langlais
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)
Peng Lu | Ting Bai | Philippe Langlais
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)
Multi-task learning (MTL) has been studied recently for sequence labeling. Typically, auxiliary tasks are selected specifically in order to improve the performance of a target task. Jointly learning multiple tasks in a way that benefit all of them simultaneously can increase the utility of MTL. In order to do so, we propose a new LSTM cell which contains both shared parameters that can learn from all tasks, and task-specific parameters that can learn task-specific information. We name it a Shared-Cell Long-Short Term Memory SC-LSTM. Experimental results on three sequence labeling benchmarks (named-entity recognition, text chunking, and part-of-speech tagging) demonstrate the effectiveness of our SC-LSTM cell.
Search
Fix author
Co-authors
- Mehdi Rezagholizadeh 8
- Ivan Kobyzev 7
- Philippe Langlais 5
- Boxing Chen 3
- Ali Ghodsi 3
- Jerry Huang 3
- Ahmad Rashid 3
- Suyuchen Wang 3
- Yufei Cui 2
- Bang Liu 2
- Xinyu Wang 2
- Sophia Ananiadou 1
- Ting Bai 1
- Yupeng Cao 1
- Sarath Chandar 1
- Xiao-Wen Chang 1
- Nuo Chen 1
- Xi Chen 1
- Hanting Chen 1
- Xinghao Chen 1
- Arman Cohan 1
- Zhiyang Deng 1
- Alan Do-Omri 1
- Yun Feng 1
- Heming Fu 1
- Penglei Gao 1
- Abbas Ghaddar 1
- Polydoros Giannouris 1
- Yu Gu (谷峪) 1
- Yuqing Guo 1
- Yi Han 1
- Kai Han 1
- Yueru He 1
- Huan He 1
- Jimin Huang 1
- Yusuke Iwasawa 1
- Aref Jafari 1
- Mingyang Jiang 1
- Yuechen Jiang 1
- Haohang Li 1
- Tianda Li 1
- Irene Li 1
- Shengyuan Lin 1
- Mingquan Lin 1
- Zhiwei Liu 1
- Xiao-Yang Liu 1
- Chun Hei Lo 1
- Alejandro Lopez-Lira 1
- Sitao Luan 1
- Sicheng Lyu 1
- Linrui Ma 1
- Edison Marrese-Taylor 1
- Yutaka Matsuo 1
- Michael R. Metel 1
- Jian-Yun Nie 1
- Triantafillos Papadopoulos 1
- Xueqing Peng 1
- Pascal Poupart 1
- Lingfei Qian 1
- Meikang Qiu 1
- Yang Ren 1
- Lifeng Shang 1
- Kaleb E. Smith 1
- Efstathia Soufleri 1
- Jun’ichi Tsujii 1
- Yan Wang 1
- Xiaoyu Wang 1
- Keyi Wang 1
- Feng Wen 1
- Ruoyu Xiang 1
- Qianqian Xie 1
- Guojun Xiong 1
- Shanshan Yang 1
- Yichun Yin 1
- Yangyang Yu 1
- Xihao Yuan 1
- Qiuhao Zeng 1
- Chengjun Zhan 1
- Vincent Jim Zhang 1
- Jeff Zhao 1
- Yilun Zhao 1
- Yijia Zhao 1
- Hanlin xu 1