Suyuchen Wang
2026
MultiFinBen: Benchmarking Large Language Models for Multilingual and Multimodal Financial Application
Xueqing Peng | Lingfei Qian | Yan Wang | Ruoyu Xiang | Yueru He | Yang Ren | Mingyang Jiang | Vincent Jim Zhang | Yuqing Guo | Jeff Zhao | Huan He | Yi Han | Yun Feng | Yuechen Jiang | Yupeng Cao | Haohang Li | Yangyang Yu | Xiaoyu Wang | Penglei Gao | Shengyuan Lin | Keyi Wang | Shanshan Yang | Yilun Zhao | Zhiwei Liu | Peng Lu | Jerry Huang | Suyuchen Wang | Triantafillos Papadopoulos | Polydoros Giannouris | Efstathia Soufleri | Nuo Chen | Zhiyang Deng | Heming Fu | Yijia Zhao | Mingquan Lin | Meikang Qiu | Kaleb E Smith | Arman Cohan | Xiao-Yang Liu | Jimin Huang | Guojun Xiong | Alejandro Lopez-Lira | Xi Chen | Junichi Tsujii | Jian-Yun Nie | Sophia Ananiadou | Qianqian Xie
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Xueqing Peng | Lingfei Qian | Yan Wang | Ruoyu Xiang | Yueru He | Yang Ren | Mingyang Jiang | Vincent Jim Zhang | Yuqing Guo | Jeff Zhao | Huan He | Yi Han | Yun Feng | Yuechen Jiang | Yupeng Cao | Haohang Li | Yangyang Yu | Xiaoyu Wang | Penglei Gao | Shengyuan Lin | Keyi Wang | Shanshan Yang | Yilun Zhao | Zhiwei Liu | Peng Lu | Jerry Huang | Suyuchen Wang | Triantafillos Papadopoulos | Polydoros Giannouris | Efstathia Soufleri | Nuo Chen | Zhiyang Deng | Heming Fu | Yijia Zhao | Mingquan Lin | Meikang Qiu | Kaleb E Smith | Arman Cohan | Xiao-Yang Liu | Jimin Huang | Guojun Xiong | Alejandro Lopez-Lira | Xi Chen | Junichi Tsujii | Jian-Yun Nie | Sophia Ananiadou | Qianqian Xie
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Real-world financial analysis involves information across multiple languages and modalities, from reports and news to scanned filings and meeting recordings. Yet most existing evaluations of LLMs in finance remain text-only, monolingual, and largely saturated by current models. To bridge these gaps, we present MultiFinBen, the first expert-annotated multilingual (five languages) and multimodal (text, vision, audio) benchmark for evaluating LLMs in realistic financial contexts. MultiFinBen introduces two new task families: multilingual financial reasoning, which tests cross-lingual evidence integration from filings and news, and financial OCR, which extracts structured text from scanned documents containing tables and charts. Rather than aggregating all available datasets, we apply a structured, difficulty-aware selection based on advanced model performance, ensuring balanced challenge and removing redundant tasks. Evaluating 21 leading LLMs shows that even frontier multimodal models like GPT-4o achieve only 46.01% overall, stronger on vision and audio but dropping sharply in multilingual settings. These findings expose persistent limitations in multilingual, multimodal, and expert-level financial reasoning. All datasets, evaluation scripts, and leaderboards are publicly released.
2025
R3Mem: Bridging Memory Retention and Retrieval via Reversible Compression
Xiaoqiang Wang | Suyuchen Wang | Yun Zhu | Bang Liu
Findings of the Association for Computational Linguistics: ACL 2025
Xiaoqiang Wang | Suyuchen Wang | Yun Zhu | Bang Liu
Findings of the Association for Computational Linguistics: ACL 2025
Memory plays a key role in enhancing LLMs’ performance when deployed to real-world applications. Existing solutions face trade-offs: explicit memory designs based on external storage require complex management and incur storage overhead, while implicit memory designs that store information via parameters struggle with reliable retrieval. In this paper, we propose R3Mem, a memory network that optimizes both information Retention and Retrieval through Reversible context compression. Specifically, R3Mem employs virtual memory tokens to compress and encode infinitely long histories, further enhanced by a hierarchical compression strategy that refines information from document- to entity-level for improved assimilation across granularities. For retrieval, R3Mem employs a reversible architecture, reconstructing raw data by invoking the model backward with compressed information. Implemented via parameter-efficient fine-tuning, it can integrate seamlessly with any Transformer-based model. Experiments demonstrate that our memory design achieves state-of-the-art performance in long-context language modeling and retrieval-augmented generation tasks. It also significantly outperforms conventional memory modules in long-horizon interaction tasks like conversational agents, showcasing its potential for next-generation retrieval systems.
WebMMU: A Benchmark for Multimodal Multilingual Website Understanding and Code Generation
Rabiul Awal | Mahsa Massoud | Aarash Feizi | Zichao Li | Suyuchen Wang | Christopher Pal | Aishwarya Agrawal | David Vazquez | Siva Reddy | Juan A. Rodriguez | Perouz Taslakian | Spandana Gella | Sai Rajeswar
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Rabiul Awal | Mahsa Massoud | Aarash Feizi | Zichao Li | Suyuchen Wang | Christopher Pal | Aishwarya Agrawal | David Vazquez | Siva Reddy | Juan A. Rodriguez | Perouz Taslakian | Spandana Gella | Sai Rajeswar
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
We present WebMMU, a multilingual benchmark that evaluates three core web tasks: (1) website visual question answering, (2) code editing involving HTML/CSS/JavaScript, and (3) mockup-to-code generation. Unlike prior benchmarks that treat these tasks separately, WebMMU unifies them using expert-annotated, real-world web data to assess models’ abilities in complex multi-step reasoning, precise element grounding, and functional UI comprehension and coding. Our evaluation shows that while multimodal large language models (MLLMs) perform well on basic information extraction, they struggle with reasoning and grounding, editing code to preserve functionality, and generating design-to-code that maintains hierarchy and supports multilingual content. These findings reveal key limitations in current MLLMs and underscore the need for improved multimodal and cross-lingual reasoning to build future web agents capable of automating diverse web development tasks.
LongRecipe: Recipe for Efficient Long Context Generalization in Large Language Models
Zhiyuan Hu | Yuliang Liu | Jinman Zhao | Suyuchen Wang | Yan Wang | Wei Shen | Qing Gu | Anh Tuan Luu | See-Kiong Ng | Zhiwei Jiang | Bryan Hooi
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zhiyuan Hu | Yuliang Liu | Jinman Zhao | Suyuchen Wang | Yan Wang | Wei Shen | Qing Gu | Anh Tuan Luu | See-Kiong Ng | Zhiwei Jiang | Bryan Hooi
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language models (LLMs) face significant challenges in handling long-context tasks because of their limited effective context window size during pretraining, which restricts their ability to generalize over extended sequences. Meanwhile, extending the context window in LLMs through post-pretraining is highly resource-intensive.To address this, we introduce LongRecipe, an efficient training strategy for extending the context window of LLMs, including impactful token analysis, position index transformation, and training optimization strategies. It simulates long-sequence inputs while maintaining training efficiency and significantly improves the model’s understanding of long-range dependencies. Experiments on three types of LLMs show that LongRecipe can utilize long sequences while requiring only 30% of the target context window size, and reduces computational training resource over 85% compared to full sequence training. Furthermore, LongRecipe also preserves the original LLM’s capabilities in general tasks. Ultimately, we can extend effective context window of open-source LLMs from 8k to 128k, achieving performance close to GPT-4 with just one day of dedicated training using a single GPU with 80G memory.Our code is released at https://github.com/zhiyuanhubj/LongRecipe.
Improving Context Fidelity via Native Retrieval-Augmented Reasoning
Suyuchen Wang | Jinlin Wang | Xinyu Wang | Shiqi Li | Xiangru Tang | Sirui Hong | Xiao-Wen Chang | Chenglin Wu | Bang Liu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Suyuchen Wang | Jinlin Wang | Xinyu Wang | Shiqi Li | Xiangru Tang | Sirui Hong | Xiao-Wen Chang | Chenglin Wu | Bang Liu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Large language models (LLMs) often struggle with context fidelity, producing inconsistent answers when responding to questions based on provided information. Existing approaches either rely on expensive supervised fine-tuning to generate evidence post-answer or train models to perform web searches without necessarily improving utilization of the given context. We propose CARE, a novel native retrieval-augmented reasoning framework that teaches LLMs to explicitly integrate in-context evidence within their reasoning process with the model’s own retrieval capabilities. Our method requires limited labeled evidence data while significantly enhancing both retrieval accuracy and answer generation performance through strategically retrieved in-context tokens in the reasoning chain. Extensive experiments on multiple real-world and counterfactual QA benchmarks demonstrate that our approach substantially outperforms supervised fine-tuning, traditional retrieval-augmented generation methods, and external retrieval solutions. This work represents a fundamental advancement in making LLMs more accurate, reliable, and efficient for knowledge-intensive tasks.
STRICT: Stress-Test of Rendering Image Containing Text
Tianyu Zhang | Xinyu Wang | Lu Li | Zhenghan Tai | Jijun Chi | Jingrui Tian | Hailin He | Suyuchen Wang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Tianyu Zhang | Xinyu Wang | Lu Li | Zhenghan Tai | Jijun Chi | Jingrui Tian | Hailin He | Suyuchen Wang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
While diffusion models have revolutionized text-to-image generation with their ability to synthesize realistic and diverse scenes, they continue to struggle with generating consistent and legible text within images. This shortcoming is commonly attributed to the locality bias inherent in diffusion-based generation, which limits their capacity to model long-range spatial dependencies. In this paper, we introduce STRICT, a benchmark designed to systematically stress-test the ability of diffusion models to render coherent and instruction-aligned text in images. Our benchmark evaluates models across multiple dimensions: (1) the maximum length of readable text that can be generated and (2) the correctness and legibility of the generated text. We assess several state-of-the-art models, including proprietary and open-source variants, and reveal persistent limitations in long-range consistency and instruction-following capabilities. Our findings provide insights into architectural bottlenecks and motivate future research directions in multimodal generative modeling.
FACT: Examining the Effectiveness of Iterative Context Rewriting for Multi-fact Retrieval
Jinlin Wang | Suyuchen Wang | Ziwen Xia | Sirui Hong | Yun Zhu | Bang Liu | Chenglin Wu
Findings of the Association for Computational Linguistics: NAACL 2025
Jinlin Wang | Suyuchen Wang | Ziwen Xia | Sirui Hong | Yun Zhu | Bang Liu | Chenglin Wu
Findings of the Association for Computational Linguistics: NAACL 2025
Large Language Models (LLMs) are proficient at retrieving single facts from extended contexts, yet they struggle with tasks requiring the simultaneous retrieval of multiple facts, especially during generation. This paper identifies a novel “lost-in-the-middle” phenomenon, where LLMs progressively lose track of critical information throughout the generation process, resulting in incomplete or inaccurate retrieval. To address this challenge, we introduce Find All Crucial Texts (FACT), an iterative retrieval method that refines context through successive rounds of rewriting. This approach enables models to capture essential facts incrementally, which are often overlooked in single-pass retrieval. Experiments demonstrate that FACT substantially enhances multi-fact retrieval performance across various tasks, though improvements are less notable in general-purpose QA scenarios. Our findings shed light on the limitations of LLMs in multi-fact retrieval and underscore the need for more resilient long-context retrieval strategies.
2024
Resonance RoPE: Improving Context Length Generalization of Large Language Models
Suyuchen Wang | Ivan Kobyzev | Peng Lu | Mehdi Rezagholizadeh | Bang Liu
Findings of the Association for Computational Linguistics: ACL 2024
Suyuchen Wang | Ivan Kobyzev | Peng Lu | Mehdi Rezagholizadeh | Bang Liu
Findings of the Association for Computational Linguistics: ACL 2024
This paper addresses the challenge of train-short-test-long (TSTL) scenarios in Large Language Models (LLMs) equipped with Rotary Position Embedding (RoPE), where models pre-trained on shorter sequences face difficulty with out-of-distribution (OOD) token positions in longer sequences. We introduce Resonance RoPE, a novel approach designed to narrow the generalization gap in TSTL scenarios by refining the interpolation of RoPE features for OOD positions, significantly improving the model performance without additional online computational costs. Furthermore, we present PosGen, a new synthetic benchmark specifically designed for fine-grained behavior analysis in TSTL scenarios, aiming to isolate the constantly increasing difficulty of token generation on long contexts from the challenges of recognizing new token positions. Our experiments on synthetic tasks show that after applying Resonance RoPE, Transformers recognize OOD position better and more robustly. Our extensive LLM experiments also show superior performance after applying Resonance RoPE to the current state-of-the-art RoPE scaling method, YaRN, on both upstream language modeling tasks and a variety of downstream long-text applications.
2023
Efficient Classification of Long Documents via State-Space Models
Peng Lu | Suyuchen Wang | Mehdi Rezagholizadeh | Bang Liu | Ivan Kobyzev
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Peng Lu | Suyuchen Wang | Mehdi Rezagholizadeh | Bang Liu | Ivan Kobyzev
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Transformer-based models have achieved state-of-the-art performance on numerous NLP applications. However, long documents which are prevalent in real-world scenarios cannot be efficiently processed by transformers with the vanilla self-attention module due to their quadratic computation complexity and limited length extrapolation ability. Instead of tackling the computation difficulty for self-attention with sparse or hierarchical structures, in this paper, we investigate the use of State-Space Models (SSMs) for long document classification tasks. We conducted extensive experiments on six long document classification datasets, including binary, multi-class, and multi-label classification, comparing SSMs (with and without pre-training) to self-attention-based models. We also introduce the SSM-pooler model and demonstrate that it achieves comparable performance while being on average 36% more efficient. Additionally our method exhibits higher robustness to the input noise even in the extreme scenario of 40%.
Search
Fix author
Co-authors
- Bang Liu 5
- Peng Lu 3
- Sirui Hong 2
- Ivan Kobyzev 2
- Mehdi Rezagholizadeh 2
- Jinlin Wang 2
- Xinyu Wang 2
- Chenglin Wu 2
- Yun Zhu 2
- Aishwarya Agrawal 1
- Sophia Ananiadou 1
- Rabiul Awal 1
- Yupeng Cao 1
- Xiao-Wen Chang 1
- Nuo Chen 1
- Xi Chen 1
- Jijun Chi 1
- Arman Cohan 1
- Zhiyang Deng 1
- Aarash Feizi 1
- Yun Feng 1
- Heming Fu 1
- Penglei Gao 1
- Spandana Gella 1
- Polydoros Giannouris 1
- Qing Gu 1
- Yuqing Guo 1
- Yi Han 1
- Yueru He 1
- Huan He 1
- Hailin He 1
- Bryan Hooi 1
- Zhiyuan Hu 1
- Jerry Huang 1
- Jimin Huang 1
- Mingyang Jiang 1
- Yuechen Jiang 1
- Zhiwei Jiang 1
- Haohang Li 1
- Zichao Li 1
- Shiqi Li 1
- Lu Li 1
- Shengyuan Lin 1
- Mingquan Lin 1
- Zhiwei Liu 1
- Xiao-Yang Liu 1
- Yuliang Liu 1
- Alejandro Lopez-Lira 1
- Mahsa Massoud 1
- See Kiong Ng 1
- Jian-Yun Nie 1
- Christopher Pal 1
- Triantafillos Papadopoulos 1
- Xueqing Peng 1
- Lingfei Qian 1
- Meikang Qiu 1
- Sai Rajeswar 1
- Siva Reddy 1
- Yang Ren 1
- Juan A. Rodriguez 1
- Wei Shen 1
- Kaleb E. Smith 1
- Efstathia Soufleri 1
- Zhenghan Tai 1
- Xiangru Tang 1
- Perouz Taslakian 1
- Jingrui Tian 1
- Jun’ichi Tsujii 1
- Luu Anh Tuan 1
- David Vazquez 1
- Yan Wang 1
- Xiaoyu Wang 1
- Keyi Wang 1
- Xiaoqiang Wang 1
- Yan Wang 1
- Ziwen Xia 1
- Ruoyu Xiang 1
- Qianqian Xie 1
- Guojun Xiong 1
- Shanshan Yang 1
- Yangyang Yu 1
- Vincent Jim Zhang 1
- Tianyu Zhang 1
- Jeff Zhao 1
- Yilun Zhao 1
- Yijia Zhao 1
- Jinman Zhao 1