Shaobo Wang
2026
Stability Implies Redundancy: Delta Attention Selective Halting for Efficient Long-Context Prefilling
Yujie Chen | Tailai Chen | Yifeng Gao | Zoe Wanying He | Yijue Xu | Shaobo Wang | Linfeng Zhang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yujie Chen | Tailai Chen | Yifeng Gao | Zoe Wanying He | Yijue Xu | Shaobo Wang | Linfeng Zhang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Prefilling computational costs pose a significant bottleneck for Large Language Models (LLMs) and Large Multimodal Models (LMMs) in long-context settings. While token pruning reduces sequence length, prior methods rely on heuristics that break compatibility with hardware-efficient kernels like FlashAttention. In this work, we observe that tokens evolve toward semantic fixing points, making further processing redundant. To this end, we introduce Delta Attention Selective Halting (DASH), a training-free policy that monitors the layer-wise update dynamics of the self-attention mechanism to selectively halt stabilized tokens. Extensive evaluation confirms that DASH generalizes across language and vision benchmarks, delivering significant prefill speedups while preserving model accuracy and hardware efficiency. Code will be released at https://github.com/verach3n/DASH.git .
FinReporting: An Agentic Workflow for Localized Reporting of Cross-Jurisdiction Financial Disclosure
Fan Zhang | Mingzi Song | Rania Elbadry | Yankai Chen | Shaobo Wang | Yixi Zhou | Xunwen Zheng | Yueru He | Yuyang Dai | Georgi Nenkov Georgiev | Ayesha Gull | Muhammad Usman Safder | Fan Wu | Liyuan Meng | Fengxian Ji | Junning Zhao | Xueqing Peng | Jimin Huang | YU Chen | Xue Liu | Preslav Nakov | Zhuohan Xie
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
Fan Zhang | Mingzi Song | Rania Elbadry | Yankai Chen | Shaobo Wang | Yixi Zhou | Xunwen Zheng | Yueru He | Yuyang Dai | Georgi Nenkov Georgiev | Ayesha Gull | Muhammad Usman Safder | Fan Wu | Liyuan Meng | Fengxian Ji | Junning Zhao | Xueqing Peng | Jimin Huang | YU Chen | Xue Liu | Preslav Nakov | Zhuohan Xie
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
Financial reporting systems increasingly leverage Large Language Models (LLMs) to extract and summarize corporate disclosures. However, most existing approaches assume a single-market setting and overlook structural differences across jurisdictions. Variations in accounting taxonomies, tagging infrastructures (e.g., XBRL vs. PDF), and aggregation conventions introduce substantial challenges for semantic alignment and reliable verification. Here, we aim to bridge this gap. We present FinReporting, an agentic workflow for localized cross-jurisdiction financial reporting. The system constructs a unified canonical ontology spanning the income statement, balance sheet, and cash flow statement, and decomposes reporting into auditable stages, including filing acquisition, extraction, canonical mapping, and anomaly logging. Rather than treating LLMs as free-form generators, FinReporting employs them as constrained verifiers operating under explicit decision rules with evidence grounding.Evaluated on annual filings from the USA, Japan, and China, FinReporting improves consistency and reliability under heterogeneous reporting regimes. We further release an interactive demo that enables cross-market inspection and supports structured export of localized financial statements. Our demo is available at https://huggingface.co/spaces/BoomQ/FinReporting-Demo. A video describing our system is available at https://www.youtube.com/watch?v=f65jdEL31Kk.
2025
Stop Looking for “Important Tokens” in Multimodal Language Models: Duplication Matters More
Zichen Wen | Yifeng Gao | Shaobo Wang | Junyuan Zhang | Qintong Zhang | Weijia Li | Conghui He | Linfeng Zhang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Zichen Wen | Yifeng Gao | Shaobo Wang | Junyuan Zhang | Qintong Zhang | Weijia Li | Conghui He | Linfeng Zhang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Vision tokens in multimodal large language models often dominate huge computational overhead due to their excessive length compared to linguistic modality. Abundant recent methods aim to solve this problem with token pruning, which first defines an importance criterion for tokens and then prunes the unimportant vision tokens during inference. However, in this paper, we show that the importance is not an ideal indicator to decide whether a token should be pruned. Surprisingly, it usually results in inferior performance than random token pruning and leading to incompatibility to efficient attention computation operators. Instead, we propose DART (Duplication-Aware Reduction of Tokens), which prunes tokens based on its duplication with other tokens, leading to significant and training-free acceleration. Concretely, DART selects a small subset of pivot tokens and then retains the tokens with low duplication to the pivots, ensuring minimal information loss during token pruning. Experiments demonstrate that DART can prune 88.9% vision tokens while maintaining comparable performance, leading to a 1.99× and 2.99× speed-up in total time and prefilling stage, respectively, with good compatibility to efficient attention operators.
Data Whisperer: Efficient Data Selection for Task-Specific LLM Fine-Tuning via Few-Shot In-Context Learning
Shaobo Wang | Xiangqi Jin | Ziming Wang | Jize Wang | Jiajun Zhang | Kaixin Li | Zichen Wen | Zhong Li | Conghui He | Xuming Hu | Linfeng Zhang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Shaobo Wang | Xiangqi Jin | Ziming Wang | Jize Wang | Jiajun Zhang | Kaixin Li | Zichen Wen | Zhong Li | Conghui He | Xuming Hu | Linfeng Zhang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Fine-tuning large language models (LLMs) on task-specific data is essential for their effective deployment. As dataset sizes grow, efficiently selecting optimal subsets for training becomes crucial to balancing performance and computational costs. Traditional data selection methods often require fine-tuning a scoring model on the target dataset, which is time-consuming and resource-intensive, or rely on heuristics that fail to fully leverage the model’s predictive capabilities. To address these challenges, we propose Data Whisperer, an efficient, training-free, attention-based method that leverages few-shot in-context learning with the model to be fine-tuned. Comprehensive evaluations were conducted on both raw and synthetic datasets across diverse tasks and models. Notably, Data Whisperer achieves superior performance compared to the full GSM8K dataset on the Llama-3-8B-Instruct model, using just 10% of the data, and outperforms existing methods with a 3.1-point improvement and a 7.4× speedup.
Search
Fix author
Co-authors
- Linfeng Zhang 3
- Yifeng Gao 2
- Conghui He 2
- Zichen Wen 2
- Tailai Chen 1
- YU Chen (陈昱) 1
- Yankai Chen 1
- Yujie Chen 1
- Yuyang Dai 1
- Rania Elbadry 1
- Georgi Nenkov Georgiev 1
- Ayesha Gull 1
- Yueru He 1
- Zoe Wanying He 1
- Xuming Hu 1
- Jimin Huang 1
- Fengxian Ji 1
- Xiangqi Jin 1
- Kaixin Li 1
- Weijia Li 1
- Zhong Li 1
- Xue Liu 1
- Liyuan Meng 1
- Preslav Nakov 1
- Xueqing Peng 1
- Muhammad Usman Safder 1
- Mingzi Song 1
- Jize Wang 1
- Ziming Wang 1
- Fan Wu 1
- Zhuohan Xie 1
- Yijue Xu 1
- Fan Zhang 1
- Jiajun Zhang 1
- Junyuan Zhang 1
- Qintong Zhang 1
- Junning Zhao 1
- Xunwen Zheng 1
- Yixi Zhou 1