Wen Ma


2025

pdf bib
Mitigating Posterior Salience Attenuation in Long-Context LLMs with Positional Contrastive Decoding
Zikai Xiao | Ziyang Wang | Wen Ma | Yan Zhang | Wei Shen | WangYan WangYan | Luqi Gong | Zuozhu Liu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

While Large Language Models (LLMs) support long contexts, they struggle with performance degradation within the context window. Current solutions incur prohibitive training costs, leaving statistical behaviors and cost-effective approaches underexplored. From the decoding perspective, we identify the Posterior Salience Attenuation (PSA) phenomenon, where the salience ratio correlates with long-text performance degradation. Notably, despite the attenuation, gold tokens still occupy high-ranking positions in the decoding space. Motivated by it, we propose the training-free Positional Contrastive Decoding (PCD) that contrasts the logits derived from long-aware attention with those from designed local-aware attention, enabling the model to focus on the gains introduced by large-scale short-to-long training. Through the analysis of long-term decay simulation, we demonstrate that PCD effectively alleviates attention score degradation. Experimental results show that PCD achieves state-of-the-art performance on long-context benchmarks.

pdf bib
LongWeave: A Long-Form Generation Benchmark Bridging Real-World Relevance and Verifiability
Zikai Xiao | Fei Huang | Jianhong Tu | Jianhui Wei | Wen Ma | Yuxuan Zhou | Jian Wu | Bowen Yu | Zuozhu Liu | Junyang Lin
Findings of the Association for Computational Linguistics: EMNLP 2025

Generating long, informative, and factual outputs remains a major challenge for Large Language Models (LLMs). Existing benchmarks for long-form generation typically assess real-world queries with hard-to-verify metrics or use synthetic setups that ease evaluation but overlook real-world intricacies. In this paper, we introduce LongWeave, which balance real-world and verifiable assessment with Target-Anchored Evaluation (TAE). TAE constructs tasks by first defining verifiable targets within real-world scenarios, then systematically generating corresponding queries, textual materials, and anchors based on these targets. This ensures that tasks are both realistic and objectively assessable, enabling rigorous assessment of model capabilities in meeting complex real-world constraints. LongWeave supports customizable input/output lengths (up to 64K/8K tokens) across seven distinct tasks. Evaluation on 23 LLMs show that even state-of-the-art models encounter significant challenges in long-form generation as real-world complexity and output length increase. Dataset will be publicly available.

2020

pdf bib
Evaluation of Pretrained BERT Model by Using Sentence Clustering
Naoki Shibayama | Rui Cao | Jing Bai | Wen Ma | Hiroyuki Shinnou
Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation