Van Yang


Fixing paper assignments

  1. Please select all papers that do not belong to this person.
  2. Indicate below which author they should be assigned to.
Provide a valid ORCID iD here. This will be used to match future papers to this author.
Provide the name of the school or the university where the author has received or will receive their highest degree (e.g., Ph.D. institution for researchers, or current affiliation for students). This will be used to form the new author page ID, if needed.

TODO: "submit" and "cancel" buttons here


2025

pdf bib
100-LongBench: Are de facto Long-Context Benchmarks Literally Evaluating Long-Context Ability?
Van Yang | Hongye Jin | Shaochen Zhong | Song Jiang | Qifan Wang | Vipin Chaudhary | Xiaotian Han
Findings of the Association for Computational Linguistics: ACL 2025

Long-context capability is considered one of the most important abilities of LLMs, as a truly long context-capable LLM shall enable its users to effortlessly process many originally exhausting tasks — e.g., digesting a long-form document to find answers v.s., directly asking an LLM about it. However, existing real-task-based long-context evaluation benchmarks have a few major shortcomings. For instance, some Needle-in-a-Haystack-like benchmarks are too synthetic, and therefore do not represent the real world usage of LLMs. While some real-task-based benchmarks like LongBench avoid this problem, such benchmarks are often formed in a way where each data sample has a fixed sequence length, which not only makes them solely suitable for models with a certain range of context windows, but also lacks a proxy to know at what length the model/method-of-interest would fail. Last, most benchmarks tend to not provide proper metrics to separate long-context performance from the model’s baseline ability, so when conducting a cross-model/recipe comparison, such conflation makes the user unable to understand how exactly one model or recipe excels at the long-context task in relation to its baseline ability. To address these issues, we introduce a length-controllable, real-life reflective benchmark with a novel metric that disentangles baseline knowledge from long-context capabilities. Experiments demonstrate the superiority of our datasets in effectively evaluating LLMs. All assets are available at https://github.com/uservan/100-LongBench.git.