Mizuki Arai


Fixing paper assignments

  1. Please select all papers that belong to the same person.
  2. Indicate below which author they should be assigned to.
Provide a valid ORCID iD here. This will be used to match future papers to this author.
Provide the name of the school or the university where the author has received or will receive their highest degree (e.g., Ph.D. institution for researchers, or current affiliation for students). This will be used to form the new author page ID, if needed.

TODO: "submit" and "cancel" buttons here


2025

pdf bib
Evaluating LLMs’ Ability to Understand Numerical Time Series for Text Generation
Mizuki Arai | Tatsuya Ishigaki | Masayuki Kawarada | Yusuke Miyao | Hiroya Takamura | Ichiro Kobayashi
Proceedings of the 18th International Natural Language Generation Conference

Data-to-text generation tasks often involve processing numerical time-series as input such as financial statistics or meteorological data. Although large language models (LLMs) are a powerful approach to data-to-text, we still lack a comprehensive understanding of how well they actually understand time-series data. We therefore introduce a benchmark with 18 evaluation tasks to assess LLMs’ abilities of interpreting numerical time-series, which are categorized into: 1) event detection—identifying maxima and minima; 2) computation—averaging and summation; 3) pairwise comparison—comparing values over time; and 4) inference—imputation and forecasting. Our experiments reveal five key findings: 1) even state-of-the-art LLMs struggle with complex multi-step reasoning; 2) tasks that require extracting values or performing computations within a specified range of the time-series significantly reduce accuracy; 3) instruction tuning offers inconsistent improvements for numerical interpretation; 4) reasoning-based models outperform standard LLMs in complex numerical tasks; and 5) LLMs perform interpolation better than forecasting. These results establish a clear baseline and serve as a wake-up call for anyone aiming to blend fluent language with trustworthy numeric precision in time-series scenarios.