Sanhorn Chen
2026
TSAQA: Time Series Analysis Question And Answering Benchmark
Baoyu Jing | Sanhorn Chen | Lecheng Zheng | Boyu Liu | Zihao Li | Jiaru Zou | Tianxin Wei | Zhining Liu | Zhichen Zeng | Ruizhong Qiu | Xiao Lin | Yuchen Yan | Dongqi Fu | Jingchao Ni | Jingrui He | Hanghang Tong
Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM)
Baoyu Jing | Sanhorn Chen | Lecheng Zheng | Boyu Liu | Zihao Li | Jiaru Zou | Tianxin Wei | Zhining Liu | Zhichen Zeng | Ruizhong Qiu | Xiao Lin | Yuchen Yan | Dongqi Fu | Jingchao Ni | Jingrui He | Hanghang Tong
Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM)
Time series data are integral to applications across domains such as finance, healthcare, transportation, and environmental science.While recent work has begun to explore time series question answering (QA), existing benchmarks still provide limited coverage of analytical capabilities under a standardized evaluation framework. We introduce TSAQA, a novel unified benchmark designed to broaden task coverage and evaluate diverse temporal analysis capabilities. TSAQA integrates 6 diverse tasks under a single framework ranging fromconventional analysis, including anomaly detection and classification, to advanced analysis, such as characterization, comparison, datatransformation, and temporal relationship analysis. Spanning 210k samples across 13 domains, the dataset employs diverse formats, including true-or-false (TF), multiple-choice (MC), and a novel puzzling (PZ), to comprehensively assess time series analysis. Zero-shotevaluation shows that TSAQA remains challenging for current Large Language Models (LLMs): best-performing commercial model,Gemini-2.5-Flash, achieves 65.08 average accuracy. Although instruction tuning improves open-source models’ performance: the best-performing model, LLaMA-3.1-8B, shows significant room for improvement. We further evaluate language-capable time series foundation models (TSFMs), showing that TSAQA extends beyond general-purpose LLMs. The data are available in https://huggingface.co/datasets/TSAQA/TSAQA-Benchmark.