Sanhorn Chen

2026

Time series data are integral to applications across domains such as finance, healthcare, transportation, and environmental science.While recent work has begun to explore time series question answering (QA), existing benchmarks still provide limited coverage of analytical capabilities under a standardized evaluation framework. We introduce TSAQA, a novel unified benchmark designed to broaden task coverage and evaluate diverse temporal analysis capabilities. TSAQA integrates 6 diverse tasks under a single framework ranging fromconventional analysis, including anomaly detection and classification, to advanced analysis, such as characterization, comparison, datatransformation, and temporal relationship analysis. Spanning 210k samples across 13 domains, the dataset employs diverse formats, including true-or-false (TF), multiple-choice (MC), and a novel puzzling (PZ), to comprehensively assess time series analysis. Zero-shotevaluation shows that TSAQA remains challenging for current Large Language Models (LLMs): best-performing commercial model,Gemini-2.5-Flash, achieves 65.08 average accuracy. Although instruction tuning improves open-source models’ performance: the best-performing model, LLaMA-3.1-8B, shows significant room for improvement. We further evaluate language-capable time series foundation models (TSFMs), showing that TSAQA extends beyond general-purpose LLMs. The data are available in https://huggingface.co/datasets/TSAQA/TSAQA-Benchmark.

Co-authors

Boyu Liu 1

Venues

GEM1
WS1

Fix author