Jingjing Wu
2026
ChessArena: A Chess Testbed for Evaluating Strategic Reasoning Capabilities of Large Language Models
Jincheng Liu | Sijun He | Jingjing Wu | Xiangsen Wang | Yang Chen | Zhaoqi Kuang | Siqi Bao | Yuan Yao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jincheng Liu | Sijun He | Jingjing Wu | Xiangsen Wang | Yang Chen | Zhaoqi Kuang | Siqi Bao | Yuan Yao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Recent large language models (LLMs) have shown strong reasoning capabilities. However, a critical question remains: do these models possess genuine strategic reasoning, or do they primarily excel at pattern recognition? To address this, we present ChessArena, a chess-based testbed for evaluating LLMs. Chess demands strategic reasoning, precise rule adherence, and the ability to track complex game states. ChessArena is a competitive framework where LLMs play against each other under four play modes. We evaluate 13 LLMs across over 800 games, testing basic understanding, move selection, and puzzle solving. Results reveal significant shortcomings: no model beats Maia-1100 (human amateur level), and some lose to random play. We also present a strong baseline: our fine-tuned Qwen3-8B substantially improves performance, approaching much larger state-of-the-art reasoning models.
2025
Theorem-Validated Reverse Chain-of-Thought Problem Generation for Geometric Reasoning
Deng Linger | Linghao Zhu | Yuliang Liu | Yu Wang | Qunyi Xie | Jingjing Wu | Gang Zhang | Yingying Zhu | Xiang Bai
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Deng Linger | Linghao Zhu | Yuliang Liu | Yu Wang | Qunyi Xie | Jingjing Wu | Gang Zhang | Yingying Zhu | Xiang Bai
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Large Multimodal Models (LMMs) face limitations in geometric reasoning due to insufficient Chain of Thought (CoT) image-text training data. While existing approaches leverage template-based or LLM-assisted methods for geometric CoT data creation, they often face challenges in achieving both diversity and precision. To bridge this gap, we introduce a two-stage Theorem-Validated Reverse Chain-of-Thought Reasoning Synthesis (TR-CoT) framework. The first stage, TR-Engine, synthesizes theorem-grounded geometric diagrams with structured descriptions and properties. The second stage, TR-Reasoner, employs reverse reasoning to iteratively refine question-answer pairs by cross-validating geometric properties and description fragments. Our approach expands theorem-type coverage, corrects long-standing misunderstandings, and enhances geometric reasoning. Fine-grained CoT improves theorem understanding and increases logical consistency by 24.5%. Our best models surpass the baselines in MathVista and GeoQA by 10.1% and 4.7%, outperforming advanced closed-source models like GPT-4o.