Zike Yuan


2025

pdf bib
GraCoRe: Benchmarking Graph Comprehension and Complex Reasoning in Large Language Models
Zike Yuan | Ming Liu | Hui Wang | Bing Qin
Proceedings of the 31st International Conference on Computational Linguistics

Evaluating the graph comprehension and reasoning abilities of Large Language Models (LLMs) is challenging and often incomplete. Existing benchmarks focus primarily on pure graph understanding, lacking a comprehensive evaluation across all graph types and detailed capability definitions. This paper presents GraCoRe, a benchmark for systematically assessing LLMs’ graph comprehension and reasoning. GraCoRe uses a three-tier hierarchical taxonomy to categorize and test models on pure graph and heterogeneous graphs, subdividing capabilities into 10 distinct areas tested through 19 tasks. Our benchmark includes 11 datasets with 5,140 graphs of varying complexity. We evaluate four closed-source and eight open-source LLMs, conducting thorough analyses from both ability and task perspectives. Key findings reveal that OpenAI o1 model has amazing comprehension and reasoning capabilities, semantic enrichment enhances reasoning performance, node ordering impacts task success, and the ability to process longer texts does not necessarily improve graph comprehension or reasoning.

pdf bib
From Long to Lean: Performance-aware and Adaptive Chain-of-Thought Compression via Multi-round Refinement
JianZhi Yan | Le Liu | Youcheng Pan | Shiwei Chen | Zike Yuan | Yang Xiang | Buzhou Tang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Chain-of-Thought (CoT) reasoning improves performance on complex tasks but introduces significant inference latency due to its verbosity. In this work, we propose Multiround Adaptive Chain-of-Thought Compression (MACC), a framework that leverages the token elasticity phenomenon—where overly small token budgets may paradoxically increase output length—to progressively compress CoTs via multiround refinement. This adaptive strategy allows MACC to dynamically determine the optimal compression depth for each input. Our method achieves an average accuracy improvement of 5.6% over state-of-the-art baselines, while also reducing CoT length by an average of 47 tokens and significantly lowering latency. Furthermore, we show that test-time performance—accuracy and token length—can be reliably predicted using interpretable features like perplexity and compression rate on training set. Evaluated across different models, our method enables efficient model selection and forecasting without repeated fine-tuning, demonstrating that CoT compression is both effective and predictable. Our code will be released in https://github.com/Leon221220/MACC.

pdf bib
MA-GTS: A Multi-Agent Framework for Solving Complex Graph Problems in Real-World Applications
Zike Yuan | Ming Liu | Hui Wang | Bing Qin
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Graph-theoretic problems arise in real-world applications like logistics, communication networks, and traffic optimization. These problems are often complex, noisy, and irregular, posing challenges for traditional algorithms. Large language models offer potential solutions but face several challenges, including limited accuracy, input length constraints, and suboptimal algorithm selection. To address these challenges, we propose MA-GTS(Multi-Agent Graph Theory Solver), a multi-agent framework that decomposes these complex problems through agent collaboration. MA-GTS maps the implicitly expressed text-based graph data into clear, structured graph representations and dynamically selects the most suitable algorithm based on problem constraints and graph structure scale. We validate MA-GTS using the G-REAL dataset, a real-world-inspired graph theory dataset we created. Experimental results show that MA-GTS outperforms state-of-the-art methods in cost-effectiveness, accuracy, and scalability, achieving strong results on multiple benchmarks (G-REAL 93.6%, GraCoRe 96.9% ,NLGraph 98.4%) with robust performance on both closed- and open-source models.