Zhixian He
2026
Lunar-Bench: Towards Evaluating Task-Oriented Reasoning of LLMs in Lunar Exploration Scenarios
Xin-Yu Xiao | Ye Tian | Erwei Yin | Zhixian He | Shiqi Wang | Yalei Liu | Qianchen Xia
Findings of the Association for Computational Linguistics: ACL 2026
Xin-Yu Xiao | Ye Tian | Erwei Yin | Zhixian He | Shiqi Wang | Yalei Liu | Qianchen Xia
Findings of the Association for Computational Linguistics: ACL 2026
The increasing complexity of lunar exploration calls for intelligent systems capable of supporting autonomous operations and scientific decision-making under uncertain and resource-limited conditions. Advances in large language models (LLMs) create new opportunities for mission planning, but their reliability in dynamic, safety-critical environments remains insufficiently evaluated. Existing benchmarks focus on static, context-independent reasoning tasks and fail to capture the constraints and dependencies of lunar missions. To address this gap, we introduce Lunar-Bench, a benchmark designed to assess the task-oriented reasoning and decision-making performance of LLMs through 3,000 tasks derived from mission procedures and documentation. We further propose the Environmental Scenario Indicators, a process-based framework that evaluates safety, efficiency, integrity, and alignment beyond conventional accuracy. Experiments on 36 representative models show that the best achieves 47.8% accuracy compared with 65.1% for human experts. Lunar-Bench and ESI together provide a principled foundation for developing reliable systems for future missions.