Mengliang He


2025

pdf bib
Mis-prompt: Benchmarking Large Language Models for Proactive Error Handling
Jiayi Zeng | Yizhe Feng | Mengliang He | Wenhui Lei | Wei Zhang | Zeming Liu | Xiaoming Shi | Aimin Zhou
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Large language models (LLMs) have demonstrated significant advancements in error handling. Current error-handling works are performed in a passive manner, with explicit error-handling instructions. However, in real-world scenarios, explicit error-handling instructions are usually unavailable. In this paper, our work identifies this challenge as how to conduct proactive error handling without explicit error handling instructions. To promote further research, this work introduces a new benchmark, termed Mis-prompt, consisting of four evaluation tasks, an error category taxonomy, and a new evaluation dataset. Furthermore, this work analyzes current LLMs’ performance on the benchmark, and the experimental results reveal that current LLMs show poor performance on proactive error handling, and SFT on error handling instances improves LLMs’ proactive error handling capabilities. The dataset will be publicly available.

pdf bib
Flow2Code: Evaluating Large Language Models for Flowchart-based Code Generation Capability
Mengliang He | Jiayi Zeng | Yankai Jiang | Wei Zhang | Zeming Liu | Xiaoming Shi | Aimin Zhou
Findings of the Association for Computational Linguistics: ACL 2025

While large language models (LLMs) show promise in code generation, existing benchmarks neglect the flowchart-based code generation. To promote further research on flowchart-based code generation, this work presents Flow2Code, a novel benchmark for flowchart-based code generation evaluation. The evaluation dataset spans 15 programming languages and includes 5,622 code segments paired with 16,866 flowcharts of three types: code, UML, and pseudocode. Extensive experiments with 13 multimodal LLMs reveal that current LLMs can not generate code based on flowcharts perfectly. Besides, experiment results show that the supervised fine-tuning technique contributes greatly to the models’ performance. The dataset will be publicly available.