Junwei Yang
2026
Failure makes the agent stronger: Enhancing Accuracy through Structured Reflection for Reliable Tool Interactions
Junhao Su | Yuanliang Wan | Junwei Yang | Hengyu Shi | Tianyang Han | Yurui Qiu | Junfeng Luo
Findings of the Association for Computational Linguistics: ACL 2026
Junhao Su | Yuanliang Wan | Junwei Yang | Hengyu Shi | Tianyang Han | Yurui Qiu | Junfeng Luo
Findings of the Association for Computational Linguistics: ACL 2026
Tool-augmented large language models (LLMs) are typically trained via supervised imitation learning or coarse-grained reinforcement learning, approaches that primarily optimize one-shot tool calls. Existing practices of self-reflection largely rely on heuristic prompting or unidirectional reasoning traces: the model is encouraged to “think more,” rather than to treat error diagnosis and correction as a learnable capability. This makes them fragile in multi-turn interaction settings—once a call fails, the model tends to repeat the same mistake instead of recovering. To address this issue, we propose structured reflection, which transforms the “from error to repair” process into a first-class, controllable, and trainable action. The agent produces a concise yet precise reflection process: specifically, the model diagnoses the error based on evidence from the previous step and then proposes a correct and executable follow-up call. During training, we combine DAPO and GSPO’s objective functions and design a more principled reward mechanism tailored to tool calling, optimizing the stepwise strategy Reflect → Call → Final. To evaluate this capability, we introduce Tool-Reflection-Bench, a lightweight benchmark dataset that programmatically verifies structural validity, executability, parameter correctness, and result consistency. Tasks in the benchmark are constructed as miniature trajectories of Erroneous Call → Reflection → Corrected Call and are split into disjoint training and testing sets. Experiments on BFCL v3 and Tool-Reflection-Bench show that our method achieves significant improvements in multi-turn tool-call success rates and error recovery, while also reducing redundant calls. These results demonstrate that making reflection explicit and treating it as an optimization objective can substantially enhance the reliability of tool interaction, providing a reproducible pathway for agents to grow stronger by learning from failure. We will release all the code and datasets as open source once the paper is accepted by the community.
SciCustom: A Framework for Custom Evaluation of Scientific Capabilities in Large Language Models
Yiyang Gu | Junwei Yang | Junyu Luo | Ye Yuan | Bin Feng | Yingce Xia | Shufang Xie | Kaili Liu | Bohan Wu | Qi Shi | Haoran Li | Beier Xiao | Zhiping Xiao | Xiao Luo | Weizhi Zhang | Philip S. Yu | Zequn Liu | Ming Zhang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yiyang Gu | Junwei Yang | Junyu Luo | Ye Yuan | Bin Feng | Yingce Xia | Shufang Xie | Kaili Liu | Bohan Wu | Qi Shi | Haoran Li | Beier Xiao | Zhiping Xiao | Xiao Luo | Weizhi Zhang | Philip S. Yu | Zequn Liu | Ming Zhang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language models (LLMs) are increasingly applied to scientific research, yet existing evaluations often fail to reflect the fine-grained capabilities required in practice. Most benchmarks are manually curated or domain-generic, limiting scalability and alignment with real scientific use cases. In this paper, we propose a new framework named SciCustom to address the problem. It enables the custom construction of benchmarks from large-scale scientific data to evaluate application-specific scientific capabilities in LLMs. SciCustom first organizes scientific knowledge into ontology-grounded knowledge units with controlled granularity and trains a tagger to map large-scale data instances into this knowledge space. Given a custom requirement, relevant knowledge units are identified via voting-based multi-model consensus. These units enable relevance-aware benchmark retrieval via binary search, followed by proxy subset selection and data-grounded benchmark generation for efficient evaluation. Experiments in chemistry and healthcare demonstrate that SciCustom reveals fine-grained differences in LLM scientific capabilities that standard benchmarks overlook, while requiring neither expert annotation nor synthetic question generation. This work provides a scalable and application-aware foundation for benchmarking scientific capabilities in LLMs.
2022
Pathway2Text: Dataset and Method for Biomedical Pathway Description Generation
Junwei Yang | Zequn Liu | Ming Zhang | Sheng Wang
Findings of the Association for Computational Linguistics: NAACL 2022
Junwei Yang | Zequn Liu | Ming Zhang | Sheng Wang
Findings of the Association for Computational Linguistics: NAACL 2022
Biomedical pathways have been extensively used to characterize the mechanism of complex diseases. One essential step in biomedical pathway analysis is to curate the description of a pathway based on its graph structure and node features. Neural text generation could be a plausible technique to circumvent the tedious manual curation. In this paper, we propose a new dataset Pathway2Text, which contains 2,367 pairs of biomedical pathways and textual descriptions. All pathway graphs are experimentally derived or manually curated. All textual descriptions are written by domain experts. We form this problem as a Graph2Text task and propose a novel graph-based text generation approach kNN-Graph2Text, which explicitly exploited descriptions of similar graphs to generate new descriptions. We observed substantial improvement of our method on both Graph2Text and the reverse task of Text2Graph. We further illustrated how our dataset can be used as a novel benchmark for biomedical named entity recognition. Collectively, we envision our method will become an important benchmark for evaluating Graph2Text methods and advance biomedical research for complex diseases.
MetaFill: Text Infilling for Meta-Path Generation on Heterogeneous Information Networks
Zequn Liu | Kefei Duan | Junwei Yang | Hanwen Xu | Ming Zhang | Sheng Wang
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Zequn Liu | Kefei Duan | Junwei Yang | Hanwen Xu | Ming Zhang | Sheng Wang
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Heterogeneous information network (HIN) is essential to study complicated networks containing multiple edge types and node types. Meta-path, a sequence of node types and edge types, is the core technique to embed HINs. Since manually curating meta-paths is time-consuming, there is a pressing need to develop automated meta-path generation approaches. Existing meta-path generation approaches cannot fully exploit the rich textual information in HINs, such as node names and edge type names. To address this problem, we propose MetaFill, a text-infilling-based approach for meta-path generation. The key idea of MetaFill is to formulate meta-path identification problem as a word sequence infilling problem, which can be advanced by pretrained language models (PLMs). We observed the superior performance of MetaFill against existing meta-path generation methods and graph embedding methods that do not leverage meta-paths in both link prediction and node classification on two real-world HIN datasets. We further demonstrated how MetaFill can accurately classify edges in the zero-shot setting, where existing approaches cannot generate any meta-paths. MetaFill exploits PLMs to generate meta-paths for graph embedding, opening up new avenues for language model applications in graph analysis.