Ziyi Wang
Papers on this page may belong to the following people: Ziyi Wang, Ziyi Wang
2026
TriEx: A Game-based Tri-View Framework for Explaining Internal Reasoning in Multi-Agent LLMs
Ziyi Wang | Chen Zhang | Wenjun Peng | Qi Wu | Xinyu Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Ziyi Wang | Chen Zhang | Wenjun Peng | Qi Wu | Xinyu Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Explainability for Large Language Model (LLM) agents is especially challenging in interactive, partially observable settings, where decisions depend on evolving beliefs and other agents. We present TriEx, a tri-view explainability framework that instruments sequential decision making with aligned artifacts: (i) structured first-person self-reasoning bound to an action, (ii) explicit second-person belief states about opponents updated over time, and (iii) third-person oracle audits grounded in environment-derived reference signals. This design turns explanations from free-form narratives into evidence-anchored objects that can be compared and checked across time and perspectives. Using imperfect-information strategic games as a controlled testbed, we show that TriEx enables scalable analysis of explanation faithfulness, belief dynamics, and evaluator reliability, revealing systematic mismatches between what agents say, what they believe, and what they do. Our results highlight explainability as an interaction-dependent property and motivate multi-view, evidence-grounded evaluation for LLM agents.
2025
Are Your LLMs Capable of Stable Reasoning?
Junnan Liu | Hongwei Liu | Linchen Xiao | Ziyi Wang | Kuikun Liu | Songyang Gao | Wenwei Zhang | Songyang Zhang | Kai Chen
Findings of the Association for Computational Linguistics: ACL 2025
Junnan Liu | Hongwei Liu | Linchen Xiao | Ziyi Wang | Kuikun Liu | Songyang Gao | Wenwei Zhang | Songyang Zhang | Kai Chen
Findings of the Association for Computational Linguistics: ACL 2025
The rapid advancement of large language models (LLMs) has shown remarkable progress in complex reasoning tasks. However, a significant disparity exists between benchmark performances and real-world applications. We attribute this gap primarily to current evaluation protocols and metrics, which inadequately capture the full spectrum of LLM capabilities, especially in complex reasoning tasks where both accuracy and consistency are essential. In this paper, we introduce **G-Pass@**k, a novel evaluation metric that continuously assesses model performance across multiple sampling attempts, quantifying both the model’s performance potential and its stability. Through extensive experiments on various public and newly constructed benchmarks, we employ G-Pass@k in conjunction with state-of-the-art large language models to provide comprehensive insights into their potential capabilities and operational consistency. Our findings reveal a significant opportunity to enhance the realistic reasoning abilities of LLMs, underscoring the necessity for more robust evaluation metrics.
2024
Self-Training with Pseudo-Label Scorer for Aspect Sentiment Quad Prediction
Yice Zhang | Jie Zeng | Weiming Hu | Ziyi Wang | Shiwei Chen | Ruifeng Xu
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yice Zhang | Jie Zeng | Weiming Hu | Ziyi Wang | Shiwei Chen | Ruifeng Xu
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Aspect Sentiment Quad Prediction (ASQP) aims to predict all quads (aspect term, aspect category, opinion term, sentiment polarity) for a given review, which is the most representative and challenging task in aspect-based sentiment analysis. A key challenge in the ASQP task is the scarcity of labeled data, which limits the performance of existing methods. To tackle this issue, we propose a self-training framework with a pseudo-label scorer, wherein a scorer assesses the match between reviews and their pseudo-labels, aiming to filter out mismatches and thereby enhance the effectiveness of self-training. We highlight two critical aspects to ensure the scorer’s effectiveness and reliability: the quality of the training dataset and its model architecture. To this end, we create a human-annotated comparison dataset and train a generative model on it using ranking-based objectives. Extensive experiments on public ASQP datasets reveal that using our scorer can greatly and consistently improve the effectiveness of self-training. Moreover, we explore the possibility of replacing humans with large language models for comparison dataset annotation, and experiments demonstrate its feasibility. We will release our code and data via GitHub.