Xinyan Zhang
2026
Neo-Classic: A Benchmark for Evaluating Linguistic-Aesthetic Reasoning in Classical Chinese Poetry
Han Zhang | Zihan Gu | Zhiyuan Wang | Tianyi Ma | Jiacheng Lu | Xinyan Zhang | Yuhao Wei | Cheng Hua
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Han Zhang | Zihan Gu | Zhiyuan Wang | Tianyi Ma | Jiacheng Lu | Xinyan Zhang | Yuhao Wei | Cheng Hua
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
While Large Language Models (LLMs) achieve high accuracy on established Classical Chinese Poetry benchmarks, it remains challenging to distinguish transferable Linguistic-Aesthetic Reasoning from reliance on familiar pre-training patterns. To address this issue, we introduce Neo-Classic, an evaluation benchmark that combines a constructionist Out-of-Sample (OOS) dataset with a suite of reverse understanding probes. Unlike traditional benchmarks that rely on verification or generation over historical corpora, Neo-Classic comprises strictly metrical poetry authored by contemporary experts, reducing the possibility of direct retrieval. We evaluate state-of-the-art models, including Qwen3-Max, Gemini-3-Pro, and DeepSeek-V3.2, across five behavioral probes designed to test hierarchical constraint satisfaction. Our results reveal two primary limitations. First, a performance gap of 20%–50% emerges when models transition from historical to contemporary texts. Second, models exhibit substantial difficulties in discourse-level ordering tasks, with standard accuracy remaining low (0–13%). Although expert-level guidance improves the performance of reasoning-enhanced models to 36%, a notable gap with human experts persists. These findings suggest that while current LLMs capture local formal patterns, they struggle with global hierarchical planning required for robust Linguistic-Aesthetic Reasoning.
Diagnosing Hidden Instabilities in Model Editing via Uncertainty Quantification
Zihan Gu | TianYi Zhang | Xinyan Zhang | Zhiyuan Wang | Han Zhang | Yuhao Wei | Jiacheng Lu | Tianyi Ma | Xingsheng Zhang | Hua Zhang | Yue Hu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zihan Gu | TianYi Zhang | Xinyan Zhang | Zhiyuan Wang | Han Zhang | Yuhao Wei | Jiacheng Lu | Tianyi Ma | Xingsheng Zhang | Hua Zhang | Yue Hu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Model editing provides a promising mechanism for updating large language models (LLMs) without expensive retraining. Existing approaches, particularly locate-and-edit methods based on least-squares optimization, aim to introduce targeted knowledge changes while preserving pre-trained behavior. In this work, we show that this objective is fundamentally fragile under standard single-edit evaluation protocols. We first develop a unified theoretical framework that characterizes activation-based editing as a constrained intervention on intermediate representations. Within this framework, we demonstrate that least-squares edits cannot, in general, isolate target updates from unrelated activations, giving rise to unavoidable interference that accumulates with successive edits. Crucially, this degradation can remain undetected in single-edit settings when assessed using conventional success and locality metrics. To expose such hidden instabilities, we introduce an uncertainty-based evaluation protocol that combines structured semantic perturbations with uncertainty quantification based on Sampling with Perturbation for UQ. By measuring edit-induced growth in aleatoric and epistemic uncertainty, our method reveals local knowledge conflicts that are invisible to existing benchmarks. Extensive experiments across multiple models, datasets, and editing algorithms show that both least-squares and other parameter-update-based methods consistently increase post-edit uncertainty. Together, our results suggest that current evaluation practices substantially overestimate the reliability of single-edit model editing, and that uncertainty-based diagnostics are necessary for assessing edit stability.
2025
Multi-Scale Temporal Scenario Planning for Financial Networks: A GNN Approach to Stress Testing
Xinyan Zhang | Xiaobeng Feng | Xiujuan Xu | Rongxuan Zhao | Peng Zhang | Jinghua Lian
Proceedings of the 2nd Workshop on Agent AI for Scenario Planning
Xinyan Zhang | Xiaobeng Feng | Xiujuan Xu | Rongxuan Zhao | Peng Zhang | Jinghua Lian
Proceedings of the 2nd Workshop on Agent AI for Scenario Planning
RIRAG: A Bi-Directional Retrieval-Enhanced Framework for Financial Legal QA in ObliQA Shared Task
Xinyan Zhang | Xiaobing Feng | Xiujuan Xu | Zhiliang Zheng | Kai Wu
Proceedings of the 1st Regulatory NLP Workshop (RegNLP 2025)
Xinyan Zhang | Xiaobing Feng | Xiujuan Xu | Zhiliang Zheng | Kai Wu
Proceedings of the 1st Regulatory NLP Workshop (RegNLP 2025)
In professional financial-legal consulting services, accurately and efficiently retrieving and answering legal questions is crucial. Although some breakthroughs have been made in information retrieval and answer generation, few frameworks have successfully integrated these tasks. Therefore, we propose RIRAG (Retrieval-In-the-loop Response and Answer Generation), a bi-directional retrieval-enhanced framework for financial-legal question answering in ObliQA Shared Task. The system introduces BDD-FinLegal, which means Bi-Directional Dynamic finance-legal, a novel retrieval mechanism specifically designed for financial-legal documents, combining traditional retrieval algorithms with modern neural network methods. Legal answer generation is implemented through large language models retrained on expert-annotated datasets. Our method significantly improves the professionalism and interpretability of the answers while maintaining high retrieval accuracy. Experiments on the ADGM dataset show that the system achieved a significant improvement in the Recall@10 evaluation metric and was recognized by financial legal experts for the accuracy and professionalism of the answer generation. This study provides new ideas for building efficient and reliable question-answering systems in the financial-legal domain.