Jin Zhang (张瑾) - ACL Anthology

Jin Zhang

Also published as: 瑾张

2025

pdf bib abs
MPVStance: Mitigating Hallucinations in Stance Detection with Multi-Perspective Verification
ZhaoDan Zhang | Zhao Zhang | Jin Zhang | Hui Xu | Xueqi Cheng
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Stance detection is a pivotal task in Natural Language Processing (NLP), identifying textual attitudes toward various targets. Despite advances in using Large Language Models (LLMs), challenges persist due to hallucination-models generating plausible yet inaccurate content. Addressing these challenges, we introduce MPVStance, a framework that incorporates Multi-Perspective Verification (MPV) with Retrieval-Augmented Generation (RAG) across a structured five-step verification process. Our method enhances stance detection by rigorously validating each response from factual accuracy, logical consistency, contextual relevance, and other perspectives. Extensive testing on the SemEval-2016 and VAST datasets, including scenarios that challenge existing methods and comprehensive ablation studies, demonstrates that MPVStance significantly outperforms current models. It effectively mitigates hallucination issues and sets new benchmarks for reliability and accuracy in stance detection, particularly in zero-shot, few-shot, and challenging scenarios.

pdf bib abs
MPRF: Interpretable Stance Detection through Multi-Path Reasoning Framework
ZhaoDan Zhang | Jin Zhang | Hui Xu | Jiafeng Guo | Xueqi Cheng
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Stance detection, a critical task in Natural Language Processing (NLP), aims to identify the attitude expressed in text toward specific targets. Despite advancements in Large Language Models (LLMs), challenges such as limited interpretability and handling nuanced content persist. To address these issues, we propose the Multi-Path Reasoning Framework (MPRF), a novel framework that generates, evaluates, and integrates multiple reasoning paths to improve accuracy, robustness, and transparency in stance detection. Unlike prior work that relies on single-path reasoning or static explanations, MPRF introduces a structured end-to-end pipeline: it first generates diverse reasoning paths through predefined perspectives, then dynamically evaluates and optimizes each path using LLM-based scoring, and finally fuses the results via weighted aggregation to produce interpretable and reliable predictions. Extensive experiments on the SEM16, VAST, and PStance datasets demonstrate that MPRF outperforms existing models. Ablation studies further validate the critical role of MPRF’s components, highlighting its effectiveness in enhancing interpretability and handling complex stance detection tasks.

pdf bib abs
T-MAD: Target-driven Multimodal Alignment for Stance Detection
ZhaoDan Zhang | Jin Zhang | Xueqi Cheng | Hui Xu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Multimodal Stance Detection (MSD) aims to determine a user’s stance - support, oppose, or neutral - toward a target by analyzing multimodal content such as texts and images from social media. Existing MSD methods struggle with generalizing to unseen targets and handling modality inconsistencies. To address these challenges, we propose the Target-driven Multi-modal Alignment and Dynamic Weighting Model (T-MAD), which combines target-driven multi-modal alignment and dynamic weighting mechanisms to capture target-specific relationships and balance modality contributions. The model incorporates iterative reasoning to iteratively refine predictions, achieving robust performance in both in-target and zero-shot settings. Experiments on the MMSD and MultiClimate datasets show that T-MAD outperforms state-of-the-art models, with optimal results achieved using RoBERTa, ViT, and an iterative depth of 5. Ablation studies further confirm the importance of multi-modal alignment and dynamic weighting in enhancing model effectiveness.

End-to-end automatic speech recognition (ASR) based on deep learning has achieved impressive progress in recent years. However, the performance of ASR foundation model often degrades significantly on out-of-domain data due to real-world domain shifts. Test-Time Adaptation (TTA) methods aim to mitigate this issue by adapting models during inference without access to source data. Despite recent progress, existing ASR TTA methods often struggle with instability under continual and long-term distribution shifts. To alleviate the risk of performance collapse due to error accumulation, we propose Dynamic Model-bank Single-Utterance Test-time Adaptation (DMSUTA), a sustainable continual TTA framework based on adaptive ASR model ensembling. DMSUTA maintains a dynamic model bank, from which a subset of checkpoints is selected for each test sample based on confidence and uncertainty criteria. To preserve both model plasticity and long-term stability, DMSUTA actively manages the bank by filtering out potentially collapsed models. This design allows DMSUTA to continually adapt to evolving domain shifts in ASR test-time scenarios. Experiments on diverse, continuously shifting ASR TTA benchmarks show that DMSUTA consistently outperforms existing continual TTA baselines, demonstrating superior robustness to domain shifts in ASR.

2024

pdf bib abs
M³CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought
Qiguang Chen | Libo Qin | Jin Zhang | Zhi Chen | Xiao Xu | Wanxiang Che
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Multi-modal Chain-of-Thought (MCoT) requires models to leverage knowledge from both textual and visual modalities for step-by-step reasoning, which gains increasing attention. Nevertheless, the current MCoT benchmark still faces some challenges: (1) absence of visual modal reasoning, (2) single-step visual modal reasoning, and (3) domain missing, thereby hindering the development of MCoT. Motivated by this, we introduce a novel benchmark (M³CoT) to address the above challenges, advancing the multi-domain, multi-step, and multi-modal CoT. Additionally, we conduct a thorough evaluation involving abundant MCoT approaches on Vision Large Language Models (VLLMs). In addition, we highlight that the current VLLMs still struggle to correctly reason in M³CoT and there is a large gap between VLLMs and human performance in M³CoT, despite their superior results on previous MCoT benchmarks. To our knowledge, we take the first meaningful step toward the multi-domain, multi-step, and multi-modal scenario in MCoT. We hope that M³CoT will serve as a valuable resource, providing a pioneering foundation in multi-domain, multi-step, multi-modal chain-of-thought research.

pdf bib abs
基于字节对编码的端到端藏语语音识别研究(End-to-End Tibetan Speech Recognition Study Based on Byte Pair Coding)
Yuqing Cai (蔡郁青) | Chao Wang (王超) | Duojie Renzeng (仁增多杰) | Yulei Zhu (朱宇雷) | Jin Zhang (张瑾) | Tashi Nyima (尼玛扎西)
Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 1: Main Conference)

“针对藏语端到端语音识别研究中存在的建模单元不统一和识别效果不理想的问题,本文提出了一种BPE-Conformer-CTC/Attention端到端藏语语音识别方法。首先,该方法采用了字节对编码算法进行语音建模,通过反复合并出现频率最高的字符对,将文本分割成易于管理、有意义的单元,平衡建模单元的粒度,从而解决藏语语音识别中建模单元不统一的问题。其次 , 使用了Conformer编码器 , 有效地融合了音频序列的全局和局部依赖关系,从而增强了模型的表征能力。最后,通过CTC/Attention联合解码策略,加速了对齐和解码过程,进而提高了识别效果的准确性和效率。在开源数据集XBMU-AMDO31和TIBMD@MUCI上的实验结果表明,所提出的BPE-Conformer-CTC/Attention模型分别取得了9.0%和4.6%的词错误率,相较于基线模型Transformer-CTC/Attention,词错误率分别相对降低了14.2%和30.3%。该研究方法为藏语端到端语音识别任务提供了一种有效的解决方案。”

pdf bib abs
LLM-Driven Knowledge Injection Advances Zero-Shot and Cross-Target Stance Detection
Zhao Zhang | Yiming Li | Jin Zhang | Hui Xu
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)

Stance detection aims at inferring an author’s attitude towards a specific target in a text. Prior methods mainly consider target-related background information for a better understanding of targets while neglecting the accompanying input texts. In this study, we propose to prompt Large Language Models (LLMs) to explicitly extract the relationship between paired text and target as contextual knowledge. We then inject such LLM-driven knowledge into a generation model BART to exploit the rich contexts and semantics. Moreover, to further enhance the decoding capability of BART, a novel prototypical contrastive scheme is designed to align input contents with stance labels. Our experimental results demonstrate the state-of-the-art performance across several publicly available datasets, showcasing effectiveness in both zero-shot and cross-target stance detection scenarios. We publicly release our code to facilitate future research.