Sihang Jiang
2026
The “Knowledge–Behavior Gap” in Cultural Taboo Safety of Large Language Models
Ying He | Sihang Jiang | Xingzhou Chen | Zhouhong Gu | Yiwei Gu | Minggui HE | Shimin Tao | Mahongxia | Yanghua Xiao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Ying He | Sihang Jiang | Xingzhou Chen | Zhouhong Gu | Yiwei Gu | Minggui HE | Shimin Tao | Mahongxia | Yanghua Xiao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Cultural taboo safety is essential for deploying large language models (LLMs), as culturally insensitive outputs may cause offense or even social harm. However, existing cultural benchmarks primarily assess cultural knowledge or values biases, while overlooking whether LLMs can recognize and respect cultural taboos, especially when taboos are implicitly hidden in seemingly harmless questions. Besides, cultural taboos are implicit, and context-dependent, thus poss unique challenges for reliable evaluation. To address these gaps, we introduce **CulShield**, the first public benchmark dedicated to evaluating and improving the cultural taboo safety of LLMs. CulShield spans 77 countries and regions, and includes over 2,020 taboos. It evaluates models along both explicit knowledge and implicit behaviors.Experiments on several advanced LLMs (e.g., GPT-4o-mini, Gemini-2.5-pro) reveal a clear "knowledge-behavior gap": models often fail to apply known taboos during interaction. We further show that variations in linguistic context can significantly affect LLMs’ cultural taboo safety. Code and data is accessible here: https://anonymous.4open.science/r/CulShield-7A0E.
Why Did Apple Fall: Evaluating Curiosity in Large Language Models
Haoyu Wang | Sihang Jiang | Yuyan Chen | Yitong Wang | Xiaojun Meng | Jiansheng Wei | Yanghua Xiao
Findings of the Association for Computational Linguistics: ACL 2026
Haoyu Wang | Sihang Jiang | Yuyan Chen | Yitong Wang | Xiaojun Meng | Jiansheng Wei | Yanghua Xiao
Findings of the Association for Computational Linguistics: ACL 2026
Curiosity serves as a fundamental construct in human cognition.Inspired by curiosity, reinforcement learning with intrinsic rewards for large language models (LLMs) has shown substantial potential.However, it remains unclear whether existing curiosity-driven methods genuinely reflect curiosity-like behaviors in LLMs, and to what extent psychological notions of curiosity can be transferred to these models. In this work, we propose a psychology-inspired framework to evaluate and leverage curiosity in LLMs.We adapt the Five-Dimensional Curiosity scale Revised (5DCR) to LLMs and combine questionnaire-based self reports with behavioral study.We find that although LLMs can exhibit curiosity-like behavioral patterns resembling those of humans, such patterns do not reflect an intrinsic trait of curiosity.Building on this insight, we design a curiosity-driven thinking pipeline to examine the functional role of human-like curious behaviors. Experiments show that instructing LLMs to emulate curious strategies leads to better performance on selected downstream tasks, indicating that mimicking curious behaviors holds promise for reasoning enhancement.
Don’t Tell the Answer, Truly Guide the Reasoning During RL Rollouts
Xinyi Wang | Jinyi Han | Zishang Jiang | Tingyun li | Jiaqing Liang | Sihang Jiang | Zhaoqian Dai | Ma Shuguang | Fei Yu | Yanghua Xiao
Findings of the Association for Computational Linguistics: ACL 2026
Xinyi Wang | Jinyi Han | Zishang Jiang | Tingyun li | Jiaqing Liang | Sihang Jiang | Zhaoqian Dai | Ma Shuguang | Fei Yu | Yanghua Xiao
Findings of the Association for Computational Linguistics: ACL 2026
Reinforcement learning (RL) has emerged as a key approach for improving long chain-of-thought (CoT) reasoning in large language models (LLMs). However, existing methods such as GRPO often break down when task difficulty exceeds the model’s capacity, resulting in sparse rewards and inefficient training. While prior work attempts to address this issue using off-policy data, it frequently introduces distributional mismatch, leading to unstable policy updates.In this work, we identify a fundamental issue underlying these limitations, which we term *low training affinity*, and propose **Affinity**, the first quantitative metric for measuring the compatibility between external guidance and a model’s intrinsic policy. Based on this insight, we introduce **HINT**, an adaptive framework designed to enhance reasoning performance while explicitly preserving high Affinity.HINT consists of two key components. First, instead of providing partial answers, it introduces **Meta-Hints**, which serve as abstract cognitive scaffolding that guides the model to independently construct solutions. Second, we propose **Affinity-Aware Policy Optimization (AAPO)**, which dynamically adjusts the learning objective based on the Affinity signal to ensure stable training.Extensive experiments across diverse benchmarks demonstrate that HINT consistently outperforms strong baselines, while achieving improved training stability and robust generalization to out-of-distribution tasks. Code is available at: https://github.com/ViviqwerAsd/HINT
Immediate Inference: The Missing Foundation in Large Language Model Logical Reasoning
Sihang Jiang | Zhiyu Lu | Keyi Wang | Jiaqing Liang | Yanghua Xiao | Xiaojun Meng | Jiansheng Wei
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Sihang Jiang | Zhiyu Lu | Keyi Wang | Jiaqing Liang | Yanghua Xiao | Xiaojun Meng | Jiansheng Wei
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
While extensive research has evaluated LLMs on complex reasoning tasks, the foundational building blocks of logical reasoning remain underexplored. We introduce IIBench, a benchmark evaluating immediate inference (elementary operations over categorical propositions). Our evaluation reveals that even SoTA models exhibit systematic deficiencies in immediate inference, and establishes immediate inference as foundational: it mediates approximately 40% of the effect on syllogistic reasoning, with near-perfect correlation ( = 0.98) across reasoning benchmarks. Our analysis reveals that models lack robust operator grounding, oscillating between structural reasoning and surface pattern matching with inconsistent handling of quantifiers and negation.
ADaPT: Token-Level Decoupling for Efficient Large Reasoning Models
Tingyun li | Zishang Jiang | Jinyi Han | Xinyi Wang | Sihang Jiang | Han Xia | Zhaoqian Dai | Ma Shuguang | Fei Yu | Jiaqing Liang | Yanghua Xiao
Findings of the Association for Computational Linguistics: ACL 2026
Tingyun li | Zishang Jiang | Jinyi Han | Xinyi Wang | Sihang Jiang | Han Xia | Zhaoqian Dai | Ma Shuguang | Fei Yu | Jiaqing Liang | Yanghua Xiao
Findings of the Association for Computational Linguistics: ACL 2026
Large reasoning models rely on long chain-of-thought to achieve strong performance, but applying such reasoning uniformly incurs high computational cost. Existing efficiency-oriented methods attempt to shorten or mix reasoning strategies, yet often degrade reasoning capability. We identify the root cause as sequence-level coupling between efficiency incentives and correctness optimization, which implicitly penalizes long but correct reasoning trajectories. To address this issue, we propose Adaptive Dual-Process Thinking (ADaPT), a token-level dual-process framework that explicitly decouples efficiency and correctness signals during training. ADaPT introduces a mode-selection token to control fast and slow reasoning, applying efficiency-related rewards exclusively to this token to avoid penalizing correct long reasoning while encouraging efficiency when appropriate. Moreover, ADaPT enables precise and continuous control over the efficiency–performance trade-off at inference time: by adjusting the generation probability of the mode-selection token, a single trained model can smoothly move along the efficiency–performance Pareto frontier. Extensive experiments demonstrate that ADaPT significantly reduces inference cost while maintaining strong reasoning performance across multiple benchmarks.
2025
Data-Efficient Selection via Grammatical Complexity in Continual Pre-training of Domain-Specific LLMs
Yizhou Ying | Geng Zhang | Cui Danxin | Chengyu Du | Guanglei Yue | Sihang Jiang | Jiaqing Liang | Yifei Fu | Hailin Hu | Yanghua Xiao
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Yizhou Ying | Geng Zhang | Cui Danxin | Chengyu Du | Guanglei Yue | Sihang Jiang | Jiaqing Liang | Yifei Fu | Hailin Hu | Yanghua Xiao
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Data efficiency is crucial in domain-specific continual pre-training (CPT) of large language models (LLMs), especially under resource constraints. Aiming for “small data, big impact,” this work addresses the limitations of existing domain-specific data selection strategies, which often rely on scarce labeled data or computationally expensive LLMs. We introduce CDF Sampling with Grammatical Complexity (CDF-GC), an annotation-independent, efficient and interpretable data selection framework for CPT. Our approach comprehensively evaluates grammatical complexity using lexical diversity and syntactic complexity, and employs a cumulative distribution function (CDF)-based sampling strategy to balance complexity and diversity. To validate the effectiveness of CDF-GC, we conducted experiments on a financial dataset. The results demonstrate that CDF-GC significantly outperforms baselines, achieving 2.0% improvement in financial QA at the same selection ratio and even surpassing full-data training by 1.7% using only 20% of the data.
From Remembering to Metacognition: Do Existing Benchmarks Accurately Evaluate LLMs?
Geng Zhang | Yizhou Ying | Sihang Jiang | Jiaqing Liang | Guanglei Yue | Yifei Fu | Hailin Hu | Yanghua Xiao
Findings of the Association for Computational Linguistics: EMNLP 2025
Geng Zhang | Yizhou Ying | Sihang Jiang | Jiaqing Liang | Guanglei Yue | Yifei Fu | Hailin Hu | Yanghua Xiao
Findings of the Association for Computational Linguistics: EMNLP 2025
Despite the rapid development of large language models (LLMs), existing benchmark datasets often focus on low-level cognitive tasks, such as factual recall and basic comprehension, while providing limited coverage of higher-level reasoning skills, including analysis, evaluation, and creation. In this work, we systematically assess the cognitive depth of popular LLM benchmarks using Bloom’s Taxonomy to evaluate both the cognitive and knowledge dimensions.Our analysis reveals a pronounced imbalance: most datasets concentrate on “Remembering” and “Understanding”, with metacognitive and creative reasoning largely underrepresented. We also find that incorporating higher-level cognitive instructions into the current instruction fine-tuning process improves model performance. These findings highlight the importance of future benchmarks incorporating metacognitive evaluations to more accurately assess and enhance model performance.
2024
CR-LLM: A Dataset and Optimization for Concept Reasoning of Large Language Models
Nianqi Li | Jingping Liu | Sihang Jiang | Haiyun Jiang | Yanghua Xiao | Jiaqing Liang | Zujie Liang | Feng Wei | Jinglei Chen | Zhenghong Hao | Bing Han
Findings of the Association for Computational Linguistics: ACL 2024
Nianqi Li | Jingping Liu | Sihang Jiang | Haiyun Jiang | Yanghua Xiao | Jiaqing Liang | Zujie Liang | Feng Wei | Jinglei Chen | Zhenghong Hao | Bing Han
Findings of the Association for Computational Linguistics: ACL 2024
Concept reasoning is an important capability for models to understand the world. However, the existing datasets, such as concept extraction and concept generation, suffer from modeledge leakage and context leakage. To address these limitations, we construct a dataset of concept reasoning for large language models (CR-LLM) with modeledge leakage prevention and context leakage prevention, which consists of 2,167 samples and covers different concept types. In addition, we propose a hybrid reasoning method, consisting of inductive reasoning, deductive reasoning and a controller. This method allows large language models to adaptively select the optimal reasoning method for each input sample. Finally, we conduct extensive experiments on CR-LLM using different models and methods. The results show that existing large language models and reasoning methods perform sub-optimally in the concept reasoning task. In contrast, our proposed method significantly improves the capabilities, achieving a 7% increase in accuracy compared to CoT and demonstrating better granularity. We release CR-LLM and code at https://github.com/Nianqi-Li/Concept-Reasoning-for-LLMs.
Search
Fix author
Co-authors
- Yanghua Xiao 8
- Jiaqing Liang 6
- Zhaoqian Dai 2
- Yifei Fu 2
- Jinyi Han 2
- Hailin Hu 2
- Zishang Jiang 2
- Xiaojun Meng 2
- Ma Shuguang 2
- Xinyi Wang 2
- Jiansheng Wei 2
- Yizhou Ying 2
- Fei Yu 2
- Guanglei Yue 2
- Geng Zhang 2
- Tingyun li 2
- Xingzhou Chen 1
- Yuyan Chen 1
- Jinglei Chen 1
- Cui Danxin 1
- Chengyu Du 1
- Zhouhong Gu 1
- Yiwei Gu 1
- Bing Han 1
- Zhenghong Hao 1
- Ying He 1
- Minggui He 1
- Haiyun Jiang 1
- Nianqi Li 1
- Zujie Liang 1
- Jingping Liu 1
- Zhiyu Lu 1
- Mahongxia 1
- Shimin Tao 1
- Haoyu Wang 1
- Yitong Wang 1
- Keyi Wang 1
- Feng Wei 1
- Han Xia 1