Aimin Zhou
2026
Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF
Yuan Fang | Yiming Luo | Aimin Zhou | Fei Tan
Findings of the Association for Computational Linguistics: ACL 2026
Yuan Fang | Yiming Luo | Aimin Zhou | Fei Tan
Findings of the Association for Computational Linguistics: ACL 2026
Ensuring the safety of large language models (LLMs) requires robust red teaming, yet the systematic synthesis of high-quality toxic data remains under-explored. We propose Reverse Constitutional AI (R-CAI), a framework for automated and controllable adversarial data generation that moves beyond isolated jailbreak prompts. By inverting a harmless constitution into a constitution of toxicity and iteratively refining model outputs through a critique–revision pipeline, R-CAI enables scalable synthesis of multi-dimensional adversarial data without human annotation. Optimizing solely for toxicity-related rewards, however, can lead to reward hacking and degraded semantic coherence. To address this challenge, we introduce probability clamping within reinforcement learning from AI feedback, which stabilizes adversarial optimization while preserving adversarial intent. Experiments demonstrate that R-CAI generates diverse, high-quality toxic data and that probability clamping substantially improves semantic coherence (15%) without sacrificing adversarial strength. Overall, R-CAI provides a fully automated framework for red teaming data generation and systematic safety evaluation of aligned language models.
SCMAPR: Self-Correcting Multi-Agent Prompt Refinement for Complex-Scenario Text-to-Video Generation
Chengyi Yang | Pengzhen Li | Jiayin Qi | Aimin Zhou | Ji Wu | Ji Liu
Findings of the Association for Computational Linguistics: ACL 2026
Chengyi Yang | Pengzhen Li | Jiayin Qi | Aimin Zhou | Ji Wu | Ji Liu
Findings of the Association for Computational Linguistics: ACL 2026
Text-to-Video (T2V) generation has benefited from recent advances in diffusion models, yet current systems still struggle under complex scenarios, which are generally exacerbated by the ambiguity and underspecification of text prompts. In this work, we formulate complex-scenario prompt refinement as a stage-wise multi-agent refinement process and propose SCMAPR, i.e., a scenario-aware and Self-Correcting Multi-Agent Prompt Refinement framework for T2V prompting. SCMAPR coordinates specialized agents to (i) route each prompt to a taxonomy-grounded scenario for strategy selection, (ii) synthesize scenario-aware rewriting policies and perform policy-conditioned refinement, and (iii) conduct structured semantic verification that triggers conditional revision when violations are detected. To clarify what constitutes complex scenarios in T2V prompting, provide representative examples, and enable rigorous evaluation under such challenging conditions, we further introduce T2V-Complexity, which is a complex-scenario T2V benchmark consisting exclusively of complex-scenario prompts. Extensive experiments on 3 existing benchmarks and our T2V-Complexity benchmark demonstrate that SCMAPR consistently improves text-video alignment and overall generation quality under complex scenarios, achieving up to 2.67% and 3.28 gains in average score on VBench and EvalCrafter, and up to 0.028 improvement on T2V-CompBench over 3 State-Of-The-Art baselines. The codes of SCMAPR are publicly available at https://github.com/HiThink-Research/SCMAPR.
HACHIMI: Scalable and Controllable Student Persona Generation via Orchestrated Agents
Yilin Jiang | Fei Tan | Xuanyu Yin | Leng Jing | Aimin Zhou
Findings of the Association for Computational Linguistics: ACL 2026
Yilin Jiang | Fei Tan | Xuanyu Yin | Leng Jing | Aimin Zhou
Findings of the Association for Computational Linguistics: ACL 2026
Student Personas (SPs) are emerging as infrastructure for educational LLMs, yet prior work often relies on ad-hoc prompting or hand-crafted profiles with limited control over educational theory and population distributions. We formalize this as Theory-Aligned and Distribution-Controllable Persona Generation (TAD-PG) and introduce HACHIMI, a multi-agent Propose-Validate-Revise framework that generates theory-aligned, quota-controlled personas. HACHIMI factorizes each persona into a theory-anchored educational schema, enforces developmental and psychological constraints via a neuro-symbolic validator, and combines stratified sampling with semantic deduplication to reduce mode collapse. The resulting HACHIMI-1M corpus comprises 1 million personas for Grades 1-12. Intrinsic evaluation shows near-perfect schema validity, accurate quotas, and substantial diversity, while external evaluation instantiates personas as student agents answering CEPS and PISA 2022 surveys; across 16 cohorts, math and curiosity/growth constructs align strongly between humans and agents, whereas classroom-climate and well-being constructs are only moderately aligned, revealing a fidelity gradient. All personas are generated with Qwen2.5-72B, and HACHIMI provides a standardized synthetic student population for group-level benchmarking and social-science simulations. Resources available at https://github.com/ZeroLoss-Lab/HACHIMI.
AlphaContext: An Evolutionary Tree-based Psychometric Context Generator for Creativity Assessment
Yixuan Wang | Yue Huang | Hong Qian | Yunzhao Wei | Yifei Ding | Wenkai Wang | Zhi Liu | Zhongjing Huang | Aimin Zhou | Jiajun Guo
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yixuan Wang | Yue Huang | Hong Qian | Yunzhao Wei | Yifei Ding | Wenkai Wang | Zhi Liu | Zhongjing Huang | Aimin Zhou | Jiajun Guo
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Creativity has become a core competence in the era of LLMs and human–AI collaboration, underpinning innovation in real-world problem solving. Crucially, the systematic improvement of creativity necessitates scientifically valid assessment instruments. Psychometric research recognizes context-based assessment as an effective way to measure creative thinking. However, high-quality expert-designed contexts remain scarce. Existing LLM-based generators often struggle with insufficient assessment cues, weak narrative coherence, limited stylistic diversity, and poor support for creative thinking. To address these challenges, we propose AlphaContext, an evolutionary tree-based psychometric context generator for creativity assessment. First, the HyperTree Outline Planner formalizes expert-designed outlining as a rule-guided hypertree and performs top-down hierarchical planning. The MCTS-based Context Generator fills the outline via MCTS to balance global structure and local quality. Then, the Evolutionary Context Optimizer evolves contexts with MAP-Elites by repeatedly updating niche elites to jointly improve diversity and quality. Finally, the Assessment-Guided Evolution Refiner simulates virtual participants with diverse styles and recycles weak contexts for further evolution. Experiments show that AlphaContext yields an average improvement of 8% over competitive methods across 6 quality metrics.
2025
The Role of Visual Modality in Multimodal Mathematical Reasoning: Challenges and Insights
Yufang Liu | Yao Du | Tao Ji | Jianing Wang | Yang Liu | Yuanbin Wu | Aimin Zhou | Mengdi Zhang | Xunliang Cai
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yufang Liu | Yao Du | Tao Ji | Jianing Wang | Yang Liu | Yuanbin Wu | Aimin Zhou | Mengdi Zhang | Xunliang Cai
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Recent research has increasingly focused on multimodal mathematical reasoning, particularly emphasizing the creation of relevant datasets and benchmarks. Despite this, the role of visual information in reasoning has been underexplored. Our findings show that existing multimodal mathematical models minimally leverage visual information, and model performance remains largely unaffected by changes to or removal of images in the dataset. We attribute this to the dominance of textual information and answer options that inadvertently guide the model to correct answers. To improve evaluation methods, we introduce the HC-M3D dataset, specifically designed to require image reliance for problem-solving and to challenge models with similar, yet distinct, images that change the correct answer. In testing leading models, their failure to detect these subtle visual differences suggests limitations in current visual perception capabilities. Additionally, we observe that the common approach of improving general VQA capabilities by combining various types of image encoders does not contribute to math reasoning performance. This finding also presents a challenge to enhancing visual reliance during math reasoning.
FinDABench: Benchmarking Financial Data Analysis Ability of Large Language Models
Shu Liu | Shangqing Zhao | Chenghao Jia | Xinlin Zhuang | Zhaoguang Long | Jie Zhou | Aimin Zhou | Man Lan | Yang Chong
Proceedings of the 31st International Conference on Computational Linguistics
Shu Liu | Shangqing Zhao | Chenghao Jia | Xinlin Zhuang | Zhaoguang Long | Jie Zhou | Aimin Zhou | Man Lan | Yang Chong
Proceedings of the 31st International Conference on Computational Linguistics
Large Language Models (LLMs) have demonstrated impressive capabilities across a wide range of tasks. However, their proficiency and reliability in the specialized domain of financial data analysis, particularly focusing on data-driven thinking, remain uncertain. To bridge this gap, we introduce FinDABench, a comprehensive benchmark designed to evaluate the financial data analysis capabilities of LLMs within this context. The benchmark comprises 15,200 training instances and 8,900 test instances, all meticulously crafted by human experts. FinDABench assesses LLMs across three dimensions: 1) Core Ability, evaluating the models’ ability to perform financial indicator calculation and corporate sentiment risk assessment; 2) Analytical Ability, determining the models’ ability to quickly comprehend textual information and analyze abnormal financial reports; and 3) Technical Ability, examining the models’ use of technical knowledge to address real-world data analysis challenges involving analysis generation and charts visualization from multiple perspectives. We will release FinDABench, and the evaluation scripts at https://github.com/xxx. FinDABench aims to provide a measure for in-depth analysis of LLM abilities and foster the advancement of LLMs in the field of financial data analysis.
Mis-prompt: Benchmarking Large Language Models for Proactive Error Handling
Jiayi Zeng | Yizhe Feng | Mengliang He | Wenhui Lei | Wei Zhang | Zeming Liu | Xiaoming Shi | Aimin Zhou
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jiayi Zeng | Yizhe Feng | Mengliang He | Wenhui Lei | Wei Zhang | Zeming Liu | Xiaoming Shi | Aimin Zhou
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language models (LLMs) have demonstrated significant advancements in error handling. Current error-handling works are performed in a passive manner, with explicit error-handling instructions. However, in real-world scenarios, explicit error-handling instructions are usually unavailable. In this paper, our work identifies this challenge as how to conduct proactive error handling without explicit error handling instructions. To promote further research, this work introduces a new benchmark, termed Mis-prompt, consisting of four evaluation tasks, an error category taxonomy, and a new evaluation dataset. Furthermore, this work analyzes current LLMs’ performance on the benchmark, and the experimental results reveal that current LLMs show poor performance on proactive error handling, and SFT on error handling instances improves LLMs’ proactive error handling capabilities. The dataset will be publicly available.
Flow2Code: Evaluating Large Language Models for Flowchart-based Code Generation Capability
Mengliang He | Jiayi Zeng | Yankai Jiang | Wei Zhang | Zeming Liu | Xiaoming Shi | Aimin Zhou
Findings of the Association for Computational Linguistics: ACL 2025
Mengliang He | Jiayi Zeng | Yankai Jiang | Wei Zhang | Zeming Liu | Xiaoming Shi | Aimin Zhou
Findings of the Association for Computational Linguistics: ACL 2025
While large language models (LLMs) show promise in code generation, existing benchmarks neglect the flowchart-based code generation. To promote further research on flowchart-based code generation, this work presents Flow2Code, a novel benchmark for flowchart-based code generation evaluation. The evaluation dataset spans 15 programming languages and includes 5,622 code segments paired with 16,866 flowcharts of three types: code, UML, and pseudocode. Extensive experiments with 13 multimodal LLMs reveal that current LLMs can not generate code based on flowcharts perfectly. Besides, experiment results show that the supervised fine-tuning technique contributes greatly to the models’ performance. The dataset will be publicly available.
2024
Chinese Essay Fluency Evaluation (CEFE) Task
Xinlin Zhuang | Xinshu Shen | Hongyi Wu | Man Lan | Xiaopeng Bai | Yuanbin Wu | Aimin Zhou | Shaoguang Mao
Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 3: Evaluations)
Xinlin Zhuang | Xinshu Shen | Hongyi Wu | Man Lan | Xiaopeng Bai | Yuanbin Wu | Aimin Zhou | Shaoguang Mao
Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 3: Evaluations)
“This paper presents a detailed review of Task 7 in the CCL24-Eval: the second Chinese Essay Fluency Evaluation (CEFE). The task aims to identify fine-grained grammatical errors that impair readability and coherence in essays authored by Chinese primary and secondary school students, evaluate the essays’ fluency levels, and recommend corrections to improve their written fluency. The evaluation comprises three tracks: (1) Coarse-grained and fine-grained error identification; (2) Error sentence rewriting; and (3) Essay Fluency Level Recognition. We garnered 29 completed registrations, resulting in 180 submissions from 10 dedicated teams. The paper discusses the submissions and analyzes the results from all participating teams.”
Are U a Joke Master? Pun Generation via Multi-Stage Curriculum Learning towards a Humor LLM
Yang Chen | Chong Yang | Tu Hu | Xinhao Chen | Man Lan | Li Cai | Xinlin Zhuang | Xuan Lin | Xin Lu | Aimin Zhou
Findings of the Association for Computational Linguistics: ACL 2024
Yang Chen | Chong Yang | Tu Hu | Xinhao Chen | Man Lan | Li Cai | Xinlin Zhuang | Xuan Lin | Xin Lu | Aimin Zhou
Findings of the Association for Computational Linguistics: ACL 2024
Although large language models (LLMs) acquire extensive world knowledge and some reasoning abilities, their proficiency in generating humorous sentences remains a challenge. Previous research has demonstrated that the humor generation capabilities of ChatGPT are confined to producing merely 25 unique jokes. In this work, we concentrate on endowing LLMs with the ability of generating puns, a particular category of humor by preference learning method. We propose a multi-stage curriculum preference learning framework to optimize both pun structure preferences and humor preferences. Specifically, we improve the Direct Preference Optimization (DPO) algorithm to address the challenge of multi-objective alignment problem. Besides, to facilitate further advancement in this field, we collect a Chinese Pun (ChinesePun) dataset, containing 2.1k puns and corresponding annotations. Experimental results on both Chinese and English benchmark datasets demonstrate that our method significantly outperforms all the baseline models.
Investigating and Mitigating Object Hallucinations in Pretrained Vision-Language (CLIP) Models
Yufang Liu | Tao Ji | Changzhi Sun | Yuanbin Wu | Aimin Zhou
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Yufang Liu | Tao Ji | Changzhi Sun | Yuanbin Wu | Aimin Zhou
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Large Vision-Language Models (LVLMs) have achieved impressive performance, yet research has pointed out a serious issue with object hallucinations within these models. However, there is no clear conclusion as to which part of the model these hallucinations originate from. In this paper, we present an in-depth investigation into the object hallucination problem specifically within the CLIP model, which serves as the backbone for many state-of-the-art vision-language systems. We unveil that even in isolation, the CLIP model is prone to object hallucinations, suggesting that the hallucination problem is not solely due to the interaction between vision and language modalities. To address this, we propose a counterfactual data augmentation method by creating negative samples with a variety of hallucination issues. We demonstrate that our method can effectively mitigate object hallucinations for CLIP model, and we show the the enhanced model can be employed as a visual encoder, effectively alleviating the object hallucination issue in LVLMs.
2023
Overview of CCL23-Eval Task 8: Chinese Essay Fluency Evaluation (CEFE) Task
Xinshu Shen | Hongyi Wu | Xiaopeng Bai | Yuanbin Wu | Aimin Zhou | Shaoguang Mao | Tao Ge | Yan Xia
Proceedings of the 22nd Chinese National Conference on Computational Linguistics (Volume 3: Evaluations)
Xinshu Shen | Hongyi Wu | Xiaopeng Bai | Yuanbin Wu | Aimin Zhou | Shaoguang Mao | Tao Ge | Yan Xia
Proceedings of the 22nd Chinese National Conference on Computational Linguistics (Volume 3: Evaluations)
“This paper provides a comprehensive review of the CCL23-Eval Task 8, i.e., Chinese EssayFluency Evaluation (CEFE). The primary aim of this task is to systematically identify the typesof grammatical fine-grained errors that affect the readability and coherence of essays writtenby Chinese primary and secondary school students, and then to suggest suitable corrections toenhance the fluidity of their written expression. This task consists of three distinct tracks: (1)Coarse-grained and fine-grained error identification; (2) Character-level error identification andcorrection; (3) Error sentence rewriting. In the end, we received 44 completed registration forms,leading to a total of 130 submissions from 11 dedicated participating teams. We present theresults of all participants and our analysis of these results. Both the dataset and evaluation toolused in this task are available1.”
Search
Fix author
Co-authors
- Yuanbin Wu 4
- Man Lan 3
- Xinlin Zhuang 3
- Xiaopeng Bai 2
- Mengliang He 2
- Tao Ji 2
- Yufang Liu 2
- Zeming Liu 2
- Shaoguang Mao 2
- Xinshu Shen 2
- Xiaoming Shi 2
- Fei Tan 2
- Hongyi Wu 2
- Jiayi Zeng 2
- Wei Zhang 2
- Xunliang Cai 1
- Li Cai 1
- Yang Chen 1
- Xinhao Chen 1
- Yang Chong 1
- Yifei Ding 1
- Yao Du 1
- Yuan Fang 1
- Yizhe Feng 1
- Tao Ge 1
- Jiajun Guo 1
- Tu Hu 1
- Yue Huang 1
- Zhongjing Huang 1
- Chenghao Jia 1
- Yilin Jiang 1
- Yankai Jiang 1
- Leng Jing 1
- Wenhui Lei 1
- Pengzhen Li 1
- Xuan Lin 1
- Yang Liu 1
- Shu Liu 1
- Ji Liu 1
- Zhi Liu 1
- Zhaoguang Long 1
- Xin Lu 1
- Yiming Luo 1
- Jiayin Qi 1
- Hong Qian 1
- Changzhi Sun 1
- Jianing Wang 1
- Yixuan Wang 1
- Wenkai Wang 1
- Yunzhao Wei 1
- Ji Wu 1
- Yan Xia 1
- Chengyi Yang 1
- Chong Yang 1
- Xuanyu Yin 1
- Mengdi Zhang 1
- Shangqing Zhao 1
- Jie Zhou 1