Yutao Hou


2026

Large language models (LLMs) are increasingly applied in financial scenarios. However, they may produce harmful outputs, including facilitating illegal activities or unethical behavior, posing serious compliance risks. To systematically evaluate LLM safety in finance, we propose FinSafetyBench, a bilingual (English-Chinese) red-teaming benchmark designed to test an LLM’s refusal of requests that violate financial compliance. Grounded in real-world financial crime cases and ethics standards, the benchmark comprises 14 subcategories spanning financial crimes and ethical violations. Through extensive experiments on general-purpose and finance-specialized LLMs under three representative attack settings, we identify critical vulnerabilities that allow adversarial prompts to bypass compliance safeguards. Further analysis reveals stronger susceptibility in Chinese contexts and highlights the limitations of prompt-level defenses against sophisticated or implicit manipulation strategies.
Large Language Models (LLMs) have demonstrated remarkable capabilities in various reasoning-intensive tasks. However, these models exhibit unexpected brittleness, often failing on simple variations of the same underlying task. Existing robustness evaluations predominantly rely on hand-crafted templates or a limited set of perturbation rules. Consequently, such approaches lack the adaptability to probe latent vulnerabilities unique to specific models and remain susceptible to data contamination. To address this, we propose the Math Stress Tester (MaSTer), an automated framework inspired by software stress testing. MaSTer generates adversarial variants via a multi-round rewrite-verify loop, ensuring semantic consistency while successfully inducing model failure. Our framework generates benchmark variants dynamically for each LLM, thus minimizing the risk of data contamination. Experiments on GSM8K and MATH-500 demonstrate the effectiveness of MaSTer on mathematical tasks. Additionally, we validate the framework’s extensibility to non-mathematical tasks, highlighting its broad applicability. Furthermore, we demonstrate that the synthesized variants generated by MaSTer can be utilized as a fine-tuning dataset to significantly enhance the model’s robustness.

2025

"本文提出了一种多智能体协同的干扰数据生成框架,旨在评测分析大语言模型在复杂干扰下的鲁棒性。该框架以数学领域为起点,逐步扩展至医学、法律、科学及通用场景,构建了涵盖拼写干扰、数字干扰、类型干扰与谣言干扰四类干扰的跨领域数据集AntIF,共计近5000条数据。在此基础上,本文对主流开源语言模型进行了系统的抗干扰能力评估,并结合不同的提示工程策略与模型微调方法,深入分析了AntIF 在提升模型鲁棒性方面的实际效果。"