Zhihao Zhang

Other people with similar names: Zhihao Zhang, Zhihao Zhang (Soochow)

Unverified author pages with similar names: Zhihao Zhang

2026

CheMM-R1: Enhancing Chemical Structure Recognition and Elucidation with Reasoning Multimodal Large Language Models
Liting Huang | Zhihao Zhang | Shoujin Wang
Findings of the Association for Computational Linguistics: ACL 2026

While Multimodal Large Language Models (MLLMs) demonstrate strong reasoning capabilities, they lack domain-specific expertise to effectively perform chemical tasks. For example, existing MLLMs struggle with both the lower-level task of molecular structure recognition and the higher-level task of chemical spectral data elucidation. When faced with complex molecular structures and multimodal chemical data (including spectral images and texts), they often fail to provide reliable inference, resulting in poor performance. Moreover, there are no benchmark datasets for evaluating multi-step multimodal reasoning capacities in the chemistry domain. To this end, we establish CheMM-Bench, a comprehensive benchmark dataset with 48,500 reasoning steps across four chemical tasks (SmilesQA, IupacQA, MwQA, SpectraQA) for evaluating visual reasoning in both molecular structure recognition and spectral analysis. On top of this, we present CheMM-R1, a state-of-the-art chemistry-specific MLLM trained with CheMMGRPO, a novel adaptation of Group Relative Policy Optimisation tailored for chemical reasoning. CheMMGRPO employs domain-specific reward functions to assess chemical validity, structural accuracy, format compliance, and factual correctness. CheMM-R1 surpasses leading proprietary models (GPT-o3, Gemini-2.5-Pro, Claude-3.5-Sonnet, and Grok-2) across all CheMM-Bench tasks. The evaluation code and model are publicly available.

pdf bib abs

Safety alignment in Large Language Models is critical for healthcare; however, reliance on binary refusal boundaries often results in over-refusal of benign queries or unsafe compliance with harmful ones. While existing benchmarks measure these extremes, they fail to evaluate Safe Completion: the model’s ability to maximise helpfulness on dual-use or borderline queries by providing safe, high-level guidance without crossing into actionable harm. We introduce Health-ORSC-Bench, the first large-scale benchmark designed to systematically measure Over-Refusal and Safe Completion quality in healthcare. Comprising 31,920 benign boundary prompts across seven health categories (e.g., self-harm, medical misinformation), our framework uses an automated pipeline with human validation to test models at varying levels of intent ambiguity. We evaluate 30 state-of-the-art LLMs, including GPT-5 and Claude-4, revealing a significant tension: safety-optimised models frequently refuse up to 80% of "Hard" benign prompts, while domain-specific models often sacrifice safety for utility. Our findings demonstrate that model family and size significantly influence calibration: larger frontier models (e.g., GPT-5, Llama-4) exhibit "safety-pessimism" and higher over-refusal than smaller or MoE-based counterparts (e.g., Qwen-3-Next), highlighting that current LLMs struggle to balance refusal and compliance. Health-ORSC-Bench provides a rigorous standard for calibrating the next generation of medical AI assistants toward nuanced, safe, and helpful completions. Our code and data is available at: https://github.com/ZhihaoZhang97/Health-ORSC-Bench. Warning: Some contents may include toxic or undesired contents.

2025

pdf bib abs

Infodemics and health misinformation have significant negative impact on individuals and society, exacerbating confusion and increasing hesitancy in adopting recommended health measures. Recent advancements in generative AI, capable of producing realistic, human-like text and images, have significantly accelerated the spread and expanded the reach of health misinformation, resulting in an alarming surge in its dissemination. To combat the infodemics, most existing work has focused on developing misinformation datasets from social media and fact-checking platforms, but has faced limitations in topical coverage, inclusion of AI-generation, and accessibility of raw content. To address these gaps, we present MM-Health, a large scale multimodal misinformation dataset in the health domain consisting of 34,746 news article encompassing both textual and visual information. MM-Health includes human-generated multimodal information (5,776 articles) and AI-generated multimodal information (28,880 articles) from various SOTA generative AI models. Additionally, We benchmarked our dataset against three tasks—reliability checks, originality checks, and fine-grained AI detection—demonstrating that existing SOTA models struggle to accurately distinguish the reliability and origin of information. Our dataset aims to support the development of misinformation detection across various health scenarios, facilitating the detection of human and machine-generated content at multimodal levels.

Co-authors

Venues

Findings3

Fix author