Zhihui Zhu

2026

MolSafeEval: A Benchmark for Uncovering Safety Risks in AI-Generated Molecules
Tong Xu | Xinzhe Cao | Zhihui Zhu | Keyan Ding | Huajun Chen
Findings of the Association for Computational Linguistics: ACL 2026

Current molecular generation benchmarks emphasize task complexity, molecule novelty, and property alignment; they largely overlook a critical concern: the potential safety risks of AI-generated molecules. In practice, many generative models may produce molecules with toxic, reactive, or otherwise hazardous characteristics—posing hidden dangers that remain insufficiently addressed. To address this gap, we introduce MolSafeEval, a benchmark dedicated to evaluating and analyzing the safety risks of molecular generation. Unlike prior approaches that rely on narrow toxicity predictors, MolSafeEval integrates heterogeneous safety knowledge—ranging from toxicological databases to hazard rules—into a structured molecular safety knowledge graph. This graph serves as a foundation for large language model–based reasoning, enabling systematic detection and explanation of unsafe features in generated compounds. We further categorize molecular generative models into four representative task types—unconditional generation, property optimization, target protein–based design, and text-based generation—and provide standardized datasets and safety evaluation protocols for each.

2025

pdf bib abs

EcomMMMU: Strategic Utilization of Visuals for Robust Multimodal E-commerce Models
Xinyi Ling | Hanwen Du | Zhihui Zhu | Xia Ning
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

E-commerce platforms are rich in multimodal data, featuring a variety of images that depict product details. However, this raises an important question: do these images always enhance product understanding, or can they sometimes introduce redundancy or degrade performance? Existing datasets are limited in both scale and design, making it difficult to systematically examine this question. To this end, we introduce EcomMMMU, an e-commerce multimodal multitask understanding dataset with 406,190 samples and 8,989,510 images. EcomMMMU is comprised of multi-image visual-language data designed with 8 essential tasks and a specialized VSS subset to benchmark the capability of multimodal large language models (MLLMs) to effectively utilize visual content. Analysis on EcomMMMU reveals that product images do not consistently improve performance and can, in some cases, degrade it. This indicates that MLLMs may struggle to effectively leverage rich visual content for e-commerce tasks. Building on these insights, we propose SUMEI, a data-driven method that strategically utilizes multiple images via predicting visual utilities before using them for downstream tasks. Comprehensive experiments demonstrate the effectiveness and robustness of SUMEI. The data and code are available through https://github.com/ninglab/EcomMMMU.

pdf bib abs

Captions Speak Louder than Images: Generalizing Foundation Models for E-commerce from High-quality Multimodal Instruction Data
Xinyi Ling | Hanwen Du | Bo Peng | Zhihui Zhu | Xia Ning
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

Multimodal foundation models (MFMs) have demonstrated strong capabilities in e-commerce by effectively leveraging multimodal data to enhance product understanding and user experienceHowever, the development of e-commerce MFMs is hindered by two challenges: (1) the scarcity of large-scale, high-quality multimodal benchmark datasets; and (2) the lack of effective multimodal information integration methods in e-commerce. To address these challenges, we introduce MMECInstruct, the first large-scale, high-quality multimodal instruction dataset designed specifically for e-commerce MFMs. MMECInstruct comprises 75,000 samples covering 7 real-world e-commerce tasks, supporting both in-domain (IND) and out-of-domain (OOD) evaluations. Leveraging MMECInstruct, we develop CASLIE, a lightweight framework that enhances multimodal information understanding and integration for e-commerce. Our comprehensive evaluation demonstrates that MMECInstruct endows CASLIE with advanced capability and strong generalizability in e-commerce applications. MMECInstruct and CASLIE models are publicly accessible through https://github.com/ninglab/CASLIE.

Co-authors

Keyan Ding 1

Bo Peng 1

Tong Xu 1

Venues

Fix author