Fengnan Li


2026

Existing datasets for evaluating text-to-image generation focus mostly on real-life images, which poses challenges for assessing academicfigure generation given real scientific captions, which is a hot topic in AI for Science. To fill the gap, we propose HE4AFG, a novel datasetwhich first provides a Holistic Evaluation for Academic caption-to-Figure Generation (AFG). Specifically, HE4AFG collects real figure captions from 8 scientific domains and finally generates 3,900 evaluation samples (particularly, including multi-panel figures) using 5 mainstream large multimodal models (LMMs). For each sample, we provide high-quality human ratings in terms of three aspects—scientific aesthetic (SA), topic relevance (TR), and attribute correctness (AC). Moreover, we present two trainable models: (1) HE4AFG-E, an automated Evaluation model for AFG, which generates aspect-aware training examples and then use them to train three aspect-specific evaluation modules via contrastive learning; (2) HE4AFG-R, an automated Refinement model, which generates and utilizes feedback on the quality of the figures (e.g., unfaithful elements) to continuously improve AFG. Extensive experiments on HE4AFG demonstrate the effectiveness and performance advantages of our models.

2025

Transformer-based models have achieved state-of-the-art performance in document classification but struggle with long-text processing due to the quadratic computational complexity in the self-attention module. Existing solutions, such as sparse attention, hierarchical models, and key sentence extraction, partially address the issue but still fall short when the input sequence is exceptionally lengthy. To address this challenge, we propose **IRIS** (**I**nterpretable **R**etrieval-Augmented Classification for long **I**nterspersed Document **S**equences), a novel, lightweight framework that utilizes retrieval to efficiently classify long documents while enhancing interpretability. IRIS segments documents into chunks, stores their embeddings in a vector database, and retrieves those most relevant to a given task using learnable query vectors. A linear attention mechanism then aggregates the retrieved embeddings for classification, allowing the model to process arbitrarily long documents without increasing computational cost and remaining trainable on a single GPU. Our experiments across six datasets show that IRIS achieves comparable performance to baseline models on standard benchmarks, and excels in three clinical note disease risk prediction tasks where documents are extremely long and key information is sparse. Furthermore, IRIS provides global interpretability by revealing a clear summary of key risk factors identified by the model. These findings highlight the potential of IRIS as an efficient and interpretable solution for long-document classification, particularly in healthcare applications where both performance and explainability are crucial.