Peng Zhou

Other people with similar names: Peng Zhou, Peng Zhou

Unverified author pages with similar names: Peng Zhou

2026

Multimodal Chemical Structure-Text Coreference in Intellectual Property via Rule-guided Reinforcement Learning
Hanmeng Zhong | Wentao Wu | Linqing Chen | Peng Zhou
Findings of the Association for Computational Linguistics: ACL 2026

Navigating biopharmaceutical intellectual property necessitates precisely associating visual chemical structures with their textual referents across lengthy documents. Despite its critical role in drug discovery, this multimodal coreference task remains underexplored. It presents unique challenges, including handling Markush structures and distinguishing the atom-level differences between adjacent structures. To bridge this gap, we define the multimodal Chemical Structure-Text coreference and introduce CheST, the first dataset explicitly designed for the task. Furthermore, to satisfy the strict logical consistency in the task, we propose RULER, a RULE-guided multimodal Reinforcement learning framework built upon an SFT cold start. RULER utilizes rule-driven reward functions operationalizing multidimensional consistencies, acting as a domain-specific "verifier" to obtain the correct domain knowledge. Experimental results demonstrate that RULER achieves a 40% improvement over the strongest baseline–Gemini-2.5-Pro, demonstrating the superior efficacy.

pdf bib abs

The Dominance of Text Space: Unveiling the Asymmetric Nature of Cross-Modal Alignment in Large Language Models
Linqing Chen | Hanmeng Zhong | Wentao Wu | Peng Zhou
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Recent advancements in Multimodal Large Language Models (MLLMs) have largely been driven by aligning visual encoders with pre-trained Large Language Models (LLMs). While effective, the geometric nature of this alignment remains under-explored. Existing methods often assume a symmetric interaction between visual and textual modalities, implying that both spaces adapt to each other. In this paper, we challenge this assumption and propose the "Text Space as Anchor" hypothesis. We argue that the semantic space of LLMs is rigid, anisotropic, and dominant; thus, effective cross-modal alignment may be an asymmetric projection of visual features onto this pre-existing text manifold without distorting it. We identify a potential issue in current parameter-efficient tuning paradigms where task-specific visual adjustments inadvertently disrupt the projector’s geometry, leading to "catastrophic forgetting" of the alignment mechanism itself. To address this, we introduce Anchor-Preserving Projection (APP), a novel method that regularizes the projector to maintain the geometric structure of the text embedding space via spectral filtering. Extensive experiments on 8 diverse cross-modal tasks and 3 pure language benchmarks demonstrate that APP preserves the LLM’s inherent linguistic capabilities (e.g., MMLU, GSM8K) and reduces object hallucination significantly better than standard fine-tuning methods. We release our code.

Co-authors

Venues

ACL1
Findings1

Fix author