Chuanghao Ding

2026

MiMIC: Mitigating Visual Modality Collapse in Universal Multimodal Retrieval While Avoiding Semantic Misalignment
Li Juan | Chuanghao Ding | Xujie Zhang | Cam-Tu Nguyen
Findings of the Association for Computational Linguistics: ACL 2026

Universal Multimodal Retrieval (UMR) aims to map different modalities (e.g., visual and textual) into a shared embedding space for multi-modal retrieval. Existing UMR methods can be broadly divided into two categories: early-fusion approaches, such as Marvel, which projects visual features into the language model (LM) space for integrating with text modality, and late-fusion approaches, such as UniVL-DR, encode visual and textual inputs using separate encoders and obtain fused embeddings through addition. Our pilot study reveals that Marvel exhibits visual modality collapse, which is characterized by the model’s tendency to disregard visual features while depending excessively on textual cues. In contrast, although UniVL-DR is less affected by this issue, it is more susceptible to semantic misalignment, where semantically related content is positioned far apart in the embedding space. To address these challenges, we propose MiMIC, which introduces two key innovations: (1) a fusion-in-decoder architecture for effective multimodal integration, and (2) robust training through single-modality mix-in and random caption dropout. Experiments on the WebQA+ and EVQA+ datasets—where image in documents or queries might lack captions—indicate that MiMIC consistently outperforms both early- and late-fusion baselines.

2025

pdf bib abs

The synergistic mechanism based on Speculative Decoding (SD) has garnered considerable attention as a simple yet effective approach for accelerating the inference of large language models (LLMs). Nonetheless, the high rejection rates require repeated LLMs calls to validate draft tokens, undermining the overall efficiency gain of SD.In this work, we revisit existing verification mechanisms and propose a novel synergetic mechanism Consultant Decoding (CD). CD achieves up to a 2.5-fold increase in inference speed compared to the target model, while maintaining comparable generation quality (~100% of the target model’s performance). Interestingly, this is achieved by combining models whose parameter sizes differ by two orders of magnitude.In addition, CD reduces the call frequency of the large target model to below 10%, particularly in more demanding tasks.CD’s performance was even found to surpass that of the large target model, which theoretically represents the upper bound for speculative decoding.

2024

pdf bib abs

CMR Scaling Law: Predicting Critical Mixture Ratios for Continual Pre-training of Language Models
Jiawei Gu | Zacc Yang | Chuanghao Ding | Rui Zhao | Fei Tan
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Large Language Models (LLMs) excel in diverse tasks but often underperform in specialized fields due to limited domain-specific or proprietary corpus. Continual pre-training (CPT) enhances LLM capabilities by imbuing new domain-specific or proprietary knowledge while replaying general corpus to prevent catastrophic forgetting. The data mixture ratio of general corpus and domain-specific corpus, however, has been chosen heuristically, leading to sub-optimal training efficiency in practice. In this context, we attempt to re-visit the scaling behavior of LLMs under the hood of CPT, and discover a power-law relationship between loss, mixture ratio, and training tokens scale. We formalize the trade-off between general and domain-specific capabilities, leading to a well-defined Critical Mixture Ratio (CMR) of general and domain data. By striking the balance, CMR maintains the model’s general ability and achieves the desired domain transfer, ensuring the highest utilization of available resources. Considering the balance between efficiency and effectiveness, CMR can be regarded as the optimal mixture ratio. Through extensive experiments, we ascertain the predictability of CMR, propose CMR scaling law and have substantiated its generalization. These findings offer practical guidelines for optimizing LLM training in specialized domains, ensuring both general and domain-specific performance while efficiently managing training resources.

Co-authors

Venues

Findings2
EMNLP1

Fix author