Yu-Tong Cao
2026
MTIVE: Multi-Task Image Verification Engine Using Vision-Language Models for E-commerce
Yu-Tong Cao | Vishnu Prabhakaran | Arunita Das | Purav Aggarwal | Anoop Saladi
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Yu-Tong Cao | Vishnu Prabhakaran | Arunita Das | Purav Aggarwal | Anoop Saladi
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Vision-language models show promise for e-commerce automation but struggle with noisy real-world images and multi-task requirements. We introduce MTIVE, a curriculum learning framework that progressively adapts base models through three stages: continued pre-training on large-scale e-commerce datasets with contrastive learning and diverse dialogue templates, instruction tuning on synthetic data, and modular task-specific expert training. Our architecture uses frozen base weights with stacked LoRA adapters—shared modules for domain knowledge and lightweight task-specific experts—enabling continual learning without catastrophic forgetting. MTIVE outperforms open-source and proprietary baselines in both standard and continual learning settings.