Arsh Keshari
2026
Grounded Multimodal In-Context Learning for Product Weight Estimation at Scale in E-commerce
Bhavuk Singhal | Arsh Keshari | Ravindra Kumar Yadav
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Bhavuk Singhal | Arsh Keshari | Ravindra Kumar Yadav
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Accurately inferring implicit physical attributes of products, such as weight, is critical for large-scale e-commerce logistics but challenging due to sparse or unreliable textual metadata and high visual variability. We formulate weight estimation as a grounded multimodal reasoning problem and investigate whether large vision-language models (LVLMs) can infer discretized weight buckets through in-context learning (ICL) over product images and descriptions. We introduce a scalable inference framework that conditions predictions on automatically retrieved, category-specific exemplars and propose a distribution-calibrated retrieval strategy that aligns few-shot contexts with the empirical weight distribution of each product sub-category. This calibration substantially improves few-shot multimodal reasoning compared to random or embedding-based retrieval baselines. Across 14 high-variance categories, our approach significantly outperforms strong multimodal KNN baselines in both exact-match accuracy and near-bucket reliability. Deployed in production on a large e-commerce platform, our system processes millions of listings daily and reduces shipping-related revenue leakage by ∼22%, demonstrating that multimodal ICL can serve as a practical and cost-effective alternative to manual or hardware-based verification.