@inproceedings{mullick-etal-2025-text,
    title = "Text Takes Over: A Study of Modality Bias in Multimodal Intent Detection",
    author = "Mullick, Ankan  and
      Sharma, Saransh  and
      Jana, Abhik  and
      Goyal, Pawan",
    editor = "Christodoulopoulos, Christos  and
      Chakraborty, Tanmoy  and
      Rose, Carolyn  and
      Peng, Violet",
    booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1226/",
    pages = "24039--24069",
    ISBN = "979-8-89176-332-6",
    abstract = "The rise of multimodal data, integrating text, audio, and visuals, has created new opportunities for studying multimodal tasks such as intent detection. This work investigates the effectiveness of Large Language Models (LLMs) and non-LLMs, including text-only and multimodal models, in the multimodal intent detection task. Our study reveals that Mistral-7B, a text-only LLM, outperforms most competitive multimodal models by approximately 9{\%} on MIntRec-1 and 4{\%} on MIntRec2.0 dataset. This performance advantage comes from a strong textual bias in these datasets, where over 90{\%} of the samples require textual input, either alone or in combination with other modalities, for correct classification. We confirm the modality bias of these datasets via human evaluation, too. Next, we propose a framework to debias the datasets, and upon debiasing, more than 70{\%} of the samples in MIntRec-1 and more than 50{\%} in MIntRec2.0 get removed, resulting in significant performance degradation across all models, with smaller multimodal fusion models being the most affected with an accuracy drop of over 50 - 60{\%}. Further, we analyze the context-specific relevance of different modalities through empirical analysis. Our findings highlight the challenges posed by modality bias in multimodal intent datasets and emphasize the need for unbiased datasets to evaluate multimodal models effectively. We release both the code and the dataset used for this work at https://github.com/Text-Takes-Over-EMNLP-2025/MultiModal-Intent-EMNLP-2025."
}Markdown (Informal)
[Text Takes Over: A Study of Modality Bias in Multimodal Intent Detection](https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1226/) (Mullick et al., EMNLP 2025)
ACL