Takuya Yoshioka
2024
i-Code V2: An Autoregressive Generation Framework over Vision, Language, and Speech Data
Ziyi Yang
|
Mahmoud Khademi
|
Yichong Xu
|
Reid Pryzant
|
Yuwei Fang
|
Chenguang Zhu
|
Dongdong Chen
|
Yao Qian
|
Xuemei Gao
|
Yi-Ling Chen
|
Robert Gmyr
|
Naoyuki Kanda
|
Noel Codella
|
Bin Xiao
|
Yu Shi
|
Lu Yuan
|
Takuya Yoshioka
|
Michael Zeng
|
Xuedong Huang
Findings of the Association for Computational Linguistics: NAACL 2024
The convergence of text, visual, and audio data is crucial towards human-like artificial intelligence, however the current Vision-Language-Speech landscape is dominated by encoder-only models that lack generative abilities. We propose closing this gap with i-Code V2, one of the first models capable of generating natural language from any combination of Vision, Language, and Speech data. i-Code V2 leverages state-of-the-art single-modality encoders, combining their outputs with a new modality-fusing encoder to project combinations of modalities into a shared representational space. Language tokens are generated from these representations via an autoregressive decoder. i-Code V2 is pretrained end-to-end on a large collection of dual- and single-modality datasets with a novel text completion objective that can be generalized across arbitrary combinations of modalities. i-Code V2 matches or outperforms state-of-the-art single- and dual-modality baselines on 7 multimodal tasks, demonstrating the power of generative multimodal pretraining across a diversity of tasks and signals.
Search
Co-authors
- Ziyi Yang 1
- Mahmoud Khademi 1
- Yichong Xu 1
- Reid Pryzant 1
- Yuwei Fang 1
- show all...