2022
pdf
bib
abs
Cross-Modal Discrete Representation Learning
Alexander Liu
|
SouYoung Jin
|
Cheng-I Lai
|
Andrew Rouditchenko
|
Aude Oliva
|
James Glass
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
In contrast to recent advances focusing on high-level representation learning across modalities, in this work we present a self-supervised learning framework that is able to learn a representation that captures finer levels of granularity across different modalities such as concepts or events represented by visual objects or spoken words. Our framework relies on a discretized embedding space created via vector quantization that is shared across different modalities. Beyond the shared embedding space, we propose a Cross-Modal Code Matching objective that forces the representations from different views (modalities) to have a similar distribution over the discrete embedding space such that cross-modal objects/actions localization can be performed without direct supervision. We show that the proposed discretized multi-modal fine-grained representation (e.g., pixel/word/frame) can complement high-level summary representations (e.g., video/sentence/waveform) for improved performance on cross-modal retrieval tasks. We also observe that the discretized representation uses individual clusters to represent the same semantic concept across modalities.
pdf
bib
abs
SUPERB-SG: Enhanced Speech processing Universal PERformance Benchmark for Semantic and Generative Capabilities
Hsiang-Sheng Tsai
|
Heng-Jui Chang
|
Wen-Chin Huang
|
Zili Huang
|
Kushal Lakhotia
|
Shu-wen Yang
|
Shuyan Dong
|
Andy Liu
|
Cheng-I Lai
|
Jiatong Shi
|
Xuankai Chang
|
Phil Hall
|
Hsuan-Jui Chen
|
Shang-Wen Li
|
Shinji Watanabe
|
Abdelrahman Mohamed
|
Hung-yi Lee
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Transfer learning has proven to be crucial in advancing the state of speech and natural language processing research in recent years. In speech, a model pre-trained by self-supervised learning transfers remarkably well on multiple tasks. However, the lack of a consistent evaluation methodology is limiting towards a holistic understanding of the efficacy of such models. SUPERB was a step towards introducing a common benchmark to evaluate pre-trained models across various speech tasks. In this paper, we introduce SUPERB-SG, a new benchmark focusing on evaluating the semantic and generative capabilities of pre-trained models by increasing task diversity and difficulty over SUPERB. We use a lightweight methodology to test the robustness of representations learned by pre-trained models under shifts in data domain and quality across different types of tasks. It entails freezing pre-trained model parameters, only using simple task-specific trainable heads. The goal is to be inclusive of all researchers, and encourage efficient use of computational resources. We also show that the task diversity of SUPERB-SG coupled with limited task supervision is an effective recipe for evaluating the generalizability of model representation.
2019
pdf
bib
Controlling the Reading Level of Machine Translation Output
Kelly Marchisio
|
Jialiang Guo
|
Cheng-I Lai
|
Philipp Koehn
Proceedings of Machine Translation Summit XVII: Research Track