Zhouhan Lin


2022

pdf
Revisit Systematic Generalization via Meaningful Learning
Ning Shi | Boxin Wang | Wei Wang | Xiangyu Liu | Zhouhan Lin
Proceedings of the Fifth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP

Humans can systematically generalize to novel compositions of existing concepts. Recent studies argue that neural networks appear inherently ineffective in such cognitive capacity, leading to a pessimistic view and a lack of attention to optimistic results. We revisit this controversial topic from the perspective of meaningful learning, an exceptional capability of humans to learn novel concepts by connecting them with known ones. We reassess the compositional skills of sequence-to-sequence models conditioned on the semantic links between new and old concepts. Our observations suggest that models can successfully one-shot generalize to novel concepts and compositions through semantic linking, either inductively or deductively. We demonstrate that prior knowledge plays a key role as well. In addition to synthetic tests, we further conduct proof-of-concept experiments in machine translation and semantic parsing, showing the benefits of meaningful learning in applications. We hope our positive findings will encourage excavating modern neural networks’ potential in systematic generalization through more advanced learning schemes.

pdf
Text Editing as Imitation Game
Ning Shi | Bin Tang | Bo Yuan | Longtao Huang | Yewen Pu | Jie Fu | Zhouhan Lin
Findings of the Association for Computational Linguistics: EMNLP 2022

Text editing, such as grammatical error correction, arises naturally from imperfect textual data. Recent works frame text editing as a multi-round sequence tagging task, where operations – such as insertion and substitution – are represented as a sequence of tags. While achieving good results, this encoding is limited in flexibility as all actions are bound to token-level tags. In this work, we reformulate text editing as an imitation game using behavioral cloning. Specifically, we convert conventional sequence-to-sequence data into state-to-action demonstrations, where the action space can be as flexible as needed. Instead of generating the actions one at a time, we introduce a dual decoders structure to parallel the decoding while retaining the dependencies between action tokens, coupled with trajectory augmentation to alleviate the distribution shift that imitation learning often suffers. In experiments on a suite of Arithmetic Equation benchmarks, our model consistently outperforms the autoregressive baselines in terms of performance, efficiency, and robustness. We hope our findings will shed light on future studies in reinforcement learning applying sequence-level action generation to natural language processing.

pdf
Syntax-guided Localized Self-attention by Constituency Syntactic Distance
Shengyuan Hou | Jushi Kai | Haotian Xue | Bingyu Zhu | Bo Yuan | Longtao Huang | Xinbing Wang | Zhouhan Lin
Findings of the Association for Computational Linguistics: EMNLP 2022

Recent works have revealed that Transformers are implicitly learning the syntactic information in its lower layers from data, albeit is highly dependent on the quality and scale of the training data. However, learning syntactic information from data is not necessary if we can leverage an external syntactic parser, which provides better parsing quality with well-defined syntactic structures. This could potentially improve Transformer’s performance and sample efficiency. In this work, we propose a syntax-guided localized self-attention for Transformer that allows directly incorporating grammar structures from an external constituency parser. It prohibits the attention mechanism to overweight the grammatically distant tokens over close ones. Experimental results show that our model could consistently improve translation performance on a variety of machine translation datasets, ranging from small to large dataset sizes, and with different source languages.

pdf
Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual Speech Recognition
Xichen Pan | Peiyu Chen | Yichen Gong | Helong Zhou | Xinbing Wang | Zhouhan Lin
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Training Transformer-based models demands a large amount of data, while obtaining aligned and labelled data in multimodality is rather cost-demanding, especially for audio-visual speech recognition (AVSR). Thus it makes a lot of sense to make use of unlabelled unimodal data. On the other side, although the effectiveness of large-scale self-supervised learning is well established in both audio and visual modalities, how to integrate those pre-trained models into a multimodal scenario remains underexplored. In this work, we successfully leverage unimodal self-supervised learning to promote the multimodal AVSR. In particular, audio and visual front-ends are trained on large-scale unimodal datasets, then we integrate components of both front-ends into a larger multimodal framework which learns to recognize parallel audio-visual data into characters through a combination of CTC and seq2seq decoding. We show that both components inherited from unimodal self-supervised learning cooperate well, resulting in that the multimodal framework yields competitive results through fine-tuning. Our model is experimentally validated on both word-level and sentence-level tasks. Especially, even without an external language model, our proposed model raises the state-of-the-art performances on the widely accepted Lip Reading Sentences 2 (LRS2) dataset by a large margin, with a relative improvement of 30%.

pdf
Transkimmer: Transformer Learns to Layer-wise Skim
Yue Guan | Zhengyi Li | Jingwen Leng | Zhouhan Lin | Minyi Guo
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Transformer architecture has become the de-facto model for many machine learning tasks from natural language processing and computer vision. As such, improving its computational efficiency becomes paramount. One of the major computational inefficiency of Transformer based models is that they spend the identical amount of computation throughout all layers. Prior works have proposed to augment the Transformer model with the capability of skimming tokens to improve its computational efficiency. However, they suffer from not having effectual and end-to-end optimization of the discrete skimming predictor. To address the above limitations, we propose the Transkimmer architecture, which learns to identify hidden state tokens that are not required by each layer. The skimmed tokens are then forwarded directly to the final output, thus reducing the computation of the successive layers. The key idea in Transkimmer is to add a parameterized predictor before each layer that learns to make the skimming decision. We also propose to adopt reparameterization trick and add skim loss for the end-to-end training of Transkimmer. Transkimmer achieves 10.97x average speedup on GLUE benchmark compared with vanilla BERT-base baseline with less than 1% accuracy degradation.

pdf
RASAT: Integrating Relational Structures into Pretrained Seq2Seq Model for Text-to-SQL
Jiexing Qi | Jingyao Tang | Ziwei He | Xiangpeng Wan | Yu Cheng | Chenghu Zhou | Xinbing Wang | Quanshi Zhang | Zhouhan Lin
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Relational structures such as schema linking and schema encoding have been validated as a key component to qualitatively translating natural language into SQL queries. However, introducing these structural relations comes with prices: they often result in a specialized model structure, which largely prohibits using large pretrained models in text-to-SQL. To address this problem, we propose RASAT: a Transformer seq2seq architecture augmented with relation-aware self-attention that could leverage a variety of relational structures while inheriting the pretrained parameters from the T5 model effectively. Our model can incorporate almost all types of existing relations in the literature, and in addition, we propose introducing co-reference relations for the multi-turn scenario. Experimental results on three widely used text-to-SQL datasets, covering both single-turn and multi-turn scenarios, have shown that RASAT could achieve competitive results in all three benchmarks, achieving state-of-the-art execution accuracy (75.5% EX on Spider, 52.6% IEX on SParC, and 37.4% IEX on CoSQL).

pdf
Proceedings of the Workshop on Unimodal and Multimodal Induction of Linguistic Structures (UM-IoS)
Wenjuan Han | Zilong Zheng | Zhouhan Lin | Lifeng Jin | Yikang Shen | Yoon Kim | Kewei Tu
Proceedings of the Workshop on Unimodal and Multimodal Induction of Linguistic Structures (UM-IoS)

2021

pdf
Annotation Inconsistency and Entity Bias in MultiWOZ
Kun Qian | Ahmad Beirami | Zhouhan Lin | Ankita De | Alborz Geramifard | Zhou Yu | Chinnadhurai Sankar
Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue

MultiWOZ (Budzianowski et al., 2018) is one of the most popular multi-domain taskoriented dialog datasets, containing 10K+ annotated dialogs covering eight domains. It has been widely accepted as a benchmark for various dialog tasks, e.g., dialog state tracking (DST), natural language generation (NLG) and end-to-end (E2E) dialog modeling. In this work, we identify an overlooked issue with dialog state annotation inconsistencies in the dataset, where a slot type is tagged inconsistently across similar dialogs leading to confusion for DST modeling. We propose an automated correction for this issue, which is present in 70% of the dialogs. Additionally, we notice that there is significant entity bias in the dataset (e.g., “cambridge” appears in 50% of the destination cities in the train domain). The entity bias can potentially lead to named entity memorization in generative models, which may go unnoticed as the test set suffers from a similar entity bias as well. We release a new test set with all entities replaced with unseen entities. Finally, we benchmark joint goal accuracy (JGA) of the state-of-theart DST baselines on these modified versions of the data. Our experiments show that the annotation inconsistency corrections lead to 7-10% improvement in JGA. On the other hand, we observe a 29% drop in JGA when models are evaluated on the new test set with unseen entities.

2020

pdf
Exploiting Syntactic Structure for Better Language Modeling: A Syntactic Distance Approach
Wenyu Du | Zhouhan Lin | Yikang Shen | Timothy J. O’Donnell | Yoshua Bengio | Yue Zhang
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

It is commonly believed that knowledge of syntactic structure should improve language modeling. However, effectively and computationally efficiently incorporating syntactic structure into neural language models has been a challenging topic. In this paper, we make use of a multi-task objective, i.e., the models simultaneously predict words as well as ground truth parse trees in a form called “syntactic distances”, where information between these two separate objectives shares the same intermediate representation. Experimental results on the Penn Treebank and Chinese Treebank datasets show that when ground truth parse trees are provided as additional training signals, the model is able to achieve lower perplexity and induce trees with better quality.

2019

pdf
Interactive Language Learning by Question Answering
Xingdi Yuan | Marc-Alexandre Côté | Jie Fu | Zhouhan Lin | Chris Pal | Yoshua Bengio | Adam Trischler
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Humans observe and interact with the world to acquire knowledge. However, most existing machine reading comprehension (MRC) tasks miss the interactive, information-seeking component of comprehension. Such tasks present models with static documents that contain all necessary information, usually concentrated in a single short substring. Thus, models can achieve strong performance through simple word- and phrase-based pattern matching. We address this problem by formulating a novel text-based question answering task: Question Answering with Interactive Text (QAit). In QAit, an agent must interact with a partially observable text-based environment to gather information required to answer questions. QAit poses questions about the existence, location, and attributes of objects found in the environment. The data is built using a text-based game generator that defines the underlying dynamics of interaction with the environment. We propose and evaluate a set of baseline models for the QAit task that includes deep reinforcement learning agents. Experiments show that the task presents a major challenge for machine reading systems, while humans solve it with relative ease.

2018

pdf
Straight to the Tree: Constituency Parsing with Neural Syntactic Distance
Yikang Shen | Zhouhan Lin | Athul Paul Jacob | Alessandro Sordoni | Aaron Courville | Yoshua Bengio
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

In this work, we propose a novel constituency parsing scheme. The model first predicts a real-valued scalar, named syntactic distance, for each split position in the sentence. The topology of grammar tree is then determined by the values of syntactic distances. Compared to traditional shift-reduce parsing schemes, our approach is free from the potentially disastrous compounding error. It is also easier to parallelize and much faster. Our model achieves the state-of-the-art single model F1 score of 92.1 on PTB and 86.4 on CTB dataset, which surpasses the previous single model results by a large margin.

pdf
Learning Hierarchical Structures On-The-Fly with a Recurrent-Recursive Model for Sequences
Athul Paul Jacob | Zhouhan Lin | Alessandro Sordoni | Yoshua Bengio
Proceedings of the Third Workshop on Representation Learning for NLP

We propose a hierarchical model for sequential data that learns a tree on-the-fly, i.e. while reading the sequence. In the model, a recurrent network adapts its structure and reuses recurrent weights in a recursive manner. This creates adaptive skip-connections that ease the learning of long-term dependencies. The tree structure can either be inferred without supervision through reinforcement learning, or learned in a supervised manner. We provide preliminary experiments in a novel Math Expression Evaluation (MEE) task, which is created to have a hierarchical tree structure that can be used to study the effectiveness of our model. Additionally, we test our model in a well-known propositional logic and language modelling tasks. Experimental results have shown the potential of our approach.