Prior work in semantic parsing has shown that conventional seq2seq models fail at compositional generalization tasks. This limitation led to a resurgence of methods that model alignments between sentences and their corresponding meaning representations, either implicitly through latent variables or explicitly by taking advantage of alignment annotations. We take the second direction and propose TPol, a two-step approach that first translates input sentences monotonically and then reorders them to obtain the correct output. This is achieved with a modular framework comprising a Translator and a Reorderer component. We test our approach on two popular semantic parsing datasets. Our experiments show that by means of the monotonic translations, TPol can learn reliable lexico-logical patterns from aligned data, significantly improving compositional generalization both over conventional seq2seq models, as well as over other approaches that exploit gold alignments.
Prior to deep learning the semantic parsing community has been interested in understanding and modeling the range of possible word alignments between natural language sentences and their corresponding meaning representations. Sequence-to-sequence models changed the research landscape suggesting that we no longer need to worry about alignments since they can be learned automatically by means of an attention mechanism. More recently, researchers have started to question such premise. In this work we investigate whether seq2seq models can handle both simple and complex alignments. To answer this question we augment the popular Geo semantic parsing dataset with alignment annotations and create Geo-Aligned. We then study the performance of standard seq2seq models on the examples that can be aligned monotonically versus examples that require more complex alignments. Our empirical study shows that performance is significantly better over monotonic alignments.
We address the annotation data bottleneck for sequence classification. Specifically we ask the question: if one has a budget of N annotations, which samples should we select for annotation? The solution we propose looks for diversity in the selected sample, by maximizing the amount of information that is useful for the learning algorithm, or equivalently by minimizing the redundancy of samples in the selection. This is formulated in the context of spectral learning of recurrent functions for sequence classification. Our method represents unlabeled data in the form of a Hankel matrix, and uses the notion of spectral max-volume to find a compact sub-block from which annotation samples are drawn. Experiments on sequence classification confirm that our spectral sampling strategy is in fact efficient and yields good models.
We compare a classical CNN architecture for sequence classification involving several convolutional and max-pooling layers against a simple model based on weighted finite state automata (WFA). Each model has its advantages and disadvantages and it is possible that they could be combined. However, we believe that the first research goal should be to investigate and understand how do these two apparently dissimilar models compare in the context of specific natural language processing tasks. This paper is the first step towards that goal. Our experiments with five sequence classification datasets suggest that, despite the apparent simplicity of WFA models and training algorithms, the performance of WFAs is comparable to that of the CNNs.
Spectral models for learning weighted non-deterministic automata have nice theoretical and algorithmic properties. Despite this, it has been challenging to obtain competitive results in language modeling tasks, for two main reasons. First, in order to capture long-range dependencies of the data, the method must use statistics from long substrings, which results in very large matrices that are difficult to decompose. The second is that the loss function behind spectral learning, based on moment matching, differs from the probabilistic metrics used to evaluate language models. In this work we employ a technique for scaling up spectral learning, and use interpolated predictions that are optimized to maximize perplexity. Our experiments in character-based language modeling show that our method matches the performance of state-of-the-art ngram models, while being very fast to train.
We present a low-rank multi-linear model for the task of solving prepositional phrase attachment ambiguity (PP task). Our model exploits tensor products of word embeddings, capturing all possible conjunctions of latent embeddings. Our results on a wide range of datasets and task settings show that tensor products are the best compositional operation and that a relatively simple multi-linear model that uses only word embeddings of lexical features can outperform more complex non-linear architectures that exploit the same information. Our proposed model gives the current best reported performance on an out-of-domain evaluation and performs competively on out-of-domain dependency parsing datasets.
Event Schema Induction is the task of learning a representation of events (e.g., bombing) and the roles involved in them (e.g, victim and perpetrator). This paper presents InToEventS, an interactive tool for learning these schemas. InToEventS allows users to explore a corpus and discover which kind of events are present. We show how users can create useful event schemas using two interactive clustering steps.
In recent years we have seen the development of efficient and provably correct algorithms for learning weighted automata and closely related function classes such as weighted transducers and weighted context-free grammars. The common denominator of all these algorithms is the so-called spectral method, which gives an efficient and robust way to estimate recursively defined functions from empirical estimations of observable statistics. These algorithms are appealing because of the existence of theoretical guarantees (e.g. they are not susceptible to local minima) and because of their efficiency. However, despite their simplicity and wide applicability to real problems, their impact in NLP applications is still moderate. One of the goals of this tutorial is to remedy this situation.The contents that will be presented in this tutorial will offer a complementary perspective with respect to previous tutorials on spectral methods presented at ICML-2012, ICML-2013 and NAACL-2013. Rather than using the language of graphical models and signal processing, we tell the story from the perspective of formal languages and automata theory (without assuming a background in formal algebraic methods). Our presentation highlights the common intuitions lying behind different spectral algorithms by presenting them in a unified framework based on the concepts of low-rank factorizations and completions of Hankel matrices. In addition, we provide an interpretation of the method in terms of forward and backward recursions for automata and grammars. This provides extra intuitions about the method and stresses the importance of matrix factorization for learning automata and grammars. We believe that this complementary perspective might be appealing for an NLP audience and serve to put spectral learning in a wider and, perhaps for some, more familiar context. Our hope is that this will broaden the understanding of these methods by the NLP community and empower many researchers to apply these techniques to novel problems.The content of the tutorial will be divided into four blocks of 45 minutes each, as follows. The first block will introduce the basic definitions of weighted automata and Hankel matrices, and present a key connection between the fundamental theorem of weighted automata and learning. In the second block we will discuss the case of probabilistic automata in detail, touching upon all aspects from the underlying theory to the tricks required to achieve accurate and scalable learning algorithms. The third block will present extensions to related models, including sequence tagging models, finite-state transducers and weighted context-free grammars. The last block will describe a general framework for using spectral techniques in more general situations where a matrix completion pre-processing step is required; several applications of this approach will be described.