This is an internal, incomplete preview of a proposed change to the ACL Anthology.
For efficiency reasons, we generate only three BibTeX files per volume, and the preview may be incomplete in other ways, or contain mistakes.
Do not treat this content as an official publication.
In recent years, the two-step approach for text classification based on pre-training plus fine-tuning has led to significant improvements in classification performance. In this paper, we study the low-budget scenario, and we ask whether it is justified to allocate the additional resources needed for fine-tuning complex models. To do so, we isolate the gains obtained from pre-training from those obtained from fine-tuning. We find out that, when the gains from pre-training are factored out, the performance attained by using complex transformer models leads to marginal improvements over simpler models. Therefore, in this scenario, utilizing simpler classifiers on top of pre-trained representations proves to be a viable alternative.
Most current state-of-the-art approaches for text classification are based on fine-tuning the representations computed by large language models (LLMs). This strategy has led to significant improvements in classification performance and contributed to a reduction of the amount of labeled data required for training a model. However, for some challenging classification tasks, providing enough annotations to ensure a reliable classification continues to be the main bottleneck. This is especially true in settings of highly imbalanced class distributions. This paper proposes to tackle this bottleneck by exploiting the structural properties of pre-trained embeddings. We develop a label propagation method that uses pre-trained embeddings to spread information from the labeled samples to nearby samples in the induced space, ensuring the optimal use of annotations. Our approach is simple and relatively low-cost since it only requires computing some distances in the embedded space. We conduct experiments on different text classification datasets showing that the proposed method is efficient and significantly outperforms both self-training and random walk label propagation strategies.
Recent work on semantic parsing has shown that seq2seq models find compositional generalization challenging. Several strategies have been proposed to mitigate this challenge. One such strategy is to improve compositional generalization via data augmentation techniques. In this paper we follow this line of work and propose Archer, a data-augmentation strategy that exploits alignment annotations between sentences and their corresponding meaning representations. More precisely, we use alignments to train a two step generative model that combines monotonic lexical generation with reordering. Our experiments show that Archer leads to significant improvements in compositional generalization performance.
Textual representations based on pre-trained language models are key, especially in few-shot learning scenarios. What makes a representation good for text classification? Is it due to the geometric properties of the space or because it is well aligned with the task? We hypothesize the second claim. To test it, we develop a task alignment score based on hierarchical clustering that measures alignment at different levels of granularity. Our experiments on text classification validate our hypothesis by showing that task alignment can explain the classification performance of a given representation.
Prior work in semantic parsing has shown that conventional seq2seq models fail at compositional generalization tasks. This limitation led to a resurgence of methods that model alignments between sentences and their corresponding meaning representations, either implicitly through latent variables or explicitly by taking advantage of alignment annotations. We take the second direction and propose TPol, a two-step approach that first translates input sentences monotonically and then reorders them to obtain the correct output. This is achieved with a modular framework comprising a Translator and a Reorderer component. We test our approach on two popular semantic parsing datasets. Our experiments show that by means of the monotonic translations, TPol can learn reliable lexico-logical patterns from aligned data, significantly improving compositional generalization both over conventional seq2seq models, as well as over other approaches that exploit gold alignments.
Many real-world NLP applications face the challenge of training an entity disambiguation model for a specific domain with a small labeling budget. In this setting there is often access to a large unlabeled pool of documents. It is then natural to ask the question: which samples should be selected for annotation? In this paper we propose a solution that combines feature diversity with low rank correction. Our sampling strategy is formulated in the context of bilinear tensor models. Our experiments show that the proposed approach can significantly reduce the amount of labeled data necessary to achieve a given performance.
Prior to deep learning the semantic parsing community has been interested in understanding and modeling the range of possible word alignments between natural language sentences and their corresponding meaning representations. Sequence-to-sequence models changed the research landscape suggesting that we no longer need to worry about alignments since they can be learned automatically by means of an attention mechanism. More recently, researchers have started to question such premise. In this work we investigate whether seq2seq models can handle both simple and complex alignments. To answer this question we augment the popular Geo semantic parsing dataset with alignment annotations and create Geo-Aligned. We then study the performance of standard seq2seq models on the examples that can be aligned monotonically versus examples that require more complex alignments. Our empirical study shows that performance is significantly better over monotonic alignments.
We address the annotation data bottleneck for sequence classification. Specifically we ask the question: if one has a budget of N annotations, which samples should we select for annotation? The solution we propose looks for diversity in the selected sample, by maximizing the amount of information that is useful for the learning algorithm, or equivalently by minimizing the redundancy of samples in the selection. This is formulated in the context of spectral learning of recurrent functions for sequence classification. Our method represents unlabeled data in the form of a Hankel matrix, and uses the notion of spectral max-volume to find a compact sub-block from which annotation samples are drawn. Experiments on sequence classification confirm that our spectral sampling strategy is in fact efficient and yields good models.
We compare a classical CNN architecture for sequence classification involving several convolutional and max-pooling layers against a simple model based on weighted finite state automata (WFA). Each model has its advantages and disadvantages and it is possible that they could be combined. However, we believe that the first research goal should be to investigate and understand how do these two apparently dissimilar models compare in the context of specific natural language processing tasks. This paper is the first step towards that goal. Our experiments with five sequence classification datasets suggest that, despite the apparent simplicity of WFA models and training algorithms, the performance of WFAs is comparable to that of the CNNs.
Spectral models for learning weighted non-deterministic automata have nice theoretical and algorithmic properties. Despite this, it has been challenging to obtain competitive results in language modeling tasks, for two main reasons. First, in order to capture long-range dependencies of the data, the method must use statistics from long substrings, which results in very large matrices that are difficult to decompose. The second is that the loss function behind spectral learning, based on moment matching, differs from the probabilistic metrics used to evaluate language models. In this work we employ a technique for scaling up spectral learning, and use interpolated predictions that are optimized to maximize perplexity. Our experiments in character-based language modeling show that our method matches the performance of state-of-the-art ngram models, while being very fast to train.
Event Schema Induction is the task of learning a representation of events (e.g., bombing) and the roles involved in them (e.g, victim and perpetrator). This paper presents InToEventS, an interactive tool for learning these schemas. InToEventS allows users to explore a corpus and discover which kind of events are present. We show how users can create useful event schemas using two interactive clustering steps.
We present a low-rank multi-linear model for the task of solving prepositional phrase attachment ambiguity (PP task). Our model exploits tensor products of word embeddings, capturing all possible conjunctions of latent embeddings. Our results on a wide range of datasets and task settings show that tensor products are the best compositional operation and that a relatively simple multi-linear model that uses only word embeddings of lexical features can outperform more complex non-linear architectures that exploit the same information. Our proposed model gives the current best reported performance on an out-of-domain evaluation and performs competively on out-of-domain dependency parsing datasets.
In recent years we have seen the development of efficient and provably correct algorithms for learning weighted automata and closely related function classes such as weighted transducers and weighted context-free grammars. The common denominator of all these algorithms is the so-called spectral method, which gives an efficient and robust way to estimate recursively defined functions from empirical estimations of observable statistics. These algorithms are appealing because of the existence of theoretical guarantees (e.g. they are not susceptible to local minima) and because of their efficiency. However, despite their simplicity and wide applicability to real problems, their impact in NLP applications is still moderate. One of the goals of this tutorial is to remedy this situation.The contents that will be presented in this tutorial will offer a complementary perspective with respect to previous tutorials on spectral methods presented at ICML-2012, ICML-2013 and NAACL-2013. Rather than using the language of graphical models and signal processing, we tell the story from the perspective of formal languages and automata theory (without assuming a background in formal algebraic methods). Our presentation highlights the common intuitions lying behind different spectral algorithms by presenting them in a unified framework based on the concepts of low-rank factorizations and completions of Hankel matrices. In addition, we provide an interpretation of the method in terms of forward and backward recursions for automata and grammars. This provides extra intuitions about the method and stresses the importance of matrix factorization for learning automata and grammars. We believe that this complementary perspective might be appealing for an NLP audience and serve to put spectral learning in a wider and, perhaps for some, more familiar context. Our hope is that this will broaden the understanding of these methods by the NLP community and empower many researchers to apply these techniques to novel problems.The content of the tutorial will be divided into four blocks of 45 minutes each, as follows. The first block will introduce the basic definitions of weighted automata and Hankel matrices, and present a key connection between the fundamental theorem of weighted automata and learning. In the second block we will discuss the case of probabilistic automata in detail, touching upon all aspects from the underlying theory to the tricks required to achieve accurate and scalable learning algorithms. The third block will present extensions to related models, including sequence tagging models, finite-state transducers and weighted context-free grammars. The last block will describe a general framework for using spectral techniques in more general situations where a matrix completion pre-processing step is required; several applications of this approach will be described.