David Blei

Also published as: David M. Blei


2023

pdf
An Invariant Learning Characterization of Controlled Text Generation
Carolina Zheng | Claudia Shi | Keyon Vafa | Amir Feder | David Blei
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Controlled generation refers to the problem of creating text that contains stylistic or semantic attributes of interest. Many approaches reduce this problem to training a predictor of the desired attribute. For example, researchers hoping to deploy a large language model to produce non-toxic content may use a toxicity classifier to filter generated text. In practice, the generated text to classify, which is determined by user prompts, may come from a wide range of distributions. In this paper, we show that the performance of controlled generation may be poor if the distributions of text in response to user prompts differ from the distribution the predictor was trained on. To address this problem, we cast controlled generation under distribution shift as an invariant learning problem: the most effective predictor should be invariant across multiple text environments. We then discuss a natural solution that arises from this characterization and propose heuristics for selecting natural environments. We study this characterization and the proposed method empirically using both synthetic and real data. Experiments demonstrate both the challenge of distribution shift in controlled generation and the potential of invariance methods in this setting.

2022

pdf
Heterogeneous Supervised Topic Models
Dhanya Sridhar | Hal Daumé III | David Blei
Transactions of the Association for Computational Linguistics, Volume 10

Researchers in the social sciences are often interested in the relationship between text and an outcome of interest, where the goal is to both uncover latent patterns in the text and predict outcomes for unseen texts. To this end, this paper develops the heterogeneous supervised topic model (HSTM), a probabilistic approach to text analysis and prediction. HSTMs posit a joint model of text and outcomes to find heterogeneous patterns that help with both text analysis and prediction. The main benefit of HSTMs is that they capture heterogeneity in the relationship between text and the outcome across latent topics. To fit HSTMs, we develop a variational inference algorithm based on the auto-encoding variational Bayes framework. We study the performance of HSTMs on eight datasets and find that they consistently outperform related methods, including fine-tuned black-box models. Finally, we apply HSTMs to analyze news articles labeled with pro- or anti-tone. We find evidence of differing language used to signal a pro- and anti-tone.

2021

pdf
Rationales for Sequential Predictions
Keyon Vafa | Yuntian Deng | David Blei | Alexander Rush
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Sequence models are a critical component of modern NLP systems, but their predictions are difficult to explain. We consider model explanations though rationales, subsets of context that can explain individual model predictions. We find sequential rationales by solving a combinatorial optimization: the best rationale is the smallest subset of input tokens that would predict the same output as the full sequence. Enumerating all subsets is intractable, so we propose an efficient greedy algorithm to approximate this objective. The algorithm, which is called greedy rationalization, applies to any model. For this approach to be effective, the model should form compatible conditional distributions when making predictions on incomplete subsets of the context. This condition can be enforced with a short fine-tuning step. We study greedy rationalization on language modeling and machine translation. Compared to existing baselines, greedy rationalization is best at optimizing the sequential objective and provides the most faithful rationales. On a new dataset of annotated sequential rationales, greedy rationales are most similar to human rationales.

2020

pdf
Topic Modeling in Embedding Spaces
Adji B. Dieng | Francisco J. R. Ruiz | David M. Blei
Transactions of the Association for Computational Linguistics, Volume 8

Topic modeling analyzes documents to learn meaningful patterns of words. However, existing topic models fail to learn interpretable topics when working with large and heavy-tailed vocabularies. To this end, we develop the embedded topic model (etm), a generative model of documents that marries traditional topic models with word embeddings. More specifically, the etm models each word with a categorical distribution whose natural parameter is the inner product between the word’s embedding and an embedding of its assigned topic. To fit the etm, we develop an efficient amortized variational inference algorithm. The etm discovers interpretable topics even with large vocabularies that include rare words and stop words. It outperforms existing document models, such as latent Dirichlet allocation, in terms of both topic quality and predictive performance.

pdf
Text-Based Ideal Points
Keyon Vafa | Suresh Naidu | David Blei
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Ideal point models analyze lawmakers’ votes to quantify their political positions, or ideal points. But votes are not the only way to express a political position. Lawmakers also give speeches, release press statements, and post tweets. In this paper, we introduce the text-based ideal point model (TBIP), an unsupervised probabilistic topic model that analyzes texts to quantify the political positions of its authors. We demonstrate the TBIP with two types of politicized text data: U.S. Senate speeches and senator tweets. Though the model does not analyze their votes or political affiliations, the TBIP separates lawmakers by party, learns interpretable politicized topics, and infers ideal points close to the classical vote-based ideal points. One benefit of analyzing texts, as opposed to votes, is that the TBIP can estimate ideal points of anyone who authors political texts, including non-voting actors. To this end, we use it to study tweets from the 2020 Democratic presidential candidates. Using only the texts of their tweets, it identifies them along an interpretable progressive-to-moderate spectrum.

2016

pdf
Detecting and Characterizing Events
Allison Chaney | Hanna Wallach | Matthew Connelly | David Blei
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

2011

pdf
Bayesian Checking for Topic Models
David Mimno | David Blei
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing

2010

pdf
Variational Inference for Adaptor Grammars
Shay B. Cohen | David M. Blei | Noah A. Smith
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

2007

pdf
PU-BCD: Exponential Family Models for the Coarse- and Fine-Grained All-Words Tasks
Jonathan Chang | Miroslav Dudík | David Blei
Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007)

pdf
PUTOP: Turning Predominant Senses into a Topic Model for Word Sense Disambiguation
Jordan Boyd-Graber | David Blei
Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007)

pdf
A Topic Model for Word Sense Disambiguation
Jordan Boyd-Graber | David Blei | Xiaojin Zhu
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)