We present a nonparametric Bayesian model of tree structures based on the hierarchical Dirichlet process (HDP).
Our HDP-PCFG model allows the complexity of the grammar to grow as more training data is available.
In addition to presenting a fully Bayesian model for the PCFG, we also develop an efficient variational inference procedure.
On synthetic data, we recover the correct grammar without having to specify its complexity in advance.
We also show that our techniques can be applied to full-scale parsing applications by demonstrating its effectiveness in learning state-split grammars.
1 Introduction
Probabilistic context-free grammars (PCFGs) have been a core modeling technique for many aspects of linguistic structure, particularly syntactic phrase structure in treebank parsing (Charniak, 1996; Collins, 1999).
An important question when learning PCFGs is how many grammar symbols to allocate to the learning algorithm based on the amount of available data.
The question of "how many clusters (symbols)?" has been tackled in the Bayesian nonparametrics literature via Dirichlet process (DP) mixture models (Antoniak, 1974).
DP mixture models have since been extended to hierarchical Dirichlet processes (HDPs) and HDP-HMMs (Teh et al., 2006; Beal et al., 2002) and applied to many different types of clustering/induction problems in NLP (Johnson et al., 2006; Goldwater et al., 2006).
In this paper, we present the hierarchical Dirichlet process PCFG (HDP-PCFG). a nonparametric
Bayesian model of syntactic tree structures based on Dirichlet processes.
Specifically, an HDP-PCFG is defined to have an infinite number of symbols; the Dirichlet process (DP) prior penalizes the use of more symbols than are supported by the training data.
Note that "nonparametric" does not mean "no parameters"; rather, it means that the effective number of parameters can grow adaptively as the amount of data increases, which is a desirable property of a learning algorithm.
As models increase in complexity, so does the uncertainty over parameter estimates.
In this regime, point estimates are unreliable since they do not take into account the fact that there are different amounts of uncertainty in the various components of the parameters.
The HDP-PCFG is a Bayesian model which naturally handles this uncertainty.
We present an efficient variational inference algorithm for the HDP-PCFG based on a structured mean-field approximation of the true posterior over parameters.
The algorithm is similar in form to EM and thus inherits its simplicity, modularity, and efficiency.
Unlike EM, however, the algorithm is able to take the uncertainty of parameters into account and thus incorporate the DP prior.
Finally, we develop an extension of the HDP-PCFG for grammar refinement (HDP-PCFG-GR).
Since treebanks generally consist of coarsely-labeled context-free tree structures, the maximum-likelihood treebank grammar is typically a poor model as it makes overly strong independence assumptions.
As a result, many generative approaches to parsing construct refinements of the treebank grammar which are more suitable for the modeling task.
Lexical methods split each pre-terminal symbol into many subsymbols, one for each word, and then focus on smoothing sparse lexical statis-
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 688-697, Prague, June 2007.
©2007 Association for Computational Linguistics
model to automatically learn the number of subsymbols for each symbol.
2 Models based on Dirichlet processes
At the heart of the HDP-PCFG is the Dirichlet process (DP) mixture model (Antoniak, 1974), which is the nonparametric Bayesian counterpart to the classical finite mixture model.
In order to build up an understanding of the HDP-PCFG, we first review the Bayesian treatment of the finite mixture model (Section 2.1).
We then consider the DP mixture model (Section 2.2) and use it as a building block for developing nonparametric structured versions of
the HMM (Section 2.3) and PCFG (Section 2.4).
Our presentation highlights the similarities between these models so that each step along this progression reflects only the key differences.
2.1 Bayesian inite mixture model
We begin by describing the Bayesian finite mixture model to establish basic notation that will carry over the more complex models we consider later.
Bayesian finite mixture model
The model has K components whose prior distribution is specified by /3 = (f^,..., ffK).
The Dirichlet hyperparameter a controls how uniform this distribution is: as a increases, it becomes increasingly likely that the components have equal probability.
For each mixture component z £ {1,..., K}, the parameters of the component 0z are drawn from some prior Go.
Given the model parameters (/3,0), the data points are generated i.i.d. by first choosing a component and then generating from a data model F parameterized by that component.
In document clustering, for example, each data point xi is a document represented by its term-frequency vector.
Each component (cluster) z has multinomial parameters 0z which specifies a distribution F(•; 0z) over words.
It is customary to use a conjugate Dirichlet prior G0 = Dirichlet(a',..., a') over the multinomial parameters, which can be interpreted as adding a' — 1 pseu-docounts for each word.
We now consider the extension of the Bayesian finite mixture model to a nonparametric Bayesian mixture model based on the Dirichlet process.
We focus on the stick-breaking representation (Sethuraman, 1994) of the Dirichlet process instead of the stochastic process definition (Ferguson, 1973) or the Chinese restaurant process (Pitman, 2002).
The stick-breaking representation captures the DP prior most explicitly and allows us to extend the finite mixture model with minimal changes.
Later, it will enable us to readily define structured models in a form similar to their classical versions.
Furthermore, an efficient variational inference algorithm can be developed in this representation (Section 2.6).
The key difference between the Bayesian finite mixture model and the DP mixture model is that the latter has a countably infinite number of mixture components while the former has a predefined K. Note that if we have an infinite number of mixture components, it no longer makes sense to consider a symmetric prior over the component probabilities; the prior over component probabilities must decay in some way.
The stick-breaking distribution achieves this as follows.
We write /3 ~ GEM(a) to mean that /3 = (ff 1, ff2,...) is distributed according to the stick-breaking distribution.
Here, the concentration parameter a controls the number of effective components.
To draw /3 ~ GEM(a), we first generate a countably infinite collection of stick-breaking proportions u1, u2,..., where each uz ~ Beta(1, a).
The stick-breaking weights /3 are then defined in terms of the stick proportions:
The procedure for generating f3 can be viewed as iteratively breaking off remaining portions of a unit-
A A ...
Figure 1: A sample ff ~ GEM(1).
length stick (Figure 1).
The component probabilities { f z } will decay exponentially in expectation, but there is always some probability of getting a smaller component before a larger one.
The parameter a determines the decay of these probabilities: a larger a implies a slower decay and thus more components.
Given the component probabilities, the rest of the DP mixture model is identical to the finite mixture model:
DP mixture model
The next stop on the way to the HDP-PCFG is the
set of hidden states, where each state can be thought of as a mixture component.
The parameters of the mixture component are the emission and transition parameters.
The main aspect that distinguishes it from a flat finite mixture model is that the transition parameters themselves must specify a distribution over next states.
Hence, we have not just one top-level mixture model over states, but also a collection of mixture models, one for each state.
In developing a nonparametric version of the HMM in which the number of states is infinite, we need to ensure that the transition mixture models of each state share a common inventory of possible next states.
We can achieve this by tying these mixture models together using the hierarchical Dirichlet process (HDP) (Tehetal., 2006).
The stick-breaking representation of an HDP is defined as follows: first, the top-level stick-breaking weights f are drawn according to the stick-breaking prior as before.
Then,
a new set of stick-breaking weights ff' are generated according based on f :
where the distribution of DP can be characterized in terms of the following finite partition property: for all partitions of the positive integers into sets
where ff (A) = ^fc€A fffc.1 The resulting ff' is another distribution over the positive integers whose similarity to f is controlled by a concentration parameter a'.
[draw emission parameters] [draw transition parameters]
■ DP(a',/3) For each time step i e {1,.
Multinomial "
[emit current observation] [choose next state]
Each state z is associated with emission parameters .
In addition, each z is also associated with transition parameters (/>T, which specify a distribution over next states.
These transition parameters are drawn from a DP centered on the top-level stick-breaking weights f according to Equations (2) and (3).
Assume that z1 is always fixed to a special S TART state, so we do not need to generate it.
We now present the HDP-PCFG, which is the focus of this paper.
For simplicity, we consider Chomsky normal form (CNF) grammars, which has two types of rules: emissions and binary productions.
We consider each grammar symbol as a mixture component whose parameters are the rule probabilities for that symbol.
In general, we do not know the appropriate number of grammar symbols, so our strategy is to let the number of grammar symbols be infinite and place a DP prior over grammar symbols.
1Note that this property is a specific instance of the general stochastic process definition of Dirichlet processes.
' Dirichlet(aT) [draw rule type parameters]
' Dirichlet(aE) [draw emission parameters]
[choose rule type] [emit terminal symbol] [generate children symbols]
Parameters
Figure 2: The definition and graphical model of the HDP-PCFG.
Since parse trees have unknown structure, there is no convenient way of representing them in the visual language of traditional graphical models.
Instead, we show a simple fixed example tree.
Node 1 has two children, 2 and 3, each of which has one observed terminal child.
We use L(i) and to denote the left and right children of node i.
In the HMM, the transition parameters of a state specify a distribution over single next states; similarly, the binary production parameters of a grammar symbol must specify a distribution over pairs of grammar symbols for its children.
We adapt the HDP machinery to tie these binary production distributions together.
The key difference is that now we must tie distributions over pairs of grammar symbols together via distributions over single grammar symbols.
Another difference is that in the HMM, at each time step, both a transition and a emission are made, whereas in the PCFG either a binary production or an emission is chosen.
Therefore, each grammar symbol must also have a distribution over the type of rule to apply.
In a CNF PCFG, there are only two types of rules, but this can be easily generalized to include unary productions, which we use for our parsing experiments.
To summarize, the parameters of each grammar symbol z consists of (1) a distribution over a finite number of rule types (/>T, (2) an emission distribution (((f over terminal symbols, and (3) a binary production distribution 0^ over pairs of children grammar symbols.
Figure 2 describes the model in detail.
Figure 3 shows the generation of the binary production distributions .
We draw from a DP centered on ffffT, which is the product distribution over pairs of symbols.
The result is a doubly-infinite matrix where most of the probability mass is con-
left child state
right child state
Figure 3: The generation of binary production probabilities given the top-level symbol probabilities p. First, P is drawn from the stick-breaking prior, as in any DP-based model (a).
Next, the outer-product PPT is formed, resulting in a doubly-infinite matrix matrix (b).
We use this as the base distribution for generating the binary production distribution from a DP centered on PPT (c).
centrated in the upper left, just like the top-level distribution ffffT.
Note that we have replaced the general
G0 and F) pair with Dirichlet(aE) and Multinomial(0E) to specialize to natural language, but there is no difficulty in working with parse trees with arbitrary non-multinomial observations or more sophisticated word models.
In many natural language applications, there is a hard distinction between pre-terminal symbols (those that only emit a word) and non-terminal symbols (those that only rewrite as two non-terminal or pre-terminal symbols).
This can be accomplished by letting aT = (0, 0), which forces a draw 0T to assign probability 1 to one rule type.
An alternative definition of an HDP-PCFG would be as follows: for each symbol z, draw a distribution over left child symbols 1z ~ DP(ff) and an independent distribution over right child symbols rz ~ DP(ff).
Then define the binary production distribution as their cross-product 0f = 1zrj1.
This also yields a distribution over symbol pairs and hence defines a different type of nonparametric PCFG.
This model is simpler and does not require any additional machinery beyond the HDP-HMM.
However, the modeling assumptions imposed by this alternative are unappealing as they assume the left child and right child are independent given the parent, which is certainly not the case in natural language.
2.5 HDP-PCFG for grammar refinement
An important motivation for the HDP-PCFG is that of refining an existing treebank grammar to alleviate unrealistic independence assumptions and to improve parsing accuracy.
In this scenario, the set of symbols is known, but we do not know how many subsymbols to allocate per symbol.
We introduce the HDP-PCFG for grammar refinement
for this task.
The essential difference is that now we have a collection of HDP-PCFG models for each symbol s £ S, each one operating at the subsymbol level.
While these HDP-PCFGs are independent in the prior, they are coupled through their interactions in the parse trees.
For completeness, we have also included unary productions, which are essentially the PCFG counterpart of transitions in HMMs.
Finally, since each node i in the parse tree involves a symbolsubsymbol pair (si,zi), each subsymbol needs to specify a distribution over both child symbols and
subsymbols.
The former can be handled through a finite Dirichlet distribution since all symbols are known and observed, but the latter must be handled with the Dirichlet process machinery, since the number of subsymbols is unknown.
2.6 Variational inference
We present an inference algorithm for the HDP-PCFG model described in Section 2.4, which can also be adapted to the HDP-PCFG-GR model with a bit more bookkeeping.
Most previous inference algorithms for DP-based models involve sampling (Escobar and West, 1995; Teh et al., 2006).
However, we chose to use variational inference (Blei and Jordan, 2005), which provides a fast deterministic alternative to sampling, hence avoiding issues of diagnosing convergence and aggregating samples.
Furthermore, our variational inference algorithm establishes a strong link with past work on PCFG refinement and induction, which has traditionally employed the EM algorithm.
In EM, the E-step involves a dynamic program that exploits the Markov structure of the parse tree, and the M-step involves computing ratios based on expected counts extracted from the E-step. our vari-ational algorithm resembles the EM algorithm in form, but the ratios in the M-step are replaced with weights that reflect the uncertainty in parameter es-
HDP-PCFG for grammar refinement (HDP-PCFG-GR)
For each node i in the parse tree:
Parameters Trees
Figure 4: We approximate the true posterior p over parameters 9 and latent parse trees z using a structured mean-field distribution q, in which the distribution over parameters are completely factorized but the distribution over parse trees is unconstrained.
timates.
Because of this procedural similarity, our method is able to exploit the desirable properties of EM such as simplicity, modularity, and efficiency.
2.7 Structured mean-field approximation
We denote parameters of the HDP-PCFG as 9 = (ff, 0), where ff denotes the top-level symbol probabilities and 0 denotes the rule probabilities.
The hidden variables of the model are the training parse trees z. We denote the observed sentences as x.
The goal of Bayesian inference is to compute the posterior distribution p(9, z | x).
The central idea behind variational inference is to approximate this intractable posterior with a tractable approximation.
In particular, we want to find the best distribution q* as defined by
where Q is a tractable subset of distributions.
We use a structured mean-field approximation, meaning that we only consider distributions that factorize as follows (Figure 4):
degenerate distribution truncated at K; i.e., f z = 0 for z > K. While the posterior grammar does have an infinite number of symbols, the exponential decay of the DP prior ensures that most of the probability mass is contained in the first few symbols (Ish-waran and James, 2001).
2 While our variational approximation q is truncated, the actual PCFG model is not.
As K increases, our approximation improves.
2.8 Coordinate-wise ascent
The optimization problem defined by Equation (4) is intractable and nonconvex, but we can use a simple coordinate-ascent algorithm that iteratively optimizes each factor of q in turn while holding the others fixed.
The algorithm turns out to be similar in form to EM for an ordinary PCFG: optimizing q(z) is the analogue of the E-step, and optimizing q(0) is the analogue of the M-step; however, optimizing q(ff) has no analogue in EM.
We summarize each of these updates below (see (Liang et al., 2007) for complete derivations).
Parse trees q(z): The distribution over parse trees q(z) can be summarized by the expected sufficient statistics (rule counts), which we denote as C (z — zj zr) for binary productions and C (z —> X) for emissions.
We can compute these expected counts using dynamic programming as in the E-step of EM.
While the classical E-step uses the current rule probabilities 0, our mean-field approximation involves an entire distribution q(0).
Fortunately, we can still handle this case by replacing each rule probability with a weight that summarizes the uncertainty over the rule probability as represented by q. We define this weight in the sequel.
It is a common perception that Bayesian inference is slow because one needs to compute integrals. our mean-field inference algorithm is a counterexample: because we can represent uncertainty over rule probabilities with single numbers, much of the existing PCFG machinery based on EM can be modularly imported into the Bayesian framework.
Rule probabilities q(0): For an ordinary PCFG, the M-step simply involves taking ratios ofexpected
2In particular, the variational distance between the stick-breaking distribution and the truncated version decreases exponentially as the truncation level K increases.
For the variational HDP-PCFG, the optimal q(0) is given by the standard posterior update for Dirichlet distributions:3
where C(z) is the matrix of counts of rules with left-hand side z. These distributions can then be summarized with multinomial weights which are the only necessary quantities for updating q(z) in the next iteration:
where is the digamma function.
The emission parameters can be defined similarly.
Inspection of Equations (6) and (9) reveals that the only difference between the maximum likelihood and the mean-field update is that the latter applies the exp(^( )) function to the counts (Figure 5).
When the truncation K is large, aB///z; ///zr is near 0 for most right-hand sides (zj, zr), so exp(\I>(-)) has the effect of downweighting counts.
Since this subtraction affects large counts more than small counts, there is a rich-get-richer effect: rules that have already have large counts will be preferred.
Specifically, consider a set of rules with the same left-hand side.
The weights for all these rules only differ in the numerator (Equation (9)), so applying exp(^( )) creates a local preference for right-hand sides with larger counts.
Also note that the rule weights are not normalized; they always sum to at most one and are equal to one exactly when q(0) is degenerate.
This lack of normalization gives an extra degree of freedom not present in maximum likelihood estimation: it creates a global preference for left-hand sides that have larger total counts.
3Because we have truncated the top-level symbol weights, the DP prior on </>f reduces to a finite Dirichlet distribution.
Figure 5: The exp(^( )) function, which is used in computing the multinomial weights for mean-field inference.
It has the effect of reducing a larger fraction of small counts than large counts.
and q(z), there is no closed form expression for the optimal P*, and the objective function (Equation (4)) is not convex in P*.
Nonetheless, we can apply a standard gradient projection method (Bert-sekas, 1999) to improve P* to a local maxima.
The part of the objective function in Equation (4) that depends on P* is as follows:
See Liang et al. (2007) for the derivation of the gradient.
In practice, this optimization has very little effect on performance.
We suspect that this is because the objective function is dominated by p(x | z) and p(z | 0), while the contribution of p(0 | ff) is minor.
3 Experiments
We now present an empirical evaluation of the HDP-PCFG(-GR) model and variational inference techniques.
We first give an illustrative example of the ability of the HDP-PCFG to recover a known grammar and then present the results of experiments on large-scale treebank parsing.
3.1 Recovering a synthetic grammar
In this section, we show that the HDP-PCFG-GR can recover a simple grammar while a standard
standard PCFG
Figure 6: (a) A synthetic grammar with a uniform distribution over rules.
(b) The grammar generates trees of the form shown on the right.
PCFG fails to do so because it has no built-in control over grammar complexity.
From the grammar in Figure 6, we generated 2000 trees.
The two terminal symbols always have the same subscript, but we collapsed X to X in the training data.
We trained the HDP-PCFG-GR, with truncation K = 20, for both S and X for 100 iterations.
We set all hyperparameters to 1.
Figure 7 shows that the HDP-PCFG-GR recovers the original grammar, which contains only 4 subsymbols, leaving the other 16 subsymbols unused.
The standard PCFG allocates all the subsymbols to fit the exact co-occurrence statistics of left and right terminals.
Recall that a rule weight, as defined in Equation (9), is analogous to a rule probability for standard PCFGs.
We say a rule is effective if its weight is at least 10-6 and its left hand-side has posterior is also at least 10-6.
In general, rules with weight smaller than 10-6 can be safely pruned without affect parsing accuracy.
The standard PCFG uses all 20 subsymbols of both S and X to explain the data, resulting in 8320 effective rules; in contrast, the HDP-PCFG uses only 4 subsymbols for X and 1 for S, resulting in only 68 effective rules.
If the threshold is relaxed from 10-6 to 10-3, then only 20 rules are effective, which corresponds exactly to the true grammar.
3.2 Parsing the Penn Treebank
In this section, we show that our variational HDP-PCFG can scale up to real-world data sets.
We ran experiments on the Wall Street Journal (WSJ) portion of the Penn Treebank.
We trained on sections 2-21, used section 24 for tuning hyperparameters, and tested on section 22.
We binarize the trees in the treebank as follows: for each non-terminal node with symbol X, we in-
Figure 7: The posteriors over the subsymbols of the standard PCFG is roughly uniform, whereas the posteriors of the HDP-PCFG is concentrated on four subsymbols, which is the true number of symbols in the grammar.
troduce a right-branching cascade of new nodes with symbol X. The end result is that each node has at most two children.
To cope with unknown words, we replace any word appearing fewer than 5 times in the training set with one of 50 unknown word tokens derived from 10 word-form features.
Our goal is to learn a refined grammar, where each symbol in the training set is split into K subsymbols.
We compare an ordinary PCFG estimated with maximum likelihood (Matsuzaki et al., 2005) and the HDP-PCFG estimated using the variational inference algorithm described in Section 2.6.
To parse new sentences with a grammar, we compute the posterior distribution over rules at each span and extract the tree with the maximum expected correct number of rules (Petrov and Klein, 2007).
There are six hyperparameters in the HDP-PCFG-GR model, which we set in the following manner: a = 1, aT = 1 (uniform distribution over unar-ies versus binaries), aE = 1 (uniform distribution over terminal words), au(s) = ab(s) = , where N(s) is the number of different unary (binary) right-hand sides of rules with left-hand side s in the tree-bank grammar.
The two most important hyperparameters are au and aB, which govern the sparsity of the right-hand side for unary and binary rules.
We set au = aB although more performance could probably be gained by tuning these individually.
It turns out that there is not a single aB that works for all truncation levels, as shown in Table 1.
If the top-level distribution /3 is uniform, the value of aB corresponding to a uniform prior over pairs of children subsymbols is K2.
Interestingly, the optimal aB appears to be superlinear but subquadratic
truncation K
uniform aB
Table 1: For each truncation level, we report the aB that yielded the highest Fi score on the development set.
PCFG (smoothed)
HDP-PCFG
Table 2: Shows development Fi and grammar sizes (the number of effective rules) as we increase the truncation K.
in K. We used these values of aB in the following experiments.
The regime in which Bayesian inference is most important is when training data is scarce relative to the complexity of the model.
We train on just section 2 of the Penn Treebank.
Table 2 shows how the HDP-PCFG-GR can produce compact grammars that guard against overfitting.
Without smoothing, ordinary PCFGs trained using EM improve as K increases but start to overfit around K = 4.
Simple add-1.01 smoothing prevents overfitting but at the cost of a sharp increase in grammar sizes.
The HDP-PCFG obtains comparable performance with a much smaller number of rules.
We also trained on sections 2-21 to demonstrate that our methods can scale up and achieve broadly comparable results to existing state-of-the-art parsers.
When using a truncation level of K = 16, the standard PCFG with smoothing obtains an Fi score of 88.36 using 706157 effective rules while the HDP-PCFG-GR obtains an Fi score of 87.08 using 428375 effective rules.
We expect to see greater benefits from the HDP-PCFG with a larger truncation level.
4 Related work
The question of how to select the appropriate grammar complexity has been studied in earlier work.
It is well known that more complex models necessarily have higher likelihood and thus a penalty must be imposed for more complex grammars.
Examples of such penalized likelihood procedures include Stolcke and Omohundro (1994), which used an asymptotic Bayesian model selection criterion and Petrov et al. (2006), which used a split-merge algorithm which procedurally determines when to switch between grammars of various complexities.
These techniques are model selection techniques that use heuristics to choose among competing statistical models; in contrast, the HDP-PCFG relies on the Bayesian formalism to provide implicit control over model complexity within the framework of a single probabilistic model.
Johnson et al. (2006) also explored nonparamet-ric grammars, but they do not give an inference algorithm for recursive grammars, e.g., grammars including rules of the form A — BC and B — DA.
Recursion is a crucial aspect of PCFGs and our inference algorithm does handle it.
Finkel et al. (2007) independently developed another nonpara-metric model of grammars.
Though their model is also based on hierarchical Dirichlet processes and is similar to ours, they present a different inference algorithm which is based on sampling.
Kurihara and Sato (2004) and Kurihara and Sato (2006) applied variational inference to PCFGs.
Their algorithm is similar to ours, but they did not consider nonpara-metric models.
5 Conclusion
We have presented the HDP-PCFG, a nonparametric Bayesian model for PCFGs, along with an efficient variational inference algorithm.
While our primary contribution is the elucidation of the model and algorithm, we have also explored some important empirical properties of the HDP-PCFG and also demonstrated the potential of variational HDP-PCFGs on a full-scale parsing task.
