This paper provides an algorithmic framework for learning statistical models involving directed spanning trees, or equivalently non-projective dependency structures.
We show how partition functions and marginals for directed spanning trees can be computed by an adaptation of Kirchhoff's Matrix-Tree Theorem.
To demonstrate an application of the method, we perform experiments which use the algorithm in training both log-linear and max-margin dependency parsers.
The new training methods give improvements in accuracy over perceptron-trained models.
1 Introduction
Learning with structured data typically involves searching or summing over a set with an exponential number of structured elements, for example the set of all parse trees for a given sentence.
Methods for summing over such structures include the inside-outside algorithm for probabilistic context-free grammars (Baker, 1979), the forward-backward algorithm for hidden Markov models (Baum et al., 1970), and the belief-propagation algorithm for graphical models (Pearl, 1988).
These algorithms compute marginal probabilities and partition functions, quantities which are central to many methods for the statistical modeling of complex structures (e.g., the EM algorithm (Baker, 1979; Baum et al., 1970), contrastive estimation (Smith and Eisner, 2005), training algorithms for CRFs (Lafferty et al., 2001), and training algorithms for max-margin models (Bartlett et al., 2004; Taskar et al., 2004a)).
This paper describes inside-outside-style algorithms for the case of directed spanning trees.
These structures are equivalent to non-projective dependency parses (McDonald et al., 2005b), and more generally could be relevant to any task that involves learning a mapping from a graph to an underlying
spanning tree.
Unlike the case for projective dependency structures, partition functions and marginals for non-projective trees cannot be computed using dynamic-programming methods such as the inside-outside algorithm.
In this paper we describe how these quantities can be computed by adapting a well-known result in graph theory: Kirchhoff's Matrix-Tree Theorem (Tutte, 1984).
A naive application of the theorem yields O(n4) and O(n6) algorithms for computation of the partition function and marginals, respectively.
However, our adaptation finds the partition function and marginals in O(n3) time using simple matrix determinant and inversion operations.
We demonstrate an application of the new inference algorithm to non-projective dependency parsing.
Specifically, we show how to implement two popular supervised learning approaches for this task: globally-normalized log-linear models and max-margin models.
Log-linear estimation critically depends on the calculation of partition functions and marginals, which can be computed by our algorithms.
For max-margin models, Bartlett et al. (2004) have provided a simple training algorithm, based on exponentiated-gradient (EG) updates, that requires computation of marginals and can thus be implemented within our framework.
Both of these methods explicitly minimize the loss incurred when parsing the entire training set.
This contrasts with the online learning algorithms used in previous work with spanning-tree models (McDonald et al., 2005b).
We applied the above two marginal-based training algorithms to six languages with varying degrees of non-projectivity, using datasets obtained from the CoNLL-X shared task (Buchholz and Marsi, 2006).
Our experimental framework compared three training approaches: log-linear models, max-margin models, and the averaged perceptron.
Each of these was applied to both projective and non-projective parsing.
Our results demonstrate that marginal-based training yields models which out-
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 141-150, Prague, June 2007.
©2007 Association for Computational Linguistics
perform those trained using the averaged perceptron.
In summary, the contributions of this paper are:
We introduce algorithms for inside-outside-style calculations for directed spanning trees, or equivalently non-projective dependency structures.
These algorithms should have wide applicability in learning problems involving spanning-tree structures.
We illustrate the utility of these algorithms in log-linear training of dependency parsing models, and show improvements in accuracy when compared to averaged-perceptron training.
We also train max-margin models for dependency parsing via an EG algorithm (Bartlett et al., 2004).
The experiments presented here constitute the first application of this algorithm to a large-scale problem.
We again show improved performance over the perceptron.
The goal of our experiments is to give a rigorous comparative study of the marginal-based training algorithms and a highly-competitive baseline, the averaged perceptron, using the same feature sets for all approaches.
We stress, however, that the purpose of this work is not to give competitive performance on the CoNLL data sets; this would require further engineering of the approach.
Similar adaptations of the Matrix-Tree Theorem have been developed independently and simultaneously by Smith and Smith (2007) and McDonald and Satta (2007); see Section 5 for more discussion.
2 Background
2.1 Discriminative Dependency Parsing
Dependency parsing is the task of mapping a sentence x to a dependency structure y. Given a sentence x with n words, a dependency for that sentence is a tuple (h, m) where h e [0... n] is the index of the head word in the sentence, and m e [1... n] is the index of a modifier word.
The value h = 0 is a special root-symbol that may only appear as the head of a dependency.
We use D(x) to refer to all possible dependencies for a sentence x: D(x) = {(h, m) : h e [0 ...n],m e [1 ...n]}.
A dependency parse is a set of dependencies that forms a directed tree, with the sentence's root-symbol as its root.
We will consider both projective
Projective
Non-projective
Multi Root
root He saw her root He saw her
Figure 1: Examples of the four types of dependency structures.
We draw dependency arcs from head to modifier.
trees, where dependencies are not allowed to cross, and non-projective trees, where crossing dependencies are allowed.
Dependency annotations for some languages, for example Czech, can exhibit a significant number of crossing dependencies.
In addition, we consider both single-root and multi-root trees.
In a single-root tree y, the root-symbol has exactly one child, while in a multi-root tree, the root-symbol has one or more children.
This distinction is relevant as our training sets include both single-root corpora (in which all trees are single-root structures) and multi-root corpora (in which some trees are multi-root structures).
The two distinctions described above are orthogonal, yielding four classes of dependency structures; see Figure 1 for examples of each kind of structure.
We use Tps(x) to denote the set of all possible pro-jective single-root dependency structures for a sentence x, and Tfp(x) to denote the set of single-root non-projective structures for x. The sets Tpm(x) and (x) are defined analogously for multi-root structures.
In contexts where any class of dependency structures may be used, we use the notation T(x) as a placeholder that may be defined as Tps (x), T?p (x), Tpm(x) or Tm(x).
Following McDonald et al. (2005a), we use a discriminative model for dependency parsing.
Features in the model are defined through a function f (x, h, m) which maps a sentence x together with a dependency (h, m) to a feature vector in Rd. A feature vector can be sensitive to any properties of the triple (x, h, m).
Given a parameter vector w, the optimal dependency structure for a sentence x is
where the set T(x) can be defined as Tps (x), Tfp (x), Tpm(x) or (x), depending on the type of parsing.
The parameters w will be learned from a training set {(xj, yi) where each xi is a sentence and each yi is a dependency structure.
Much of the previous work on learning w has focused on training local models (see Section 5).
McDonald et al. (2005a; 2005b) trained global models using online algorithms such as the perceptron algorithm or MIRA.
In this paper we consider training algorithms based on work in conditional random fields (CRFs) (Laf-ferty et al., 2001) and max-margin methods (Taskar et al., 2004a).
2.2 Three Inference Problems
This section highlights three inference problems which arise in training and decoding discriminative dependency parsers, and which are central to the approaches described in this paper.
Assume that we have a vector 6 with values 0h,m G R for all (h, m) G D(x); these values correspond to weights on the different dependencies in D(x).
Define a conditional distribution over all dependency structures y G T (x) as follows:
The function Z(x; 6) is commonly referred to as the partition function.
The inference problems are then as follows:
Problem 2: Computation of the Partition Function: Calculate Z(x; 6).
Problem 3: Computation of the Marginals:
For all (h, m) e D(x), calculate /x/^m(x; 6).
Note that all three problems require a maximization or summation over the set T(x), which is exponential in size.
There is a clear motivation for
being able to solve Problem 1: by setting 9h m = w • f (x, h, m), the optimal dependency structure y*(x; w) (see Eq.
1) can be computed.
In this paper the motivation for solving Problems 2 and 3 arises from training algorithms for discriminative models.
As we will describe in Section 4, both log-linear and max-margin models can be trained via methods that make direct use of algorithms for Problems 2 and 3.
In the case of projective dependency structures (i.e., T(x) defined as Tps(x) or Tpm(x)), there are well-known algorithms for all three inference problems.
Decoding can be carried out using Viterbi-style dynamic-programming algorithms, for example the O(n3) algorithm of Eisner (1996).
Computation of the marginals and partition function can also be achieved in O(n3) time, using a variant of the inside-outside algorithm (Baker, 1979) applied to the Eisner (1996) data structures (Paskin, 2001).
In the non-projective case (i.e., T(x) defined as Tnp(x) or 7^(x)), McDonald et al. (2005b) describe how the CLE algorithm (Chu and Liu, 1965; Edmonds, 1967) can be used for decoding.
However, it is not possible to compute the marginals and partition function using the inside-outside algorithm.
We next describe a method for computing these quantities in O(n3) time using matrix inverse and determinant operations.
3 Spanning-tree inference using the Matrix-Tree Theorem
In this section we present algorithms for computing the partition function and marginals, as defined in Section 2.2, for non-projective parsing.
We first reiterate the observation of McDonald et al. (2005a) that non-projective parses correspond to directed spanning trees on a complete directed graph of n nodes, where n is the length of the sentence.
The above inference problems thus involve summation over the set of all directed spanning trees.
Note that this set is exponentially large, and there is no obvious method for decomposing the sum into dynamic-programming-like subproblems.
This section describes how a variant of Kirchhoff's Matrix-Tree Theorem (Tutte, 1984) can be used to evaluate the partition function and marginals efficiently.
Let the weight of a dependency structure y e 7fp(x) be defined as:
In the remainder of this section, we drop the nota-tional dependence on x for brevity.
The original Matrix-Tree Theorem addressed the problem of counting the number of undirected spanning trees in an undirected graph.
For the models we study here, we require a sum of weighted and directed spanning trees.
Tutte (1984) extended the Matrix-Tree Theorem to this case.
We briefly summarize his method below.
determinant of the matrix formed by deleting row h and column m from X. Finally, define the weight of any directed spanning tree of G to be the product of the weights Ah m(0) for the edges in that tree.
3.1 Partition functions via matrix determinants
From Theorem 1, it directly follows that
The above would require calculating n determinants, resulting in O(n4) complexity.
However, as we show below Z(O) may be obtained in O(n3) time using a single determinant evaluation.
Define a new matrix L(O) to be L(O) with the first row replaced by the root-selection scores:
This matrix allows direct computation of the partition function, as the following proposition shows.
Proposition 1 The partition function in Eq.
5 is given by Z(0) = |L(0)|.
3.2 Marginals via matrix inversion
The marginals we require are given by
To calculate these marginals efficiently for all values of (h, m) we use a well-known identity relating the log partition-function to marginals
Since the partition function in this case has a closed-form expression (i.e., the determinant of a matrix constructed from 0), the marginals can also obtained in closed form.
Using the chain rule, the derivative of the log partition-function in Proposition 1 is
To perform the derivative, we use the identity
where 5h)TO is the Kronecker delta.
Thus, the complexity of evaluating all the relevant marginals is dominated by the matrix inversion, and the total complexity is therefore O(n3).
In the case of multiple roots, we can still compute the partition function and marginals efficiently.
In fact, the derivation of this case is simpler than for single-root structures.
Create an extended graph G'
which augments G with a dummy root node that has edges pointing to all of the existing nodes, weighted by the appropriate root-selection scores.
Note that there is a bijection between directed spanning trees of G' rooted at the dummy root and multi-root structures y e (x).
Thus, Theorem 1 can be used to compute the partition function directly: construct a Laplacian matrix L(O) for G' and compute the minor L(0,0)(O).
Since this minor is also a determinant, the marginals can be obtained analogously to the single-root case.
More concretely, this technique corresponds to defining the matrix L(O) as
where diag(v) is the diagonal matrix with the vector v on its diagonal.
The techniques above extend easily to the case where dependencies are labeled.
For a model with L different labels, it suffices to define the edge and root scores as Ah,m(0) = Ei=i exp {^m/} and rm(0) = EL=1 exp {^0,m,^}.
The partition function over labeled trees is obtained by operating on these values as described previously, and the marginals are given by an application of the chain rule.
Both inference problems are solvable in O(n3 + Ln2) time.
4 Training Algorithms
This section describes two methods for parameter estimation that rely explicitly on the computation of the partition function and marginals.
4.1 Log-Linear Estimation
where Z(x; w) is the partition function, a sum over T/(x), T4,(x), 7pm(x) or (x).
The parameter C > 0 is a constant dictating the level of regularization in the model.
Since L(w) is a convex function, gradient descent methods can be used to search for the global minimum.
Such methods typically involve repeated computation of the loss L(w) and gradient , requiring efficient implementations of both functions.
Note that the log-probability of a parse is
so that the main issue in calculating the loss function L(w) is the evaluation of the partition functions Z(x»; w).
The gradient of the loss is given by
is the marginal probability of a dependency (h, m).
Thus, the main issue in the evaluation ofthe gradient is the computation of the marginals /th)m(x; w).
Note that Eq.
7 forms a special case of the log-linear distribution defined in Eq.
2 in Section 2.2.
If we set #h)m = w • f (x, h, m) then we have P(y | x; w) = P(y | x; 0), Z(x; w) = Z(x; 0), and /h)m(x; w) = /xh)m(x; 0).
Thus in the projective case the inside-outside algorithm can be used to calculate the partition function and marginals, thereby enabling training of a log-linear model; in the non-projective case the algorithms in Section 3 can be used for this purpose.
4.2 Max-Margin Estimation
The second learning algorithm we consider is the large-margin approach for structured prediction (Taskar et al., 2004a; Taskar et al., 2004b).
Learning in this framework again involves minimization of a
convex function L(w).
Let the marginfor parse tree y on the i'th training example be defined as
where Ei y is a measure of the loss—or number of errors—for parse y on the i'th training sentence.
In this paper we take Ei y to be the number of incorrect dependencies in the parse tree y when compared to the gold-standard parse tree yi.
mi)yi (w) = 0, so that the hinge loss is always nonnegative.
In addition, the hinge loss is 0 if and only if mi)y(w) > Ei y for all y e T(xi).
Thus the hinge loss directly penalizes margins mi)y (w) which are less than their corresponding losses Ei)2/.
Figure 2 shows an algorithm for minimizing L(w) that is based on the exponentiated-gradient algorithm for large-margin optimization described by Bartlett et al. (2004).
The algorithm maintains a set of weights #i)h)m for i = 1...
N, (h, m) e D(xi), which are updated example-by-example.
The algorithm relies on the repeated computation of marginal values /xi)h)m, which are defined as follows:1
A similar definition is used to derive marginal values /4hTO from the values ffi h m. Computation of the /x and // values is again inference of the form described in Problem 3 in Section 2.2, and can be
Bartlett et al. (2004) write P(y \ xi) as aiy.
The ai>y variables are dual variables that appear in the dual objective function, i.e., the convex dual of L(w).
Analysis of the algorithm shows that as the 0i>h>m variables are updated, the dual variables converge to the optimal point of the dual objective, and the parameters w converge to the minimum of L(w).
Inputs: Training examples {(xi ,yi)}!iL1.
Parameters: Regularization constant C, starting point /?, number of passes over training set T.
Data Structures: Real values 9i>h,m and li>h,m for i = 1...
N, (h,m) e D(xi).
Learning rate 7.
where <5i,h,m = (1 - li,h,m - /ti,h,m) and the /c,h,m values are calculated from the 9i>h,m values as described in Eq.
Algorithm: Repeat T passes over the training set, where each pass is as follows:
• For example i, calculate marginals /j>i>h,m from 9i>h,m values, and marginals /i h m from 9i>h>m values (see Eq.
8).
• Update the parameters:
Output: Parameter values w
Figure 2: The EG Algorithm for Max-Margin Estimation.
The learning rate 7 is halved each time the dual objective function (see (Bartlett et al., 2004)) fails to increase.
In our experiments we chose /?
= 9, which was found to work well during development of the algorithm.
achieved using the inside-outside algorithm for pro-jective structures, and the algorithms described in Section 3 for non-projective structures.
5 Related Work
Global log-linear training has been used in the context of PCFG parsing (Johnson, 2001).
Riezler et al. (2004) explore a similar application of log-linear models to LFG parsing.
Max-margin learning
has been applied to PCFG parsing by Taskar et al. (2004b).
They show that this problem has a QP dual of polynomial size, where the dual variables correspond to marginal probabilities of CFG rules.
A similar QP dual may be obtained for max-margin projective dependency parsing.
However, for non-projective parsing, the dual QP would require an exponential number of constraints on the dependency marginals (Chopra, 1989).
Nevertheless, alternative optimization methods like that of Tsochantaridis et al. (2004), or the EG method presented here, can still be applied.
The majority of previous work on dependency parsing has focused on local (i.e., classification of individual edges) discriminative training methods (Yamada and Matsumoto, 2003; Nivre et al., 2004; Y. Cheng, 2005).
Non-local (i.e., classification of entire trees) training methods were used by McDonald et al. (2005a), who employed online learning.
Dependency parsing accuracy can be improved by allowing second-order features, which consider more than one dependency simultaneously.
McDonald and Pereira (2006) define a second-order dependency parsing model in which interactions between adjacent siblings are allowed, and Carreras (2007) defines a second-order model that allows grandparent and sibling interactions.
Both authors give polytime algorithms for exact projective parsing.
By adapting the inside-outside algorithm to these models, partition functions and marginals can be computed for second-order projective structures, allowing log-linear and max-margin training to be applied via the framework developed in this paper.
For higher-order non-projective parsing, however, computational complexity results (McDonald and Pereira, 2006; McDonald and Satta, 2007) indicate that exact solutions to the three inference problems of Section 2.2 will be intractable.
Exploration of approximate second-order non-projective inference is a natural avenue for future research.
Two other groups of authors have independently and simultaneously proposed adaptations of the Matrix-Tree Theorem for structured inference on directed spanning trees (McDonald and Satta, 2007; Smith and Smith, 2007).
There are some algorithmic differences between these papers and ours.
First, we define both multi-root and single-root algorithms, whereas the other papers only consider multi-root
parsing.
This distinction can be important as one often expects a dependency structure to have exactly one child attached to the root-symbol, as is the case in a single-root structure.
Second, McDonald and Satta (2007) propose an O(n5) algorithm for computing the marginals, as opposed to the O(n3) matrix-inversion approach used by Smith and Smith (2007) and ourselves.
In addition to the algorithmic differences, both groups of authors consider applications of the Matrix-Tree Theorem which we have not discussed.
For example, both papers propose minimum-risk decoding, and McDonald and Satta (2007) discuss unsupervised learning and language modeling, while Smith and Smith (2007) define hidden-variable models based on spanning trees.
In this paper we used EG training methods only for max-margin models (Bartlett et al., 2004).
However, Globerson et al. (2007) have recently shown how EG updates can be applied to efficient training of log-linear models.
6 Experiments on Dependency Parsing
In this section, we present experimental results applying our inference algorithms for dependency parsing models.
Our primary purpose is to establish comparisons along two relevant dimensions: projective training vs. non-projective training, and marginal-based training algorithms vs. the averaged perceptron.
The feature representation and other relevant dimensions are kept fixed in the experiments.
We used data from the CoNLL-X shared task on multilingual dependency parsing (Buchholz and Marsi, 2006).
In our experiments, we used a subset consisting of six languages; Table 1 gives details of the data sets used.2 For each language we created a validation set that was a subset of the CoNLL-X
2Our subset includes the two languages with the lowest accuracy in the CoNLL-X evaluations (Turkish and Arabic), the language with the highest accuracy (Japanese), the most non-projective language (Dutch), a moderately non-projective language (Slovene), and a highly projective language (Spanish).
All languages but Spanish have multi-root parses in their data.
We are grateful to the providers of the treebanks that constituted the data of our experiments (Hajic et al., 2004; van der Beek et al., 2002; Kawata and Bartels, 2000; Dzeroski et al., 2006; Civit and Marti, 2002; Oflazer et al., 2003).
language
val.
Japanese
Table 1: Information for the languages in our experiments.
The 2nd column (%cd) is the percentage of crossing dependencies in the training and validation sets.
The last three columns report the size in tokens of the training, validation and test sets.
training set for that language.
The remainder of each training set was used to train the models for the different languages.
The validation sets were used to tune the meta-parameters (e.g., the value of the reg-ularization constant C) of the different training algorithms.
We used the official test sets and evaluation script from the CoNLL-X task.
All of the results that we report are for unlabeled dependency parsing.3
The non-projective models were trained on the CoNLL-X data in its original form.
Since the pro-jective models assume that the dependencies in the data are non-crossing, we created a second training set for each language where non-projective dependency structures were automatically transformed into projective structures.
All projective models were trained on these new training sets.4 Our feature space is based on that of McDonald et al. (2005a).
We performed experiments using three training algorithms: the averaged perceptron (Collins, 2002), log-linear training (via conjugate gradient descent), and max-margin training (via the EG algorithm).
Each of these algorithms was trained using pro-jective and non-projective methods, yielding six training settings per language.
The different training algorithms have various meta-parameters, which we optimized on the validation set for each language/training-setting combination.
The
3Our algorithms also support labeled parsing (see Section 3.4).
Initial experiments with labeled models showed the same trend that we report here for unlabeled parsing, so for simplicity we conducted extensive experiments only for unlabeled parsing.
4The transformations were performed by running the pro-jective parser with score +1 on correct dependencies and -1 otherwise: the resulting trees are guaranteed to be projective and to have a minimum loss with respect to the correct tree.
Note that only the training sets were transformed.
5It should be noted that McDonald et al. (2006) use a richer feature set that is incomparable to our features.
Perceptron
Max-Margin
Log-Linear
Table 2: Test data results.
The p and np columns show results with projective and non-projective training respectively.
Table 3: Results for the three training algorithms on the different languages (P = perceptron, E = EG, L = log-linear models).
AV is an average across the results for the different languages.
averaged perceptron has a single meta-parameter, namely the number of iterations over the training set.
The log-linear models have two meta-parameters: the regularization constant C and the number of gradient steps T taken by the conjugate-gradient optimizer.
The EG approach also has two metaparameters: the regularization constant C and the number of iterations, T.6 For models trained using non-projective algorithms, both projective and non-projective parsing was tested on the validation set, and the highest scoring of these two approaches was then used to decode test data sentences.
Table 2 reports test results for the six training scenarios.
These results show that for Dutch, which is the language in our data that has the highest number of crossing dependencies, non-projective training gives significant gains over projective training for all three training methods.
For the other languages, non-projective training gives similar or even improved performance over projective training.
Table 3 gives an additional set of results, which were calculated as follows.
For each of the three training methods, we used the validation set results to choose between projective and non-projective training.
This allows us to make a direct comparison of the three training algorithms.
Table 3
6We trained the perceptron for 100 iterations, and chose the iteration which led to the best score on the validation set.
Note that in all of our experiments, the best perceptron results were actually obtained with 30 or fewer iterations.
For the log-linear and EG algorithms we tested a number of values for C, and for each value of C ran 100 gradient steps or EG iterations, finally choosing the best combination of C and T found in validation.
shows the results of this comparison.7 The results show that log-linear and max-margin models both give a higher average accuracy than the perceptron.
For some languages (e.g., Japanese), the differences from the perceptron are small; however for other languages (e.g., Arabic, Dutch or Slovene) the improvements seen are quite substantial.
7 Conclusions
This paper describes inference algorithms for spanning-tree distributions, focusing on the fundamental problems of computing partition functions and marginals.
Although we concentrate on loglinear and max-margin estimation, the inference algorithms we present can serve as black-boxes in many other statistical modeling techniques.
Our experiments suggest that marginal-based training produces more accurate models than per-ceptron learning.
Notably, this is the first large-scale application of the EG algorithm, and shows that it is a promising approach for structured learning.
In line with McDonald et al. (2005b), we confirm that spanning-tree models are well-suited to dependency parsing, especially for highly non-projective languages such as Dutch.
Moreover, spanning-tree models should be useful for a variety of other problems involving structured data.
Acknowledgments
The authors would like to thank the anonymous reviewers for their constructive comments.
In addition, the authors gratefully acknowledge the following sources of support.
Terry Koo was funded by
from NTT, Agmt.
Amir Glober-son was supported by a fellowship from the Rothschild Foundation - Yad Hanadiv.
Xavier Carreras was supported by the Catalan Ministry of Innovation, Universities and Enterprise, and a grant from NTT, Agmt.
Dtd. 6/21/1998.
Michael Collins was funded by NSF grants 0347631 and DMS-0434222.
7We ran the sign test at the sentence level to measure the statistical significance of the results aggregated across the six languages.
Out of 2,472 sentences total, log-linear models gave improved parses over the perceptron on 448 sentences, and worse parses on 343 sentences.
The max-margin method gave improved/worse parses for 500/383 sentences.
Both results are significant with p < 0.001.
