One may need to build a statistical parser for a new language, using only a very small labeled treebank together with raw text.
We argue that bootstrapping a parser is most promising when the model uses a rich set of redundant features, as in recent models for scoring dependency parses (McDonald et al., 2005).
Drawing on Abney's (2004) analysis of the Yarowsky algorithm, we perform bootstrapping by entropy regulariza-tion: we maximize a linear combination of conditional likelihood on labeled data and confidence (negative Renyi entropy) on unlabeled data.
In initial experiments, this surpassed EM for training a simple feature-poor generative model, and also improved the performance of a feature-rich, conditionally estimated model where EM could not easily have been applied.
We discuss how our feature set could be extended with cross-lingual or cross-domain features, to incorporate knowledge from parallel or comparable corpora during bootstrapping.
1 Motivation
In this paper, we address the problem of bootstrapping new statistical parsers for new languages, genres, or domains.
Why is this problem important?
Many applications of multilingual NLP require parsing in order to extract information, opinions, and answers from text, and to produce improved translations.
Yet an adequate labeled training corpus—a large tree-bank of manually constructed parse trees of typical sentences—is rarely available and would be prohibitively expensive to develop.
We show how it is possible to train instead from a small hand-labeled treebank in the target domain, together with a large unannotated collection of indomain sentences.
Additional resources such as parsers for other domains or languages can be integrated naturally.
Dependency parsing is important as a key component in leading systems for information extrac-
tion (Weischedel, 2004)1 and question answering (Peng et al., 2005).
These systems rely on edges or paths in dependency parse trees to define their extraction patterns and classification features.
Parsing is also key to the latest advances in machine translation, which translate syntactic phrases (Galley et al., 2006; Marcu et al., 2006; Cowan et al., 2006).
2 Our Approach
Our approach rests on three observations:
• Recent "feature-based" parsing models are an excellent fit for bootstrapping, because the parse is often overdetermined by many redundant features.
• The feature-based framework is flexible enough to incorporate other sources of guidance during training or testing—such as the knowledge contained in a parser for another language or domain.
• Maximizing a combination of likelihood on labeled data and confidence on unlabeled data is a principled approach to bootstrapping.
2.1 Feature-Based Parsing
McDonald et al. (2005) introduced a simple, flexible framework for scoring dependency parses.
Each directed edge e in the dependency tree is described with a high-dimensional feature vector f (e).
The edge's score is the dot product f (e) • 6, where 6 is a learned weight vector.
The overall score of a dependency tree is the sum of the scores of all edges in the tree.
1Ralph Weischedel (p.c.) reports that this system's performance degrades considerably when only phrase chunking is available rather than full parsing.
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 661-611, Prague, June 2001.
©2001 Association for Computational Linguistics
Given an n-word input sentence, the parser begins by scoring each of the O(n2) possible edges, and then seeks the highest-scoring legal dependency tree formed by any n — 1 of these edges, using an O(n3) dynamic programming algorithm (Eisner, 1996) for projective trees.
For non-projective parsing, O(n3), or with some trickery O(n2), greedy algorithms exist (Chu and Liu, 1965; Edmonds, 1967; Gabow et al., 1986).
The feature function f may pay attention to many properties of the directed edge e. Of course, features may consider the parent and child words connected by e, and their parts of speech.2 But some features used by McDonald et al. (2005) also consider the parts of speech of words adjacent to the parent and child, or between the parent and child, as well as the number of words between the parent and child.
In general, these features are not available in a generative model such as a PCFG.
Although feature-based models are often trained purely discriminatively, we will see in §2.6 how to train them to model conditional probabilities.
2.2 Feature-Based Parsing and Bootstrapping
The above parsing model is robust, thanks to its many features.
On the Penn Treebank WSJ sections 02-21, for example, McDonald's parser extracts 5.5 million feature types from supervised edges alone, with about 120 feature tokens firing per edge.
The highest-scoring parse tree represents a consensus among all features on all prospective edges.
Even if a prospective edge has some discouraging features (i.e., with negative or zero weights), it may still have a relatively high score thanks to its other features.
Furthermore, even if the edge has a low total score, it may still appear in the consensus parse if the alternatives are even worse or are incompatible with other high-scoring edges.
Put another way, the parser is not able to include high-scoring features or edges independently of one another.
Selecting a good feature means accepting all other features on that edge.
It also means rejecting various other edges, because of the global constraints that a legal parse tree must give each word only one parent and must be free of cycles and, in
2Note that since we are not trying to predict parts of speech, we treat the output of one or more automatic taggers as yet more inputs to edge feature functions.
the projective case, crossings.
Our observation is that this situation is ideal for so-called "bootstrapping," "co-training," or "minimally supervised" learning methods (Yarowsky, 1995; Blum and Mitchell, 1998; Yarowsky and Wi-centowski, 2000).
Such methods should thrive when the right answer is overdetermined owing to redundant features and/or global constraints.
Concretely, suppose we start by training a supervised parser on only 100 examples, using some reg-ularization method to prevent overfitting to this set.
While many features might truly be relevant to the task, only a few appear often enough in this small training set to acquire significantly positive or negative weights.
Even this lightly trained parser may be quite sure of itself on some test sentences in a large unanno-tated corpus, when one parse scores far higher than all others.
More generally, the parser may be sure about part of a sentence: it may be certain that a particular edge is present (or absent), because that edge tends to be present (or absent) in all high-scoring parses.
Retraining the feature weights 6 on these high-confidence edges can learn about additional features that are correlated with an edge's success or failure.
For example, it may now learn strong weights for lexically specific features that were never observed in the supervised training set.
The retrained parser may now be able to confidently parse even more of the unannotated examples; so we can iterate the process.
Our hope is that the model identifies new good and bad edges at each step, and does so correctly.
The more features and global constraints the model has,
• the more power it will have to discriminate among edges even when 6 is insufficiently trained.
(Some feature weights may be too weak (i.e., too close to zero) because the initial labeled set is small.)
• the more robust it will be against errors even when 6 is incorrectly trained.
(Some feature weights may be too strong or have the wrong sign, because of overfitting or mistaken parses during bootstrapping.)
In the former case, strong features lend their strength to weak ones.
In the latter case, a conflict among strong features weakens the ones that depart from the consensus, or discounts the example sentence if there is no consensus.
Previous work on parser bootstrapping has not been able to exploit this redundancy among features, because it has used PCFG-like models with far fewer features (Steedman et al., 2003).
2.3 Adaptation and Projection via Features
The previous section assumed that we had a small supervised treebank in the target language and domain (plus a large unsupervised corpus).
We now consider other, more dubious, knowledge sources that might supplement or replace this small tree-bank.
In each case, we can use these knowledge sources to derive features that may—or may not— prove trustworthy during bootstrapping.
Parses from a different domain.
One might have a treebank for a different domain or genre of the target language.
One could simply include these trees in the initial supervised training, and hope that bootstrapping corrects any learned weights that are inappropriate to the target domain, as discussed above.
In fact, McClosky et al. (2006) found a similar technique to be effective—though only in a model with a large feature space ("PCFG + reranking"), as we would predict.
However, another approach is to train a separate out-of-domain parser, and use this to generate additional features on the supervised and unsupervised in-domain data (Blitzer et al., 2006).
Bootstrapping now teaches us where to trust the out-of-domain parser.
If our basic model has 100 features, we could add features 101 through 200, where for example /i23(e) = /23 • logPr(e) and Pr(e) is the posterior edge probability according to the out-of-domain parser.
Learning that this feature has a high weight means learning to trust the out-of-domain parser's decision on edges where in-domain feature 23 fires.
Even more sensibly, we could add features such as /2o1(e) = J2^ 1 /(e) • 9i, where f and 6 are the feature and weight vectors for the out-of-domain parser.
Learning that this feature has a high weight means learning to trust the out-of-domain parser's feature
weights for a particular class of features (those numbered 1 through 10).
This addresses the intuition that some linguistic phenomena remain stable across domains.
Parses of translations.
Suppose we have translations into English of some of our supervised or unsu-pervised sentences.
Good probabilistic dependency parsers already exist for English, so we run one over the English translation.
We can now derive many additional features on candidate edges on the target sentence.
For example, dependency edges in the target language of the form c —^ p (this denotes a child-to-parent dependency with label possessor) might often correspond to dependency paths in the
, prep pobj .
where c',p' range over word tokens in the English translation, "of" is a literal English word, and the probabilities are posteriors provided by a probabilistic aligner and a probabilistic English parser.
Note that this is a single feature (not a feature family parameterized by c, p).
It scores any candidate edge on
prep pobj English <— of <— path.
This method is inspired by Hwa et al. (2005), who bootstrapped parsers for Spanish and Chinese by projecting dependencies from English translations and training a new parser on the resulting noisy treebank.
They used only 1-best translations, 1-best alignments, dependency paths of length 1, and no labeled data in Spanish or Chinese.
Hwa et al. (2005) used a manually written postprocessor to correct some of the many incorrect projections.
By contrast, our framework uses the projected dependencies only as one source of features.
They may be overridden by other features in particular cases, and will be given a high weight only if they tend to agree with other features during bootstrapping.
A similar soft projection of dependencies was used in supervised machine translation by Smith and Eisner (2006), who used a source sentence's dependency paths to bias the generation of its translation.
Note that these bilingual features will only fire on those supervised or unsupervised sentences for which we have an English translation.
In particular, they will usually be unavailable on the test set.
However, we hope that they will seed and facilitate the bootstrapping process, by helping us confidently parse some unsupervised sentences that we would not be able to confidently parse without an English translation.
Parses of comparable English sentences.
World knowledge can be useful in parsing.
Suppose you see a French sentence that contains mangeons and pommes, and you know that manger=eat and pomme=apple.
You might reasonably guess that pommes is the direct object of mangeons, because you know that apple is a plausible direct object for eat.
We can discover this last bit of world knowledge from comparable English text.
Translation dictionaries can themselves be induced from comparable corpora (Schafer and Yarowsky, 2002; Schafer, 2006; Klementiev and Roth, 2006), or extracted from bitext or digitized versions of human-readable dictionaries if these are available.
The above inference pattern can be captured by features similar to those in equation (1).
For example, one can define a feature j by
where each event in the event space is a pair (c', p') of same-sentence tokens in comparable English text, all pairs being equally likely.
Thus, to estimate Pr(- | •), the denominator counts same-sentence token pairs ( c', p') in the comparable English corpus that translate into the types (c, p), and the numerator counts such pairs that are also related by a PreP of |po-j path.
Since the lexical translations and dependency paths are typically not labeled in the English corpus, a given pair must be counted fractionally according to its posterior probability of satisfying these conditions, given models of contextual translation and English parsing.3
2.4 Bootstrapping as Optimization
Section 2.2 assumed a relatively conventional kind of bootstrapping, where each iteration retrains the model on the examples where it is currently most confident.
This kind of "confidence thresholding" has been popular in previous bootstrapping work (as cited in §2.2).
It attempts to maintain high accuracy while gradually expanding coverage.
The assumption is that throughout the training procedure, the parser's confidence is a trustworthy guide to its correctness.
Different bootstrapping procedures use different learners, smoothing methods, confidence measures, and procedures for "forgetting" the label-ings from previous iterations.
In his analysis of Yarowsky (1995), Abney (2004) formulates several variants of bootstrapping.
These are shown to increase either the likelihood of the training data, or a lower bound on that likelihood.
In particular, Abney defines a function K that is an upper bound on the negative log-likelihood, and shows his bootstrapping algorithms locally minimize K.
We now present a generalization of Abney's K function and relate it to another semi-supervised learning technique, entropy regularization (Brand, 1999; Grandvalet and Bengio, 2005; Jiao et al., 2006).
Our experiments will tune the feature weight vector, 6, to minimize our function.
We will do so simply by applying a generic function minimization method (stochastic gradient descent), rather than by crafting a new Yarowsky-style or Abney-style iterative procedure for our specific function.
Suppose we have examples xi and corresponding possible labelings yi;k. We are trying to learn a parametric model pe(yi,k | Xi).
If p(yi,k | Xi) is a "labeling distribution" that reflects our uncertainty about the true labels, then our expected negative log-likelihood of the model is
3Similarly, Jansche (2005) imputes "missing" trees by using comparable corpora.
of the labeling distribution p ; a learner might be allowed to manipulate either in order to decrease K.
The summands of K in equation (3) can be divided into two cases, according to whether Xi is labeled or not.
For the labeled examples {Xi : i e L}, the labeling distribution p i is a point distribution that assigns all probability to the true, known label y*.
Then H(pi) = 0.
The total contribution of these examples to K simplifies to J2ieL — log pe(y* | xi), i.e., just the negative log-likelihood on the labeled data.
But what is the labeling distribution for the unla-beled examples {xi : i e L}?
Abney simply uses a uniform distribution over labels (e.g., parses), to reflect that the label is unknown.
If his bootstrapping algorithm "labels" xi, then i moves into L and H(p i) is thereby reduced from maximal to 0.
As a result, a method that labels the most confident examples may reduce K, and Abney shows that his method does so.
Our approach is different: we will take the labeling distribution p i to be our actual current belief pe i, and manipulate it through changing 6 rather than L. L remains the original set of supervised examples.
The total contribution of the unsupervised examples to K then simplifies to J2H(pe,i).
We have no reason to believe that these two contributions (supervised and unsupervised) should be weighted equally.
We thus introduce a multiplier 7 to form the actual objective function that we minimize with respect to 6:4
One may regard 7 as a Lagrange multiplier that is used to constrain the classifier's uncertainty H to be low, as presented in the work on entropy regular-ization (Brand, 1999; Grandvalet and Bengio, 2005;
Jiao et al., 2006).
Conventional bootstrapping retrains on the most confident unsupervised examples, making them
4This function is not necessarily convex in 0, because of the addition of the entropy term (Jiao et al., 2006).
One might try an annealing strategy: start 7 at zero (where the function is convex) and gradually increase it, hoping to "ride" the global maximum.
Although we could increase 7 until the entropy term dominates the minimizations and we approach a completely deterministic classifier, it is preferable to use some labeled heldout data to evaluate a stopping criterion.
more confident.
Gradient descent on equation (4) essentially does the same, since unsupervised examples contribute to (4) only through H, and the shape of the H function means that it is most rapidly decreased by making the most confident unsupervised examples more confident.
Besides favoring models that are self-confident on the unlabeled data, the objective function (4) also explicitly asks the model to continue to get the correct answers on the initial supervised corpus.
1/7 controls the strength of this request.
One could obtain a similar effect in conventional bootstrapping by up-weighting the initial labeled corpus when retraining.
Minimizing equation (4) for parsing is more computationally intensive than in many other applications of bootstrapping, such as word sense disambiguation or document classification.
With millions of features, our objective could take many iterations to converge to a local optimum, if we were only to update our parameter vector 6 after each iteration through a large unsupervised corpus.
For many machine learning problems over large datasets, online learning methods such as stochastic gradient descent (SGD) have been empirically observed to converge in fewer iterations (Bottou, 2003).
In SGD, instead of taking an optimization step in the direction of the gradient calculated over all unsupervised training examples, we parse each example, calculate the gradient of the objective function evaluated on that example alone, and then take a small step downhill.
The update rule is thus
where 6(t) is the parameter vector at time t, F(t) (6) is the objective function specialized to the time-t example, and n > 0 is a learning rate that we choose.
We check for convergence after each pass through the example set.
2.6 Algorithms and Complexity
To evaluate equation (4), we need a conditional model of trees given a sentence xi.
We define one by exponentiating and normalizing the tree scores:
pe,i(yi,fc) = exp(Eeey.ifc f (e) ^ 6)/Zi.
With exponentially many parses of xi, does our objective function (4) now have prohibitive com-
putational complexity?
The complexity is actually similar to that of the inside algorithm for parsing.
In fact, the first term of (4) for projective parsing is found by running the O(n3) inside algorithm on supervised data,5 and its gradient is found by the corresponding O(n3) outside algorithm.
For non-projective parsing, the analogy to the inside algorithm is the O(n3) "matrix-tree algorithm," which is dominated asymptotically by a matrix determinant
(Smith and Smith, 2007; Koo et al., 2007; McDonald and Satta, 2007).
The gradient of a determinant may be computed by matrix inversion, so evaluating the gradient again has the same O(n3) complexity as evaluating the function.
The second term of (4) is the Shannon entropy of the posterior distribution over parses.
Computing this for projective parsing takes O(n3) time, using a dynamic programming algorithm that is closely related to the inside algorithm (Hwa, 2000).
6 For non-projective parsing, unfortunately, the runtime rises to O(n4), since it requires determinants of n distinct matrices (each incorporating a log factor in a different column; we omit the details).
The gradient evaluation in both cases is again about as expensive as the function evaluation.
A convenient speedup is to replace Shannon entropy with Renyi entropy.
The family of Renyi entropy measures is parameterized by a:
In our setting, where p = pe,i, the events y are the possible parses of xi.
Observe that under our definition of p, £y p(y)a = {£y expEeey f (e) • (a6)]}/Za.
We already have Zi from running the inside algorithm, and we can find the numerator by running the inside algorithm again with 6 scaled by a. Thus with Renyi entropy, all computations and their gradients are O(n3)—even in the non-projective case.
Renyi entropy is also a theoretically attractive generalization.
It can be shown that lima^1 Ra(p)
5The numerator of pe,i(Vi) (see definition above) is trivial since y* is a single known parse.
But the denominator Zi is a normalizing constant that sums over all parses; it is found by a dependency-parsing variant of the inside algorithm, following (Eisner, 1996).
6See also (Mann and McCallum, 2007) for similar results on conditional random fields.
is in fact the Shannon entropy H(p) and that linia^oo R«(p) = — logmaxy p(y), i.e. the negative log probability of the modal or "Viterbi" label (Arndt, 2001; Karakos et al., 2007).
The a = 2 case, widely used as a measure of purity in decision tree learning, is often called the "Gini index."
Finally, when a = 0, we get the log of the number of labels, which equals the H(uniform distribution) that Abney used in equation (3).
3 Evaluation
For this paper, we performed some initial bootstrapping experiments on small corpora, using the features from (McDonald et al., 2005).
Afterdiscussing experimental setup (§3.1), we look at the correlation of confidence with accuracy and with oracle likelihood, and at the fine-grained behaviour of models' dependency edge posteriors (§3.2).
We then compare our confidence-maximizing bootstrapping to EM, which has been widely used in semi-supervised learning (§3.4).
Section 3.3 presents overall bootstrapping accuracy.
3.1 Experimental Design
We bootstrapped non-projective parsers for languages assembled for the CoNLL dependency parsing competitions (Buchholz and Marsi, 2006).
We selected German, Spanish, and Czech (Brants et al., 2002; Civit Torruella and Marti Antonin, 2002; Bohmova et al., 2003).
After removing sentences more than 60 words long, we randomly divided each corpus into small seed sets of 100 and 1000 trees; development and test sets of 200 trees each; and an unlabeled training set from the rest.
These treebanks contain strict dependency trees, in the sense that their only nodes are the words and a distinguished root node.
In the Czech dataset, more than one word can attach to the root; also, the trees in German, Spanish, and Czech may be non-projective.
We use the MSTParser implementation described in McDonald et al. (2005) for feature extraction.
Since our seed sets are so small, we extracted features from all edges in both the seed and the unlabeled parts of our training data, not just the edges annotated as correct.
Since this produced many more features, we pruned our features to those with at least 10 occurrences over all edges.
Correlation of
Acc.
(Shannon,
(Viterbi)
Xent.
Table 1: Correlation, on development sentences, of Renyi entropy with model accuracy and with cross-entropy ("Xent.").
Since these are measures of uncertainty, we see a negative correlation.
As a increases, we place more confidence in high-probability parses and correlate better with accuracy.
We used stochastic gradient descent first to minimize equation (4) on the labeled seed sets.
Then we continued to optimize over the labeled and unla-beled data together.
We tested for convergence using accuracy on development data.
3.2 Empirically Evaluating Entropy
Bootstrapping assumes that where the parser is confident, it tends to be correct.
Standard bootstrapping methods retrain directly on confident links; similarly, our approach tries to make the parser even more confident on those links.
Is this assumption really true empirically?
Yes: not only does confidence on unlabeled data correlate with cross-entropy, but both confidence and cross-entropy correlate well with accuracy.
As we will see, some confidence measures correlate better than others.
In particular, measures that are more peaked around the one-best prediction of the parser, as in Viterbi re-estimation, perform well.
If we train a non-projective German parser on small seed sets of 100 and 1000 trees, only, how well does its own confidence predict its performance?
For 200 points—labeled development sentences— we measured the linear correlation of various Renyi entropies (6), normalized by sentence length, with tree accuracy (Table 1).
We also measured how these normalized Renyi entropies correlate with the posterior log-probability the model assigns to the true parse (the cross-entropy).
Since Renyi entropy is a measure of uncertainty, we see a negative correlation with accuracy.
This correlation strengthens as we raise a to oo, so we might expect Viterbi re-estimation, or a differen-
Bootstrapping with R„ (Viterbi)
Figure 1: Posterior probability of correct and incorrect edges in German test data under various models.
We show the distribution of posterior probabilities for correct edges, known from an oracle, in black and incorrect edges in gray.
In the upper row, learning on an initial supervised set raises the posterior probability of correct edges while dragging along some incorrect edges.
In the lower row, we see that adding unlabeled data with R2 entropy continues the pattern of the supervised learner.
Roo (Viterbi) training induces a second mode in correct posterior probabilities near 1 although it does shift more incorrect edges closer to 1.
Figure 2: Precision-recall curves for selecting edges according to their posterior probabilities: better bootstrapping puts more area under the curve.
tiable objective function with a very high a, to perform best on held-out data.
Note also that the cross-entropy, which looks at the true labels on the held-out data, does not itself correlate very much better with accuracy than the best unsupervised confidence measures.
Finally, we see that Renyi entropies with higher a are more stable: when calculated for a model trained on more data, they improve their correlation with accuracy.
From tree confidence, we now turn to edge confidence: what is the posterior probability that a model assigns to each of the n2 edges in the dependency graph?
Figure 1 shows smoothed histograms of true edges (black) and false edges (gray) in held-out data, according to the posterior probabilities we assign to
them.
Since there are many more false edges, the figures are cropped to zoom in on the distribution of true edges.
As we start training on the labeled seed set, the posterior probabilities of true edges move towards one; many false edges also get greater mass, but not to the same extent.
As we add unlabeled data, we can see the different learning strategies of different confidence measures.
R2 gradually moves a few true and many fewer false edges towards 1, while Roo (Viterbi) learning is so confident as to induce a bimodal distribution in the posteriors of true edges.
Figure 2 visualizes the same data as four precision-recall curves, which show how noisy the highest-conidence edges are, across a range of con-idence thresholds.
Although the very high precision end stays stable after 10 iterations on the seed set, the addition of unlabeled data puts more area under the curve.
Again, Ro dominates R2.
3.3 Bootstrapping Results
We performed bootstrapping experiments on the full CoNLL sets for Czech, German, and Spanish using the non-projective model from McDonald et al. (2005).
Performance confirms the results of our analysis above (Table 2).
Adding unlabeled data improves performance over that of the seed set, with the exception of the Czech data with R2 bootstrapping.
As we saw in §3.2, bootstrapping with Ro dominates bootstrapping with R2 conidence.
For comparison, we also show the results obtained by supervised training on the combined seed and unla-beled sets.
Recall that we did not use the tree annotations to perform feature selection; models trained with only supported features ought to perform better.
Although we see statistically signiicant improvements (at the .
05 level on a paired permutation test), the quality of the parsers is still quite poor, in contrast to other applications of bootstrapping which "rival supervised methods" (Yarowsky, 1995).
Almost certainly, the CoNLL datasets, comprising at most some tens of thousands of sentences per language, are too small to afford qualitative improvements.
Also, at these relatively small training sizes, our preliminary attempts to leverage comparable English corpora did not improve performance.
What features were learned, and how dependent is performance on the seed set?
We analyzed the performance of German bootstrapping on a develop-
accuracy
Seed trees
Table 2: Dependency accuracy of the McDonald model on 200 test sentences.
When a =0, training only occurs on the supervised seed data.
As a increases, we train based on confidence in our model's analysis of the unlabeled data.
Boldface results are the best in their rows in a permutation test at the .
05 level.
ment set (Table 3).
Using the parameters at the last iteration of supervised training on the seed set as a baseline, we tried updating to their bootstrapped values the weights of only those features that occurred in the seed set.
This achieved nearly the same accuracy as updating all the features.
As one would expect, using only the non-seed features' weights performs abysmally.
This might be the case simply because the seed set is likely to contain frequently occurring features.
If, however, we use only the features occurring in an alternate training set of the same size (100 sentences), we get much worse performance.
These results indicate that our bootstrapped parser is still heavily dependent on the features that happened to ire in the seed set; we have not "forgotten" our initial conditions.
Similar experiments show that unlexicalized features contribute the most to bootstrapping performance.
Since in our log-linear models features have been trained to work together, we must not put too much weight on these ablation results.
These experiments do, however, suggest that bootstrapping improved our results by reining the values of known, non-lexicalized features.
Perhaps the most popular statistical method for learning from incomplete data is the EM algorithm (Dempster et al., 1977).
Since we cannot try EM on McDonald's conditional model, we ran some pilot experiments using the generative dependency model with valence (DMV) of Klein and Manning (2004).
As in their experiments, and unlike the other experiments in the current paper, we restricted ourselves
M feat.
acc.
non-seed
non-lexical
non-bilex.
bilexical
Table 3: Using all features, dependency accuracy on German development data rose to 64.3% on bootstrapping.
We show the contribution of different feature splits to the performance of this inal model.
For example, although this model was trained by updating all 15.5M feature weights, it performs as well if we then keep only the 1.4M features that appeared at least once in the seed set, zeroing out the weights of the others.
We do as well as the full feature set if we keep only the 3.5M non-lexicalized features.
% accuracy Bulg.
German Spanish
supervised
semi-supervised
EM Conf.
Table 4: Dependency accuracy of the DMV model (Klein and Manning, 2004).
Maximizing confidence using Ri (Shannon) entropy improved performance over its own conditional likelihood (CL) baseline and over maximum likelihood (ML).
EM degraded its ML baseline.
Since these models were only trained and tested on sentences of 10 words or fewer, accuracy is much higher than the full results in Table 2.
to sentences of ten words or fewer and to part-of-speech sequences alone, without any lexical information.
Since the DMV models projective trees, we ran experiments on three CoNLL corpora that had augmented their primary non-projective parses with alternate projective annotations: Bulgarian (Simov et al., 2005), German, and Spanish.
We performed supervised maximum likelihood and conditional likelihood estimation on a seed set of 100 sentences for each language.
These models respectively initialized EM and conidence training on unlabeled data.
We see (Table 4) that EM degrades the performance of its ML baseline.
Meri-aldo (1994) saw a similar degradation over small (and large) seed sets in HMM POS tagging.
We tried ixing and not ixing the feature expectations on the seed set during EM and show the former, better numbers.
Conidence maximization improved over both its own conditional likelihood initializer and also over ML.
We selected optimal smoothing parameters for all models and optimal a (equation (6)) and y (equation (4)) for the confidence model on labeled held-out data.
4 Future Work
We hypothesize that qualitatively better bootstrapping results will require much larger unlabeled data sets.
In scaling up bootstrapping to larger unla-beled training sets, we must carefully weight tradeoffs between expanding coverage and introducing noise from out-of-domain data.
We could also better exploit the data we have with richer models of syntax.
In supervised dependency parsing, second-order edge features provide improvements (McDonald and Pereira, 2006; Riedel and Clarke, 2006);
moreover, the feature-based approach is not limited to dependency parsing.
Similar techniques could score parses in other formalisms, such as CFG or TAG.
In this case, f extracts features from each of the derivation tree's rewrite rules (CFG) or elementary trees (TAG).
In lexicalized formalisms, f will still be able to score lexical dependencies that are implicitly represented in the parse.
Finally, we want to investigate whether larger training sets will provide traction for sparser cross-lingual and cross-domain features.
5 Conclusions
Feature-rich dependency models promise to help bootstrapping by providing many redundant features for the learner, and they can also cleanly incorporate cross-domain and cross-language information.
We explored bootstrapping feature-rich non-projective dependency parsers for Czech, German, and Spanish.
Our bootstrapping method maximizes a linear combination of likelihood and conidence.
In initial experiments on small datasets, this surpassed EM for training a simple feature-poor generative model, and also improved the performance of a feature-rich, conditionally estimated model where EM could not easily have been applied.
For our models and training sets, more peaked measures of confidence, measured by Renyi entropy, outperformed smoother ones.
Acknowledgments
The authors thank the anonymous reviewers, Noah A. Smith, and Keith Hall for helpful comments, and Ryan McDonald for making his parsing code publicly available.
This work was supported in part by
NSF ITR grant IIS-0313193.
