We present an adaptation of constraint satisfaction inference (Canisius et al., 2006b) for predicting dependency trees.
Three different classifiers are trained to predict weighted soft-constraints on parts of the complex output.
From these constraints, a standard weighted constraint satisfaction problem can be formed, the solution to which is a valid dependency tree.
1 Introduction
Like the CoNLL-2006 shared task, the 2007 shared task focuses on dependency parsing and aims at comparing state-of-the-art machine learning algorithms applied to this task (Nivre et al., 2007).
For our official submission, we used the dependency parser described by Canisius et al. (2006a).
In this paper, we present a novel approach to dependency parsing based on constraint satisfaction.
The method is an adaptation of earlier work using constraint satisfaction techniques for predicting sequential outputs (Canisius et al., 2006b).
We evaluated our approach on all ten data sets of the 2007 shared task1.
In the remainder of this paper, we will present the new constraint satisfaction method for dependency parsing in Section 2.
The method is evaluated in Section 3, in which we will also present a brief error
1Hajic et al. (2004), Aduriz et al. (2003), Marti et al. (2007), Chen et al. (2003), Bohmova et al. (2003), Marcus et al. (1993), Johansson and Nugues (2007), Prokopidis et al. (2005), Csendes et al. (2005), Montemagni et al. (2003), Oflazer et al.
analysis.
Finally, Section 4 presents our main conclusions.
2 Constraint Satisfaction Inference for Dependency Trees
The parsing algorithm we used is an adaptation for dependency trees of the constraint satisfaction inference method for sequential output structures proposed by Canisius et al. (2006b).
The technique uses standard classifiers to predict a weighted constraint satisfaction problem, the solution to which is the complete dependency tree.
Constraints that are predicted each cover a small part of the complete tree, and overlap between them ensures that global output structure is taken into account, even though the classifiers only make local predictions in isolation of each other.
A weighted constraint satisfaction problem (W-CSP) is a tuple (X,D,C,W).
Here, X = [x\, x2,...,xn} is a finite set of variables.
D(x) is a function that maps each variable to its domain, and C is a set of constraints on the values assigned to the variables.
For a traditional (non-weighted) constraint satisfaction problem, a valid solution is an assignment of values to the variables that (1) are a member of the corresponding variable's domain, and (2) satisfy all constraints in the set C. Weighted constraint satisfaction, however, relaxes this requirement to satisfy all constraints.
Instead, constraints are assigned weights that may be interpreted as reflecting the importance of satisfying that constraint.
The optimal solution to a W-CSP is the solution that assigns those values that maximise the sum of the weights of satisfied constraints.
VMOD SBJ VMOD VMOD
Figure 1: Dependency tree for the sentence No it
wasn't Black Monday
To adapt this framework to predicting a dependency tree for a sentence, we construct a constraint satisfaction problem by first introducing one variable Xi for each token of the sentence.
This variable's value corresponds to the dependency relation that token is the modifier of, i.e. it should specify a relation type and a head token.
The constraints ofthe CSP are predicted by a classifier, where the weight for a constraint corresponds to the classifier's confidence estimate for the prediction.
For the current study, we trained three classifiers to predict three different types of constraints.
Cdep(head, modifier, relation), i.e. the resulting dependency tree should have a dependency arc from head to modifier labelled with type relation.
For the example tree in Figure 1, among others the constraint Cdep(head=was, modifier=No, relation=VMOD) should be predicted.
Cdir(modifier, direction), the relative position (i.e. to its left or to its right) of the head of modifier.
The tree in Figure 1 will give rise to constraints such as
Cdir (modifier=Black, direction =RIGHT).
Cmod(head, relation), in the dependency tree, head should be modified by a relation of type relation.
The constraints generated for the word was in Figure 1 would
be Cmod (head=was, relations=SBJ), and Cmod(head=was, relations=VMOD).
Predicting constraints of type Cdep is essentially what is done by Canisius et al. (2006a); a classifier is trained to predict a relation label, or a symbol signalling the absence of a relation, for each
pair of tokens in a sentence2 .
The training data for this classifier consists of positive examples of constraints to generate, e.g. was,No, VMOD, and negative examples, of constraints not to generate, e.g. was, Black,NONE, but also No, was,NONE.
In the aforementioned paper, it is shown that downsam-pling the negative class in the classifier's training data improves the recall for predicted constraints.
The fact that improved recall comes at the cost of a reduced precision is compensated for by our choice for the weighted constraint satisfaction framework: an overpredicted constraint may still be left unsatisfied if other, conflicting constraints outweigh its own weight.
In addition to giving rise to a set of constraints, this classifier differs from the other two in the sense that it is also used to predict the domains of the variables, i.e. any dependency relation not predicted by this classifier will not be considered for inclusion in the output tree.
Whereas the Cdep classifier classifies instances for each pair of words, the classifiers for Cdir and Cmod only classify individual tokens.
The features for these classifiers have been kept simple and the same for both classifiers: a 5-slot wide window of both tokens and part-of-speech tags, centred on the token currently being classified.
The two classifiers differ in the classes they predict.
For Cdir, there are only three possible classes: LEFT, RIGHT, NONE.
Instances classified as LEFT, or RIGHT give rise to constraints, whereas NONE implies that no Cdir constraint is added for that token.
For Cmod there is a rather large class space; a class label reflects all modifying relations for the token, e.g. SBJ+VMOD.
From this label, as many constraints are generated as there are different relation types in the label.
With the above, a weighted constraint satisfaction problem can be formulated that, when solved, describes a dependency tree.
As we formulated our problem as a constraint satisfaction problem, any off-the-shelf W-CSP solver could be used to obtain the best dependency parse.
However, in general such solvers have a time complexity exponential in the
2For reasons of efficiency and to avoid having too many negative instances in the training data, we follow the approach of Canisius et al. (2006a) of limiting the maximum distance between a potential head and modifier.
Table 1: Performance of the system applied to the test data for each language.
The '06 columns show the gain/loss with respect to the parser of Canisius et
al. (2006a).
number of variables, and thus in the length of the sentence.
As a more efficient alternative we chose to use the CKY algorithm for dependency parsing (Eisner, 2000) for computing the best solution, which has only cubic time complexity, but comes with the disadvantage of only considering projective trees as candidate solutions.
3 Results and discussion
We tested our system on all ten languages of the shared task.
The three constraint classifiers have been implemented with memory-based learning.
No language-specific parameter optimisation or feature engineering has been performed, but rather the exact same system has been applied to all languages.
Labelled and unlabelled attachment scores are listed in Table 1.
In addition, we show the increase/decrease in performance when compared with the parser of Canisius et al. (2006a); for all languages but Turkish, there is a consistent increase, mostly somewhere between 1.0 and 2.0 percent in labelled attachment score.
The parser by Canisius et al. (2006a) can be considered a rudimentary implementation of constraint satisfaction inference that only uses Cdep constraints.
The parser described in this paper elaborates this by adding (1) the Cmod and Cdir soft constraints, and (2) projectivity and acyclicity hard constraints, enforced implicitly by the CKY algorithm.
To evaluate the effect of each of these constraints,
Table 2: Performance of the parser by Canisius et al. (2006a) and the performance gain of the constraint satisfaction inference parser with various constraint configurations.
Table 2 shows the labelled attachment scores for several parser configurations; starting with the 2006 parser, i.e. a parser with only Cdep constraints, then the CKY-driven Cdep parser, i.e. with acyclicity and projectivity constraints, then with Cmod, and Cdir separately, and finally, the full parser based on all constraints.
It can be seen that supplementing the Cdep-only parser with hard constraints for acyclicity and projectivity already gives a small performance improvement.
For some languages, such as Italian (+0.2), this improvement is rather small, however for Hungarian 1.9 is gained only by using CKY.
The remaining columns show that adding more constraints improves performance, and that for all languages but Turkish and Hungarian, using all constraints works best.
While in comparison with the system of Canisius et al. (2006a) the addition of extra constraints has clearly shown its use, we expect the Cdep classifier still to be the performance bottleneck of the system.
This is mainly due to the fact that this classifier is also responsible for defining the domains of the CSP variables, i.e. which dependency relations will be considered for inclusion in the output.
For this reason, we performed an error analysis of the output of the Cdep classifier and the effect it has on the performance of the complete system.
In our error analysis, we distinguish three types of errors: 1) label errors, a correct dependency arc was added to the tree, but its label is incorrect, 2) recall
Language
prec.
rec.
Hungarian
Table 3: Columns two and three: precision and recall on dependency predictions by the Cdep classifier.
Columns four and five: percentage of dependency arc precision and recall errors caused by out-of-domain errors.
errors, the true dependency tree contains an arc that is missing from the predicted tree, and 3) precision errors, the predicted tree contains a dependency arc that is not part of the true dependency parse.
Label errors are always a direct consequence of erroneous Cdep predictions.
If the correct arc was predicted, but with an incorrect label, then by definition, the correct arc with the correct label cannot have been predicted at the same time.
In case of the other two types of errors, the correct constraints may well have been predicted, but afterwards outweighed by other, conflicting constraints.
Nevertheless, precision and recall errors may also be caused by the fact that the Cdep classifier simply did not predict a dependency arc where it should have.
We will refer to those errors as out-of-domain errors, since the domain of at least one of the CSP variables does not contain the correct value.
An out-of-domain error is a direct consequence of a recall error made by the Cdep classifier.
To illustrate these interactions, Table 3 shows for all languages the precision and recall of the Cdep classifier, and the percentage of dependency precision and recall errors that are out-of-domain errors.
The table reveals several interesting facts.
For English, which is the language for which our system attains its highest score, the percentage of dependency precision and recall errors caused by Cdep recall er-
rors is the lowest of all languages.
This can directly be related to the 90% recall of the English Cdep classifier.
Apparently, the weak precision (59%), caused by down-sampling the training data, is compensated for in the subsequent constraint satisfaction process.
For Italian, the percentage of out-of-domain-related errors is much higher than for English.
At the same time, the precision and recall of the Cdep classifier are much more in balance, i.e. a higher precision, but a lower recall.
We tried breaking this balance in favour of a higher recall by applying an even stronger down-sampling of negative instances, and indeed the parser benefits from this.
Labelled attachment increases from 77.04% to 78.41%.
The precision and recall of this new Cdep classifier are 58.65% and 87.15%, respectively.
The lowest Cdep precision has been observed for Hungarian (44.71), which unfortunately is not mirrored by a high recall score.
Remarkably however, after English, Hungarian has the lowest percentage of dependency errors due to Cdep recall errors (69.08 and 67.45).
It is therefore hypothesised that not the low recall, but the low precision is the main cause for errors made on Hungarian.
With this in mind, we briefly experimented with weaker down-sampling ratios in order to boost precision, but so far we did not manage to attain better results.
4 Concluding remarks
We have presented a novel dependency parsing method based on a standard constraint satisfaction framework.
First results on a set often different languages have been promising, but so far no extensive optimisation has been performed, which inevitably reflects upon the scores attained by the system.
Future work will focus on tuning the many parameters our system has, as well as on experimenting with different types of constraints to supplement or replace one or more of the three types used in this study.
Acknowledgements
The authors wish to thank Antal van den Bosch for discussions and suggestions.
This research is funded by NWO, the Netherlands Organization for Scientific Research under the IMIX programme.
