In the paper we describe a dependency parser that uses exact search and global learning (Crammer et al., 2006) to produce labelled dependency trees.
Our system integrates the task of learning tree structure and learning labels in one step, using the same set of features for both tasks.
During label prediction, the system automatically selects for each feature an appropriate level of smoothing.
We report on several experiments that we conducted with our system.
In the shared task evaluation, it scored better than average.
1 Introduction
Dependency parsing is a topic that has engendered increasing interest in recent years.
One promising approach is based on exact search and structural learning (McDonald et al., 2005; McDonald and Pereira, 2006).
In this work we also pursue this approach.
Our system makes no provisions for non-projective edges.
In contrast to previous work, we aim to learn labelled dependency trees at one fell swoop.
This is done by maintaining several copies of feature vectors that capture the features' impact on predicting different dependency relations (deprels).
In order to preserve the strength of McDonald et al. (2005)'s approach in terms of unla-belled attachment score, we add feature vectors for generalizations over deprels.
We also employ various reversible transformations to reach treebank formats that better match our feature representation and
that reduce the complexity of the learning task.
The paper first presents the methodology used, goes on to describing experiments and results and finally concludes.
2 Methodology
2.1 Parsing Algorithm
In our approach, we adopt Eisner (1996)'s bottom-up chart-parsing algorithm in McDonald et al. (2005)'s formulation, which finds the best pro-jective dependency tree for an input string (xi,..., xn).
We assume that every possible head-dependent pair is described by a feature vector with associated weights .
Eisner's algorithm achieves optimal tree packing by storing partial structures in two matrices b and L .
First the diagonals of the matrices are initiated with 0; then all other cells are filled according to eqs.
(1) and (2) and their symmetric variants.
This algorithm only accommodates features for single links in the dependency graph.
We also investigated an extension, McDonald and Pereira (2006)'s second-order model, where more of the parsing history is taken into account, viz. the last dependent assigned to a head i. In the extended model, b is updated as defined in eq.
(3); optimal packing requires a third matrix M.
2.2 Feature Representation
features for root words w fp lcp .
lcpjjpjfpk, wifpjfpk}).
All features but unary token features were optionally extended with direction of dependency ( or ) and binned token distance (|» - j\ = 1, 2, 3, 4, > 5, > 10).
2.3 Structural Learning
For determining feature weights iu, we used online passive-aggressive learning (OPAL) (Crammer et al., 2006).
OPAL iterates repeatedly over all training instances , adapting weights after each parse.
It tries to change weights as little as possible (passive-ness), while ensuring that (1) the correct tree y gets at least as much weight as the best parse tree and (2) the difference in weight between and rises with the average number of errors in y (aggressiveness).
This optimization problem has a closed-form solution:
'Agreement was computed from morphological features, viz. gender, number and person, and case.
In languages with subject-verb agreement, we added a nominative case feature to finite verbs.
In Basque, agreement is case-specific (absolutive, dative, ergative, other case).
features
iteration
Table 1: Performance on devset of Italian treebank.
In parentheses: reduction to non-null features after first iteration.
Having a closed-form solution, OPAL is easier to implement and more efficient than the MIRA algorithm used by McDonald et al. (2005), although it achieves a performance comparable to MIRA's on many problems (Crammer et al., 2006).
2.4 Learning Labels for Dependency Relations
So far, the presented system, which follows closely the approach of McDonald et al. (2005), only predicts unlabelled dependency trees.
To derive a labeling, we departed from their approach: We split each feature along the deprel label dimension, so that each deprel is associated with its own feature vector (cf. eq.
(4), where ® is the tensor product and the orthogonal encoding).
In parsing, we only consider the best deprel label.
On its own, this simple approach led to a severe degradation of performance, so we took a step back by re-introducing features for unlabelled trees.
For each set of deprels , we designed a taxonomy with a single maximal element (complete abstraction over deprel labels) and one minimal element for each deprel label.
We also included an intermediate layer in T that collects classes of deprels, such as
Features
Table 2: Figures for Experiments on Treebanks.
complement, adjunct, marker, punctuation, or coordination deprels, and in this way provides for better smoothing.
The taxonomy translates to an encoding , where iff node in is an ancestor
of (Tsochantaridis et al., 2004).
Substituting for leads to a massive amount of features, so we pruned the taxonomy on a feature-to-feature basis by merging all nodes on a level that only encompass deprels that never occur with this feature in the training data.
2.5 Treebank Transformations
Having no explicit feature representation for the information in the morphological features slot (cf. section 2.2), we partially redistributed that information to other slots: Verb form, case2 to fp, semantic classification to an empty lemma slot (Turkish affixes, e.g. "Able", "Ly").
The balance between fp and w was not always optimal; we used a fine-grained3 classification in punctuation tags, distinguished between prepositions (e.g. in) and preposition-article combinations (e.g. nel) in Italian4 on the basis of number/gender features, and collected definite and indefinite articles under one common fp tag.
When distinctions in deprels are recoverable from context, we removed them: The dichotomy between conjunctive and disjunctive coordination in Italian
2Case was transferred to fp only if important for determination of deprel (CA, HU, IT).
3Classes of punctuation are e.g. opening and closing brackets, commas and punctuation signalling the end ofa sentence.
4Prep and PrepArt behave differently syntactically (e.g. an article can only follow a genuine preposition).
depends in most cases exclusively on the coordinating conjunction.
The Greek and Czech treebanks have a generic distinction between ordinary deprels and deprels in a coordination, apposition, and parenthesis construction.
In Greek, we got rid of the parenthesis markers on deprels by switching head and dependent, giving the former head (the parenthesis) a unique new deprel.
For Czech, we reduced the number of deprels from 46 to 34 by swapping the deprels of conjuncts, appositions, etc. and their heads (coordination or comma).
Sometimes, multiple conjuncts take different deprels.
We only provided for the clash between "ExD" (ellipsis) and other deprels, in which case we added "ExD", see below.
rozliseni
standard
-Apos:ExD
In Basque, agreement is usually between arguments and auxiliary verbs, so we re-attached5 relevant arguments from main verb to auxiliary verb.
The training set for Arabic contains some very long sentences (up to 396 tokens).
Since context-free parsing sentences of this length is tedious, we split up all sentences at final punctuation signs
Unfortunately, we did not take into account projectivity, so this step resulted in a steep increase of non-projective edges (9.4% of all edges) and a corresponding degradation of our evaluation results in Basque.
Language
Hungarian
Table 3: Results on DevTest and Test Sets compared with the Average Performance in CoNLL'07.
LAS = Labelled Attachment Score, UAS = Unlabelled Attachment Score, LAcc = Label Accuracy, AV = Average score.
(AuxK).
With this trick, we pushed down maximal sentence length to 196.
Unfortunately, we overlooked the fact that in Turkish, the ROOT deprel not only designates root nodes but also attaches some punctuation marks.
This often leads to non-projective structures, which our parser cannot handle, so our parser scored below average in Turkish.
In after-deadline experiments, we took this feature of the Turkish treebank into account and achieved above-average results by re-linking all ROOT-ed punctuation signs to the immediately preceding token.
3 Experiments and Results
deprel labels).
The last column in Table 2 shows the average time needed in a training iteration.
For nearly all languages, our approach achieved a performance better than average (see Table 3).
Only in Turkish and Basque did we score below average.
On closer inspection, we saw that this performance was due to our projectivity assumption and to insufficient exploration of these treebanks.
In its bottom part, Table 3 gives results of improved versions of our approach.
4 Conclusion
We presented an approach to dependency parsing that is based on exact search and global learning.
Special emphasis is laid on an integrated derivation of labelled and unlabelled dependency trees.
We also employed various transformation techniques to reach treebank formats that are better suited to our approach.
The approach scores better than average in (nearly) all languages.
Nevertheless, it is still a long way from cutting-edge performance.
One direction we would like to explore in the future is the integration of dynamic features on deprel labels.
Acknowledgements
We would like to thank the organizing team for making possible again a great shared task at CoNLL!
