In this paper, we describe a two-stage multilingual dependency parser used for the multilingual track of the CoNLL 2007 shared task.
The system consists of two components: an unlabeled dependency parser using Gibbs sampling which can incorporate sentence-level (global) features as well as token-level (local) features, and a dependency relation labeling module based on Support Vector Machines.
Experimental results show that the global features are useful in all the languages.
1 Introduction
Making use of as many informative features as possible is crucial to obtain high performance in machine learning based NLP.
Recently, several methods for incorporating non-local features have been investigated, though such features often make models complex and thus complicate inference.
Collins and Koo (2005) proposed a reranking method for phrase structure parsing with which any type of global features in a parse tree can be used.
For dependency parsing, McDonald and Pereira (2006) proposed a method which can incorporate some types of global features, and Riedel and Clarke (2006) studied a method using integer linear programming which can incorporate global linguistic constraints.
In this paper, we study dependency parsing using Gibbs sampling which can incorporate any type of global feature in a sentence.
The parser determines unlabeled dependency structures only, and we attach dependency relation labels using Support Vector Machines afterwards.
We participated in the multilingual track of the
evaluated the system on data sets of 10 languages (Hajic et al., 2004; Aduriz et al., 2003; Marti et al., 2007; Chen et al., 2003; Bohmova et al., 2003; Marcus et al., 1993; Johansson and Nugues, 2007; Prokopidis et al., 2005; Csendes et al., 2005; Mon-temagni et al., 2003; Oflazer et al., 2003).
The rest of the paper describes the specification of the system and the evaluation results.
2 Unlabeled Dependency Parsing using Global Features
2.1 Probabilistic Model
where QM(h|w) is an initial distribution, fk(w, h) is the k-th feature function, K is the number of feature functions, and \k is the weight of the k-th feature.
H(w) is the set of possible configurations of heads for a given sentence w. Although it is appropriate that H(w) is the set of projective trees for projective languages, and is the set ofnon-projective trees (which is a superset of the set of projective trees) for non-projective languages, in this study, we define H(w) to be the set of all the possible graphs, which contains |w|]v'] elements.
PA)M(h|w) and QM(h|w) are defined over H(w)1.
The probability distribution Pa,m (h|w) is a joint distribution of all the heads conditioned by a sentence, therefore we call this model sentence-level model.
The feature function fk (w, h) is defined on a sentence w with heads h, and we can use any information in the sentence without the independence assumption for the heads of the tokens, therefore we call fk(w, h)
1H(w) is a superset of the set of non-projective trees, and is an unnecessarily large set which contains ill-formed dependency trees such as trees with cycles.
This issue may cause reduction of parsing performance, but we adopt this approach for computational efficiency.
sentence-level (global) feature.
We define initial distribution QM(h|w) as the product of qM(h\w, t) which is the probability distribution of the head h of each t-th token calculated with maximum entropy models: iwi
where gi(w, t, h) is the l-th feature function, L is the number of feature functions, and [i\ is the weight of the l-th feature. qM(h\w,t) is a model of the head of a single token, calculated independently from other tokens, therefore we call qM(h\w,t) token-level model, and gi(w,t,h) token-level (local) feature.
2.2 Decoding and Parameter Estimation
Let us consider how to ind the optimal solution h, given a sentence w, parameters of the sentence-level model A = [Xi} • • •, XK}, and parameters of the token-level model M = • • •, Since the probabilistic model contains global features and eficient algorithms such as dynamic programming cannot be used, we use Gibbs sampling to obtain an approximated solution.
Gibbs sampling can ef-iciently generate samples from high-dimensional probability distributions with complex dependencies among variables (Andrieu et al., 2003), and we assume that R samples [h(i), • • •, h(R)} are generated from PA,M(h|w) using Gibbs sampling.
Then, the marginal distribution of the head of the t-th token given w, Pt(h|w), is approximately calculated as follows:
where 5(i,j) is the Kronecker delta.
In order to ind a solution using the marginal distribution, we adopt the maximum spanning tree (MST) framework proposed by McDonald et al. (2005a).
In this framework, scores for possible edges in dependency graphs are deined, and the optimal dependency tree is found as the MST in which the summation of the edge scores is maximized.
Let s(i,j) denote the score of the edge from a parent node (head) i to a child node (dependent) j. We deine s(i, j) as follows:
We use the logarithm ofthe marginal distributionbe-cause the summation of edge scores is maximized by the MST search algorithms but the product ofthe marginal distributions should be maximized.
The best projective parse tree is obtained using the Eisner algorithm (Eisner, 1996) with the scores, and the best non-projective one is obtained using the Chu-Liu-Edmonds (CLE) algorithm (McDonald et al.,
2005b).
Although in this method, the factored score s(i, j) is used to measure likelihood of dependency trees, the score is calculated taking a whole sentence into consideration using Gibbs sampling.
estimation with Gaussian priors.
We deine the following objective function M:
where a is a hyper parameter of Gaussian priors.
The optimal parameters M which maximize M can be obtained by quasi-Newton methods such as the L-BFGS algorithm with above M and its partial derivatives.
The parameters of the sentence-level model A = [Xi} • • •, XK} can also be estimated in a similar way with the following objective function L after the parameters of the token-level model are estimated.
This objective function and its partial derivative contain summations over all the possible configurations which are difficult to calculate.
We approximately calculate these values using static Monte Carlo (not MCMC) methods with fixed S samples {hn(1), , hn(S)} generated from Qu(h\wn)2:
2Static Monte Carlo methods become inefficient when the dimension of the probabilistic distribution is high, and more sophisticated methods would be used for accurate parameter estimation.
The token-level features used in the system are the same as those used in MSTParser version 0.4.23.
The features include lexical forms and (coarse and fine) POS tags of parent tokens, child tokens, their surrounding tokens, and tokens between the child and the parent.
The direction and the distance from a parent to its child, and the FEATS fields of the parent and the child which are split into elements and then combined are also included.
Features that appeared less than 5 times in training data are ignored.
Global features can capture any information in dependency trees, and the following nine types of global features are used (In the following, parent node means a head token, and child node means a dependent token):
Child Unigram+Parent+Grandparent This feature template is a 4-tuple consisting of (1) a child node, (2) its parent node, (3) the direction from the parent node to the child node, and (4) the grandparent node.
Each node in the feature template is expanded to its lexical form and coarse POS tag in order to obtain actual features.
Features that appeared in four or less sentences are ignored.
The same procedure is applied to the following other features.
Child Bigram+Parent This feature template is a 4-tuple consisting of (1) a child node, (2) its parent node, (3) the direction from the parent node to the child node, and (4) the nearest outer sibling node (the nearest sibling node which exists on the opposite side of the parent node) of the child node.
This feature template is almost the same as the one used by McDonald and Pereira
(2006).
Child Bigram+Parent+Grandparent This feature template is a 5-tuple.
The irst four elements (1)-(4) are the same as the Child Bi-gram+Parent feature template, and the additional element (5) is the grandparent node.
Child Trigram+Parent This feature template is a 5-tuple.
The irst four elements (1)-(4) are the same as the Child Bigram+Parent feature template, and the additional element (5) is the next nearest outer sibling node of the child node.
3http://sourceforge.net/projects/mstparser
Parent+All Children This feature template is a tuple with more than one element.
The irst element is a parent node, and the other elements are all of its child nodes.
Parent+All Children+Grandparent This feature template is a tuple with more than two elements.
The elements other than the last one are the same as the Parent+All Children feature template, and the last element is the grandparent node.
Child+Ancestor This feature template is a 2-tuple
ancestor nodes.
Acyclic This feature type has one of two values,
true if the dependency tree is acyclic, or false
otherwise.
Projective This feature type has one of two values, true if the dependency tree is projective, or false otherwise.
3 Dependency Relation Labeling
Dependency relation labeling can be handled as a multi-class classiication problem, and we use Support Vector Machines (SVMs) which have been successfully applied to many NLP tasks.
Solving large-scale multi-class classiication problem with SVMs requires substantial computational resources, so we use the revision learning method (Nakagawa et al., 2002).
The revision learning method combines a probabilistic model which has smaller computational cost with a binary classiier which has higher generalization capacity.
In the method, the latter classiier revises the output of the former model to conduct multi-class classiication with higher accuracy and reasonable computational cost.
In this study, we use maximum entropy (ME) models as the probabilistic model and SVMs with the second order polynomial kernel as the binary classiier.
The dependency label of each node is determined independently of the labeling of other nodes.
Table 1: Results of Multilingual Dependency Parsing
Algorithm
Features
Hungarian
local +global
Table 2: Unlabeled Attachment Scores in Different Settings (underlined values indicate submitted results, and bold values indicate the highest scores)
and the child tokens of i (the j"-th token where j" e [j"\hj» = i})4.
As the features for ME models, a subset of them is used since ME models are used just for reducing the search space, and do not need so many features.
4 Results and Analysis
In order to tune the system, we split each training data set into two parts, and used the first half for training and the remaining half for testing in development.
The CLE algorithm was used for Basque, Czech, Hungarian and Turkish, and the Eisner algorithm was used for the others.
We used lemmas for Catalan, Czech, Greek and Italian, and word forms for all others.
The values of the parameters to be fixed were chosen as R = 500, S = 200, a = 0.25, and a' = 0.25.
With these parameter settings, training took 247 hours, and testing took 343 minutes on an Opteron 250 processor.
Table 1 shows the evaluation results on the test sets.
Accuracy was measured with the labeled attachment score (LAS) and the unlabeled attachment score (UAS).
Among the participating systems in the shared task, we obtained the second best average accuracy in the labeled attachment score, and the best average accuracy in the unlabeled attachment score.
Compared with other systems, the gap between our labeled and unlabeled scores is relatively big.
In this study, labeling of dependency relations was performed in a separate post-processing step, and each label was predicted independently.
The labeled scores may be improved if the parsing process and the labeling process are performed at the same time, and dependencies among labels are taken into account.
We conducted experiments with different settings.
Table 2 shows the results measured with the unla-beled attachment score.
In the table, Eisner and
4Although polynomial kernels of SVMs can implicitly handle combined features, some of combined features were also included explicitly because using unnecessarily high order polynomial kernels decreases performance.
CLE indicate that the Eisner algorithm and the CLE algorithm are used in decoding, and local and +global indicate that local features alone, and local and global features together are used.
The CLE algorithm performed better than the Eisner algorithm for Basque, Czech, Hungarian, Italian and Turkish.
All of these data sets except Italian contain relatively a large number of non-projective sentences (the percentage of sentences with at least one non-projective relation in the training data is over 20% (Nivre et al., 2007)), though the Greek data set, on which the Eisner algorithm performed better, also contains many non-projective sentences (20.3%).
By using the global features, the accuracy was improved in all the cases except for Turkish with the Eisner algorithm (Table 2).
The increase was rather large in Chinese and Czech.
When the global features were used in these languages, the dependency accuracy for tokens whose heads had conjunctions as parts-of-speech was notably improved; from 80.5% to 86.0% in Chinese (Eisner), and from 73.2% to 77.6% in Czech (CLE).
We investigated the trained global models, and found that Parent+All Children features, whose parents were conjunctions and whose children had compatible classes, had large positive weights, and those whose children had incompatible classes had large negative weights.
A feature with a larger weight is generally more influential.
Riedel and Clarke (2006) suggested to use linguistic constraints such as "arguments of a coordination must have compatible word classes," and such constraint seemed to be represented by the features in our models.
5 Conclusion
In this study, we applied a dependency parser using global features to multilingual dependency parsing.
Evaluation results showed that the use ofglobal features was effective to obtain higher accuracy in multilingual dependency parsing.
Improving dependency relation labeling is left for future work.
