Deterministic dependency parsers use parsing actions to construct dependencies.
These parsers do not compute the probability of the whole dependency tree.
They only determine parsing actions stepwisely by a trained classifier.
To globally model parsing actions of all steps that are taken on the input sentence, we propose two kinds of probabilistic parsing action models that can compute the probability of the whole dependency tree.
The tree with the maximal probability is outputted.
The experiments are carried on 10 languages, and the results show that our probabilistic parsing action models outperform the original deterministic dependency parser.
1 Introduction
The target of CoNLL 2007 shared task (Nivre et al., 2007) is to parse texts in multiple languages by using a single dependency parser that has the capacity to learn from treebank data.
Among parsers participating in CoNLL 2006 shared task (Buchholz et al., 2006), deterministic dependency parser shows great efficiency in time and comparable performances for multi-lingual dependency parsing (Nivre et al., 2006).
Deterministic parser regards parsing as a sequence of parsing actions that are taken step by step on the input sentence.
Parsing actions construct dependency relations between words.
Deterministic dependency parser does not score the entire dependency tree as most of state-of-the-art parsers.
They only stepwisely choose the most probable parsing action.
In this paper, to globally
model parsing actions of all steps that are taken on the input sentence, we propose two kinds of probabilistic parsing action models that can compute the entire dependency tree's probability.
Experiments are evaluated on diverse data set of 10 languages provided by CoNLL 2007 shared-task (Nivre et al., 2007).
Results show that our probabilistic parsing action models outperform the original deterministic dependency parser.
We also present a general error analysis across a wide set of languages plus a detailed error analysis of Chinese.
Next we briefly introduce the original deterministic dependency parsing algorithm that is a basic component of our models.
2 Introduction of Deterministic Dependency Parsing
There are mainly two representative deterministic dependency parsing algorithms proposed respectively by Nivre (2003), Yamada and Matsumoto (2003).
Here we briefly introduce Yamada and Matsumoto's algorithm, which is adopted by our models, to illustrate deterministic dependency parsing.
The other representative method of Nivre also parses sentences in a similar deterministic manner except different data structure and parsing actions.
Yamada's method originally focuses on unla-beled dependency parsing.
Three kinds of parsing actions are applied to construct the dependency between two focus words.
The two focus words are the current sub tree's root and the succeeding (right) sub tree's root given the current parsing state.
Every parsing step results in a new parsing state, which includes all elements of the current partially built tree.
Features are extracted about these two focus words.
In the training phase, features and the corresponding parsing action compose the training
I He I I provides I I confirming | evidence I RIGHT
I provides I I confirmingl
SmF'Ty Iprovides I Confirming I I evidence I RighT)> E^iil^ levidence I
I provides I
confirming
evidence
Figure 1.
The example of the parsing process of Yamada and Matsumoto's method.
The input sentence
is "He provides confirming evidence."
data.
In the testing phase, the classifier determines which parsing action should be taken based on the features.
The parsing algorithm ends when there is no further dependency relation can be made on the whole sentence.
The details of the three parsing actions are as follows:
LEFT: it constructs the dependency that the right focus word depends on the left focus word.
RIGHT: it constructs the dependency that the left focus word depends on the right focus word.
SHIFT: it does not construct dependency, just moves the parsing focus.
That is, the new left focus word is the previous right focus word, whose succeeding sub tree's root is the new right focus word.
The illustration of these three actions and the parsing process is presented in figure 1.
Note that the focus words are shown as bold black box.
We extend the set of parsing actions to do labeled dependency parsing.
LEFT and RIGHT are concatenated by dependency labels, while SHIFT remains the same.
For example in figure 1, the original action sequence "RIGHT -> SHIFT -> RIGHT -> LEFT" becomes "RIGHT-SBJ -> SHIFT -> RIGHT-NMOD -> LEFT-OBJ".
3 Probabilistic Parsing Action Models
Deterministic dependency parsing algorithms are greedy.
They choose the most probable parsing action at every parsing step given the current parsing state, and do not score the entire dependency tree.
To compute the probability of whole dependency tree, we propose two kinds of probabilistic models that are defined on parsing actins: parsing
action chain model (PACM) and parsing action phrase model (PAPM).
The parsing process can be viewed as a Markov Chain.
At every parsing step, there are several candidate parsing actions.
The objective of this model is to find the most probable sequence of parsing actions by taking the Markov assumption.
As shown in figure 1, the action sequence "RIGHT-
OBJ" constructs the right dependency tree of the example sentence.
Choosing this action sequence among all candidate sequences is the objective of this model.
Where T denotes the dependency tree, S denotes the original input sentence, d denotes the parsing action at time step .
We add an artificial parsing action d0 as initial action.
We introduce a variable contextd to denote the resulting parsing state when the action d is taken on contextd 1. contextd is the original input sentence.
Suppose d0...dn are taken sequentially on the input sentence S, and result in a sequence of parsing states contextd0 ...contextdn , then P(T\S) defined in equation (1) becomes as below:
Formula (3) comes from formula (2) by obeying the Markov assumption.
Note that formula (4) is about the classifier of parsing actions.
It denotes the probability of the parsing action d given the
parsing state contextd If we train a classifier
that can predict with probability output, then we can compute P(T\S) by computing the product of the probabilities of parsing actions.
The classifier we use throughout this paper is SVM (Vapnik, 1995).
We adopt Libsvm (Chang and Lin, 2005), which can train multi-class classifier and support training and predicting with probability output (Chang and Lin, 2005).
For this model, the objective is to choose the parsing action sequence that constructs the dependency tree with the maximal probability.
Because this model chooses the most probable sequence, not the most probable parsing action at only one step, it avoids the greedy property of the original deterministic parsers.
We use beam search for the decoding of this model.
We use m to denote the beam size.
Then beam search is carried out as follows.
At every parsing step, all parsing states are ordered (or partially m ordered) according to their probabilities.
Probability of a parsing state is determined by multiplying the probabilities of actions that generate that state.
Then we choose m best parsing states for this step, and next parsing step only consider these m best parsing states.
Parsing terminates when the first entire dependency tree is constructed.
To obtain a list of n-best parses, we simply continue parsing until either n trees are found, or no further parsing can be fulfilled.
In the Parsing Action Chain Model (PACM), actions are competing at every parsing step.
Only m best parsing states resulted by the corresponding actions are kept for every step.
But for the parsing problem, it is reasonable that actions are competing for which phrase should be built.
For dependency
syntax, one phrase consists of the head word and all its children.
Based on this motivation, we propose Parsing Action Phrase Model (PAPM), which divides parsing actions into two classes: constructing action and shifting action.
If a phrase is built after an action is performed, the action is called constructing action.
In original Yamada's algorithm, constructing actions are LEFT and RIGHT.
For example, if LEFT is taken, it indicates that the right focus word has found all its children and becomes the head of this new phrase.
Note that one word with no children can also be viewed as a phrase if its dependency on other word is constructed.
In the extended set of parsing actions for labeled parsing, compound actions, which consist of LEFT and RIGHT concatenated by dependency labels, are constructing actions.
If no phrase is built after an action is performed, the action is called shifting action.
Such action is SHIFT.
We denote aj as constructing action and bj as
shifting action. j indexes the time step.
Then we introduce a new concept: parsing action phrase.
We use A to denote the zth parsing action phrase.
parsing action phrase A is a sequence of parsing
actions that constructs the next syntactic phrase.
A1 consists of a constructing action, A2 consists
of a shifting action and a constructing action, A3
consists of a constructing action.
The indexes are different for both sides of the expansion Az — bJ k..b- 1aj, A is the z'th parsing action phrase corresponding to both constructing action aj at time step j and all its preceding shifting actions.
Note that on the right side of the expansion, only one constructing action is allowed and is always at the last position, while shifting action can occur several times or does not occur at all.
It is parsing action phrases, i.e. sequences of parsing actions, that are competing for which next phrase should be built.
The probability of the dependency tree given the input sentence is redefined as:
Where k represents the number of steps that shifting action can be taken. contextA is the parsing
state resulting from a sequence of actions b k...b a taken on context. .
Similar with parsing action chain model (PACM), we use beam search for the decoding of parsing action phrase model (PAPM).
The difference is that PAPM do not keep m best parsing states at every parsing step.
Instead, PAPM keep m best states which are corresponding to m best current parsing action phrases (several steps of SHIFT and the last step of a constructing action).
4 Experiments and Results
centage of non-projective relations, which are 0.0%, 0.1% and 0.3% respectively.
Except these three languages, we use software of projectiviza-tion/deprojectivization provided by Nivre and Nilsson (2005) for other languages.
Because our algorithm only deals with projective parsing, we should projectivize training data at first to prepare for the following training of our algorithm.
During testing, deprojectivization is applied to the output of the parser.
Considering the classifier of Libsvm (Chang and Lin, 2005), the features are extracted from the following fields of the data representation: FORM, LEMMA, CPOSTAG, POSTAG, FEATS and DE-
PREL.
We split values of FEATS field into its
atomic components.
We only use available features of DEPREL field during deterministic parsing.
We use similar feature context window as used in Ya-mada's algorithm (Yamada and Matsumoto, 2003).
In detail, the size of feature context window is six, which consists of left two sub trees, two focus words related sub trees and right two sub trees.
This feature template is used for all 10 languages.
After submitting the testing results of Parsing Action Chain Model (PACM), we also perform original deterministic parsing proposed by Yamada and Matsumoto (2003).
The total results are shown in table 1.
The experimental results are mainly evaluated by labeled attachment score (LAS), unlabeled attachment score (UAS) and labeled accuracy (LA).
Table 1 shows that Parsing Action Chain Model (PACM) outperform original Yamada's parsing method for all languages.
The LAS improvements range from 0.60 percentage points to 1.71 percentage points.
Note that the original Yamada's method still gives testing results above the official reported average performance of all languages.
Table 1.
The performances of Yamada's method (Yam) and Parsing Action Chain Model (PACM).
Not all languages have only one root node of a sentence.
Since Parsing Action Phrase Model (PAPM) only builds dependencies, and shifting action is not the ending action of a parsing action phrase, PAPM always ends with one root word.
This property makes PAPM only suitable for Catalan, Chinese, English and Hungarian, which are unary root languages.
PAPM result of Catalan was not submitted before deadline due to the shortage of time and computing resources.
We report Catalan's PAPM result together with that of other three languages in table 2.
LAS PAPM
Table 2.
The performance of Parsing Action Phrase Model (PAPM) for Catalan, Chinese, English and Hungarian.
Compared with the results of PACM shown in table 1, the performance of PAPM differs among different languages.
Catalan and English show that PAPM improves 2.31% and 0.86% respectively over PACM, while the improvement of Chinese is marginal, and there is a little decrease of Hungarian.
Hungarian has relatively high percentage of non-projective relations.
If phrase consists of head word and its non-projective children, the constructing actions that are main actions in PAPM will be very difficult to be learned because some non-projective children together with their heads have no chance to be simultaneously as focus words.
Although projectivization is also performed for Hungarian, the built-in non-projective property still has negative influence on the performance.
5 Error Analysis
In the following we provide a general error analysis across a wide set of languages plus a detailed analysis of Chinese.
5.1 General Error Analysis
One of the main difficulties in dependency parsing is the determination of long distance dependencies.
Although all kinds of evaluation scores differ
dramatically among different languages, 69.91% to 85.83% regarding LAS, there are some general observations reflecting the difficulty of long distance dependency parsing.
We study this difficulty from two aspects about our full submission of PACM: precision of dependencies of different arc lengths and precision of root nodes.
For arcs of length 1, all languages give high performances with lowest 91.62% of Czech (Bohmova et al., 2003) to highest 96.8% of Catalan (Marti et al., 2007).
As arcs lengths grow longer, various degradations are caused.
For Catalan, score of arc length 2 is similar with that of arc length 1, but there are dramatic degradations for longer arc lengths, from 94.94% of arc length 2 to 85.22% of length 3-6.
For English (Johansson and Nugues, 2007) and Italian (Montemagni et al., 2003), there are graceful degradation for arcs of length 1,2 and 3-6, with 96-91-85 of English and 95-85-75 of Italian.
For other languages, long arcs also give remarkable degradations that pull down the performance.
Precision of root nodes also reflects the performance of long arc dependencies because the arc between the root and its children are often long arcs.
In fact, it is the precision of roots and arcs longer than 7 that mainly pull down the overall performance.
Yamada's method is a bottom-up parsing algorithm that builds short distance dependencies at first.
The difficulty of building long arc dependencies may partially be resulted from the errors of short distance dependencies.
The deterministic manner causes error propagation, and it indirectly indicates that the errors of roots are the final results of error propagation of short distance dependencies.
But there is an exception occurred in Chinese.
The root precision is 90.48%, only below the precision of arcs of length 1.
This phenomenon exists because the sentences in Chinese data set (Chen et al., 2003) are in fact clauses with average length of 5.9 rather than entire sentences.
The root words are heads of clauses.
2003) and Turkish (Oflazer et al., 2003), the improvement of root precision is small, but dependencies of arcs longer than 1 give better scores.
For PAPM, good performances of Catalan and English also give significant improvements of root precision over PACM.
For Catalan, the root precision improvement is from 63.86% to 95.21%; for English, the root precision improvement is from
62.03% to 89.25%.
5.2 Error Analysis of Chinese
There are mainly two sources of errors regarding LAS in Chinese dependency parsing.
One is from conjunction words (C) that have a relatively high percentage of wrong heads (about 20%), and therefore 19% wrong dependency labels.
In Chinese, conjunction words often concatenate clauses.
Long distance dependencies between clauses are bridged by conjunction words.
It is difficult for conjunction words to find their heads.
The other source of errors comes from auxiliary words (DE) and preposition words (P).
Unlike conjunction words, auxiliary words and preposition words have high performance of finding right head, but label accuracy (LA) decrease significantly.
The reason may lie in the large dependency label set consisting of 57 kinds of dependency labels in Chinese.
Moreover, auxiliary words (DE) and preposition words (P) have more possible dependency labels than other coarse POS have.
This introduces ambiguity for parsers.
Most common POS including noun and verb contribute much to the overall performance of 83% Labeled Attachment Scores (LAS).
Adverbs obtain top score while adjectives give the worst.
6 Conclusion
We propose two kinds of probabilistic models defined on parsing actions to compute the probability of entire sentence.
Compared with original Yamada and Matsumoto's deterministic dependency method which stepwisely chooses most
probable parsing action, the two probabilistic models improve the performance regarding all 10 languages in CoNLL 2007 shared task.
Through the study of parsing results, we find that long distance dependencies are hard to be determined for all 10 languages.
Further analysis about this difficulty is needed to guide the research direction.
Feature exploration is also necessary to provide more informative features for hard problems.
Ackowledgements
This work was supported by Hi-tech Research and Development Program of China under grant No. 2006AA01Z144, the Natural Sciences Foundation of China under grant No. 60673042, and the Natural Science Foundation of Beijing under grant No.
4052027, 4073043.
