This paper presents an online algorithm for dependency parsing problems.
We propose an adaptation of the passive and aggressive online learning algorithm to the dependency parsing domain.
We evaluate the proposed algorithms on the 2007 CONLL Shared Task, and report errors analysis.
Experimental results show that the system score is better than the average score among the participating systems.
1 Introduction
Singer, 2003).
The difference of MIRA-based parsing in comparison with history-based methods is that the MIRA-based parser were trained to maximize the accuracy of the overall tree.
The MIRA based parsing is close to maximum-margin parsing as in Taskar et al. (2004) and Tsochantaridis et al. (2005) for parsing.
However, unlike maximum-margin parsing, it is not limited to parsing sentences of 15 words or less due to computation time.
The performance of MIRA based parsing achieves the state-of-the-art performance in English data (McDonald et
al., 2005a; McDonald et al., 2006).
In this paper, we propose a new adaptation of online larger-margin learning to the problem of dependency parsing.
Unlike the MIRA parser, our method does not need an optimization procedure in each learning update, but users only an update equation.
This might lead to faster training time and easier implementation.
The contributions of this paper are two-fold: First, we present a training algorithm called PA learning for dependency parsing, which is as easy to implement as Perceptron, yet competitive with large margin methods.
This algorithm has implications for anyone interested in implementing discriminative training methods for any application.
Second, we evaluate the proposed algorithm on the multilingual data task as well as the domain adaptation task (Nivre et al., 2007).
The remaining parts of the paper are organized as follows: Section 2 proposes our dependency parsing with Passive-Aggressive learning.
Section 3 discusses some experimental results and Section 4 gives conclusions and plans for future work.
2 Dependency Parsing with Passive-Aggressive Learning
This section presents the modification of Passive-Aggressive Learning (PA) (Crammer et al., 2006) for dependency parsing.
We modify the PA algorithm to deal with structured prediction, in which our problem is to learn a discriminant function that maps an input sentence x to a dependency tree y. Figure 1 shows an example of dependency parsing which depicts the relation of each word to another word within a sentence.
There are some algorithms
root John hit the ball with the bat
Figure 1: This is an example of dependency tree
to determine these relations of each word to another words, for instance, the modified CKY algorithm (Eisner, 1996) is used to define these relations for a given sentence.
2.1 Parsing Algorithm
Dependency-tree parsing as the search for the maximum spanning tree (MST) in a graph was proposed by McDonald et al. (2005b).
In this subsection, we briefly describe the parsing algorithms based on the first-order MST parsing.
Due to the limitation of participation time, we only applied the first-order decoding parsing algorithm in CONLL-2007.
However, our algorithm can be used for the second order parsing.
where is a high-dimensional binary fea-
ture representation of the edge from xwi to xwj .
For example in Figure 1, we can present an example ) as follows;
The basic question must be answered for models of this form: how to find the dependency tree y with
the highest score for sentence x?
The two algorithms we employed in our dependency parsing model are the Eisner parsing (Eisner, 1996) and Chu-Liu's algorithm (Chu and Liu, 1965).
The algorithms are commonly used in other online-learning dependency parsing, such as in (McDonald et al., 2005a).
In the next subsection we will address the problem of how to estimate the weight wi associated with a feature $i in the training data using an online PA learning algorithm.
This section presents a modification of PA algorithm for structured prediction, and its use in dependency parsing.
The Perceptron style for natural language processing problems as initially proposed by (Collins, 2002) can provide state of the art results on various domains including text chunking, syntactic parsing, etc. The main drawback of the Perceptron style algorithm is that it does not have a mechanism for attaining the maximize margin of the training data.
It may be difficult to obtain high accuracy in dealing with hard learning data.
The structured support vector machine (Tsochantaridis et al., 2005) and the maximize margin model (Taskar et al., 2004) can gain a maximize margin value for given training data by solving an optimization problem (i.e quadratic programming).
It is obvious that using such an optimization algorithm requires much computational time.
For dependency parsing domain, McDonald et al (2005a) modified the MIRA learning algorithm (McDonald et al., 2005a) for structured domains in which the optimization problem can be solved by using Hidreth's algorithm (Censor and Zenios, 1997), which is faster than the quadratic programming technique.
In contrast to the previous method, this paper presents an online algorithm for dependency parsing in which we can attain the maximize margin of the training data without using optimization techniques.
It is thus much faster and easier to implement.
The details of PA algorithm for dependency parsing are presented below.
ciated with a weight value.
The goal of PA learning for dependency parsing is to obtain a parameter w that minimizes the hinge-loss function and the margin of learning data.
2 Aggressive parameter C
3 Output: the PA learning model
6 Receive an sentence xt
Algorithm 1: The Passive-Aggressive algorithm for dependency parsing.
Algorithm 1 shows the PA learning algorithm for dependency parsing in which its three variants are different only in the update formulas.
In Algorithm 1, we employ two kinds of argmax algorithms: The first is the decoding algorithm for projective language data and the second one is for non-projective language data.
Algorithm 1 shows (line 8) p(y, yt) is a real-valued loss for the tree yt relative to the correct tree y. We define the loss of a dependency tree as the number of words which have an incorrect parent.
Thus, the largest loss a dependency tree can have is the length of the sentence.
The similar loss function is designed for the dependency tree with labeled.
Algorithm 1 returns an averaged weight vector: an auxiliary weight vector v is maintained that accumulates the values of w after each iteration, and the returned weight vector is the average of all the weight vectors throughout training.
Averaging has been shown to help reduce overfitting (McDonald et al., 2005a; Collins, 2002).
It is easy to see that the
main difference between the PA algorithms and the Perceptron algorithm (PC) (Collins, 2002) as well as the MIRA algorithm (McDonald et al., 2005a) is in line 9.
As we can see in the PC algorithm, we do not need the value t and in the MIRA algorithm we need an optimization algorithm to compute t. We also have three updated formulations for obtaining Tt in Line 9.
In the scope of this paper, we only focus on using the second update formulation (PA-I method) for training dependency parsing data.
Table 3: Feature Set 3: In Between POS Features and Surrounding Word POS Features
features used in our system are described below.
Tables 1 and 2 show our basic features.
These
features are added for entire words as well as for the 5-gram prefix if the word is longer than 5 characters.
• In addition to these features shown in Table 1, the morphological information for each pair of words p-word and c-word are represented.
In addition, we also add the conjunction morphological information of p-word and c-word.
We do not use the LEMMA and CPOSTAG information in our set features.
The morphological information are obtained from FEAT information.
• Table 3 shows our Feature set 3 which take the form of a POS trigram: the POS of the parent, of the child, and of a word in between, for all words linearly between the parent and the child.
This feature was particularly helpful for nouns identifying their parent (McDonald et al., 2005a).
• Table 3 also depicts these features taken the form of a POS 4-gram: The POS of the parent, child, word before/after parent and word before/after child.
The system also used backoff features with various trigrams where one of the local context POS tags was removed.
• All features are also conjoined with the direction of attachment, as well as the distance between the two words being attached.
3 Experimental Results and Discussion
4 shows the number of training and testing sentences for these languages.
The table shows that the sentence length in Arabic data is largest and its size of training data is smallest.
These factors might be af-
fected to the accuracy of our proposed algorithm as we will discuss later.
The training and testing were conducted on a pentium IV at 4.3 GHz.
The detailed information about the data are shown in the CONLL-2007 shared task.
We applied non-projective and projective parsing along with PA learning for the data in CONLL-2007.
Table 5 reports experimental results by using the first order decoding method in which an MST parsing algorithm (McDonald et al., 2005b) is applied for non-projective parsing and the Eisner's method is used for projective language data.
In fact, in our method we applied non-projective parsing for the Italian data, the Turkish data, and the Greek data.
This was because we did not have enough time to train all training data using both projective and non-projective parsing.
This is the problem of discriminative learning methods when performing on a large set of training data.
In addition, to save time in training we set the number of best trees k to 1 and the parameter C is set to 0.05.
Table 5 shows the comparison of the proposed method with the average, and three top systems on the CONLL-2007.
As a result, our method yields results above the average score on the CONLL-2007 shared task (Nivre et al., 2007).
Table 5 also indicates that the Basque results obtained a lower score than other data.
We obtained 69.11 UA score and 58.16 LA score, respectively.
These are far from the results of the Top3 scores (81.13 and 75.49).
We checked the outputs of the Basque data to understand the main reason for the errors.
We see that the errors in our methods are usually mismatched with the gold data at the labels "ncmod" and "ncsubj".
The main reason might be that the application of projective parsing for this data in both training and testing is not suitable.
This was because the number of sentences with at least 1 non projective relation in the data is large (26.1).
The Arabic score is lower than the scores of other data because of some difficulties in our method as follows.
Morphological and sentence length problems are the main factors which affect the accuracy of parsing Arabic data.
In addition, the training size in the Arabic is also a problem for obtaining a good result.
Furthermore, since our tasks was focused on improving the accuracy of English data, it might be unsuitable for other languages.
This is an imbalance
Languages
Training size
Tokens size
tokens-per-sent
Hungarian
Table 4: The data used in the multilingual track (Nivre et al., 2007).
NPR means non-projective-relations.
AL-1-NPR means at-least-least 1 non-projective relation.
problem in our method.
Table 5 also shows the comparison of our system to the average score and the Top3 scores.
It depicts that our system is accurate in English data, while it has low accuracy in Basque and Arabic data.
We also evaluate our models in the domain adaptation tasks.
This task is to adapt our model trained on PennBank data to the test data in the Biomedical domain.
The pchemtb-closed shared task (Marcus et al., 1993; Johansson and Nugues, 2007; Kulick et al., 2004) is used to illustrate our models.
We do not use any additional unlabeled data in the Biomedical domain.
Only the training data in the PennBank is used to train our model.
Afterward, we selected carefully a suitable parameter using the development test set.
We set the parameter C to 0.01 and select the non projective parsing for testing to obtain the highest result in the development data after performing several experiments.
After that, the trained model was used to test the data in Biomedical domain.
The score (UA=82.04; LA=79.50) shows that our method yields results above the average score (UA=76.42; LA=73.03).
In addition, it is officially coming in 4th place out of 12 teams and within 1.5% of the top systems.
The good result of performing our model in another domain suggested that the PA learning seems sensitive to noise.
We hope that this problem is solved in future work.
4 Conclusions
This paper presents an online algorithm for dependency parsing problem which have tested on various language data in CONLL-2007 shared task.
The performance in English data is close to the Top3 score.
We also perform our algorithm on the domain adaptation task, in which we only focus on the training of the source data and select a suitable parameter using the development set.
The result is very good as it is close to the Top3 score of participating systems.
Future work will also be focused on extending our method to a version of using semi-supervised learning that can efficiently be learnt by using labeled and unlabeled data.
We hope that the application of the PA algorithm to other NLP problems such as semantic parsing will be explored in future work.
Acknowledgments
We would like to thank D. Yuret for his helps in checking errors of my parser's outputs.
We would like to thank Vinh-Van Nguyen his helps during the revision process and Mary Ann Mooradian for correcting the paper.
We would like to thank to anonymous reviewers for helpful discussions and comments on the manuscript.
Thank also to Sebastian Riedel for checking the issues raised in the reviews.
The work on this paper was supported by a Mon-bukagakusho 21st COE Program.
