Three versions of the Covington algorithm for non-projective dependency parsing have been tested on the ten different languages for the Multilingual track of the CoNLL-X Shared Task.
The results were achieved by using only information about heads and daughters as features to guide the parser which obeys strict incrementality.
1 Introduction
In this paper we focus on two things.
First, we investigate the impact of using different flavours of Covington's algorithm (Covington, 2001) for non-projective dependency parsing on the ten different languages provided for CoNLL-X Shared Task (Nivre et al., 2007).
Second, we test the performance of a pure grammar-based feature model in strictly incremental fashion.
The grammar model relies only on the knowledge of heads and daughters of two given words, as well as the words themselves, in order to decide whether they can be linked with a certain dependency relation.
In addition, none of the three parsing algorithms guarantees that the output dependency graph will be projective.
2 Covington's algorithm(s)
In his (2001) paper, Covington presents a "fundamental" algorithm for dependency parsing, which he claims has been known since the 1960s but has, up to his paper-publication, not been presented systematically in the literature.
We take three of its flavours, which enforce uniqueness (a.k.a.
single-headedness) but do not observe projectivity.
The algorithms work one word at a time and attempt to build a connected dependency graph with only a single left-to-right pass through the input.
The three flavours are: Exhaustive Search, Head First with Uniqueness (ESHU), Exhaustive Search Dependents First with Uniqueness (ESDU) and List-based search with Uniqueness (LSU).
The yes/no function HEAD?
(w1,w2), checks whether a word wl can be a head of a word w2 according to a grammar G. It also respects the single-head and no-cycle conditions.
The LINK(w1,w2) procedure links word wl as the head of word w2 with a dependency relation as proposed by G. When traversing Headlist and Wordlist we start with the last word added.
(Nivre, 2007) describes an optimized version of Covington's algorithm implemented in MaltParser (Nivre, 2006) with a running
time c(^|— f) for an n-word sentence, where c is some constant time in which the LINK operation can be performed.
However, due to time constraints, we will not bring this version of the algorithm into focus, but see some preliminary remarks on it with respect to our parsing model in 6.
foreach Hin Wordlist
terminate this foreach loop;
if no head for W was found then
3 Classifier as an Instant Grammar
The HEAD? function in the algorithms presented in 2, requires an "instant grammar" (Covington, 2001) of some kind, which can tell the parser whether the two words under scrutiny can be linked and with what dependency relation.
To satisfy this requirement, we use TiMBL - a Memory-based learner (Daelemans et al., 2004) - as a classifier to predict the relation (if any) holding between the two words.
Building heavily on the ideas of History-based parsing (Black et al., 1993; Nivre, 2006), training the parser means essentially running the parsing algorithms in a learning mode on the data in order to gather training instances for the memory-based learner.
In a learning mode, the HEAD? function has access to a fully parsed dependency graph.
In the parsing mode, the HEAD? function in the algorithms issues a call to the classifier using features from the parsing history (i.e. a partially built dependency graph PG).
1 Covington adds W to the Wordlist as soon as it has been seen, however we have chosen to wait until after all tests have been completed.
classifier then attempts to map this feature vector to any of predefined classes.
These are all the dependency relations, as defined by the treebank and the class "NO" in the cases where no link between the two words is possible.
4 The Grammar model
The features used in our history-based model are restricted only to the partially built graph PG.
We call this model a pure grammar-based model since the only information the parsing algorithms have at their disposal is extracted from the graph, such as the head and daughters of the current word.
Preceding words not included in the PG as well as words following the current word are not available to the algorithm.
In this respect such a model is very restrictive and suffers from the pitfalls of the incremental processing (Nivre, 2004).
The motivation for the chosen model, was to approximate a Data Oriented Parsing (DOP) model (e.g. (Bod et al., 2003)) for Dependency Grammar.
Under DOP, analyses of new sentences are produced by combining previously seen tree fragments.
However, the tree fragments under the original DOP model are static, i.e. we have a corpus of all possible subtrees derived from a treebank.
Under our approach, these tree fragments are built dynamically, as we try to parse the sentence.
Because of the chosen DOP approximation, we have not included information about the preceding and following words of the two words to be linked in our feature model.
To exemplify our approach, (1) shows a partially build graph and all the words encountered so far and Fig.
1 shows two examples of the tree-building operations for linking words f and d, and f and a.
(1) abc d ef...
Given two words i and j to be linked with a dependency relation, such that word j precedes word i, the following features describe the models on which the algorithms have been trained and tested:
ds(i) means any two daughters (if available) of word i, h(i/j) refers to the head of word i or word j, depending on the direction of applying the HEAD? function (see Fig 1) and h(h(i/j)) stands for the head of the head of word i or word j.
The basic model, which was used for the largest training data sets of Czech and Chinese, includes only the first four features in every category.
A larger model used for the datasets of Catalan and Hungarian adds the h(j/i) feature from every category.
The enhanced model used for Arabic, Basque, English, Greek, Italian and Turkish uses the full set of features.
This tripartite division of models was motivated only by time- and resource-constraints.
The simplest model is for Chinese and uses only 5 features while the enhanced model for Arabic for example uses a total of39 features.
5 Results and Setup
Table 1 summarizes the results of testing the three algorithms on the ten different languages.
The parser was written in C#. Training and testing were performed on a MacOSX 10.4.9 with 2GHz Intel Core2Duo processor and 1GB memory, and a Dell Dimension with 2.80GHz Pentium 4 processor and 1GB memory running Mepis Linux.
TiMBL was run in client-server mode with default settings (IB1 learning algorithm, extrapolation from the most similar example, i.e. k = 1, initiated with the command "Timbl -S <portnumber> -f
Table 1: Test results for the 10 languages.
LA is the Labelled Attachment Score and UA is the Unla-belled Attachment Score
<training_file>").
Additionally, we attempted to use Support Vector Machines (SVM) as an alternative classifier.
However, due to the long training time, results from using SVM were not included but training an SVM classifier for some of the languages has started.
6 Discussion
Before we attempt a discussion on the results presented in Table 1, we give a short summary ofthe basic word order typology of these languages according to (Greenberg, 1963).
Table 2 shows whether the languages are SVO (subject-verb-object) or SOV (subject-object-verb), or VSO (verb-subject-object); contain Pr (prepositions) or Po (postpositions); NG (noun precedes genitive) or GN (genitive precedes noun); AN (adjective precedes noun) or NA (noun precedes adjective).
2Greenberg had give varying for the word-order typology of English.
However, we trusted our own intuition as well as the hint of one of the reviewers.
English2
Hungarian
Table 2: Basic word order typology of the ten languages following Greenberg's Universals
Looking at the data in Table 1, several observations can be made.
One is the different performance of languages from the same language family, i.e. Italian, Greek and Catalan.
However, the head-first (ESHU) algorithm presented better than the dependents-first (ESDU) one in all of these languages.
The SOV languages like Hungarian, Basque and Turkish had preference for the dependent's first algorithms (ESDU and LSU).
The ESDU algorithm also fared better with the SVO languages, except for Italian.
However, the Greenberg's basic word order typology cannot shed enough light into the performance of the three parsing algorithms.
One question that pops up immediately is whether a different feature-model using the same parsing algorithms would achieve similar results.
Can the different performance be attributed to the treebank annotation?
Would another classifier fare better than the Memory-based one?
These questions remain for future research though.
Finally, for the Basque data we attempted to test the optimized version of the Covington algorithm (Nivre, 2007) against the three other versions discussed here.
Additionally, since our feature vectors differed from those described in (Nivre, 2007), head-dependent-features vs. j-i-features, we changed them so that all the four algorithms send a similar feature vector, j-i-features, to the classifier.
The preliminary result was that Nivre's version was the fastest, with fewer calls to the LINK procedure and with the smallest training data-set.
However, all the four algorithms showed about 20% decrease in
LA/UA scores.
Our first intuition about the results from the tests done on all the 10 languages was that the classification task suffered from a highly skewed class distribution since the training instances that correspond to a dependency relation are largely outnumbered by
was low and we expected the classifier to be able to predict more of the required links.
However, the results we got from additional optimizations we performed on Hungarian, following recommendation from the anonymous reviewers, may lead to a different conclusion.
The chosen grammar model, relying only on connecting dynamically built partial dependency graphs, is insufficient to take us over a certain threshold.
7 Conclusion
In this paper we showed the performance of three flavours of Covington's algorithm for non-projective dependency parsing on the ten languages provided for the CoNLL-X Shared Task (Nivre et al., 2007).
The experiment showed that given the grammar model we have adopted it does matter which version of the algorithm one uses.
The chosen model, however, showed a poor performance and suffered from two major flaws - the use of only partially built graphs and the pure incremental processing.
It remains to be seen how these parsing algorithms will perform in a parser, with a much richer feature model and whether it is worth using different flavours when parsing different languages or the differences among them are insignificant.
Acknowledgements
We would like to thank the two anonymous reviewers for their valuable comments.
We are grateful to Joakim Nivre for discussion on the Covington algorithm, Bertjan Busser for help with TiMBL, Antal van den Bosch for help with paramsearch, Matthew Johnson for providing the necessary functionality to his .
NET implementation of SVM and Patrycja Jablonska for discussion on the Greenberg's Universals.
