Following (Blitzer et al., 2006), we present an application of structural correspondence learning to non-projective dependency parsing (McDonald et al., 2005).
To induce the correspondences among dependency edges from different domains, we looked at every two tokens in a sentence and examined whether or not there is a preposition, a determiner or a helping verb between them.
Three binary linear classifiers were trained to predict the existence of a preposition, etc, on unlabeled data and we used singular value decomposition to induce new features.
During the training, the parser was trained with these additional features in addition to these described in (McDonald et al., 2005).
We discriminatively trained our parser in an on-line fashion using a variant of the voted perceptron (Collins, 2002; Collins and Roark, 2004; Crammer and Singer, 2003).
1 Introduction
We have recently seen growing popularity of dependency parsing.
It is no longer rare to see dependency relations used as features, in tasks such as machine translation (Ding and Palmer, 2005) and relation extraction (Bunescu and Mooney, 2005).
However, there is one factor that prevents the use of dependency parsing: sparseness of annotated corpora outside Wall Street Journal.
In many situations we need to parse sentences from a target domain with no labeled data, which is a different distribution from a
source domain where plentiful labeled training data is available.
In this paper, we investigate the effectiveness of structural correspondence learning (SCL) (Blitzer et al., 2006) in the domain adaptation task given by the CoNLL 2007.
They hypothesize that a model trained in the source domain using this common feature representation will generalize better to the target domain, and focus on using unlabeled data from both the source and target domains to learn a common feature representation that is meaningful across both domains.
The paper is structured as follows: in section 2, we review the decoding and learning aspects of (McDonald et al., 2005), in section 3, structural correspondence learning applied to dependency parsing, and in section 4, we describe the experiments and the features needed for the CoNLL 2006 shared task.
2 Non-Projective Dependency Parsing
2.1 Dependency Structure
Let us define x to be a generic sequence of input tokens together with their POS tags and other morphological features, and y to be a generic dependency structure, that is, a set of edges for x.
A labeled edge is a tuple (DEPREL, i — j) where i is the start point of the edge, j is the end point, and DEPREL is the label of the edge.
The token at i is the head of the token at j.
Table 1 shows our formulation of a structured prediction problem.
Given x, the input tokens and their features (column 2 and 3, Table 1), the task is to pre-
Labeled Edge
yesterday
Yorkshire
Table 1: Example Edges
dict y, the set of labeled edges (column 4, Table 1).
In this paper we use the common method of factoring the score of the dependency structure as the sum of the scores of all the labeled edges.
A dependency structure is characterized by its labeled edges, and for each labeled edge, we have features and corresponding weights.
The score of a dependency structure is the sum of these weights.
In the upcoming section, we explain a decoding algorithm for the dependency structures, and later we give a method for learning the weight vector used in the decoding.
2.2 Maximum Spanning Tree Algorithm
As in (McDonald et al., 2005), we use Chu-Liu-Edmonds (CLE) algorithm (Chu and Liu, 1965; Edmonds, 1967) for decoding.
CLE finds the Maximum Spanning Tree in a directed graph.
The following is a summary given in (McDonald et al., 2005).
Informally, the algorithm has each vertex in the graph greedily select the incoming edge with highest weight.
Note that the edge is coming from the parent to the child.
That is, given a child node wordj, we are finding the parent, or the head wordi such that the edge (i, j) has the highest weight among all i, i = j.
If a tree results, then this must be the maximum spanning tree.
If not, there must be a cycle.
The procedure identifies a cycle and contracts it into a single vertex and recalculates edge weights going into and out of the cycle.
It can be shown that a maximum spanning tree on the contracted graph is equivalent to a maximum spanning tree in the original graph (Leonidas, 2003).
Hence the algorithm can recursively call itself on the new graph.
structured prediction.
In short, the update is executed when the decoder fails to predict the correct parse, and we compare the correct parse y* and the incorrect parse y' suggested by the decoding algorithm.
The weights of the features in y' will be lowered, and the weights of the features in y* will be increased accordingly.
3 Domain Adaptation
Following (Blitzer et al., 2006), we present an application of structural correspondence learning (SCL) to non-projective dependency parsing (McDonald et al., 2005).
SCL is a method for adapting a classifier learned in a source domain to a target domain.
We assume that both domains have unlabeled data, but only the source domain has labeled training data.
SCL works as follows: 1.
Define a set of pivot features on the unlabeled data from both domains.
Use these pivot features to learn a mapping from the original feature spaces of both domains to a shared, low-dimensional real-valued feature space.
A high inner product in this new space indicates a high degree ofcorrespondence.
Use both the transformed and original features from the source domain.
Again using both the transformed and original features, test the samples from the target domain.
If we learned a good mapping, then the effectiveness of the classifier in the source domain should transfer to the target domain.
To induce the correspondences among dependency edges in the source domain and the target domain, we looked at every two tokens in a sentence and examined whether or not there is a preposition, a determiner or a helping verb between them.
Although no edge is present in unlabeled data, the
presence of a preposition indicates that this edge between the tokens, if existed, will not be a noun modifier (in English corpus, this label is NMOD).
Thus, this induced feature should correlate with the label of an edge candidate.
We postulate that the label of an edge candidate, if known, may allow the supervised learner to choose the correct edge among the edge candidates in the target domain.
In the first step, we chose the presence of a preposition, a determiner or a helping verb between tokens as pivot features.
Then three binary linear classifiers were trained to predict the existence of a preposition (prep), determiner (det) and helping verb (hv) on unlabeled data and obtained a weight vector for each classifier.
classifier det(ee) classifierhv(e)
The input to the above classifiers is an edge e instead of a whole sentence x. 0 is a mapping from an edge to a feature vector.
Since POS tags were not available in unlabeled data, for pivot predictors, we took the subset of the features given by an edge.
The features for pivot predictors are listed in Table 2.
The reminder of the features are the same as ones
used in (McDonald et al., 2005).
Using each weight vector as a column, we created a weight matrix.
W = [wPreP\wdet\whv].
And run a singular value decomposition to induce a lower dimensional feature space.
W = .
We then took the transpose of the resulting unitary matrix, UT which maps the original data to the space spanned by the principal components, and applied it to the feature vector of every potential edge.
The original feature vector is ( f fsuhset ).
We argument the
y f reminder J
feature vector with the additional feature induced by
Cfsubset \ fr eminder U fsuhset J
were used throughout the training and testing of the dependency parser.
4 Experiments
Our experiments were conducted on CoNLL-2007 shared task domain adaptation track (Nivre et al., 2007) using treebanks (Marcus et al., 1993; Johansson and Nugues, 2007; Kulick et al., 2004).
Table 2: Binary Features for Pivot Predictors
4.1 Dependency Relation
The CLE algorithm works on a directed graph with unlabeled edges.
Since the CoNLL shared task requires the labeling of edges, as a preprocessing stage, we created a directed complete graph.
Then we labeled each edge with the highest scoring dependency relation.
This complete graph was given to the CLE algorithm and the edge labels were never altered in the course of finding the maximum spanning tree.
The features we used for pivot predictors to classify each edge ( DEPREL, i, j) are shown in Table 2.
The index i is the position of the parent and j is that of the child.
wordj = the word token at the position j. posj = the coarse part-of-speech at j.
No other features were used beyond the combinations of the word token in Table 2.
The hardware used was an Intel CPU at 3.0 Ghz with 32 GB of memory, and the software was written in C++. While more iterations should help, due to the time constraints, we were unable to complete more training.
The parser required a few days to train.
5 Results
Unfortunately, we have discovered a bug in our codes after submitting our results for the blind tests, and the reported results in (Nivre et al., 2007) were not representative of our approach.
The current results (closed class) are shown in Table 3.
For the explanations of Labeled Attachment Score, Unlabeled Attachment Score and Label Accuracy, the readers are suggested to refer to the shared task introductory paper (Nivre et al., 2007).
WSJ represents the application of the parser without SCL to the source domain test set, and WSJ-SCL the parser with SCL to the same test set.
Similarily
Domain LAS UAS Label Accuracy
Table 3: Labeled Attachment Score, Unlabeled Attachment Score and Label Accuracy
Chem and Chem-SCL represents the application of the parser without SCL and with SCL to the source domain test set respectively.
We did batch learning by running the online algorithm 4 times.
An arrow — indicates how the results after 2nd iteration changed at the end of 4th iteration.
Contrary to our expectations, we seem to see SCL overfitting to the source domain WSJ in this experiment.
Due to the lack of POS tags in unlabeled data, our feature set for pivot predictors uses tokens extensively unlike that for the dependency parser.
Since tokens are not as abstract as POS tags, we suspect induced features may have caused overfitting.
6 Conclusion
We presented an application of structural correspondence learning to non-projective dependency parsing.
Effectiveness of SCL for domain adaptation is mixed in this experiment perhaps due to the mismatch between feature sets.
Future work includes use of more sophisticated features such as POS and other morphological features, possibly a joint domain adaptation of POS tagging and dependency parsing for unlabeled data as well as re-examination of pivot features.
