We describe some challenges of adaptation in the 2007 CoNLL Shared Task on Domain Adaptation.
Our error analysis for this task suggests that a primary source of error is differences in annotation guidelines between treebanks.
Our suspicions are supported by the observation that no team was able to improve target domain performance substantially over a state of the art baseline.
1 Introduction
Dependency parsing, an important NLP task, can be done with high levels of accuracy.
However, adapting parsers to new domains without target domain labeled training data remains an open problem.
This paper outlines our participation in the 2007 CoNLL Shared Task on Domain Adaptation (Nivre et al., 2007).
The goal was to adapt a parser trained on a single source domain to a new target domain using only unlabeled data.
We were given around 15K sentences of labeled text from the Wall Street Journal (WSJ) (Marcus et al., 1993; Johansson and Nugues, 2007) as well as 200K unlabeled sentences.
The development data was 200 sentences of labeled biomedical oncology text (BIO, the ONCO portion of the Penn Biomedical Treebank), as well as 200K unlabeled sentences (Kulick et al., 2004).
The two test domains were a collection of medline chemistry abstracts (pchem, the CYP portion of the Penn Biomedical Treebank) and the Child Language Data Exchange System corpus (CHILDES) (MacWhin-ney, 2000; Brown, 1973).
We used the second order two stage parser and edge labeler of McDonald et al. (2006), which achieved top results in the 2006
CoNLL-X shared task.
Preliminary experiments indicated that the edge labeler was fairly robust to domain adaptation, lowering accuracy by 3% in the development domain as opposed to 2% in the source, so we focused on unlabeled dependency parsing.
Our system did well, officially coming in 3rd place out of 12 teams and within 1% of the top system (Table 1).
1 In unlabeled parsing, we scored 1st and 2nd on CHILDES and pchem respectively.
However, our results were obtained without adaptation.
Given our position in the ranking, this suggests that no team was able to significantly improve performance on either test domain beyond that of a state-of-the-art parser.
After much effort in developing adaptation methods, it is critical to understand the causes of these negative results.
In what follows, we provide an error analysis that attributes domain loss for this task to a difference in annotation guidelines between domains.
We then overview our attempts to improve adaptation.
While we were able to show limited adaptation on reduced training data or with first-order features, no modifications improved parsing with all the training data and second-order features.
2 Parsing Challenges
We begin with an error analysis for adaptation between WSJ and BIO.
We divided the available WSJ data into a train and test set, trained a parser on the train set and compared errors on the test set and BIO.
Accuracy dropped from 90% on WSJ to 84% on BIO.
We then computed the fraction of errors involving each POS tag.
For the most common
1While only 8 teams participated in the closed track with us, our score beat all of the teams in the open track.
pchem ul
childes ul
Table 1: Official labeled (1) and other unlabeled (ul) submitted results for the two test domains (pchem and childes) and development data accuracy (bio).
The parser was trained on the provided WSJ data.
POS types, the loss (difference in source and target error) was: verbs (2%), conjunctions (5%), digits (23%), prepositions (4%), adjectives (3%), determiners (4%) and nouns (9%).
2 Two POS types stand out: digits and nouns.
Digits are less than 4% of the tokens in BIO.
Errors result from the BIO annotations for long sequences of digits which do not appear in WSJ.
Since these annotations are new with respect to the WSJ guidelines, it is impossible to parse these without injecting knowledge of the annotation guidelines.
3 Nouns are far more common, comprising 33% of BIO and 30% of WSJ tokens, the most popular POS tag by far.
Additionally, other POS types listed above (adjectives, prepositions, determiners, conjunctions) often attach to nouns.
To confirm that nouns were problematic, we modified a first-order parser (no second order features) by adding a feature indicating correct noun-noun edges, forcing the parser to predict these edges correctly.
Adaptation performance rose on BIO from 78% without the feature to 87% with the feature.
This indicates that most of the loss comes from missing these edges.
The primary problem for nouns is the difference between structures in each domain.
The annotation guidelines for the Penn Treebank flattened noun phrases to simplify annotation (Marcus et al., 1993), so there is no complex structure to NPs.
Kubler (2006) showed that it is difficult to compare the Penn Treebank to other treebanks with more complex noun structures, such as BIO.
Consider the WSJ phrase "the New York State Insurance Department".
The annotation indicates a flat structure, where ev-
2We measured these drops on several other dependency parsers and found similar results.
ery token is headed by "Department".
In contrast, a similar BIO phrase has a very different structure, pursuant to the BIO guidelines.
For "the detoxi-cation enzyme glutathione transferase P1-1", "enzyme" is the head of the NP, "P1-1" is the head of "transferase", and "transferase" is the head of "glu-tathione".
Since the guidelines differ, we observe no corresponding structure in the WSJ.
It is telling that the parser labels this BIO example by attaching every token to the inal proper noun "P1-1", exactly as the WSJ guidelines indicate.
Unlabeled data cannot indicate that BIO uses a different standard.
Another problem concerns appositives.
For example, the phrase "Howard Mosher, president and chief executive oficer," has "Mosher" as the head of "Howard" and of the appositive NP delimited by commas.
While similar constructions occur in BIO, there are no commas to indicate this.
An example is the above BIO NP, in which the phrase "glutathione transferase P1-1" is an appositive indicating which "enzyme" is meant.
However, since there are no commas, the parser thinks "P1-1" is the head.
However, there are not many right to left attaching nouns.
In addition to a change in the annotation guidelines for NPs, we observed an important difference in the distribution of POS tags.
NN tags were almost twice as likely in the BIO domain (14% in WSJ and 25% in BIO).
NNP tags, which are close to 10% of the tags in WSJ, are nonexistent in BIO (.
24%).
The cause for this is clear when the annotation guidelines are considered.
The proper nouns in WSJ are names of companies, people and places, while in BIO they are names of genes, proteins and chemicals.
However, for BIO these nouns are labeled NN instead of NNP.
This decision effectively removes NNP from the BIO domain and renders all features that depend on the NNP tag ineffective.
In our above BIO NP example, all nouns are labeled NN, whereas the WSJ example contains NNP tags.
The largest tri-gram differences involve nouns, such as NN-NN-NN, NNP-NNP-NNP, NN-IN-NN, and IN-NN-NN.
However, when we examine the coarse POS tags, which do not distinguish between nouns, these differences disappear.
This indicates that while the overall distribution of POS tags is similar between the domains, the ine grained tags differ.
These ine grained tags provide more information than coarse tags; experiments that removed ine grained tags
hurt WSJ performance but did not affect BIO.
Finally, we examined the effect of unknown words.
Not surprisingly, the most signiicant differences in error rates concerned dependencies between words of which one or both were unknown to the parser.
For two words that were seen in the training data loss was 4%, for a single unknown word loss was 15%, and 26% when bothwords were unknown.
Both words were unknown only 5% of the time in BIO, while one of the words being unknown was more common, reflecting 27% of decisions.
Upon further investigation, the majority of unknown words were nouns, which indicates that unknown word errors were caused by the problems discussed above.
Recent theoretical work on domain adaptation (Ben-David et al., 2006) attributes adaptation loss to two sources: the difference in the distribution between domains and the difference in labeling functions.
Adaptation techniques focus on the former since it is impossible to determine the latter without knowledge of the labeling function.
In parsing adaptation, the former corresponds to a difference between the features seen in each domain, such as new words in the target domain.
The decision function corresponds to differences between annotation guidelines between two domains.
Our error analysis suggests that the primary cause of loss from adaptation is from differences in the annotation guidelines themselves.
Therefore, signiicant improvements cannot be made without speciic knowledge of the target domain's annotation standards.
No amount of source training data can help if no relevant structure exists in the data.
Given the results for the domain adaptation track, it appears no team successfully adapted a state-of-the-art parser.
3 Adaptation Approaches
We survey the main approaches we explored for this task.
While some of these approaches provided a modest performance boost to a simple parser (limited data and first-order features), no method added any performance to our best parser (all data and second-order features).
A natural approach to improving parsing is to modify the feature set, both by removing features less likely to transfer and by adding features that are more likely to transfer.
We began with the irst approach and removed a large number of features that we believed transfered poorly, such as most features for noun-noun edges.
We obtained a small improvement in BIO performance on limited data only.
We then added several different types offeatures, specifically designed to improve noun phrase constructions, such as features based on the lexical position of nouns (common position in NPs), frequency of occurrence, and NP chunking information.
For example, trained on in-domain data, nouns that occur more often tend to be heads.
However, none of these features transfered between domains.
A inal type of feature we added was based on the behavior of nouns, adjectives and verbs in each domain.
We constructed a feature representation of words based on adjacent POS and words and clustered words using an algorithm similar to that of Saul and Pereira (1997).
For example, our clustering algorithm grouped irst names in one group and measurements in another.
We then added the cluster membership as a lexical feature to the parser.
None of the resulting features helped adaptation.
Training diversity may be an effective source for adaptation.
We began by adding information from multiple different parsers, which has been shown to improve in-domain parsing.
We added features indicating when an edge was predicted by another parser and if an edge crossed a predicted edge, as well as conjunctions with edge types.
This failed to improve BIO accuracy since these features were less reliable at test time.
Next, we tried instance bagging (Breiman, 1996) to generate some diversity among parsers.
We selected with replacement 2000 training examples from the training data and trained three parsers.
Each parser then tagged the remaining 13K sentences, yielding 39K parsed sentences.
We then shuffled these sentences and trained a final parser.
This failed to improve performance, possibly because of conflicting annotations or because of lack of sufficient diversity.
To address conflicting annota-
tions, we added slack variables to the MIRA learning algorithm (Crammer et al., 2006) used to train the parsers, without success.
We measured diversity by comparing the parses of each model.
The difference in annotation agreement between the three instance bagging parsers was about half the difference between these parsers and the gold annotations.
While we believe this is not enough diversity, it was not feasible to repeat our experiment with a large number of parsers.
3.3 Target Focused Learning
Another approach to adaptation is to favor training examples that are similar to the target.
We irst mod-iied the weight given by the parser to each training sentence based on the similarity of the sentence to target domain sentences.
This can be done by modifying the loss to limit updates in cases where the sentence does not reflect the target domain.
We tried a number of criteria to weigh sentences without success, including sentence length and number of verbs.
Next, we trained a discriminative model on the provided unlabeled data to predict the domain of each sentence based on POS n-grams in the sentence.
Training sentences with a higher probability of being in the target domain received higher weights, also without success.
Further experiments showed that any decrease in training data hurt parser performance.
It would seem that the parser has no dif-iculty learning important training sentences in the presence of unimportant training examples.
A related idea focused on words, weighing highly tokens that appeared frequently in the target domain.
We scaled the loss associated with a token by a factor proportional to its frequency in the target domain.
We found certain scaling techniques obtained tiny improvements on the target domain that, while signiicant compared to competition results, are not statistically signiicant.
We also attempted a similar approach on the feature level.
A very predictive source domain feature is not useful if it does not appear in the target domain.
However, limiting the feature space to target domain features had no effect.
Instead, we scaled each feature's value by a factor proportional to its frequency in the target domain and trained the parser on these scaled feature values.
We obtained small improvements on small amounts of training data.
4 Future Directions
Given our pessimistic analysis and the long list of failed methods, one may wonder if parser adaptation is possible at all.
We believe that it is.
First, there may be room for adaptation with our domains if a common annotation scheme is used.
Second, we have stressed that typical adaptation, modifying a model trained on the source domain, will fail but there may be unsupervised parsing techniques that improve performance after adaptation, such as a rule based NP parser for BIO based on knowledge of the annotations.
However, this approach is unsatisfying as it does not allow general purpose adaptation.
5 Acknowledgments
We thank Joel Wallenberg and Nikhil Dinesh for their informative and helpful linguistic expertise, Kevin Lerman for his edge labeler code, and Koby Crammer for helpful conversations.
Dredze is supported by a NDSEG fellowship; Ganchev and Taluk-
DARPA under Contract No. NBCHD03001.
Any opinions, indings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the DARPA or the Department of Interior-National
Business Center (DOI-NBC).
References
Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira.
Analysis of representations for domain adaptation.
In NIPS.
Leo Breiman.
1996.
Bagging predictors.
Machine Learning, 24(2):123-140.
R. Brown.
1973.
A First Language: The Early Stages.
Harvard University Press.
Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, and Yoram Singer.
Online passive-aggressive algorithms.
Journal of Machine Learning Research, 7:551-585, Mar.
R. Johansson and P. Nugues.
2007.
Extended constituent-to-dependency conversion for English.
In Proc. of the 16th Nordic Conference on Computational Linguistics (NODALIDA).
Sandra Kubler.
How do treebank annotation schemes influence parsing results? or how not to compare apples and oranges.
In RANLP.
S. Kulick, A. Bies, M. Liberman, M. Mandel, R. McDonald, M. Palmer, A. Schein, and L. Ungar.
2004.
Integrated annotation for biomedical information extraction.
In Proc. of the Human Language Technology Conference and the Annual Meeting of the North American Chapter of the Association for Computational Linguistics (HLT/NAACL).
for Analyzing Talk.
Lawrence Erlbaum.
M. Marcus, B. Santorini, and M. Marcinkiewicz.
1993.
Building a large annotated corpus of English: the Penn Treebank.
Computational Linguistics, 19(2):313-330.
Ryan McDonald, Kevin Lerman, and Fernando Pereira.
2006.
Multilingual dependency parsing with a two-stage discriminative parser.
In Conference on Natural Language Learning (CoNLL).
CoNLL).
Lawrence Saul and Fernando Pereira.
1997.
Aggregate and mixed-order markov models for statistical language modeling.
In EMNLP.
