This paper reports a hybridization experiment, where a baseline ML dependency parser, LingPars, was allowed access to Constraint Grammar analyses provided by a rule-based parser (EngGram) for the same data.
Descriptive compatibility issues and their influence on performance are discussed.
The hybrid system performed considerably better than its ML baseline, and proved more robust than the latter in the domain adaptation task, where it was the best-scoring system in the open class for the chemical test data, and the best overall system for the CHILDES test data.
1 Introduction
LingPars, a language-independent treebank-learner developed in the context of the CoNLL-X 2006 shared task (http://nextens.uvt.nl/~conll/), was inspired by the Constraint Grammar (CG) parsing approach (Karlsson et al. 1995) in the sense that it prioritized the identification of syntactic function over syntactic form, basing the dependency potential of a word on "edge" labels like subject, object etc. rather than the other way around.
The system also used other features typical of CG systems, such as BARRIER conditions, tag chains of variable length, implicit clause boundaries and tag sets (Bick 2006).
For the 2007 task only one such feature was newly introduced - a directedness marker for a few major functions, splitting subject, adverbial and adnominal labels into pairs of left- and
right-attaching labels (e.g. SBJ-L, SBJ-R, NMOD-L, NMOD-R).
Even this small addition, however, increased the memory space requirements of the model to such a degree that only runs with 50-75% of the training data were possible on the available hardware.
The main purpose of the LingPars architecture
• Can an independent, rule-based parser be made to conform to different, data-imposed descriptive conventions without too great a loss in accuracy?
• Does a rules-based dependency parser have a better chance than a machine-learned one to identify long-distance relations and global sentence structure, thus providing valuable arbiter information to the latter?
Obviously, both points rule out a test involving many languages with the same parser (CoNLL task 1).
The domain adaptation task (task 2), however, satisfied the single-language condition and also adressed the descriptive adaptation problem (second hypothesis), involving three English treebanks - Wall Street Journal data from the Penn treebank (PTB, Marcus et al. 1993) for training, and the Pchem (Kulick et al. 2004) and CHILDES (Brown
1973 and MacWhinney 2000) treebanks with biomedical and spoken language data, respectively.
2 Developing and adapting EngGram
A parser with hand-written rules pays a high "labour price" to arrive at deep, linguistically pre-
dictable and versatile analyses.
For CG systems as employed by the author, the cost, from lexicon to dependency, is usually several man years, and results are not language-independent.
One way of increasing development efficiency is to combine modules for different levels of analysis while reusing or adapting the less-language independent ones.
Thus, the development of a new English dependency parser, EngGram, under way for some time, was accelerated for the present project by seeding the syntactic disambiguation grammar with Danish rules from the well-established DanGram parser (http://beta.visl.sdu.dk/ constraint_grammar.html).
By maintaining an identical set of syntactic function tags, it was even possible to use the Danish dependency module (Bick 2005) with only minor adaptations (mainly concerning noun chains and proper nouns).
In order to integrate the output of a CG parser into an ML parser for the shared task data, several levels of compatibility issues have to be addressed.
On the input side, (1) PTB tokenization and (2) word classes (PoS) have to be fed into the CG parser bypassing its own modules of morphological analysis and disambiguation.
On the output side, (3) CG function categories and (4) attachment conventions have to be adapted to match PTB ones.
For example, the manual rules were tuned to a tokenization system that handles expressions such as "a=few", "at=least" and "such=as" as units.
Though amounting to only 1% of running text, they constitute syntactically crucial words, and misanalysis leads to numerous secondary errors.
Even worse is the case of the genitive-s (also with a frequency of 1%), tokenised in the PTB convention, but regarded a morpheme in EngGram.
Since EngGram does not have a word class for the isolated 's', and since ordinary rules disfavour postnominal singel-word attachment, the 's' had to be fused in PTB-to-CG input, creating fewer tokens and thus problems in re-aligning the analysed output.
Also relevant for a full structure parser is the parse window.
Here, in order to match PTB window size, EngGram had to be forced not to regard ; ( ) and : as delimiters, with an arguable loss in annota-
tion accuracy due to rules with global NOT contexts designed for smaller windows.
Finally, PTB convention fuses certain word classes, like subordinating conjunctions and prepositions (IN), and the infititive marker and the preposition "to" (TO).
Though these cases can be treated by letting CG disambiguation override the CoNLL input's pos tag, input pos can then no longer be said to be "known", with some deterioration in recall as a consequence.
Open class categories matched well even at a word-by-word level, closed class tokens were found to sometimes differ for individual words, an error source left largely unchecked.
Treebank error rate is another factor to be considered - in cases where the PoS accuracy of the human-revised treebanks is lower than that of a CG system, the latter should be allowed to always assign its own tags, rather than follow the supposedly fixed input pos.
In the domain adaptation task, the CHILDES data were a case in point.
A separate CG run indicated 6.6% differences in PoS, and manual inspection of part of the cases suggested that while some cases were irrelevant variations (e.g. adjective vs. participle), most were real error on the part of the treebank, and the parser was therefore set to ignore test data annotation and to treat it as pure text.
Errors appeared to be rarer in the training data, but inconsistencies between pos and function label (e.g. IN-preposition and SBJ-subject for "that") prove that errors aren't unknown here either -which is why a hybrid system with independent analysis has the potential benefit of compensating for "mis-learned" patterns in the ML system.
Output conversion from CG to PTB/CoNLL format had to address, besides realignment of tokens (e.g. genitive-s), the disparity in edge (function) labels.
However, since the PTB set was more coarse grained, it was possible to simply lump several EngGram labels into one PTB label, for instance:
Some idiosyncrasies had to be observed here, for instance the treatment of SC (subject complement)
as VMOD for words, but ADV for clauses, or the descriptive decision to tag direct objects in ACI constructions with OA-clausal complements as subjects.
Some cases of label variation, however, could not be solved in a systematic way.
Thus, adverbs within verb chains, always ADVL in EngGram, could not systematically be mapped, since PTB uses both VMOD and ADV in this position.
A certain percentage of mismatches in spite of a correct analysis must therefore be taken into account as part of the "price" for letting the CG system advise the machine learner.
Dependencies were generally used in the same way in both systems, but multi word expressions were problematic, since PTB - without marking them as MWE - appears to attach all elements to a common head even where internal structure (e.g. a PP) is present.
No reliable way was found to predict this behaviour from CG dependency output.
Finally, PTB often uses the adverbial modifier tag (AMOD) for what would logically be the head of an expression:
about (head) 1,200 (AMOD) so (head) totally (AMOD) herbicide (head) resistant (AMOD) EngGram in these examples regards the first element as AMOD modifier, and the second as head.
Since the inversion was so common, it was accepted as either intentional or systematically erroneous, and the CG output inverted accordingly.
It is an open question, for future research, whether the CG and ML systems could have been harmonized better, had the training data been an original dependency treebank rather than a constituent tree-bank, - or at least linguistically revised at the dependency level.
Making the constituent-dependency conversion principles (Johansson & Nugues 2007, forthcoming) public before rather than after the shared task might also have contributed to a better CG annotation transfer.
3 System architecture
quency threshold - the LEMMA or, if absent, FORM tag.
In a first round, LingPars calculates a preference list of functions and dependencies for each word, examining all possible mother-daughter pairs and n-grams in the sentence (or paragraph).
Next, dependencies are adjusted for function, basically summing up the frequency-, distance- and direction-calibrated function->PoS attachment probabilities for all contextually allowed functions for a given word.
Finally, dependency probabilities are weighted using linked probabilities for possible mother-, daughter- and sister-tags in a second pass.
The result are 2 arrays, one for possible daugh-ter->mother pairs, one for word:function pairs.
LingPars then attempts to "effectuate" the dependency (daughter->mother) array, starting with the -in normalized terms - highest value.
If the daughter candidate is as yet unattached, and the dependency does not produce circularities or crossing branches, the corresponding part of the (ordered) word:func-tion array is calibrated for the suggested dependency, and the top-ranking function chosen.
One of the major problems in the original system was uniqueness clashes, and as a special case, root attachment ambiguity, resulting from a conflict between the current best attachment candidate in the pipe and an earlier chosen attachment to the same head.
Originally, the parser tried to resolve these conflicts by assigning penalties to the attachments in question and recalculating "second best" attachments for the tokens in question.
While solving some cases, this method often timed out without finding a globally compatible solution.
In the new version of LingPars, with open resources, the attachment and function label rankings were calibrated using the analysis suggested by the EngGram CG system for the same data, assigning extra weights to readings supported by the rule based analysis, using addition of a weight constant for function, and multiplication with a weight constant for attachments, thus integrating CG information on par with statistical information1.
This was
Experiments suggested that there is a limit beyond which an increase of these weighting constants, for both function and dependency, will actually lead to a decrease in performance, because the positive effect of long-distance attachments from the CG system will be cancelled out by the negative effect of
not, however, thought sufficient to resolve the global syntactic problem of root attachment where (wrong) statistical preferences could be so strong that even 20 rounds of penalties could not weaken them sufficient to be ruled out.
Therefore, root and root attachments supported by the CG trees were fixed in the first pass, without reruns.
The same method was used for another source of global errors - coordination.
Here, the probabilistic system had difficulties learning patterns, because a specific function label (SBJ or OBJ etc) would be associated with a non-specific word class (CC), and a non-specific function (COORD) with a host of different word classes.
Again, adding a first-pass override based on CG-provided coordination links solved many of these cases.
Though limited to 2 types of global dependency (root and coordination), the help provided by the rule based analysis, also had indirect benefits by providing a better point of departure for other attachments, among other things because LingPars exaggerated both good and bad analyses: Good attachments would help weight other attachments through correct n-gram-, mother-, daughter- and sibling contexts, but isolated bad attachments would lead to even worse attachments by triggering, for instance, incorrect BARRIER or crossing branch constraints.
These adverse effects were moderated by getting a larger percentage of global dependencies right in the first place, and also by a new addition to the crossing and BARRIER subroutine invalidating it in the case of CG-supported attachments.
Evaluation
The hybrid LingPars was the best-scoring system in the open section of both domain adaptation tasks2 (Nivre et al. 2007), outperforming its probabilistic core system on all scores, with an improvement of 6.57 LAS percentage points for the
disturbing the application of machine-learned local dependencies.
2 During the test phase, the data set for one of the originally 2 test domains, CHILDES, was withdrawn from the official ranking, though its scores were still computed and admissible for evaluation.
CHILDES attachment score (table 2).
In the former, the effect was slightly more marked for attachment than for label accuracy.
However, whereas results also surpassed those of the top closed class system in the CHILDES domain (by 1.12 percentage points), they fell short of this mark for the pchemtb corpus - by 1.26 percentage points for label accuracy and 1.80 for attachment.
Table 1: Performance, Pchemtb data
Top score
CHILDES closed
CHILDES open
Table 2: Performance, CHILDES data
When compared with runs on (unknown) data from the training domain, cross-domain performance of the closed system was 2 percentage points lower for attachment and 3.5 lower for label accuracy (LA scores of 71.81 and 58.07 for the pchemtb and CHILDES corpus, respectively).
Interestingly, hybrid results for the pchemtb data were only marginally lower than for the training domain (in fact, higher for attachment), suggesting a higher domain robustness for the hybrid than for the probabilistic approach.
3This is the accuracy for the test data used during development.
For the PTB gold test data from track 1, LAS was higher (76.21).
