In this paper, we describe a new algorithm for recovering WH-trace empty nodes.
Our approach combines a set of hand-written patterns together with a probabilistic model.
Because the patterns heavily utilize regular expressions, the pertinent tree structures are covered using a limited number of patterns.
The probabilistic model is essentially a probabilistic context-free grammar (PCFG) approach with the patterns acting as the terminals in production rules.
We evaluate the algorithm's performance on gold trees and parser output using three different metrics.
Our method compares favorably with state-of-the-art algorithms that recover WH-traces.
1 Introduction
In this paper, we describe a new algorithm for recovering WH-trace empty nodes in gold parse trees in the Penn Treebank and, more importantly, in automatically generated parses.
This problem has only been investigated by a handful of researchers and yet it is important for a variety of applications, e.g., mapping parse trees to logical representations and structured representations for language modeling.
For example, SuperARV language models (LMs) (Wang and Harper, 2002; Wang et al., 2003), which tightly integrate lexical features and syntactic constraints, have been found to significantly reduce word error in English speech recognition tasks.
In order to generate SuperARV LM training, a state-of-the-art parser is used to parse training material and then a rule-based transformer converts the parses to
the SuperARV representation.
The transformer is quite accurate when operating on treebank parses; however, trees produced by the parser lack one important type of information - gaps, particularly WH-traces, which are important for more accurate extraction of the SuperARVs.
Approaches applied to the problem of empty node recovery fall into three categories.
Dienes and Dubey (2003) recover empty nodes as a preprocessing step and pass strings with gaps to their parser.
Their performance was comparable to (Johnson, 2002); however, they did not evaluate the impact of the gaps on parser performance.
Collins (1999) directly incorporated wh-traces into his Model 3 parser, but he did not evaluate gap insertion accuracy directly.
Most of the research belongs to the third category, i.e., post-processing of parser output.
Johnson (2002) used corpus-induced patterns to insert gaps into both gold standard trees and parser output.
Campbell (2004) developed a set of linguistically motivated hand-written rules for gap insertion.
Machine learning methods were employed by (Higgins, 2003; Levy and Manning, 2004; Gabbard et al., 2006).
In this paper, we develop a probabilistic model that uses a set of patterns and tree matching to guide the insertion of WH-traces.
We only insert traces of non-null WH-phrases, as they are most relevant for our goals.
Our effort differs from the previous approaches in that we have developed an algorithm for the insertion of gaps that combines a small set of expressive patterns with a probabilistic grammar-based model.
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 620-629, Prague, June 2007.
©2007 Association for Computational Linguistics
We have developed a set of tree-matching patterns that are applied to propagate a gap down a path in a parse tree.
Pattern examples appear in Figure 1.
Each pattern is designed to match a subtree (a root and one or more levels below that root) and used to guide the propagation of the trace into one or more nodes at the terminal level of the pattern (indicated using directed edges).
Since tree-matching patterns are applied in a top-down fashion, multiple patterns can match the same subtree and allow alternative ways to propagate a gap.
Hence, we have developed a probabilistic model to select among the alternative paths.
We have created 24 patterns for WHNP traces, 16 for WHADVP, 18 for WHPP, and 11 for WHADJP.
Figure 1: Examples of tree-matching patterns
Before describing our model, we first introduce some notation.
• TN is a tree dominating the string of words between positions i and j with N being the label of the root.
We assume there are no unary chains like N — X —...
— Y — N (which could be collapsed to a single node N) in the tree, so that T-jf uniquely describes the subtree.
• A gap location gCd'N is represented as a tuple
(gaptype, ancstr(a, b, N ),c,d), where gaptype is the type of the gap, (e.g., whnp for a WHNP trace), ancstr(a, b, N) is the gap's nearest ancestor, with a and b being its span and N being its label, and c and d indicating where the gap can be inserted.
Note that a gap's location is specified precisely when c = d. If the gap is yet to be inserted into its final location but will be inserted somewhere inside ancstr(a,b,N), then we set c = a and d = b.
TN.
• P(g^yN\gaptype,TNN) is the probability that a gap of gaptype is located between x and y, with a
x < y < b < j.
Given this notation, our model is tasked to identify the best location for the gap in a parse tree among the alternatives, i.e.,
where gXXx1 represents a gap location in a tree, and T = T11 is the subtree of the parse tree whose root node is the nearest ancestor node dominating the WH-phrase, excluding the WH-node itself, and gaptype is the type of the gap.
In order to simplify the notation, we will omit the root labels N in T11 and g^N, implying that they match where appropriate.
To guide this model, we utilize tree-matching patterns (see Figure 1), which are formally defined as functions:
where T is the space of parse trees, G is the space of gap types, and T is the space of gaps g^, and none is a special value representing failure to match1.
The application of a pattern is defined as:
Because patterns are uniquely associated with specific gap types, we will omit gaptype to simplify the notation.
Application is a function defined for every pair (ptrn, Tj) with fixed gaptype.
Patterns are applied to the root of Tj, not to an arbitrary subtree.
Consider an example of pattern application shown in Figure 2.
The tree contains a relative clause such that the WHNP-phrase that was moved from some location inside the subtree of its sister node S.
viewers will tune in to see
1Modeling conjunction requires an alternative definition for patterns: ptrn : T x G — Powerset(T) U {none}.
For the sake of simplicity, we ignore conjunctions in the following discussion, except for in the few places where it matters, since this has little impact on the development ofour model.
Figure 2: A pattern application example
g78. g88.
Figure 3: Another pattern application example
Suppose that, in addition to the pattern applications shown in Figure 2, there is one more, namely: app(P5,T48) — g48.
The sequence of patterns P\,P2, P5 proposes an alternative grammatically plausible location for the gap, as shown in Figure 3.
Notice that the combination of the two sequences produces a tree of patterns, as shown in Figure 4, and this pattern tree covers much of the structure of the T28 subtree.
The number of unique subtrees that contain WH-phrases is essentially infinite; hence, modeling them directly is infeasible.
However, trees with varying details, e.g., optional adverbials, often can be char-
Figure 4: Pattern tree
acterized by the same tree of patterns.
Hence, we can represent the space of trees by utilizing a relatively small set of classes of trees that are determined by their tree of pattern applications.
Let n be the set of all patterns.
We define the set of patterns matching tree Tij as follows:
It is important to also define a function to map a tree to the set of pattern chains applicable to a particular tree.
The pseudocode for this function called FindPCs appears in Figure 52.
When applied to Tij, this function returns the set of all pattern chains, applications of which would result in concrete gap locations.
The algorithm is guaranteed to terminate as long as trees are of finite depth and each pattern moves the gap location down at least one level in the tree at each iteration.
Using this function, we define Tree Class (TC) of a tree Tij
as TC(Tij) = FindPCs(Tij).
2list o element means "append element to list".
gfy — app(P,Tij) PC — PC o -JL.
Figure 5: Pseudocode for FindPCs
In the case of a conjunction, the function FindPCs is slightly more complex.
Recall that in this case app(P, Tij) produces a set of gaps or none.
The pseudocode for this case appears in Figure 6.
The set of pattern chains constructed by the function FindPCs can be represented as a pattern tree with patterns being the edges.
For example, the pattern tree in Figure 4 corresponds to the tree displayed in Figures 2 and 3.
This pattern tree captures the history of gap propagations beginning at A. Assuming at that point only pattern Pi is applicable, subtree B is produced.
If P2 yields subtree C, and at that point patterns P3 and P5 can be applied, this yields subtree D and exact location F (which is expressed by the termination symbol $), respectively.
Finally, pattern P4 matches subtree D and proposes exact gap location E. It is important to note that this pattern tree can be thought of as an automaton, with A, B, C, D, E, and F being the states and the pattern applications being the transitions.
With this representation, we can create a regular grammar using patterns as the terminals and their
forall pa e PCi
return prod } function FindPCs(Tij) {
forall P e Mij
The set app(P, Tij) must be ordered, so that branches of conjunction are concatenated in a well defined order.
Figure 6: Pseudocode for FindPCs in the case of conjunction
which might correspond to something like "that viewers will tune in to expect to see."
Note that this pattern chain belongs to a different tree class, which incidentally would have inserted the gap at a different location (VP see gap).
To overcome this problem we add additional constraints to the grammar to ensure that all parses the grammar generates belong to the same tree class.
One way to do this is to include the start state of a transition as an element of the terminal, e.g., ,
{Pp3p5j.
That is, we extend the terminals to include the left-hand side of the productions they are emitted
and the sequence of terminals becomes:
{Pi} {P2} {P3,P5} {Pi} $.
Note that the grammar is unambiguous For such a grammar, the question "what is the probability of a parse tree given a string and grammar" doesn't make sense; however, the question "what is the probability of a string given the grammar" is still valid, and this is essentially what we require to develop a generative model for gap insertion.
• Let T = {\ P £ n} U{$} be the set of terminals, where $ is a special symbol4.
• Let N = {S} |J powerset(n) be the set of nonterminals with S being the start symbol.
• Let P be the set of productions, defined as the union of the following sets:
{S — v \ v £ powerset(n)}.
{v — P f \ P £ n — n , v £ pset(P) and f £ powerset(n)}.
These are nonterminal transitions, note that they emit only non-terminal patterns.
{v — P$ \ P £ n and v £ pset(P)}.
These are the terminal transitions, they emit a terminal pattern and the symbol $.
{v — P li ...in \ P £ n — n , v £ pset(P) and \/ie[i..n] ii £ powerset(n)}.
This rule models conjunction with n branches.
Given the grammar defined in the previous subsection, we will define a probabilistic model for gap insertion.
Recall that our goal is to find:
Just like the probability of a sentence is obtained by summing up the probabilities of its parses, the probability of the gap being at gxabx is the sum of probabilities of all pattern chains that yield gxa
ab .
3Patterns that generate exact position for a gap.
4Symbol $ helps to separate branches in strings with conjunction.
where Y = {pc \ app(pc,T) = gXX}.
Note that pci £ TC (T) by definition.
For our model, we use two approximations.
First, we collapse a tree T into its Tree Class TC(T), effectively ignoring details irrelevant to gap insertion:
Figure 7: A pattern tree with the pattern chain ABDGM marked using bold lines
Consider the pattern tree shown in Figure 7.
The probability of the pattern chain ABDGM given the pattern tree can be computed as:
where NR(TC(T)) is the number of occurrences of the tree class TC(T) in the training corpus and
the pattern chain ABDGM leads to a correct gap in trees corresponding to the tree class TC(T).
For many tree classes, NR(TC(T)) may be a small number or even zero, thus this direct approach cannot be applied to the estimation ofPr(pci\ TC(T)).
Further approximation is required to tackle the spar-sity issue.
In the following discussion, XY will denote an edge (pattern) between vertices X and Y in
production Pr(B — D) ofaPCFG.
Recall that the meaning assigned to a state in pattern grammar in Section 2.2 is the set of patterns matching at that state.
Thus, according to that semantics, only the edges displayed bold in Figure 8 are involved in computation of
{DG,DH }.
Figure 8: The context considered for estimation of the probability of transition from B to D
Pattern trees are fairly shallow (partly because many patterns cover several layers in a parse tree as can be seen in Figures 1 and 2); therefore, the context associated with a production covers a good part of a pattern tree.
Another important observation is that the local configuration of a node, which is described by the set of matching patterns, is the most relevant to the decision of where the gap is to be propagated5.
This is the reason why the states are represented this way.
Formally, the second approximation we make is
5We have evaluated a model that only uses Pr(BDI{BD, BE, BF}) for the probability of taking BD and found it performs only slightly worse than the model presented here.
as follows:
where G is a PCFG model based on the grammar described above.
where P(pci) is the parse of the pattern chain pci which is a string of terminals of G. Combining the formulae:
To handle conjunction, we must express the fact that pattern chains yield sets of gaps.
Thus, the goal becomes:
aibi nanbn
1x1 ' " " " ' hxnx.
{gai_x1i,---,gXZbxnn}}.
The remaining equations are unaffected.
Even for the relatively small number of patterns, the number of non-terminals in the grammar can potentially be large (2|n|).
This does not happen in practice since most patterns are mutually exclusive.
Nonetheless, productions, unseen in the training data, do occur and their probabilities have to be estimated.
Rewriting the probability of a transition Pr(A a B) as P(A, a, B), we use the following interpolation:
We estimate the parameters on the held out data (section 24 of WSJ) using a hill-climbing algorithm.
3 Evaluation
We compare our algorithm under a variety ofcondi-tions to the work of (Johnson, 2002) and (Gabbard et al., 2006).
We selected these two approaches because of their availability6.
In addition, (Gabbard et al., 2006) provides state-of-the-art results.
Since we only model the insertion of WH-traces, all metrics include co-indexation with the correct WH phrases identified by their type and word span.
We evaluate on three metrics.
The first metric, which was introduced by Johnson (2002), has been widely reported by researchers investigating gap insertion.
A gap is scored as correct only when it has the correct type and string position.
The metric has the shortcoming that it does not require correct attachment into the tree.
The second metric, which was developed by Campbell (2004), scores a gap as correct only when it has the correct gap type and its mother node has the correct nonterminal label and word span.
As Campbell points out, this metric does not restrict the position of the gap among its siblings, which in most cases is desirable; however, in some cases (e.g., double object constructions), it does not correctly detect errors in object order.
This metric is also adversely affected by incorrect attachments of optional constituents, such as PPs, due to the span requirement.
To overcome the latter issue with Campbell's metric, we propose to use a third metric that evaluates gaps with respect to correctness of their lexical head, type of the mother node, and the type of the co-indexed wh-phrase.
This metric differs from that used by Levy and Manning (2004) in that it counts only the dependencies involving gaps, and so it represents performance of the gap insertion algorithm more directly.
We evaluate gap insertion on gold trees from section 23 of the Wall Street Journal Penn Treebank (WSJ) and parse trees automatically produced using the Charniak (2000) and Bikel (2004) parsers.
These parsers were trained using sections 00 through 22 of the WSJ with section 24 as the development set.
Because our algorithm inserts only traces of nonempty WH phrases, to fairly compare to Johnson's and Gabbard's performance on WH-traces alone, we
Johnson's source code is publicly available, and Ryan Gab-bard kindly provided us with output trees produced by his system.
remove the other gap types from both the gold trees and the output of their algorithms.
Note that Gab-bard et al.'s algorithm requires the use of function tags, which are produced using a modified version of the Bikel parser (Gabbard et al., 2006) and a separate software tool (Blaheta, 2003) for the Charniak parser output.
For our algorithm, we do not utilize function tags, but we automatically replace the tags of auxiliary verbs in tensed constructions with AUX prior to inserting gaps using tree surgeon (Levy and Andrew, 2006).
We found that Johnson's algorithm more accurately inserts gaps when operating on auxified trees, and so we evaluate his algorithm using these modified trees.
In order to assess robustness of our algorithm, we evaluate it on a corpus of a different genre - Broadcast News Penn Treebank (BN), and compare the result with Johnson's and Gabbard's algorithms.
The BN corpus uses a modified version of annotation guidelines, with some of the modifications affecting gap placement.
Since our algorithms were trained on WSJ, we apply tree transformations to the BN corpus to convert these trees to WSJ style.
We also auxify the trees as described previously.
G), and our (denoted Pres) algorithms on section 23 gold trees, as well as on parses generated by the Charniak and Bikel parsers.
In addition to WHNP and WHADVP results that are reported in the literature, we also present results for WHPP gaps even though there is a small number of them in section 23 (i.e., 22 gaps total).
Since there are only 3 nonempty WHADJP phrases in section 23, we omit them in our evaluation.
Gold Trees
Compared to Johnson's and Gabbard's algorithm, our algorithm significantly reduces the error on gold trees (table 1).
Operating on automatically parsed trees, our system compares favorably on all WH traces, using all metrics, except for two instances: Gabbard's algorithm has better performance on WHNP, using Cambpell's metric and trees generated by the Charniak parser by 0.3% and on WHADVP, using Johnson's metric and trees produces by the Bikel parser by 0.7%.
However, we believe that the dependency metric is more appropriate for evaluation on automatically parsed trees because it enforces the most important aspects of tree structure for evaluating gap insertion.
The relatively poor performance of Johnson's and our algorithms on WHPP gaps compared that on WHADVP gaps is probably due, at least in part, to the significantly smaller number of WHPP gaps in the training corpus and the relatively wider range ofpossible attachment sites for the prepositional phrases.
Table 2 displays how well the algorithms trained on WSJ perform on BN.
A large number of the errors are due to FRAGs which are far more common in the speech corpus than in WSJ.
WHPP and WHADJP, although more rare than the other types, are presented for reference.
It is clear from the contrast between the results based on gold standard trees and the automatically produced parses in Table 1 that parse error is a major source of error.
Parse error impacts all of the metrics, but the patterns of errors are different.
For WH-NPs, Campbell's metric is lower than the other two across all three algorithms, suggesting that this metric is adversely affected by factors that do not impact the other metrics (most likely the span of the gap's mother node).
For WHADVPs, the metrics
show a similar degradation due to parse error across the board.
We are reluctant to draw conclusions for the metrics on WHPPs; however, it should be noted that the position of the PP should be less critical for evaluating these gaps than their correct attachment, suggesting that the head dependency metric would more accurately reflect the performance of the system for these gaps.
Campbell's metric has an interesting property: in parse trees, we can compute the upper bound on recall by simply checking whether the correct WH-phrase and gap's mother node exist in the parse tree.
We present recall results and upper bounds in Table 3.
Clearly the algorithms are performing close to the upper bound for WHNPs when we take into account the impact of parse errors on this metric.
Clearly there is room for improvement for the WHPPs.
Campbell
Head dep
In addition to parser errors, which naturally have the most profound impact on the performance, we found the following sources oferrors to have impact on our results:
• Annotation errors and inconsistency in PTB, which impact not only the training of our system, but also its evaluation.
Charniak Parser
Bikel Parser
Table 3: Recall on trees produced by the Charniak and Bikel parsers and their upper bounds (UB)
There are some POS labeling errors that confuse our patterns, e.g.,
PTB annotation guidelines leave it to annota-tors to decide whether the gap should be attached at the conjunction level or inside its branches (Bies et al., 1995) leading to inconsistency in attachment decisions for adverbial
gaps.
• Lack of coverage: Even though the patterns we use are very expressive, due to their small number some rare cases are left uncovered.
• Model errors: Sometimes despite one ofthe applicable pattern chains proposes the correct gap, the probabilistic model chooses otherwise.
We believe that a lexicalized model can eliminate most of these errors.
4 Conclusions and Future Work
The main contribution of this paper is the development of a generative probabilistic model for gap insertion that operates on subtree structures.
Our model achieves state-of-the-art performance, demonstrating results very close to the upper bound on WHNP using Campbell's metric.
Performance for WHADVPs and especially WHPPs, however, has room for improvement.
We believe that lexicalizing the model by adding information about lexical heads of the gaps may resolve some of the errors.
For example:
(VP (VB deliver) ...
These sentences have very similar structure, with two potential places to insert gaps (ignoring reordering with siblings).
The current model inserts the gaps as follows: when Congress (VP wanted (S to know) gap) and when it is (VP expected (S to deliver) gap), making an error in the second case (partly due to the bias towards shorter pattern chains, typical for a PCFG).
However, deliver is more likely to take a temporal modifier than know.
In future work, we will investigate methods for adding lexical information to our model in order to improve the performance on WHADVPs and WH-PPs.
In addition, we will investigate methods for automatically inferring patterns from a treebank corpus to support fast porting of our approach to other languages with treebanks.
5 Acknowledgements
We would like to thank Ryan Gabbard for providing us output from his algorithm for evaluation.
We would also like to thank the anonymous reviewers for invaluable comments.
This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-06-C-0023.
Any opinions, findings and conclusions or recommendations expressed in this material are those ofthe authors and do not necessarily reflect the views of DARPA.
