This paper proposes the use of Lexical-ized Tree-Adjoining Grammar (LTAG) formalism as an important additional source of features for the Semantic Role Labeling (SRL) task.
Using a set of one-vs-all Support Vector Machines (SVMs), we evaluate these LTAG-based features.
Our experiments show that LTAG-based features can improve SRL accuracy significantly.
When compared with the best known set of features that are used in state of the art SRL systems we obtain an improvement in F-score from 82.34% to 85.25%.
1 Introduction
Semantic Role Labeling (SRL) aims to identify and label all the arguments for each predicate occurring in a sentence.
It involves identifying constituents in the sentence that represent the predicate's arguments and assigning pre-specified semantic roles to them.
[AOsener Ports of Call Inc. ] reached agreements to [Vverb sell] [A1thing its remaining seven aircraft] [A2buyer to buyers that weren't disclosed] .
is an example of SRL annotation from the PropBank corpus (Palmer et al., 2005), where the subscripted information maps the semantic roles AO, A1, A2 to arguments for the predicate sell as defined in the PropBank Frame Scheme.
For SRL, high accuracy has been achieved by:
(i) proposing new types of features (see Table 1 in Section 3 for previously proposed features),
(iii) dealing with incorrect parser output by using more than one parser (Pradhan et al., 2005b).
Our work in this paper falls into category (i).
We propose several novel features based on Lexicalized Tree Adjoining Grammar (LTAG) derivation trees in order to improve SRL performance.
To show the usefulness of these features, we provide an experimental study comparing LTAG-based features with the standard set of features and kernel methods used in state-of-the-art SRL systems.
The LTAG formalism provides an extended domain of locality in which to specify predicate-argument relationships and also provides the notion of a derivation tree.
These two properties of LTAG make it well suited to address the SRL task.
SRL feature extraction has relied on various syntactic representations of input sentences, such as syntactic chunks (Hacioglu et al., 2004) and full syntactic parses (Gildea and Jurafsky, 2002).
In contrast with features from shallow parsing, previous work (Gildea and Palmer, 2002; Punyakanok et al., 2005b) has shown the necessity of full syntactic parsing for SRL.
In order to generalize the path feature (see Table 1 in Section 3) which is probably the most salient (while being the most data sparse) feature for SRL, previous work has extracted features from other syntactic representations, such as CCG derivations (Gildea and Hockenmaier, 2003) and dependency trees (Hacioglu, 2004) or integrated features from different parsers (Pradhan et al., 2005b).
To avoid explicit feature engineering on trees, (Mos-chitti, 2004) used convolution kernels on selective portions of syntactic trees.
In this paper, we also compare our work with tree kernel based methods.
Most SRL systems exploit syntactic trees as the main source of features.
We would like to take this one step further and show that using LTAG deriva-
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 590-599, Prague, June 2007.
©2007 Association for Computational Linguistics
Figure 1: A parse tree schematic, and two plausible LTAG derivation trees for it: derivation tree 71 uses elementary trees a1 and p1 while 72 uses a2 and a3.
tion trees as an additional source of features can improve both argument identification and classification accuracy in SRL.
2 Using LTAG-based Features in SRL
We assume some familiarity with Lexicalized Tree-Adjoining Grammar (LTAG); (Joshi and Schabes, 1997) is a good introduction to this formalism.
A LTAG is defined to be a set of lexicalized elementary trees (etree for short), of which there are two types, initial trees and auxiliary trees.
Typically etrees can be composed through two operations into parse trees, substitution and adjunction.
We use sister adjunction which is commonly used in LTAG statistical parsers to deal with the relatively flat Penn Tree-bank trees (Chiang, 2000).
The tree produced by composing the etrees is the derived/parse tree and the tree that records the history of composition is the derivation tree.
A reasonable way to define SRL features is to provide a strictly local dependency (i.e. within a single etree) between predicate and argument.
There have been many different proposals on how to maintain syntactic locality (Xia, 1999; Chen and Vijay-Shanker, 2000) and SRL locality (Chen and Ram-bow, 2003; Shen and Joshi, 2005) when extracting LTAG etrees from a Treebank.
These proposed methods are exemplified by the derivation tree y1 in Fig.
However, in most cases they can only provide a local dependency between predicate and argument for 87% of the argument constituents (Chen and Rambow, 2003), which is too low to provide high
SRL accuracy.
In LTAG-based statistical parsers, high accuracy is obtained by using the Magerman-Collins head-percolation rules in order to provide the etrees (Chiang, 2000).
This method is exemplified by the derivation tree y2 in Fig.
Comparing 71 with 72 in Fig.
1 and assuming that join is the predicate and the NP is the potential argument, the path feature as defined over the LTAG derivation tree 72 is more useful for the SRL task as it distinguishes between main clause and non-finite embedded clause predicates.
This alternative derivation tree also exploits the so-called extended domain of locality (Joshi and Schabes, 1997) (the examples in Section 2.1 show this clearly).
In this paper, we crucially rely on features defined on LTAG derivation trees of the latter kind.
We use polynomial kernels to create combinations of features defined on LTAG derivation trees.
2.1 LTAG-based Feature Extraction
In order to create training data for the LTAG-based features, we convert the Penn Treebank phrase structure trees into LTAG derivations.
First, we prune the Treebank parse tree using certain constraints.
Then we decompose the pruned parse trees into a set of LTAG elementary trees and obtain a derivation tree.
For each constituent in question, we extract features from the LTAG derivation tree.
We combine these features with the standard features used for SRL and train an SVM classifier on the combined LTAG derivation plus SRL annotations from the PropBank corpus.
For the test data, we report on results using the gold-standard Treebank data, and in addition we also report results on automatically parsed data using the Charniak parser (Charniak, 2000) as provided by the CoNLL 2005 shared task.
We did this for three reasons: (i) our results are directly comparable to those who have used the Charniak parses distributed with the CoNLL 2005 data-set; (ii) we avoid the possibility of a better parser identifying a larger number of argument constituents and thus leading to better results, which is orthogonal to the discriminative power of our proposed LTAG-based features; and (iii) the quality of LTAG derivation trees depends indirectly on the quality of head dependencies recovered by the parser and it is a well-known folklore result (see Table 3 in (McDonald et al.,
2005)) that applying the head-percolation heuristics on parser output produces better dependencies when compared to dependencies directly recovered by the parser (whether the parser is an LTAG parser or a lexicalized PCFG parser).
Given a parse tree, the pruning component iden-tines the predicate in the tree and then only admits those nodes that are sisters to the path from the predicate to the root.
It is commonly used in the SRL community (cf. (Xue and Palmer, 2004)) and our experiments show that 91% of the SRL targets can be recovered despite this aggressive pruning.
We make two enhancements to the pruned Propbank tree: we enrich the sister nodes with head information, a part-of-speech tag and word pair: (t, w) and PP nodes are expanded to include the NP complement of the PP (including head information).
The target SRL node is still the PP.
Figure 2 is a pruned parse tree for a sentence from the PropBank.
After pruning, the pruned tree is decomposed around the predicate using standard head-percolation based heuristic rules1 to convert a Tree-bank tree into an LTAG derivation tree.
Figure 3 shows the resulting etrees after decomposition.
Figure 4 is the derivation tree for the entire pruned tree.
Each node in this derivation tree represents an etree in Figure 3.
In our model we make an independence assumption that each SRL is assigned to each constituent independently, conditional only on the path from the predicate etree to the argument etree in the derivation tree.
Different etree siblings in the LTAG derivation tree do not influence each other in our current models.
We defined 5 LTAG feature categories: predicate etree-related features (P for short), argument etree-related features (A), subcategorization-related features (S), topological relation-related features (R), intermediate etree-related features (I).
Since we consider up to 6 intermediate etrees between the predicate and the argument etree, we use I-1 to I-6 to represent these 6 intermediate trees respectively.
agreements VP-H VP
aircraft to NNS-H buyers
Figure 2: The pruned tree for the sentence "Ports of Call Inc. reached agreements to sell its remaining seven aircraft to buyers that weren't disclosed."
agreements
aircraft
Inc.
'using http://www.isi.edu/~chiang/software/treep/treep.html
Figure 3: Elementary trees after decomposition of the pruned tree.
Category P: Predicate etree & its variants Predicate etree is an etree with predicate, such as e0 in Figure 3.
This new feature complements the predicate feature in the standard SRL feature set.
One variant is to remove the predicate lemma.
Another variant is a combination of predicate tree w/o predicate lemma&POS and voice.
In addition, this variant combined with predicate lemma comprises another new feature.
In the example, these three variants are (VP(VB)) and (VP).active and (VP) active sell respectively.
Category A: Argument etree & its variants Analogous to the predicate etree, the argument etree is an etree with the target constituent and its head.
Similar
Figure 4: LTAG derivation tree for Figure 2.
to predicate etree related features, argument etree, argument etree with removal of head word, combination of argument etree w/o head POS&head word and head Named Entity (NE) label (if any) are considered.
For example, in Figure 3, these 3 features for e6 are e6, (NP(NNP)) and (NP)±OC with head word "Inc." having NE label "LOC".
Category S: Index of current argument etree in subcat frame of predicate etree Sub-categorization is a standard feature that denotes the immediate expansion of the predicate's parent.
For example, it is VJNP-PP for predicate sell in the given sentence.
For argument etree el in Figure 3, the index feature value is 1 since it is the very first element in the (ordered) subcat sequence.
Category R:
Relation type between argument etree & predicate etree This feature is a combination of position and modifying relation.
Position is a binary valued standard feature to describe if the argument is before or after the predicate in a parse tree.
For each argument etree and intermediate etree, we consider three types of modifying relations they may have with the predicate etree: modifying (value 1), modified (value 2) and neither (value 3).
From Figure 4, we can see el modifies e0 (predicate tree).
So their modifying relation type value is 1; Combining this value with the position value, this feature for el is "1_after".
Attachment point of argument etree This feature describes where the argument etree is sister-adjoined/adjoined to the predicate etree, or the other way around.
For el in the example, VP in the predicate tree is the attachment point.
Distance This feature is the number of intermediate etrees between argument etree and predicate etree in the derivation tree.
In Figure 4, the distance from e4
to the predicate etree is 1 since only one intermediate etree e3 is between them.
Category I:
Intermediate etree related features Intermediate etrees are those etrees that are located between the predicate etree and argument etrees.
The set of features we propose for each intermediate etree is quite similar to those for argument etrees except we do not consider the named-entity label for head words in this case.
Relation type of intermediate etree & predicate etree.
Attachment point of intermediate etree.
Distance between intermediate etree and predicate etree.
Up to 6 intermediate etrees are considered and the category I features are extracted for each of them (if they exist).
Each etree represents a linguistically meaningful fragment.
The features of relation type, attachment point as well as the distance characterize the topo-logical relations among the relevant etrees.
In particular, the attachment point and distance features can explicitly capture important information hidden in the standard path feature.
The intermediate tree related features can give richer contextual information between predicate tree and argument trees.
We added the subcat index feature to be complementary to the sub-cat and syntactic frame features in the standard feature set.
3 Standard Feature Set
Our standard feature set is a combination of features proposed by (Gildea and Jurafsky, 2002), (Surdeanu et al., 2003; Pradhan et al., 2004; Pradhan et al., 2005b) and (Xue and Palmer, 2004).
All features listed in Table 1 are used for argument classification in our baseline system; and features with asterisk are not used for argument identification2.
We compare this baseline SRL system with a system that includes a combination of these features with the LTAG-based features.
Our baseline uses all features that have been used in the state-of-the-art SRL systems and as our experimental results show, these standard features do indeed obtain state-of-the-art
2This is a standard idea in the SRL literature: removing features more useful for classification, e.g. named entity features, makes the classifier for identification more accurate.
Table 1: Standard features adopted by a typical SRL system.
Features with asterisk * are not used for argument identification.
• predicate lemma and voice
• phrase type and head word
• path from phrase to predicate 1
• position: phrase relative to predicate: before or after
• sub-cat records the immediate structure that expands from
predicate's parent_
• predicate POS
• first/last word/POS
• POS of word immediately before/after phrase
• LCA(Lowest Common Ancestor) path from phrase to its lowest common ancestor with predicate
• punctuation immediately before/after phrase*
• content word named entity label for PP parent node* Additional features proposed by (Xue and Palmer, 2004)
• predicate_phrase type
• predicate_head word
• voice_position
• syntactic frame*
1 In Fig.
2 NNS1NPISIVPIVB is the path from the constituent NNS(agreements) to the predicate VB(sell) and the path length is 4.
2 This feature is different from the frame feature which usually refers to all the semantic participants for the particular predicate.
accuracy on the SRL task.
We will show that adding LTAG-based features can improve the accuracy over this very strong baseline.
4 Experiments
4.1 Experimental Settings
Training data (PropBank Sections 2-21) and test data (PropBank Section 23) are taken from CoNLL-2005 shared task3 All the necessary annotation information such as predicates, parse trees as well as Named Entity labels is part of the data.
The ar-
3http://www.lsi.upc.es/~srlconll/.
SRL, see (Xue and Palmer, 2004; Moschitti, 2004)).
We chose these labels for our experiments because they have sufficient training/test data for the performance comparison and provide sufficient counts for accurate significance testing.
However, we also provide the evaluation result on the test set for full CoNLL-2005 task (all argument types).
We use SVM-light4 (Joachims, 1999) with a polynomial kernel (degree=3) as our binary classifier for argument classification.
We applied a linear kernel to argument identification because the training cost of this phase is extremely computationally expensive.
We use 30% of the training samples to fine tune the regularization parameter c and the loss-function cost parameter j for both stages of argument identification and classification.
With parameter validation experiments, we set c = 0.258 and j = 1 for the argument identification learner and c = 0.1 and j = 4 for the argument classification learner.
The classification performance is evaluated using Precision/Recall/F-score (p/r/f) measures.
We extracted all the gold labels of A0-A4 and AM with the argument constituent index from the original test data as the "gold output".
When we evaluate, we contrast the output of our system with the gold output and calculate the p/r/f for each argument type.
Our evaluation criteria which is based on predicting the SRL for constituents in the parse tree is based on the evaluation used in (Toutanova et al., 2005).
However, we also predict and evaluate those Prop-Bank arguments which do not have a corresponding constituent in the gold parse tree or the automatic parse tree: the missing constituent case.
We also evaluate discontinuous PropBank arguments using the notation used in the CoNLL-2005 data-set but we do not predict them.
This is contrast with some previous studies where the problematic cases have been usually discarded or the largest constituents in the parse tree that almost capture the missing constituent cases are picked as being the correct answer.
Note that, in addition to the constituent based evalu-
Gold Standard
Charniak Parser
std+ltag
Table 2: Argument identification results on test data
ation, in Section 4.4 we also provide the evaluation of our model on the CoNLL-2005 data-set.
Because the main focus of this work is to evaluate the impact of the LTAG-based features, we did not consider the frameset or a distribution over the entire argument set or apply any inference/constraints as a post-processing stage as most current SRL systems do.
We focus our experiments on showing the value added by introducing LTAG-based features to the SRL task over and above what is currently used in SRL research.
4.2 Argument Identification
Table 2 shows results on argument identification (a binary classification of constituents into argument or non-argument).
To fully evaluate the influence of the LTAG-based features, we report the identification results on both Gold Standard parses and on Charniak parser output (Charniak, 2000)5.
2005b).
4.3 Argument Classification
Based on the identification results, argument classification will assign the semantic roles to the argument candidates.
For each argument of A0-A4 and AM, a "one-vs-all" SVM classifier is trained on both the standard feature set (std) and the augmented feature set (std+ltag).
Table 3 shows the classification results on the Gold-standard parses with the
gold argument identification; Table 4 and 5 show the classification results on the Charniak parser with the gold argument identification and the automatic argument identification respectively.
Scores for multi-class SRL are calculated based on the total number of correctly predicted labels, total number of gold labels and the number of labels in our prediction for this argument set.
5We use the parses supplied with the CoNLL-2005 shared task for reasons of comparison.
Table 3: Argument classification results on Goldstandard parses with gold argument boundaries
From the results shown in the tables, we can see that by adding the LTAG-based features, the overall performance of the systems is improved both for argument identification and for argument classification.
Table 3 and 4 show that with the gold argument identification, the classification for each class in {A0, A1, A2, A3, AM} consistently benefit from LTAG-based features.
Especially for A3, LTAG-based features lead to more than 3 percent improvement.
But for A4 arguments, the performance drops 3 percent in both cases.
As we noticed in Table 5, which presents the argument classification results on Charniak parser output with the automatic argument identification, the prediction accuracy for classes A0, A1, A3, A4 and AM is improved, but drops a little for A2.
Table 4: Argument classification results on Charniak parser output with gold argument boundaries
is not directly comparable since their system used the more accurate n-best parser output of (Charniak and Johnson, 2005).
In addition their system also used global inference.
Our focus in this paper was to propose new LTAG features and to evaluate impact of these features on the SRL task.
We also compared our proposed feature set against predicate/argument features (PAF) proposed by (Moschitti, 2004).
We conducted an experiment using SVM-light-TK-1.2 toolkit6.
The PAF tree kernel is combined with the standard feature vectors by a linear operator.
With settings of Table 5, its multi-class performance (p/r/f)% is 83.09/80.18/81.61 with linear kernel and 85.36/81.79/83.53 with polynomial kernel (degree=3) over std feature vectors.
multi-class
Table 5: Argument classification results on Charniak parser output with automatic argument boundaries
4.5 Significance Testing
To assess the statistical significance of the improvements in accuracy we did a two-tailed significance test on the results of both Table 2 and 5 where Charniak's parser outputs were used.
We chose SIGF1, which is an implementation of a computer-intensive, stratified approximate-randomization test (Yeh, 2000).
The statistical difference is assessed on SRL identification, classification for each class (A0-A4, AM) and the full SRL task (overall performance).
In Table 2 and 5, we labeled numbers under std+ltag that are statistically significantly better from those under std with asterisk.
The significance tests show that for identification and full SRL task, the improvements are statistically significant with p value of 0.013 and 0.0001 at a confidence level of 95%.
The significance test on each class shows that the improvement by adding LTAG-based features is statistically significant for class A0, A1, A3 and AM.
Even though in Table 5 the performance of A2 appears to be worse it is not significantly so, and A4 is not significantly better.
In comparison, the performance of PAF did not show significantly better than std with p value of 0.593 at the same confidence level of 95%.
7http://www.coli.uni-saarland.de/~pado/sigf/index.html
Table 6: Impact of each LTAG feature category (P, R, S, A, I defined in Section 2.1.3) on argument classification and identification on CoNLL-2005 development set (WSJ Section 24). full denotes the full feature set, and we use —a to denote removal of a feature category of type a. For example, full-P is the feature set obtained by removing all P category features. std denotes the standard feature set.
5 Analysis of the LTAG-based features
We analyzed the drop in performance when a particular type of LTAG feature category is removed from the full set of LTAG features (we use the broad categories P, R, S, A, I as deined in Section 2.1.3).
Table 6 shows how much performance is lost (or gained) when a particular type of LTAG feature is dropped from the full set.
These experiments were done on the development set from CoNLL-2005 shared task, using the provided Charniak parses.
All the SVM models were trained using a polynomial kernel with degree 3.
It is clear that the S, A, I category features help in most cases and P category features hurt in most cases, including argument identiication.
It is also worth noting that the R and I category features help most for identiication.
This vindicates the use of LTAG derivations as a way to generalize long paths in the parse tree between the predicate and argument.
Although it seems LTAG features have negative impact on prediction of A3 arguments on this development set, dropping the P category features can actually improve performance over the standard feature set.
In contrast, for the prediction of A2 arguments, none of the LTAG feature categories seem to help.
Note that since we use a polynomial kernel in the full set, we cannot rule out the possibility that a feature that improves performance when dropped may still be helpful when combined in a non-linear kernel with features from other categories.
However, this analysis on the development set does indicate that overall performance may be improved by drop-
ping the P feature category.
We plan to examine this effect in future work.
6 Related Work
There has been some previous work in SRL that uses LTAG-based decomposition of the parse tree.
(Chen and Rambow, 2003) use LTAG-based decomposition of parse trees (as is typically done for statistical LTAG parsing) for SRL.
Instead of extracting a typical "standard" path feature from the derived tree, (Chen and Rambow, 2003) uses the path within the elementary tree from the predicate to the constituent argument.
Under this frame, they only recover semantic roles for those constituents that are localized within a single etree for the predicate, ignoring cases that occur outside the etree.
As stated in their paper, "as a consequence, adjunct semantic roles (ARGM's) are basically absent from our test corpus"; and around 13% complement semantic roles cannot be found in etrees in the gold parses.
In contrast, we recover all SRLs by exploiting more general paths in the LTAG derivation tree.
A similar drawback can be found in (Gildea and Hocken-maier, 2003) where a parse tree path was defined in terms of Combinatory Categorial Grammar (CCG) types using grammatical relations between predicate and arguments.
The two relations they defined can only capture 77% arguments in Propbank and they had to use a standard path feature as a replacement when the defined relations cannot be found in CCG derivation trees.
In our framework, we use intermediate sub-structures from LTAG derivations to capture these relations instead of bypassing this issue.
Compared to (Liu and Sarkar, 2006), we have used a more sophisticated learning algorithm and a richer set of syntactic LTAG-based features in this task.
In particular, in this paper we built a strong baseline system using a standard set of features and did a thorough comparison between this strong baseline and our proposed system with LTAG-based features.
The experiments in (Liu and Sarkar, 2006) were conducted on gold parses and it failed to show any improvements after adding LTAG-based features.
Our experimental results show that LTAG-based features can help improve the performance of SRL systems.
While (Liu and Sarkar, 2006) propose some new features for SRL based on LTAG derivations, we propose several novel features and in addition they do not show that their features are useful
for SRL.
Our approach shares similar motivations with the approach in (Shen and Joshi, 2005) which uses Prop-Bank information to recover an LTAG treebank as if it were hidden data underlying the Penn Treebank.
However their goal was to extract an LTAG grammar using PropBank information from the Treebank, and not the SRL task.
Features extracted from LTAG derivations are different and provide distinct information when compared to predicate-argument features (PAF) or sub-categorization features (SCF) used in (Moschitti, 2004) or even the later use of argument spanning trees (AST) in the same framework.
The adjunction operation of LTAG and the extended domain of locality is not captured by those features as we have explained in detail in Section 2.
7 Conclusion and Future Work
In this paper we show that LTAG-based features improve on the best known set of features used in current SRL prediction systems: the F-score for argument identification increased from 86.26% to
task.
The analysis of the impact of each LTAG feature category shows that the intermediate etrees are important for the improvement.
In future work we plan to explore the impact that different types of LTAG derivation trees have on this SRL task, and explore the use of tree kernels defined over the LTAG derivation tree.
LTAG derivation tree kernels were previously used for parse re-ranking by (Shen et al.,
2003).
Our work also provides motivation to do SRL and LTAG parsing simultaneously.
Acknowledgements
This research was partially supported by NSERC, Canada (RGPIN: 264905).
We would like to thank Aravind Joshi, Libin Shen, and the anonymous reviewers for their comments.
