This paper presents a tree-to-tree transduction method for text rewriting.
Our model is based on synchronous tree substitution grammar, a formalism that allows local distortion of the tree topology and can thus naturally capture structural mismatches.
We describe an algorithm for decoding in this framework and show how the model can be trained discriminatively within a large margin framework.
Experimental results on sentence compression bring significant improvements over a state-of-the-art model.
1 Introduction
Recent years have witnessed increasing interest in text-to-text generation methods for many natural language processing applications ranging from text summarisation to question answering and machine translation.
At the heart of these methods lies the ability to perform rewriting operations according to a set of prespecified constraints.
For example, text simplification identifies which phrases or sentences in a document will pose reading difficulty for a given user and substitutes them with simpler alternatives (Carroll et al., 1999).
Sentence compression produces a summary of a single sentence that retains the most important information while remaining grammatical (Jing, 2000).
Ideally, we would like a text-to-text rewriting system that is not application specific.
Given a parallel corpus of training examples, we should be able to learn rewrite rules and how to combine them in order to generate new text.
A great deal of previous work has focused on the rule induction problem (Barzilay
and McKeown, 2001; Pang et al., 2003; Lin and Pan-tel, 2001; Shinyama et al., 2002), whereas relatively little emphasis has been placed on the actual generation task (Quirk et al., 2004).
A notable exception is sentence compression for which end-to-end rewriting systems are commonly developed (Knight and Marcu, 2002; Turner and Charniak, 2005; Galley and McKeown, 2007; Riezler et al., 2003; McDonald, 2006).
The appeal of this task lies in its simplified formulation as a single rewrite operation, namely word deletion (Knight and Marcu, 2002).
Solutions to the compression task have been cast mostly in a supervised learning setting (but see
and Turner and Charniak (2005) for unsupervised methods).
Rewrite rules are learnt from a parsed parallel corpus and subsequently used to find the best compression from the set of all possible compressions for a given sentence.
A common assumption is that the tree structures representing long sentences and their compressions are isomorphic.
Consequently, the models are not generally applicable to other text rewriting problems since they cannot readily handle structural mismatches and more complex rewriting operations such as substitutions or insertions.
A related issue is that the tree structure of the compressed sentences is often poor; most algorithms delete words or constituents without paying too much attention to the structure of the compressed sentence.
However, without an explicit generation mechanism that allows tree transformations, there is no guarantee that the compressions will have well-formed syntactic structures.
And it will not be easy to process them for subsequent generation or analysis tasks.
In this paper we present a text-to-text rewriting
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 73-82, Prague, June 2007.
©2007 Association for Computational Linguistics
model that scales to non-isomorphic cases and can thus naturally account for structural and lexical divergences.
Our approach is inspired by synchronous tree substitution grammar (STSG, Eisner (2003)) a formalism that allows local distortion of the tree topology.
We show how such a grammar can be induced from a parallel corpus and propose a large margin model for the rewriting task which can be viewed as a weighted tree-to-tree transducer.
Our learning framework makes use of the algorithm put forward by Tsochantaridis et al. (2005) which efficiently learns a prediction function to minimise a given loss function.
Experiments on sentence compression show significant improvements over the state-of-the-art.
Beyond sentence compression and related text-to-text generation problems (e.g., paraphrasing), our model is generally applicable to tasks involving structural mapping.
Examples include machine translation (Eisner, 2003) or semantic parsing (Zettlemoyer and Collins, 2005).
2 Related Work
Knight and Marcu (2002) proposed a noisy-channel formulation of sentence compression based on synchronous context-free grammar (SCFG).
The latter is a generalisation of the context-free grammar (CFG) formalism to simultaneously produce strings in two languages.
In the case of sentence compression, the grammar rules have two right hand sides, one corresponding to the source (long) sentence and the other to its target compression.
The synchronous derivations are learnt from a parallel corpus and their probabilities are estimated generatively.
Given a long sentence, l, the aim is to find the corresponding compressed sentence, s, which maximises P(s)P(l\s) (here P(s) is the source model and P(l\s) the channel model.)
Modifications of this model are reported in Turner and Charniak (2005) and Galley and McKeown (2007) with improved results.
The channel model is limited to tree deletion and does not allow any type of tree re-organisation.
Non-isomorphic tree structures are common when translating between languages.
It is therefore not surprising that most previous work on tree rewriting falls within the realm of machine translation.
Proposals include Eisner's (2003) synchronous tree substitution grammar (STSG), Melamed's (2004)
multitext grammar, and Graehl and Knight's (2004) tree-to-tree transducers.
Despite differences in formalism, all these approaches model the translation process using tree-based probabilistic transduction rules.
The grammar induction process requires EM training which can be computationally expensive especially if all synchronous rules are considered.
Our work formulates sentence compression in the framework of STSG (Eisner, 2003).
We propose a novel grammar induction algorithm that does not require EM training and is coupled with a separate large margin training process (Tsochantaridis et al., 2005) for weighting each rule.
McDonald (2006) also presents a sentence compression model that uses a discriminative large margin algorithm.
However, we differ in two important respects.
First, our generation algorithm is more powerful, performing complex tree transformations, whereas McDonald only considers simple word deletion.
Being tree-based, the generation algorithm is better able to preserve the grammaticality of the compressed output.
Second, our model can be tuned to a wider range of loss functions (e.g.,tree-based measures).
3 Problem Formulation
We formulate sentence compression as an instance of the general problem of learning a mapping from input patterns x e X to discrete structured objects y e Y. Our training sample consists of a parallel corpus of input (uncompressed) and output (compressed) pairs (xi,yi)...
(xn,yn) e X x Y and our task is to predict a target labelled tree y from a source labelled tree x. As we describe below, y is not precisely a target tree, but instead derivations which generate both the source and the target tree.
We model the dependency between x and y as a weighted STSG.
Grammar rules are of the form (X, Y) — (y, a, P) where y and a are elementary trees composed of a mixture of terminal and nonterminals rooted with non-terminals X and Y respectively, and P is a set of variable correspondences between pairs of frontier non-terminals in y and a. A grammar rule specifies that we can substitute the trees y and a for corresponding X and Y nodes in the source and target trees respectively.
For example, the rule:
allows adjective phrases to be dropped from the source tree within an NP.
The indices H are used to specify the variable correspondences, p.
Each grammar rule has a score from which the overall score of a compression y for sentence x can be derived.
These scores are learnt discrimina-tively using the large margin technique proposed by Tsochantaridis et al. (2005).
The synchronous rules are combined using a chart-based parsing algorithm (Eisner, 2003) to generate the derivation (i.e., compressed tree) with the highest score.
We begin by describing our STSG generation algorithm in Section 3.1.
We next explain how a synchronous grammar is induced from a parallel corpus of original sentences and their compressions (Section 3.2) and give the details of our learning framework (Section 3.3).
Generation aims to find the best target tree for a given source tree using the transformations specified by the synchronous grammar.
(We discuss how we obtain this grammar in the following section.)
where y ranges over all target derivations (and therefore trees), w is a parameter vector and score(-) is an objective function measuring the quality of the derivation.
In common with many parsing methods, we encounter a problem with spurious ambiguity: i.e., there may be many derivations (sequences of rule applications) which produce the same target tree.
Ideally we would sum up the scores over all these derivations, however for the sake oftractability we instead take the maximum score.
This allows us to pose the maximisation problem over derivations rather than target trees.
The generation algorithm uses a dynamic program defined over the constituents in the source tree as shown in Figure 1 (see also Eisner (2003)).
The algorithm makes the assumption that the scoring function decomposes with the derivation, such that a partial score can be evaluated at each step, i.e., score(x,y;w) = £reyscore(r;w) where r are the rules used in the derivation.
This method builds a chart of the best scoring partial derivation for each source subtree headed by a given target nonterminal.
The inductive step is applied recursively
11: find best derivation using back-pointers from (root, cbest)
Figure 1: Generation algorithm to find the best derivation. nr and nv are the source nodes indexed by the rule's source side (root and variable), while cr and cv are the non-terminal categories ofthe rule's target side (root and variable).
' / ^ -/- V - .
is I very good and includes ...
Figure 2: Example of a rule application during generation.
The dashed area shows a matching rule for the VP node.
bottom-up, and involves applying a grammar rule to a node in the source tree.
Rules with substitution variables in their frontier are scored with reference to the chart for the matching nodes and target nonterminal categories.
Once the process is complete, we can read the best score from the chart cell for the root node, and the best derivation can be constructed by traversing back-pointers also stored in the chart.
This is illustrated in Figure 2 where the rule
applied to the top VP node.
The score of the resulting tree would reference the chart to calculate the score for the best target tree at the ADJP node with syntactic category NP.
3.2 Grammar Induction
Our induction algorithm automatically finds grammar rules from a word-aligned parsed parallel corpus.
The rules are pairs of elementary trees (i.e., tree fragments) whose leaf nodes are linked by the word alignments.
These leaves can be either terminal or non-terminal symbols.
Initially, the algorithm ex-
tracts tree pairs from word aligned text by choosing aligned constituents in the source and the target.
These pairs are then generalised using subtrees which are also extracted, resulting in synchronous rules with variable nodes.
The set of aligned tree pairs are extracted using the alignment template method (Och and Ney, 2004), constrained to syntactic constituent pairs:
where nS and nT are source and target tree nodes (subtrees), A = {(s, t)} is the set of word alignments (pairs of word-indices), Y(•) returns the yield span for a subtree and v is the exclusive-or operator.
The next step is to generalise the candidate pairs by replacing subtrees with variable nodes.
We could fully trust the word alignments and adopt a strategy in which the rules are generalised as much as possible and thus include little lexicalisation.
Figure 3 shows a simple sentence pair and the resulting synchronous rules according to this generalisation strategy.
Alternatively, we could extract every possible rule by including unlexicalised rules, lexi-calised rules and their combination.
The downside here is that the total number of possible rules is factorial in the size of the candidate set.
We address this problem by limiting the number of variables and the recursion depth, and by filtering out singleton rules.
There is no guarantee that the induced rules will generalise well to a testing set.
For example, the testing data may have a rule which was not seen in the training set (e.g., a new terminal or non terminal).
In this case no rule can be applied and subsequently generation fails.
For this reason we allow the model to duplicate any CFG production from the source tree, and uses a feature to flag that this rule was unseen in training.
These SCFG rules are then merged with the induced rules and fed into the feature detection module (see Section 3.3 for details).
We now describe how the parameters of our STSG generation system are fit to a supervised training set.
For a given source tree, the space of sister target trees implied by the synchronous grammar is often very large, and the majority of these trees are un-
Documentation is
very good
Figure 3: Induced synchronous grammar from asen-tence pair using a strategy that extracts general rules.
grammatical or are poor compressions.
The training procedure learns weights such that the model can discriminate between these trees and predict a good target tree.
For this we develop a discriminative training process which learns a weighted tree-to-tree transducer.
Our model is based on Tsochantaridis et al.'s (2005) framework for learning Support Vector Machines (SVMs) with structured output spaces, using the SVMstruct implementation.1 We briefly summarise the approach below; for a more detailed description we refer the interested reader to Tsochan-
taridis et al. (2005).
Traditionally SVMs learn a linear classifier that separates two or more classes with the largest possible margin.
Analogously, structured SVMs attempt to separate the correct structure from all other
1http://svmlight.joachims.org/sv^struct.html
structures with a large margin.
Given an input instance x, we search for the optimum output y under the assumption that x and y can be adequately described using a combined feature vector representation ^(x, y).
Recall that x are the source trees and y are synchronous derivations which generate both x and a target tree.
The goal of the training procedure is to find a parameter vector w such that it satisfies the condition:
where xi, yi are the ith training source tree and target derivation.
To obtain a unique solution — there will be several parameter vectors w satisfying (3) if the training instances are linearly separable — Tsochantaridis et al. (2005) select the w that maximises the minimum distance between yi and the closest runner-up structure.
The framework also incorporates a loss function.
This property is particularly appealing in the context of sentence compression and generally text-to-text generation.
For example, a compression that differs from the gold standard with respect to one or two words should be treated differently from a compression that bears no resemblance to it.
Another important factor is the length of the compression.
Compressions whose length is similar to the gold standard should be be preferable to longer or shorter output.
A loss function A(yi, y) quantifies the accuracy of prediction y with respect to the true output value yi.
We give details of the loss functions we employed for the compression task below.
We are now ready to state the learning objective for the structured SVM.
We use the soft-margin formulation which allows errors in the training set, via the slack variables tli:
Slack variables tli are introduced here for each training example xi, C is a constant that controls the trade-off between training error minimisation and
The optimisation problem in (4) is approximated using a polynomial time cutting plane algorithm (Tsochantaridis et al., 2005).
This optimisation crucially relies on finding the constraint incurring the maximum cost.
The cost function for slack rescaling can be formulated as:
In order to adapt this framework to our generation problem, we must provide the feature mapping ^(x, y), a loss function A(yi, y), and a maximiser y = argmaxyej H(y) (see (5)).
The following sections describe how these are instantiated in the sentence compression task.
Feature Mapping We devised a general feature set suitable for compression and paraphrasing.
Our feature space is defined over source trees (x) and target derivations (y).
All features apply to a single grammar rule; a feature vector for a derivation is expressed as the sum of the feature vectors for each rule in this derivation.
We make use of syntactic, lexical, and compression specific features.
Our simplest syntactic feature is the identity of a synchronous rule.
Specifically, we record its source tree, its target tree and their combination.
We also include rule frequencies ^(target|source), §(source|target) and (|) (source,target).
Another feature records the frequencies of the CFG productions used in the target side of a rule.
This allows the model to learn the weights of a CFG generation grammar, as a proxy for a language model.
Using scores from a pre-trained CFG grammar or an n-gram language model might be preferable when the training sample is small, however we leave this as future work.
Our last syntactic feature keeps track of the source root and the target root non-terminals.
Our lexical features contain the list of tokens in the source yield, target yield, and both.
We also use words as features.
2Alternatively, the loss function can be used to rescale the margin.
This approach is less desirable as it is not scale invariant (Tsochantaridis et al., 2005).
We also found empirically that slack-rescaling slightly outperforms margin rescaling on our compression task.
Finally, we have implemented a set of compression-specific features.
These include a feature that detects if the yield of the target side of a synchronous rule is a subset of the yield of its source.
We also take note of the edit operations (i.e., removal, insertion) required to transform the source side into the target.
Edit operations are recorded separately for trees and their yields.
In order to encourage compression, we also count the number of words on the target, the number of rules used in the derivation and the number of dropped variables.
Loss Functions The large margin configuration sketched above is quite modular and in theory a wide range of loss functions could be specified.
Examples include edit-distance, precision, F-score, BLEU and tree-based measures.
In practice, the loss function should be compatible with our maximisation algorithm which requires the objective function to decompose along the same lines as the tree derivation.3 Given this restriction, we define a loss based on position-independent unigram precision (Prec) which penalises errors in the yield independently for each word.
Although fairly intuitive, this loss is far from ideal.
First, it maximally rewards repeatedly predicting the same word if the latter is in the reference target tree.
Secondly, it may bias towards overly short output which drops core information — one-word compressions will tend to have higher precision than longer output.
To counteract this, we introduce two brevity penalty measures (BP) inspired by BLEU (Papineni et al., 2002) which we incorporate into the loss function, using a product, loss = 1 — Prec • BP:
where - is the reference length and c is the candidate length.
3Optimising non-decompositional loss functions complicates the objective function, which then cannot be solved efficiently using a dynamic program.
value one when c = r and decays towards zero for c < r and c > r. In both cases, brevity is assessed against the gold standard target (not the source) to allow the system to learn the correct degree of compression from the training data.
Maximisation Algorithm Our algorithm finds the maximising derivation for H(y) in (5).
This derivation will have a high loss and a high score under the model, and therefore represents the most-violated constraint which is then added to the SVM's working set of constraints (see (4)).
The standard generation method from Section 3.1 cannot be used without modiication to ind the best scoring derivation since it does not account for the loss function or the gold standard derivation.
Instead, we stratify the generation chart with the number of true and false positive tokens predicted, as described in Joachims (2005).
These contingency values allow us to compute the precision and brevity penalty (see (6)) for each complete derivation.
This is then combined with the derivation score and the gold standard derivation score to give H(y).
The gold standard derivation features, ^(xi; yi), must be calculated from a derivation linking the source tree to the gold target tree.
As there may be many such derivations, we find a unique derivation using the smallest rules possible (for maximum generality).
This is done using a dynamic program, similar to the inside-outside algorithm used in parsing.
Other strategies are also possible, however we leave this to future work.
Finally, we can ind the global maximum H(y) by maximising over all the root chart entries.
4 Evaluation Set-up
In this section we present our experimental set-up for assessing the performance of the max margin model described above.
We give details of the cor-poraused, briefly introduce McDonald's (2006) sentence compression model used for comparison with our approach, and explain how system output was evaluated.
Corpora We evaluated our system on two different corpora.
The irst is the compression corpus of Knight and Marcu (2002) derived automatically from the document-abstract pairs of the Ziff-
Davis corpus.
Previous compression work has almost exclusively used this corpus.
Our experiments follow Knight and Marcu's partition oftraining, test, and development sets (1,002/36/12 instances).
We also present results on Clarke and Lapata's (2006a) Broadcast News corpus.4 This corpus was created manually (annotators were asked to produce compressions for 50 Broadcast news stories) and poses more of a challenge than Ziff-Davis.
Being a speech corpus, it often contains incomplete and ungram-matical utterances and speech artefacts such as dis-fluencies, false starts and hesitations.
Furthermore, spoken utterances have varying lengths, some are very wordy whereas others cannot be reduced any further.
Thus a hypothetical compression system trained on this domain should be able to leave some sentences uncompressed.
Again we used Clarke and Lapata's training, test, and development set split (882/410/78 instances).
Comparison with State-of-the-art We evaluated our approach against McDonald's (2006) discriminative model.
This model is a good basis for comparison for several reasons.
First, it achieves competitive performance with Knight and Marcu's (2002) decision tree and noisy channel models.
Second, it also uses large margin learning.
Sentence compression is formulated as a string-to-substring mapping problem with a deletion-based Hamming loss.
Recall that our formulation involves a tree-to-tree mapping.
Third, it uses a feature space complementary to ours.
For example features are deined between adjacent words, and syntactic evidence is incorporated indirectly into the model.
In contrast our model relies on synchronous rules to generate valid compressions and does not explicitly incorporate adjacency features.
We used an implementation of McDonald (2006) for comparison of results (Clarke and Lapata,
2007).
Evaluation Measures In line with previous work we assessed our model's output by eliciting human judgements.
Participants were presented with an original sentence and its compression and asked to rate the latter on a ive point scale based on the information retained and its grammaticality.
We conducted two separate elicitation studies, one for the
4The corpus can be downloaded from http://homepages. inf.ed.ac.uk/s04 60 0 84/data/.
O: I just wish my parents and my other teachers could
be like this teacher, so we could communicate.
M: I wish my teachers could be like this teacher.
S: I wish my teachers could be like this, so we could
communicate.
G: I wish my parents and other teachers could be like
this, so we could communicate.
O: Earlier this week, in a conference call with analysts,
the bank said it boosted credit card reserves by $350
million.
M: Earlier said credit card reserves by $350 million.
S: In a conference call with analysts, the bank boosted
card reserves by $350 million.
G: In a conference call with analysts the bank said it
boosted credit card reserves by $350 million.
Table 1: Compression examples from the Broadcast news corpus (O: original sentence, M: McDonald
Ziff-Davis and one for the Broadcast news dataset.
In both cases our materials consisted of 96 source-target sentences.
These included gold standard compressions and the output of our system and McDonald's (2006).
We were able to obtain ratings on the entire Ziff-Davis test set as it has only 32 instances; this was not possible for Broadcast news as the test section consists of 410 instances.
Consequently, we randomly selected 32 source-target sentences to match the size of the Ziff-Davis test set.5 We collected ratings from 60 unpaid volunteers, all self reported native English speakers.
Both studies were conducted over the Internet.
Examples of our experimental items are given in Table 1.
We also report results using F1 computed over grammatical relations (Riezler et al., 2003).
We chose F1 (as opposed to accuracy or edit distance-based measures) as Clarke and Lapata (2006b) show that it correlates reliably with human judgements.
5 Experiments
The framework presented in Section 3 is quite flexible.
Depending on the grammar induction strategy, choice of features, loss function and maximisation algorithm, different classes of models can be derived.
Before presenting our results in detail we discuss the speciic model employed in our experiments and explain how its parameters were instantiated.
In order to build a compression model we need
5A Latin square design ensured that subjects did not see two different compressions of the same sentence.
compression rate
Figure 4: Compression rate vs. grammatical relations F1 using unigram precision alone and in combination with two brevity penalties.
a parallel corpus of syntax trees.
We obtained syntactic analyses for source and target sentences with Bikel's (2002) parser.
Our corpora were automatically aligned with Giza++ (Och et al., 1999) in both directions between source and target and symmetrised using the intersection heuristic (Koehn et al., 2003).
Each word in the lexicon was also aligned with itself.
This was necessary in order to inform Giza++ about word identity.
Unparseable sentences and those longer than 50 tokens were removed from the data set.
We induced a synchronous tree substitution grammar from the Ziff-Davis and Broadcast news corpora using the method described in Section 3.2.
We extracted all maximally general synchronous rules.
These were complemented with more speciic rules from conjoining pairs of general rules.
The speciic rules were pruned to remove singletons and those rules with more than 3 variables.
Grammar rules were represented by the features described in Section 3.3.
An important parameter for our compression task is the appropriate choice of loss function.
Ideally, we would like a loss function that encourages compression without overly aggressive information loss.
Figure 4 plots compression rate against grammatical relations F1 using each of the loss functions presented in Section 3.3 on the Ziff-Davis development set.6 As can be seen with unigram precision alone (Prec)
6We obtained a similar plot for the Broadcast News corpus but omit it due to lack of space.
Ziff-Davis
Broadcast News
McDonald06 STSG
Gold standard
the system produces overly short output, whereas the one-sided brevity penalty (BP1) achieves the opposite effect.
The two-sided brevity penalty (BP2) seems to strike the right balance: it encourages compression while achieving good F-scores.
This suggests that important information is retained in spite of signiicant compression.
We also varied the regularisation parameter C (see (4)) over a range of values on the development set and found that setting it to 0.01 yields overall good performance across corpora and loss functions.
We now present our results on the test set.
These were obtained with a model that uses slack rescal-ing and a precision-based loss function with a two-sided brevity penalty (C = 0.01).
Table 2 shows the average compression rates (CompR) for McDonald (2006) and our model (STSG) as well as their performance according to grammatical relations F1.
The row 'Gold standard' displays human-produced compression rates.
Notice that our model obtains compression rates similar to the gold standard, whereas McDonald tends to compress less on Ziff-Davis and more on Broadcast news.
As far as F1 is concerned, we see that STSG outperforms McDonald on both corpora.
The difference in F1 is statistically signii-cant on Broadcast news but not on Ziff-Davis (which consists solely of 32 sentences).
Table 3 presents the results of our elicitation study.
We carried out an Analysis of Variance (ANOVA) to examine the effect of system type (Mc-Donald06, STSG, Gold standard) on the compression ratings.
The ANOVA revealed a reliable effect on both corpora.
We used post-hoc Tukey tests to
Model Ziff-Davis Broadcast news
examine whether the mean ratings for each system differed significantly.
The Tukey tests showed that STSG is perceived as significantly better than McDonald06.
There is no significant difference between STSG and the gold standard compressions on the Broadcast news; both systems are significantly worse than the gold standard on Ziff-Davis.
These results are encouraging, indicating that our highly expressive framework is a good model for sentence compression.
Under several experimental conditions we obtain better performance than previous work.
Importantly, the model described here is not compression-specific, it could be easily adapted to other tasks, corpora or languages (for which syntactic analysis tools are available).
Being supervised, our model learns to fit the compression rate of the training data.
In this sense, it is somewhat inflexible as it cannot easily adapt to a specific rate given by a user or imposed by an application (e.g., when displaying text on small screens).
Compression rate can be indirectly manipulated by adopting loss functions that encourage or discourage compression (see Figure 4), but admittedly in other frameworks (e.g., Clarke and Lapata (2006a)) the length of the compression can be influenced more naturally.
In our formulation of the compression problem, a derivation is characterised by a single inventory of features.
This entails that the feature space cannot in principle distinguish between derivations that use the same rules, applied in a different order.
Although, this situation does not arise often in our dataset, we believe that it can be ameliorated by intersecting a language model with our generation algorithm (Chiang, 2005).
6 Conclusions and Future Work
In this paper we have presented a novel method for sentence compression cast in the framework of structured learning.
We develop a system that generates compressions using a synchronous tree substitution grammar whose weights are discriminatively trained within a large margin model.
We also describe an appropriate algorithm than can be used in both training (i.e., learning the model weights) and decoding (i.e., finding the most plausible compression under the model).
The proposed formulation allows us to capture rewriting operations that go beyond word deletion and can be easily tuned to specific loss functions directly related to the problem at hand.
We empirically evaluate our approach against a state-of-the art model (McDonald, 2006) and show performance gains on two compression corpora.
Future research will follow three directions.
First, we will extend the framework to incorporate position dependent loss functions.
Examples include the Hamming distance or more sophisticated functions that take the tree structure of the source and target sentences into account.
Such functions can be supported by augmenting our generation algorithm with a beam search.
Secondly, the present paper used a relatively simple feature set.
Our intention was to examine our model's performance without extensive feature engineering.
Nevertheless, improvements should be possible by incorporating features defined over n-grams and dependencies (McDonald, 2006).
Finally, the experiments presented in this work use a grammar acquired from the training corpus.
However, there is nothing inherent in our formalisation that restricts us to this particular grammar.
We therefore plan to investigate the potential of our method with unsupervised or semi-supervised grammar induction techniques for additional rewriting tasks including paraphrase generation and machine translation.
Acknowledgements The authors acknowledge the support of EPSRC (grants GR/T04540/01 and GR/T04557/01).
We are grateful to James Clarke for sharing his implementation of McDonald (2006) with us.
Special thanks to Philip Blunsom for insightful comments and suggestions.
