The paper presents a novel sentence trimmer in Japanese, which combines a non-statistical yet generic tree generation model and Conditional Random Fields (CRFs), to address improving the grammaticality of compression while retaining its relevance.
Experiments found that the present approach outperforms in grammaticality and in relevance a dependency-centric approach (Oguro et al., 2000; Morooka et al., 2004; Yamagata et al., 2006; Fukutomi et al., 2007) - the only line of work in prior literature (on Japanese compression) we are aware of that allows replication and permits a direct comparison.
1 Introduction
For better or worse, much of prior work on sentence compression (Riezler et al., 2003; McDonald, 2006; Turner and Charniak, 2005) turned to a single corpus developed by Knight and Marcu (2002) (K&M, henceforth) for evaluating their approaches.
The K&M corpus is a moderately sized corpus consisting of 1,087 pairs of sentence and compression, which account for about 2% of a Ziff-Davis collection from which it was derived.
Despite its limited scale, prior work in sentence compression relied heavily on this particular corpus for establishing results (Turner and Charniak, 2005; McDonald, 2006; Clarke and Lapata, 2006; Galley and McKe-own, 2007).
It was not until recently that researchers started to turn attention to an alternative approach which does not require supervised data (Turner and
Charniak, 2005).
Our approach is broadly in line with prior work
Clarke and Lapata, 2006), in that we make use of some form of syntactic knowledge to constrain compressions we generate.
What sets this work apart from them, however, is a novel use we make of Conditional Random Fields (CRFs) to select among possible compressions (Lafferty et al., 2001; Sutton and McCallum, 2006).
An obvious benefit of using CRFs for sentence compression is that the model provides a general (and principled) probabilistic framework which permits information from various sources to be integrated towards compressing sentence, a property K&M do not share.
Nonetheless, there is some cost that comes with the straightforward use of CRFs as a discriminative classifier in sentence compression; its outputs are often ungrammatical and it allows no control over the length of compression they generates (Nomoto, 2007).
We tackle the issues by harnessing CRFs with what we might call dependency truncation, whose goal is to restrict CRFs to working with candidates that conform to the grammar.
Thus, unlike McDonald (2006), Clarke and Lap-ata (2006) and Cohn and Lapata (2007), we do not insist on finding a globally optimal solution in the space of 2n possible compressions for an n word long sentence.
Rather we insist on finding a most plausible compression among those that are explicitly warranted by the grammar.
Later in the paper, we will introduce an approach called the 'Dependency Path Model' (DPM) from the previous literature (Section 4), which purports to provide a robust framework for sentence compres-
sion in Japanese.
We will look at how the present approach compares with that of DPM in Section 6.
2 A Sentence Trimmer with CRFs
Our idea on how to make CRFs comply with grammar is quite simple: we focus on only those label sequences that are associated with grammatically correct compressions, by making CRFs look at only those that comply with some grammatical constraints G, and ignore others, regardless of how probable they are.1 But how do we find compressions that are grammatical?
To address the issue, rather than resort to statistical generation models as in the previous literature (Cohn and Lapata, 2007; Galley and McKeown, 2007), we pursue a particular rule-based approach we call a 'dependency truncation,' which as we will see, gives us a greater control over the form that compression takes.
Let us denote a set of label assignments for S that satisfy constraints, by G(S).
2 We seek to solve the following,
There would be a number of ways to go about the problem.
In the context of sentence compression, a linear programming based approach such as Clarke and Lapata (2006) is certainly one that deserves consideration.
In this paper, however, we will explore a much simpler approach which does not require as involved formulation as Clarke and Lapata (2006)
do.
We approach the problem extentionally, i.e., through generating sentences that are grammatical, or that conform to whatever constraints there are.
1 Assume as usual that CRFs take the form,
fi and gt are 'features' associated with edges and vertices, respectively, and k e C, where C denotes a set of cliques in CRFs.
Xj and jit are the weights for corresponding features. w and f are vector representations of weights and features, respectively (Tasker, 2004).
2Note that a sentence compression can be represented as an array of binary labels, one of them marking words to be retained in compression and the other those to be dropped.
Figure 1: Syntactic structure in Japanese
Consider the following.
(3) Mushoku-no John -ga takai kuruma unemployed John sbj expensive car -wo kat-ta.
acc buy past
'John, who is unemployed, bought an expensive car.'
whose grammatically legitimate compressions would include:
(4) (a) John -ga takai kuruma -wo kat-ta.
'John bought an expensive car.'
(c) Mushoku-no John -ga kuruma -wo kat-ta. '
John, who is unemployed, bought a car.
(f) Takai kuruma-wo kat-ta.
(h) Kat-ta.
Figure 2: Compressing an NP chunk
Figure 3: Trimming TDPs
of the Japanese language we need to take into account when generating compressions, is that the sentence, which is free of word order and verb-final, typically takes a left-branching structure as in Figure 1, consisting of an array of morphological units called bunsetsu (BS, henceforth).
A BS, which we might regard as an inflected form (case marked in the case of nouns) of verb, adjective, and noun, could involve one or more independent linguistic elements such as noun, case particle, but acts as a morphological atom, in that it cannot be torn apart, or partially deleted, without compromising the grammaticality.3 Noting that a Japanese sentence typically consists of a sequence of case marked NPs and adjuncts, followed by a main verb at the end (or what would be called 'matrix verb' in linguistics), we seek to compress each of the major chunks in the sentence, leaving untouched the matrix verb, as its removal often leaves the sentence unintelligible.
In particular, starting with the leftmost BS in a major constituent,
3Example 3 could be broken into BSs: /Mushuku -no /John -ga / takai / kuruma -wo / kat-ta /.
we work up the tree by pruning BSs on our way up, which in general gives rise to grammatically legitimate compressions of various lengths (Figure 2).
More specifically, we take the following steps to construct G(S).
Let S = abcde.
Assume that it has a dependency structure as in Figure 3.
We begin by locating terminal nodes, i.e., those which have no incoming edges, depicted as filled circles in Figure 3, and find a dependency (singly linked) path from each terminal node to the root, or a node labeled 'e' here, which would give us two paths pi = a-c-d-e and p2 = b-c-d-e (call them terminating dependency paths, or TDPs).
Now create a set T of all trimmings, or suffixes of each TDP, including an empty string:
{}}, a set of compressions over S based on TDPs.
What is interesting about the idea is that creating G(S) does not involve much of anything that is specific to a given language.
Indeed this could be done on English as well.
Take for instance a sentence at the top of Table 1, which is a slightly modified lead sentence from an article in the New York Times.
Assume that we have a relevant dependency structure as shown in Figure 5, where we have three TDPs, i.e., one with southern, one with British and one with lethal.
Then G(S) would include those listed in Table 1.
A major difference from Japanese lies in the direction in which a tree is branching out: right versus left.4
Having said this, we need to address some language specific constraints: in Japanese, for instance, we should keep a topic marked NP in compression as its removal often leads to a decreased readability; and also it is grammatically wrong to start any compressed segment with sentence nominalizers such as
4We stand in a marked contrast to previous 'grafting' approaches which more or less rely on an ad-hoc collection of transformation rules to generate candidates (Riezler et al.,
2003).
An official was quoted troops in southern Iraq
Table 1: Hedge-clipping English yesterday as accusing Iran of supplying explosive technology used in lethal attacks on British
An official was troops in Iraq An official was troops
An official was An official was An official was An official was An official was
quoted yesterday as accusing Iran of supplying explosive technology used in lethal attacks on British
quoted yesterday as accusing Iran of supplying explosive technology used in lethal attacks on troops quoted yesterday as accusing Iran of supplying explosive technology used in lethal attacks quoted yesterday as accusing Iran of supplying explosive technology used in attacks quoted yesterday as accusing Iran of supplying explosive technology quoted yesterday as accusing Iran of supplying technology
^attafck^
southern
Figure 5: An English dependency structure and TDPs
Figure 4: Combining TDP suffixes
-koto and -no.
In English, we should keep a preposition from being left dangling, as in An official was quoted yesterday as accusing Iran of supplying technology used in.
In any case, we need some extra rules on G(S) to take care of language specific issues (cf. Vandeghinste and Pan (2004) for English).
An important point about the dependency truncation is that for most of the time, a compression it generates comes out reasonably grammatical, so the number of 'extras' should be small.
Finally, in order for CRFs to work with the compressions, we need to translate them into a sequence of binary labels, which involves labeling an element token, bunsetsu or a word, with some label, e.g., 0 for 'remove' and 1 for 'retain,' as in Figure 6.
Consider following compressions yi to y4 for x = P1P2P3P4P5P&.
Pi denotes a bunsetsu (BS). '
0' marks a BS to be removed and '1' that to be retained.
Assume that G(S) = {yi, y2, y3}- Because y4 is not part of G(S), it is not considered a candidate for a compression for y, even if its likelihood may exceed those of others in G(S).
We note that the approach here does not rely on so much of CRFs as a discriminative classiier as CRFs as a strategy for ranking among a limited set of label sequences which correspond to syntactically plausible simpli-ications of input sentence.
Furthermore, we could dictate the length of compression by putbting an additional constraint on out-
Figure 6: Compression in binary representation.
Another point to note is that G(S) is finite and relatively small — it was found, for our domain, G(S) usually runs somewhere between a few hundred and ten thousand in length — so in practice it suffices that we visit each compression in G(S), and select one that gives the maximum value for the objective function.
We will have more to say about the size of the search space in Section 6.
3 Features in CRFs
We use an array of features in CRFs which are either derived or borrowed from the taxonomy that a Japanese tokenizer called juman and knp,6 a Japanese dependency parser (aka Kurohashi-Nagao parser), make use of in characterizing the output they produce: both juman and knp are part of the compression model we build.
Features come in three varieties: semantic, morphological and syntactic.
Semantic features are used for classifying entities into semantic types such as name of person, organization, or place, while syntactic features characterize the kinds of dependency
5It is worth noting that the present approach can be recast into one based on 'constraint relaxation' (Tromble and Eisner, 2006).
relations that hold among BSs such as whether a BS is of the type that combines with the verb (renyou), or of the type that combines with the noun (rental), etc.
A morphological feature could be thought of as something that broadly corresponds to an English POS, marking for some syntactic or morphological category such as noun, verb, numeral, etc. Also we included ngram features to encode the lexical context in which a given morpheme appears.
Thus we might have something like: for some words (morphemes) w1, w2, and w3, fwi-w2 (w3) = 1 if w3 is preceded by w\, w2; otherwise, 0.
In addition, we make use of an IR-related feature, whose job is to indicate whether a given morpheme in the input appears in the title of an associated article.
The motivation for the feature is obviously to identify concepts relevant to, or unique to the associated article.
Also included was a feature on tfidf, to mark words that are conceptually more important than others.
The number of features came to around 80,000 for the corpus we used in the experiment.
4 The Dependency Path Model
where y = [30, [31,..., /3n-i, i.e., a compression consisting of any number of bunsetsu's, or phraselike elements. f (•) measures the relevance of content in y; and g( ) the fluency of text. a is to provide a way of weighing up contributions from each component.
We further define:
7Kikuchi et al. (2003) explore an approach similar to DPM.
Three legged
disappeared
Figure 7: A dependency structure
q(-) is meant to quantify how worthy of inclusion in compression, a given bunsetsu is; and p(3i,3j) represents the connectivity strength of dependency relation between [3i and [3j. s(-) is a linking function that associates with a bunsetsu any one of those that follows it. g(y) thus represents a set of linked edges that, if combined, give the largest probability for y.
Dependency path length (DL) refers to the number of (singly linked) dependency relations (or edges) that span two bunsetsu's.
Consider the dependency tree in Figure 7, which corresponds to a somewhat contrived sentence Three-legged dogs disappeared from sight' Take an English word for a bunsetsu here.
We have
Since dogs is one edge away from three-legged, DL for them is 1; and we have DL of two for three-legged and disappeared, as we need to cross two edges in the direction of arrow to get from the former to the latter.
In case there is no path between words as in the last two cases above, we take the DL to be infinite.
represent bunsestu's that the edge spans.
Cs(3) denotes the class of a bunsetsu where the edge starts and Ce(3) that of a bunsetsu where the edge ends.
What we mean by 'class of bunsetsu' is some sort of a classificatory scheme that concerns linguistic characteristics of bunsetsu, such as a part-of-speech of the head, whether it has an inflection, and if it does, what type of inflection it has, etc. Moreover, DPM uses two separate classificatory schemes for Cs (3) and Ce(/3).
In DPM, we define the connectivity strength p by:
# of t's found in compressions # of triples found in the training data
We complete the DPM formulation with:
pc(3) denotes the probability of having bunsetsu 3 in compression, calculated analogously to Eq.
10,8 and tfidf(3) obviously denotes the tfidf value of 3.
In DPM, a compression of a given sentence can be obtained by finding arg maxy h(y), where y ranges over possible candidate compressions of a particular length one may derive from that sentence.
In the experiment described later, we set a = 0.1 for DPM, following Morooka et al. (2004), who found the best performance with that setting for a.
5 Evaluation Setup
We created a corpus of sentence summaries based on email news bulletins we had received over five to six months from an on-line news provider called Nikkei Net, which mostly deals with finance and politics.9 Each bulletin consists of six to seven news briefs, each with a few sentences.
Since a news brief contains nothing to indicate what its longer version
8DPM puts bunsetsu's into some groups based on linguistic features associated with them, and uses the statistics of the groups for pc rather than that of bunsetsu's that actually appear in text.
Table 2: The rating scale on fluency RATING EXPLANATION
2 only partially intelligible/grammatical
3 makes sense; seriously flawed in grammar
4 makes good sense; only slightly flawed in grammar
5 makes perfect sense; no grammar flaws
Table 3: The rating scale on content overlap
RATING EXPLANATION
1 no overlap with reference
2 poor or marginal overlap w. ref.
3 moderate overlap w. ref.
4 significant overlap w. ref.
5 perfect overlap w. ref.
might look like, we manually searched the news site for a full-length article that might reasonably be considered a long version of that brief.
We extracted lead sentences both from the brief and from its source article, and aligned them, using what is known as the Smith-Waterman algorithm (Smith and Waterman, 1981), which produced 1,401 pairs of summary and source sentence.10 For the ease of reference, we call the corpus so produced 'NICOM' for the rest of the paper.
A part of our system makes use of a modeling toolkit called GRMM (Sutton et al., 2004; Sutton, 2006).
Throughout the experiments, we call our approach 'Generic Sentence Trimmer' or GST.
6 Results and Discussion
We ran DPM and GST on NICOM in the 10-fold cross validation format where we break the data into 10 blocks, use 9 of them for training and test on the remaining block.
In addition, we ran the test at three different compression rates, 50%, 60% and 70%, to learn how they affect the way the models perform.
This means that for each input sentence in NICOM, we have three versions of its compression created, corresponding to a particular rate at which the sentence is compressed.
We call a set of compressions so generated 'NICOM-g.'
In order to evaluate the quality of outputs GST and DPM generate, we asked 6 people, all Japanese natives, to make an intuitive judgment on how each compression fares in fluency and relevance to gold
10The Smith-Waterman algorithm aims at finding a best match between two sequences which may include gaps, such as a-c-d-e and a-b-c-d-e.
The algorithm is based on an idea rather akin to dynamic programming.
standards (created by humans), on a scale of 1 to 5.
To this end, we conducted evaluation in two separate formats; one concerns luency and the other relevance.
The luency test consisted of a set of compressions which we created by randomly selecting 200 of them from NICOM-g, for each model at compression rates 50%, 60%, and 70%; thus we have 200 samples for each model and each compression rate.11 The total number of test compressions came
to 1,200.
The relevance test, on the other hand, consisted of paired compressions along with the associated gold standard compressions.
Each pair contains compressions both from DPM and from GST at a given compression rate.
We randomly picked 200 of them from NICOM-g, at each compression rate, and asked the participants to make a subjective judgment on how much of the content in a compression semantically overlap with that of the gold standard, on a scale of 1 to 5 (Table 3).
Also included in the survey are 200 gold standard compressions, to get some idea of how fluent "ideal" compressions are, compared to those generated by machine.
Tables 4 and 5 summarize the results.
Table 4 looks at the luency of compressions generated by each of the models; Table 5 looks at how much of the content in reference is retained in compressions.
In either table, cr stands for compression rate.
All the results are averaged over samples.
We find in Table 4 a clear superiority of GST over DPM at every compression rate examined, with lu-ency improved by as much as 60% at 60%.
However, GST fell short of what human compressions achieved in fluency — an issue we need to address
Table 5: Semantic (Content) Overlap (Average)
Number of Candidates
in the future.
Since the average CR of gold standard compressions was 60%, we report their luency at that rate only.
Table 5 shows the results in relevance of content.
Again GST marks a superior performance over DPM, beating it at every compression rate.
It is interesting to observe that GST manages to do well in the semantic overlap, despite the cutback on the search space we forced on GST.
As for luency, we suspect that the superior performance of GST is largely due to the dependency truncation the model is equipped with; and its performance in content overlap owes a lot to CRFs.
However, just how much improvement GST achieved over regular CRFs (with no truncation) in luency and in relevance is something that remains to be seen, as the latter do not allow for variable length compression, which prohibits a straightforward comparison between the two kinds of models.
We conclude the section with a few words on the size of \G(S)|, i.e., the number of candidates generated per run of compression with GST.
Figure 8 shows the distribution of the numbers of candidates generated per compression, which looks like the familiar scale-free power curve.
Over 99% of the time, the number of candidates or \G(S)\ is found to be less than 500.
7 Conclusions
This paper introduced a novel approach to sentence compression in Japanese, which combines a syntactically motivated generation model and CRFs, in or-
der to address luency and relevance of compressions we generate.
What distinguishes this work from prior research is its overt withdrawal from a search for global optima to a search for local optima that comply with grammar.
We believe that our idea was empirically borne out, as the experiments found that our approach outperforms, by a large margin, a previously known method called DPM, which employs a global search strategy.
The results on semantic overlap indicates that the narrowing down of compressions we search obviously does not harm their relevance to references.
An interesting future exercise would be to explore whether it is feasible to rewrite Eq.
5 as a linear integer program.
If it is, the whole scheme of ours would fall under what is known as 'Linear Programming CRFs' (Tasker, 2004; Roth and Yih, 2005).
What remains to be seen, however, is whether GST is trans-ferrable to languages other than Japanese, notably, English.
The answer is likely to be yes, but details have yet to be worked out.
