We present a new approach to automatic summarization based on neural nets, called NetSum.
We extract a set of features from each sentence that helps identify its importance in the document.
We apply novel features based on news search query logs and Wikipedia entities.
Using the RankNet learning algorithm, we train a pair-based sentence ranker to score every sentence in the document and identify the most important sentences.
We apply our system to documents gathered from CNN.com, where each document includes highlights and an article.
Our system significantly outperforms the standard baseline in the ROUGE-1 measure on over 70% of our document set.
1 Introduction
Automatic summarization was first studied almost 50 years ago by Luhn (Luhn, 1958) and has continued to be a steady subject of research.
Automatic summarization refers to the creation of a shortened version of a document or cluster of documents by a machine, see (Mani, 2001) for details.
The summary can be an abstraction or extraction.
In an abstract summary, content from the original document may be paraphrased or generated, whereas in an extract summary, the content is preserved in its original form, i.e., sentences.
Both summary types can involve sentence compression, but abstracts tend to be more condensed.
In this paper, we focus on producing fully automated single-document extract summaries of newswire articles.
To create an extract, most automatic systems use linguistic and/or statistical methods to identify key words, phrases, and concepts in a sentence or across single or multiple documents.
Each sentence is then assigned a score indicating the strength of presence of key words, phrases, and so on.
Sentence scoring methods utilize both purely statistical and purely semantic features, for example as in (Vanderwende et al., 2006; Nenkova et al., 2006; Yih et al., 2007).
cea and Radev, 2006).
In 2001-02, the Document Understanding Conference (DUC, 2001), issued the task of creating a 100-word summary of a single news article.
The best performing systems (Hirao et al., 2002; Lal and Ruger, 2002) used various learning and semantic-based methods, although no system could outperform the baseline with statistical significance (Nenkova, 2005).
After 2002, the single-document summarization task was dropped.
In recent years, there has been a decline in studies on automatic single-document summarization, in part because the DUC task was dropped, and in part because the task of single-document extracts may be counterintuitively more difficult than multi-
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 448-457, Prague, June 2007.
©2007 Association for Computational Linguistics
document summarization (Nenkova, 2005).
However, with the ever-growing internet and increased information access, we believe single-document summarization is essential to improve quick access to large quantities of information.
Recently, CNN.com (CNN.com, 2007a) added "Story Highlights" to many news articles on its site to allow readers to quickly gather information on stories.
These highlights give a brief overview of the article and appear as 3—4 related sentences in the form of bullet points rather than a summary paragraph, making them even easier to quickly scan.
Our work is motivated by both the addition of highlights to an extremely visible and reputable online news source, as well as the inability of past single-document summarization systems to outperform the extremely strong baseline of choosing the first n sentences of a newswire article as the summary (Nenkova, 2005).
Although some recent systems indicate an improvement over the baseline (Mi-halcea, 2005; Mihalcea and Tarau, 2005), statistical significance has not been shown.
We show that by using a neural network ranking algorithm and third-party datasets to enhance sentence features, our system, NetSum, can outperform the baseline with statistical significance.
Our paper is organized as follows.
Section 2 describes our two studies: summarization and highlight extraction.
We describe our dataset in detail in Section 3.
Our ranking system and feature vectors are outlined in Section 4.
We present our evaluation measure in Section 5.
Sections 6 and 7 report on our results on summarization and highlight extraction, respectively.
We conclude in Section 8 and discuss future work in Section 9.
In this paper, we focus on single-document summarization of newswire documents.
Each document consists of three highlight sentences and the article text.
Each highlight sentence is human-generated, but is based on the article.
In Section 4 we discuss the process of matching a highlight to an article sentence.
The output of our system consists of purely extracted sentences, where we do not perform any sentence compression or sentence generation.
We leave such extensions for future work.
We develop two separate problems based on our document set.
First, can we extract three sentences that best "match" the highlights as a whole?
In this task, we concatenate the three sentences produced by our system into a single summary or block, and similarly concatenate the three highlight sentences into a single summary or block.
We then compare our system's block against the highlight block.
Second, can we extract three sentences that best "match" the three highlights, such that ordering is preserved?
In this task, we produce three sentences, where the first sentence is compared against the first highlight, the second sentence is compared against the second highlight, and the third sentence is compared against the third highlight.
Credit is not given for producing three sentences that match the highlights, but are out of order.
The second task considers ordering and compares sentences on an individual level, whereas the first task considers the three chosen sentences as a summary or block and disregards sentence order.
In both tasks, we assume the title has been seen by the reader and will be listed above the highlights.
3 Evaluation Corpus
Our data consists of 1365 news documents gathered from CNN.com (CNN.com, 2007a).
Each document was extracted by hand, where a maximum of 50 documents per day were collected.
The documents were hand-collected on consecutive days during the month of February.
Each document includes the title, timestamp, story highlights, and article text.
The timestamp on articles ranges from December 2006 to February 2007, since articles remain posted on CNN.com for up to several months.
The story highlights are human-generated from the article text.
The number of story highlights is between 3-4.
Since all articles include at least 3 story highlights, we consider only the task of extracting three highlights from each article.
4 Description of Our System
Our goal is to extract three sentences from a single news document that best match various characteristics of the three document highlights.
One way to identify the best sentences is to rank the sentences
TIMESTAMP: 1:59 p.m. EST, January 31, 2007 TITLE: Nigeria reports first human death from bird flu HIGHLIGHT 1: Government boosts surveillance after woman dies HIGHLIGHT 2: Egypt, Djibouti also have reported bird flu in humans HIGHLIGHT 3: H5N1 birdflu virus has killed 164 worldwide since 2003 ARTICLE: 1.
Health officials reported Nigeria's first cases of birdflu inhumans on Wednesday, saying one woman had died and a family member had been infected but was responding to treatment.
Thevictim,a22-yearoldwomaninLagos,diedJanuary17,InformationMinister Frank Nweke said in a statement.
He added that the government was boosting surveillance across Africa's most-populous nation after the infections in Lagos, Nigeria's biggest city.
The World Health Organization had no immediate confirmation.
Nigerian health officials earliersaid14humansampleswerebeingtested.
Nwekemadenomentionofthosecaseson Wednesday.
An outbreak of H5N1 bird flu hit Nigeria last year, but no human infections had been reported until Wednesday.
Until the Nigerian report, Egypt and Djibouti were the only African countries that had confirmed infections among people.
Eleven people have died in Egypt.
10.
The bird flu virus remains hard for humans to catch, but health experts fear H5N1 may mutate into a form that could spread easily among humans and possibly kill millions in a flu pandemic.
11.
Amid a new H5N1 outbreak reported in recent weeks in Nigeria's north, hundreds of miles from Lagos, health workers have begun a cull of poultry.
12.
Bird flu is generally not harmful to humans, but the H5N1 virus has claimed at least 164 lives worldwide since it began ravaging Asian poultry in late 2003, according to the WHO.
13.
The H5N1 strain had been confirmed in 15 of Nigeria's 36 states.
14.
By September, when the last known case of the virus was found in poultry in a farm near Nigeria's biggest city of Lagos, 915,650 birds had been slaughtered nationwide by government veterinary teams under a plan in which the owners were promised compensation.
15.
However, many Nigerian farmers have yet to receive compensation in the north of the country, and health officials fear that chicken deaths may be covered up by owners reluctant to slaughter their animals.
16.
Since bird flu cases were first discovered in Nigeria last year, Cameroon, Djibouti, Niger, Ivory Coast, Sudan and Burkina Faso have also reported the H5N1 strain of bird flu in birds.
17.
There are fears that it has spread even further than is known in Africa because monitoring is difficult on a poor continent withweakinfrastructure.
18.
Withsub-SaharanAfricabearingthebruntoftheAIDSepidemic, there is concern that millions of people with suppressed immune systems will be particularly vulnerable, especially in rural areas with little access to health facilities.
19.
Many people keep chickens for food, even in densely populated urban areas.
Figure 1: Example document containing highlights and article text.
Sentences are numbered by their position.
Article is from (CNN.com, 2007b).
using a machine learning approach, for example as in (Hirao et al., 2002).
A train set is labeled such that the labels identify the best sentences.
Then a set of features is extracted from each sentence in the train and test sets, and the train set is used to train the system.
The system is then evaluated on the test set.
The system learns from the train set the distribution of features for the best sentences and outputs a ranked list of sentences for each document.
In this paper, we rank sentences using a neural network algorithm called RankNet (Burges et al., 2005).
From the labels and features for each sentence, we train a model that, when run on a test set of sentences, can infer the proper ranking of sentences in a document based on information gathered during training about sentence characteristics.
To accomplish the ranking, we use RankNet (Burges et al., 2005), a ranking algorithm based on neural networks.
RankNet is a pair-based neural network algorithm used to rank a set of inputs, in this case, the set of sentences in a given document.
The system is trained on pairs of sentences (Si, Sj), such that Si should be ranked higher or equal to Sj. Pairs are generated between sentences in a single document, not across documents.
Each pair is determined from the input labels.
Since our sentences are labeled using ROUGE (see Section 4.3), if the ROUGE score of Si is greater than the ROUGE score of Sj, then (Si,Sj) is one input pair.
The cost function for RankNet is the probabilistic cross-entropy cost function.
Training is performed using a modified version of the back propagation algorithm for two layer nets (Le Cun et al., 1998), which is based on optimizing the cost function by gradient descent.
A similar method of training on sentence pairs in the context of multi-document summarization was recently shown in (Toutanova et al., 2007).
Our system, NetSum, is a two-layer neural net trained using RankNet.
To speed up the performance of RankNet, we implement RankNet in the framework of LambdaRank (Burges et al., 2006).
For details, see (Burges et al., 2006; Burges et al., 2005).
We experiment with between 5 and 15 hidden nodes and with an error rate between 10-2 and 10-7.
We implement 4 versions of NetSum.
The first
version, NetSum(b), is trained for our first summarization problem (b indicates block).
The pairs are generated using the maximum ROUGE scores l1 (see Section 4.3).
The other three rankers are trained to identify the sentence in the document that best matches highlight n. We train one ranker, NetSum(n), for each highlight n, for n = 1, 2, 3, resulting in three rankers.
NetSum(n) is trained using pairs generated from the l1n ROUGE scores between sentence Si and highlight Hn (see Section
4.3).
4.2 Matching Extracted to Generated Sentences
In this section, we describe how to determine which sentence in the document best matches a given highlight.
Choosing three sentences most similar to the three highlights is very challenging since the highlights include content that has been gathered across sentences and even paragraphs, and furthermore include vocabulary that may not be present in the text.
Jing showed, for 300 news articles, that 19% of human-generated summary sentences contain no matching article sentence (Jing, 2002).
In addition, only 42% of the summary sentences match the content of a single article sentence, where there are still semantic and syntactic transformations between the summary sentence and article sentence..
Since each highlight is human generated and does not exactly match any one sentence in the document, we must develop a method to identify how closely related a highlight is to a sentence.
We use the ROUGE (Lin, 2004b) measure to score the similarity between an article sentence and a highlight sentence.
We anticipate low ROUGE scores for both the baseline and NetSum due to the difficulty of finding a single sentence to match a highlight.
Recall-Oriented Understudy for Gisting Evaluation (Lin, 2004b), known as ROUGE, measures the quality of a model-generated summary or sentence by comparing it to a "gold-standard", typically human-generated, summary or sentence.
It has been shown that ROUGE is very effective for measuring both single-document summaries and single-document
headlines (Lin, 2004a).
ROUGE-N is a N-gram recall between a model-
generated summary and a reference summary.
We use ROUGE-N, for N = 1, for labeling and evaluation of our model-generated highlights.1 ROUGE-1 and ROUGE-2 have been shown to be statistically similar to human evaluations and can be used with a single reference summary (Lin, 2004a).
We have only one reference summary, the set of human-generated highlights, per document.
In our work, the reference summary can be a single highlight sentence or the highlights as a block.
We calculate
ROUGE-N as
where R is the reference summary, Si is the modelgenerated summary, and N is the length of the N-gram gramj 2 The numerator cannot excede the number of N-grams (non-unique) in R.
We label each sentence Si by its ROUGE-1 score.
For the first problem of matching the highlights as a block, we label each Si by li, the maximum ROUGE-1 score between Si and each highlight Hn, for n = 1, 2, 3, given by l1 = maxn(R(Si, Hn)).
For the second problem of matching three sentences to the three highlights individually, we label each sentence Si by l1>n, the ROUGE-1 score between Si and Hn, given by l1n = R(Si; Hn).
The ranker for highlight n, NetSum(n), is passed samples labeled using l1n.
RankNet takes as input a set of samples, where each sample contains a label and feature vector.
The labels were previously described in Section 4.3.
In this section, we describe each feature in detail and motivate in part why each feature is chosen.
We generate 10 features for each sentence Si in each document, listed in Table 1.
Each feature is chosen to identify characteristics of an article sentence that may match those of a highlight sentence.
Some of the features such as position and N-gram frequencies are commonly used for scoring.
Sentence scoring based on
1We use an implementation of ROUGE that does not perform stemming or stopword removal.
2ROUGE is typically used when the length of the reference summary is equal to length of the model-generated summary.
Our reference summary and model-generated summary are different lengths, so there is a slight bias toward longer sentences.
Feature Name
Is First Sentence
Sentence Position
SumBasic Score
SumBasic Bigram Score
Title Similarity Score
Average News Query Term Score
News Query Term Sum Score
Relative News Query Term Score
Average Wikipedia Entity Score
Wikipedia Entity Sum Score
Table 1: Features used in our model.
sentence position, terms common with the title, appearance of keyword terms, and other cue phrases is known as the Edmundsonian Paradigm (Edmund-son, 1969; Alfonesca and Rodriguez, 2003; Mani, 2001).
We use variations on these features as well as a novel set of features based on third-party data.
Typically, news articles are written such that the first sentence summarizes the article.
Thus, we include a binary feature F (Si) that equals 1 if Si is the first sentence of the document: F(Si) = ôi;1, where ô is the Kronecker delta function.
This feature is used only for NetSum(b) and NetSum(1).
We include sentence position since we found in empirical studies that the sentence to best match highlight H1 is on average 10% down the article, the sentence to best match H2 is on average 20% down the article, and the sentence to best match H3 is 31% down the article.3 We calculate the position of Si in document D as
where i = {1,..., 1} is the sentence number and I is the number of sentences in D.
We include the SumBasic score (Nenkova et al., 2006) of a sentence to estimate the importance of a sentence based on word frequency.
We calculate the SumBasic score of Si in document D as
3Though this is not always the case, as the sentence to match H2 precedes that to match H1 in 22.03% of documents, and the sentence to match H3 precedes that to match H2 in 29.32% of and precedes that to match H1 in 28.81% of documents.
where p(w) is the probability of word w and |Si| is the number of words in sentence Si.
We calculate p(w) as p(w) = Co^(w); where Count(w) is the number of times word w appears in document D and |D| is the number of words in document D. Note that the score ofa sentence is the average probability of a word in the sentence.
We also include the SumBasic score over bi-grams, where w in Eq 3 is replaced by bigrams and we normalize by the number of bigrams in Si.
We compute the similarity of a sentence Si in document D with the title T of D as the relative probability of title terms t £ T in Si as
is the number of times term t
where p(t) = —\r\ appears in T over the number of terms in T.
The remaining features we use are based on third-party data sources.
Previously, third-party sources such as WordNet (Fellbaum, 1998), the web (Ja-galamudi et al., 2006), or click-through data (Sun et al., 2005) have been used as features.
We propose using news query logs and Wikipedia entities to enhance features.
We base several features on query terms frequently issued to Microsoft's news search engine http://search.live.com/news, and enti-ties4 found in the online open-source encyclopedia Wikipedia (Wikipedia.org, 2007).
If a query term or Wikipedia entity appears frequently in a CNN document, then we assume highlights should include that term or entity since it is important on both the document and global level.
Sentences containing query terms or Wikipedia entities therefore contain important content.
We confirm the importance of these third-party features in Section 7.
We collected several hundred of the most frequently queried terms in February 2007 from the news query logs.
We took the daily top 200 terms for 10 days.
Our hypothesis is that a sentence with a higher number of news query terms should be a better candidate highlight.
We calculate the average probability of news query terms q in Si as
4We define an entity as a title of a Wikipedia page.
where p(q) is the probability of a news term q and \q G Si\ is the number of news terms in Sj. p(q) = C|geD|g'>' wnere Count(q) is the number of times term q appears in D and \q G D\ is the number of news query terms in D.
We perform term disambiguation on each document using an entity extractor (Cucerzan, 2007).
Terms are disambiguated to a Wikipedia entity only if they match a surface form in Wikipedia.
Wikipedia surface forms are terms that disambiguate to a Wikipedia entity and link to a Wikipedia page with the entity as its title.
For example, "WHO" and "World Health Org." both refer to the World Health Organization, and should disambiguate to the entity "World Health Organization".
Sentences in CNN document D that contain Wikipedia entities that frequently appear in CNN document D are considered important.
We calculate the average Wikipedia entity score for Si as
We also include the sum of Wikipedia entities, given by WE+(Sj) = EeeSi p(e).
Note that all features except position features are a variant of SumBasic over different term sets.
All features are computed over sentences where every word has been lowercased and punctuation has been removed after sentence breaking.
We examined using stemming, but found stemming to be ineffective.
5 Evaluation
We evaluate the performance of NetSum using ROUGE and by comparing against a baseline system.
For the first summarization task, we compare against the baseline of choosing the first three sentences as the block summary.
For the second high-
lights task, we compare NetSum(n) against the baseline of choosing sentence n (to match highlight n).
Both tasks are novel in attempting to match highlights rather than a human-generated summary.
We consider ROUGE-1 to be the measure of importance and thus train our model on ROUGE-1 (to optimize ROUGE-1 scores) and likewise evaluate our system on ROUGE-1.
We list ROUGE-2 scores for completeness, but do not expect them to be substantially better than the baseline since we did not directly optimize for ROUGE-2.5
For every document in our corpus, we compare NetSum's output with the baseline output by computing ROUGE-1 and ROUGE-2 between the highlight block and NetSum and between the highlight block and the block of sentences.
Similarly, for each
highlight, we compute ROUGE-1 and ROUGE-2
between highlight n and NetSum(n) and between highlight n and sentence n, for n = 1, 2, 3.
For each task, we calculate the average ROUGE-1 and ROUGE-2 scores of NetSum and of the baseline.
We also report the percent of documents where the ROUGE-1 score of NetSum is equal to or better than the ROUGE-1 score of the baseline.
We perform all experiments using ive-fold cross-validation on our dataset of 1365 documents.
We divide our corpus into ive random sets and train on three combined sets, validate on one set, and test on the remaining set.
We repeat this procedure for every combination of train, validation, and test sets.
Our results are the micro-averaged results on the ive test sets.
For all experiments, Table 3 lists the statistical tests performed and the signiicance of performance differences between NetSum and the baseline at 95% confidence.
6 Results: Summarization
We irst ind three sentences that, as a block, best match the three highlights as a block.
NetSum(b) produces a ranked list of sentences for each document.
We create a block from the top 3 ranked sentences.
The baseline is the block of the irst 3 sentences of the document.
A similar baseline outper-
5NetSum can directly optimize for any measure by training on it, such as training on ROUGE-2 or on a weighted sum of
scores could be further improved.
We leave such studies for future work.
Table 2: Results on summarization task with standard error at 95% confidence.
Bold indicates significance under paired tests.
NetSum(b)
Table 3: Paired tests for statistical significance at 95% conidence between baseline and NetSum performance; 1: McNemar, 2: Paired t-test, 3: Wilcoxon signed-rank.
"x" indicates pass, "o" indicates fail.
Since our studies are pair-wise, tests listed here are more accurate than error bars reported in Tables 2-5.
forms all previous systems for news article summarization (Nenkova, 2005) and has been used in the DUC workshops (DUC, 2001).
For each block produced by NetSum(b) and the baseline, we compute the ROUGE-1 and ROUGE-2 scores of the block against the set of highlights as a block.
For 73.26% of documents, NetSum(b) produces a block with a ROUGE-1 score that is equal to or better than the baseline score.
The two systems produce blocks of equal ROUGE-1 score for 24.69% of documents.
Under ROUGE-2, NetSum(b) performs equal to or better than the baseline on 73.19% of documents and equal to the baseline on 40.51% of documents.
Table 2 shows the average ROUGE-1 and ROUGE-2 scores obtained with NetSum(b) and the baseline.
NetSum(b) produces a higher quality block on average for ROUGE-1.
Table 4 lists the sentences in the block produced by NetSum(b) and the baseline block, for the articles shown in Figure 1.
The NetSum(b) summary achieves a ROUGE-1 score of 0.52, while the baseline summary scores only 0.36.
Table 4: Block results for the block produced by NetSum(b) and the baseline block for the example article.
ROUGE-1 scores computed against the highlights as a block are listed.
7 Results: Highlights
Our second task is to extract three sentences from a document that best match the three highlights in order.
To accomplish this, we train NetSum(n) for each highlight n = 1 , 2, 3.
We compare NetSum(n) with the baseline of picking the nth sentence of the document.
We perform ive-fold cross-validation across our 1365 documents.
Our results are reported for the micro-average of the test results.
For each highlight n produced by both NetSum(n) and the baseline, we compute the ROUGE-1 and ROUGE-2 scores against the nth highlight.
We expect that beating the baseline for n = 1 is a more dificult task than for n = 2 or 3 since the irst sentence of a news article typically acts as a summary of the article and since we expect the irst highlight to summarize the article.
NetSum(1), however, produces a sentence with a ROUGE-1 score that is equal to or better than the baseline score for 93.26% of documents.
The two systems produce sentences of equal ROUGE-1 scores for 82.84% of documents.
Under ROUGE-2, NetSum(1) performs equal to or better than the baseline on 94.21% of documents.
Table 5 shows the average ROUGE-1 and ROUGE-2 scores obtained with NetSum(1) and the baseline.
NetSum(1) produces a higher quality sentence on average under ROUGE-1.
The content ofhighlights 2 and 3 is typically from later in the document, so we expect the baseline to not perform as well in these tasks.
NetSum(2) outperforms the baseline since it is able to identify sentences from further down the document as important.
For 77.73% of documents, NetSum(2) produces a sentence with a ROUGE-1 score that is equal to or better than the score for the baseline.
The two systems produce sentences ofequal ROUGE-1 score for 33.92% of documents.
Under ROUGE-2, Net-Sum(2) performs equal to or better than the baseline
Baseline(l)
Table 5: Results on ordered highlights task with standard error at 95% conidence.
Bold indicates signiicance under paired tests.
NetSum(l)
Baseline
Table 6: Highlight results for highlight n produced by NetSum(n) and highlight n produced by the baseline for the example article.
ROUGE-1 scores computed against highlight n are listed.
84.84% of the time.
For 81.09% of documents, Net-Sum(3) produces a sentence with a ROUGE-1 score that is equal to or better than the score for the baseline.
The two systems produce sentences of equal ROUGE-1 score for 28.45% of documents.
Under ROUGE-2, NetSum(3) performs equal to or better than the baseline 89.91% of the time.
Table 5 shows the average ROUGE-1 and
Sum(3), and the baseline.
Both NetSum(2) and Net-Sum(3) produce a higher quality sentence on average under both measures.
Table 6 gives highlights produced by NetSum(n) and the highlights produced by the baseline, for the article shown in Figure 1.
The NetSum(n) highlights produce ROUGE-1 scores equal to or higher than the baseline ROUGE-1 scores.
In feature ablation studies, we conirmed that the inclusion of news-based and Wikipedia-based features improves NetSum's peformance.
For example, we removed all news-based and Wikipedia-based features in NetSum(3).
The resulting performance
moderately declined.
Under ROUGE-1, the baseline produced a better highlight on 22.34% of documents, versus only 18.91% when using third-party features.
Similarly, NetSum(3) produced a summary of equal or better ROUGE-1 score on only 77.66% of documents, compared to 81.09% of documents when using third-party features.
In addition, the average ROUGE-1 score dropped to 0.2182 and the average ROUGE-2 score dropped to 0.0448.
The performance of NetSum with third-party features over NetSum without third-party features is statistically signiicant at 95% conidence.
However, NetSum still outperforms the baseline without third-party features, leading us to conclude that RankNet and simple position and term frequency features contribute the maximum performance gains, but increased ROUGE-1 and ROUGE-2 scores are a clear beneit of third-party features.
8 Conclusions
We have presented a novel approach to automatic single-document summarization based on neural networks, called NetSum.
Our work is the irst to use both neural networks for summarization and third-party datasets for features, using Wikipedia and news query logs.
We have evaluated our system on two novel tasks: 1) producing a block of highlights and 2) producing three ordered highlight sentences.
Our experiments were run on previously unstudied data gathered from CNN.com.
Our system shows remarkable performance over the baseline of choosing the irst n sentences of the document, where the performance difference is statistically signiicant under ROUGE-1.
9 Future Work
An immediate future direction is to further explore feature selection.
We found third-party features beneicial to the performance of NetSum and such sources can be mined further.
In addition, feature selection for each NetSum system could be performed separately since, for example, highlight 1 has different characteristics than highlight 2.
In our experiments, ROUGE scores are fairly low because a highlight rarely matches the content of a single sentence.
To improve NetSum's performance, we must consider extracting content across sentence
boundaries.
Such work requires a system to produce abstract summaries.
We hope to incorporate sentence simpliication and sentence splicing and merging in a future version of NetSum.
Another future direction is the identiication of "hard" and "easy" inputs.
Although we report average ROUGE scores, such measures can be misleading since some highlights are simple to match and some are much more dificult.
A better system evaluation measure would incorporate the dificulty of the input and weight reported results accordingly.
