We
present
a
new
approach
to
automatic
summarization
based
on
neural
nets
,
called
NetSum
.
We
extract
a
set
of
features
from
each
sentence
that
helps
identify
its
importance
in
the
document
.
We
apply
novel
features
based
on
news
search
query
logs
and
Wikipedia
entities
.
Using
the
RankNet
learning
algorithm
,
we
train
a
pair-based
sentence
ranker
to
score
every
sentence
in
the
document
and
identify
the
most
important
sentences
.
We
apply
our
system
to
documents
gathered
from
CNN.com
,
where
each
document
includes
highlights
and
an
article
.
Our
system
significantly
outperforms
the
standard
baseline
in
the
ROUGE-1
measure
on
over
70
%
of
our
document
set
.
1
Introduction
Automatic
summarization
was
first
studied
almost
50
years
ago
by
Luhn
(
Luhn
,
1958
)
and
has
continued
to
be
a
steady
subject
of
research
.
Automatic
summarization
refers
to
the
creation
of
a
shortened
version
of
a
document
or
cluster
of
documents
by
a
machine
,
see
(
Mani
,
2001
)
for
details
.
The
summary
can
be
an
abstraction
or
extraction
.
In
an
abstract
summary
,
content
from
the
original
document
may
be
paraphrased
or
generated
,
whereas
in
an
extract
summary
,
the
content
is
preserved
in
its
original
form
,
i.e.
,
sentences
.
Both
summary
types
can
involve
sentence
compression
,
but
abstracts
tend
to
be
more
condensed
.
In
this
paper
,
we
focus
on
producing
fully
automated
single-document
extract
summaries
of
newswire
articles
.
To
create
an
extract
,
most
automatic
systems
use
linguistic
and
/
or
statistical
methods
to
identify
key
words
,
phrases
,
and
concepts
in
a
sentence
or
across
single
or
multiple
documents
.
Each
sentence
is
then
assigned
a
score
indicating
the
strength
of
presence
of
key
words
,
phrases
,
and
so
on
.
Sentence
scoring
methods
utilize
both
purely
statistical
and
purely
semantic
features
,
for
example
as
in
(
Vanderwende
et
al.
,
2006
;
Nenkova
et
al.
,
2006
;
Yih
et
al.
,
2007
)
.
cea
and
Radev
,
2006
)
.
In
2001-02
,
the
Document
Understanding
Conference
(
DUC
,
2001
)
,
issued
the
task
of
creating
a
100-word
summary
of
a
single
news
article
.
The
best
performing
systems
(
Hirao
et
al.
,
2002
;
Lal
and
Ruger
,
2002
)
used
various
learning
and
semantic-based
methods
,
although
no
system
could
outperform
the
baseline
with
statistical
significance
(
Nenkova
,
2005
)
.
After
2002
,
the
single-document
summarization
task
was
dropped
.
In
recent
years
,
there
has
been
a
decline
in
studies
on
automatic
single-document
summarization
,
in
part
because
the
DUC
task
was
dropped
,
and
in
part
because
the
task
of
single-document
extracts
may
be
counterintuitively
more
difficult
than
multi
-
Proceedings
of
the
2007
Joint
Conference
on
Empirical
Methods
in
Natural
Language
Processing
and
Computational
Natural
Language
Learning
,
pp.
448-457
,
Prague
,
June
2007
.
©
2007
Association
for
Computational
Linguistics
document
summarization
(
Nenkova
,
2005
)
.
However
,
with
the
ever-growing
internet
and
increased
information
access
,
we
believe
single-document
summarization
is
essential
to
improve
quick
access
to
large
quantities
of
information
.
Recently
,
CNN.com
(
CNN.com
,
2007a
)
added
"
Story
Highlights
"
to
many
news
articles
on
its
site
to
allow
readers
to
quickly
gather
information
on
stories
.
These
highlights
give
a
brief
overview
of
the
article
and
appear
as
3
—
4
related
sentences
in
the
form
of
bullet
points
rather
than
a
summary
paragraph
,
making
them
even
easier
to
quickly
scan
.
Our
work
is
motivated
by
both
the
addition
of
highlights
to
an
extremely
visible
and
reputable
online
news
source
,
as
well
as
the
inability
of
past
single-document
summarization
systems
to
outperform
the
extremely
strong
baseline
of
choosing
the
first
n
sentences
of
a
newswire
article
as
the
summary
(
Nenkova
,
2005
)
.
Although
some
recent
systems
indicate
an
improvement
over
the
baseline
(
Mi-halcea
,
2005
;
Mihalcea
and
Tarau
,
2005
)
,
statistical
significance
has
not
been
shown
.
We
show
that
by
using
a
neural
network
ranking
algorithm
and
third-party
datasets
to
enhance
sentence
features
,
our
system
,
NetSum
,
can
outperform
the
baseline
with
statistical
significance
.
Our
paper
is
organized
as
follows
.
Section
2
describes
our
two
studies
:
summarization
and
highlight
extraction
.
We
describe
our
dataset
in
detail
in
Section
3
.
Our
ranking
system
and
feature
vectors
are
outlined
in
Section
4
.
We
present
our
evaluation
measure
in
Section
5
.
Sections
6
and
7
report
on
our
results
on
summarization
and
highlight
extraction
,
respectively
.
We
conclude
in
Section
8
and
discuss
future
work
in
Section
9
.
In
this
paper
,
we
focus
on
single-document
summarization
of
newswire
documents
.
Each
document
consists
of
three
highlight
sentences
and
the
article
text
.
Each
highlight
sentence
is
human-generated
,
but
is
based
on
the
article
.
In
Section
4
we
discuss
the
process
of
matching
a
highlight
to
an
article
sentence
.
The
output
of
our
system
consists
of
purely
extracted
sentences
,
where
we
do
not
perform
any
sentence
compression
or
sentence
generation
.
We
leave
such
extensions
for
future
work
.
We
develop
two
separate
problems
based
on
our
document
set
.
First
,
can
we
extract
three
sentences
that
best
"
match
"
the
highlights
as
a
whole
?
In
this
task
,
we
concatenate
the
three
sentences
produced
by
our
system
into
a
single
summary
or
block
,
and
similarly
concatenate
the
three
highlight
sentences
into
a
single
summary
or
block
.
We
then
compare
our
system
's
block
against
the
highlight
block
.
Second
,
can
we
extract
three
sentences
that
best
"
match
"
the
three
highlights
,
such
that
ordering
is
preserved
?
In
this
task
,
we
produce
three
sentences
,
where
the
first
sentence
is
compared
against
the
first
highlight
,
the
second
sentence
is
compared
against
the
second
highlight
,
and
the
third
sentence
is
compared
against
the
third
highlight
.
Credit
is
not
given
for
producing
three
sentences
that
match
the
highlights
,
but
are
out
of
order
.
The
second
task
considers
ordering
and
compares
sentences
on
an
individual
level
,
whereas
the
first
task
considers
the
three
chosen
sentences
as
a
summary
or
block
and
disregards
sentence
order
.
In
both
tasks
,
we
assume
the
title
has
been
seen
by
the
reader
and
will
be
listed
above
the
highlights
.
3
Evaluation
Corpus
Our
data
consists
of
1365
news
documents
gathered
from
CNN.com
(
CNN.com
,
2007a
)
.
Each
document
was
extracted
by
hand
,
where
a
maximum
of
50
documents
per
day
were
collected
.
The
documents
were
hand-collected
on
consecutive
days
during
the
month
of
February
.
Each
document
includes
the
title
,
timestamp
,
story
highlights
,
and
article
text
.
The
timestamp
on
articles
ranges
from
December
2006
to
February
2007
,
since
articles
remain
posted
on
CNN.com
for
up
to
several
months
.
The
story
highlights
are
human-generated
from
the
article
text
.
The
number
of
story
highlights
is
between
3-4
.
Since
all
articles
include
at
least
3
story
highlights
,
we
consider
only
the
task
of
extracting
three
highlights
from
each
article
.
4
Description
of
Our
System
Our
goal
is
to
extract
three
sentences
from
a
single
news
document
that
best
match
various
characteristics
of
the
three
document
highlights
.
One
way
to
identify
the
best
sentences
is
to
rank
the
sentences
TIMESTAMP
:
1
:
59
p.m.
EST
,
January
31
,
2007
TITLE
:
Nigeria
reports
first
human
death
from
bird
flu
HIGHLIGHT
1
:
Government
boosts
surveillance
after
woman
dies
HIGHLIGHT
2
:
Egypt
,
Djibouti
also
have
reported
bird
flu
in
humans
HIGHLIGHT
3
:
H5N1
birdflu
virus
has
killed
164
worldwide
since
2003
ARTICLE
:
1
.
Health
officials
reported
Nigeria
's
first
cases
of
birdflu
inhumans
on
Wednesday
,
saying
one
woman
had
died
and
a
family
member
had
been
infected
but
was
responding
to
treatment
.
Thevictim
,
a22-yearoldwomaninLagos
,
diedJanuary17
,
InformationMinister
Frank
Nweke
said
in
a
statement
.
He
added
that
the
government
was
boosting
surveillance
across
Africa
's
most-populous
nation
after
the
infections
in
Lagos
,
Nigeria
's
biggest
city
.
The
World
Health
Organization
had
no
immediate
confirmation
.
Nigerian
health
officials
earliersaid14humansampleswerebeingtested
.
Nwekemadenomentionofthosecaseson
Wednesday
.
An
outbreak
of
H5N1
bird
flu
hit
Nigeria
last
year
,
but
no
human
infections
had
been
reported
until
Wednesday
.
Until
the
Nigerian
report
,
Egypt
and
Djibouti
were
the
only
African
countries
that
had
confirmed
infections
among
people
.
Eleven
people
have
died
in
Egypt
.
10
.
The
bird
flu
virus
remains
hard
for
humans
to
catch
,
but
health
experts
fear
H5N1
may
mutate
into
a
form
that
could
spread
easily
among
humans
and
possibly
kill
millions
in
a
flu
pandemic
.
11
.
Amid
a
new
H5N1
outbreak
reported
in
recent
weeks
in
Nigeria
's
north
,
hundreds
of
miles
from
Lagos
,
health
workers
have
begun
a
cull
of
poultry
.
12
.
Bird
flu
is
generally
not
harmful
to
humans
,
but
the
H5N1
virus
has
claimed
at
least
164
lives
worldwide
since
it
began
ravaging
Asian
poultry
in
late
2003
,
according
to
the
WHO
.
13
.
The
H5N1
strain
had
been
confirmed
in
15
of
Nigeria
's
36
states
.
14
.
By
September
,
when
the
last
known
case
of
the
virus
was
found
in
poultry
in
a
farm
near
Nigeria
's
biggest
city
of
Lagos
,
915,650
birds
had
been
slaughtered
nationwide
by
government
veterinary
teams
under
a
plan
in
which
the
owners
were
promised
compensation
.
15
.
However
,
many
Nigerian
farmers
have
yet
to
receive
compensation
in
the
north
of
the
country
,
and
health
officials
fear
that
chicken
deaths
may
be
covered
up
by
owners
reluctant
to
slaughter
their
animals
.
16
.
Since
bird
flu
cases
were
first
discovered
in
Nigeria
last
year
,
Cameroon
,
Djibouti
,
Niger
,
Ivory
Coast
,
Sudan
and
Burkina
Faso
have
also
reported
the
H5N1
strain
of
bird
flu
in
birds
.
17
.
There
are
fears
that
it
has
spread
even
further
than
is
known
in
Africa
because
monitoring
is
difficult
on
a
poor
continent
withweakinfrastructure
.
18
.
Withsub-SaharanAfricabearingthebruntoftheAIDSepidemic
,
there
is
concern
that
millions
of
people
with
suppressed
immune
systems
will
be
particularly
vulnerable
,
especially
in
rural
areas
with
little
access
to
health
facilities
.
19
.
Many
people
keep
chickens
for
food
,
even
in
densely
populated
urban
areas
.
Figure
1
:
Example
document
containing
highlights
and
article
text
.
Sentences
are
numbered
by
their
position
.
Article
is
from
(
CNN.com
,
2007b
)
.
using
a
machine
learning
approach
,
for
example
as
in
(
Hirao
et
al.
,
2002
)
.
A
train
set
is
labeled
such
that
the
labels
identify
the
best
sentences
.
Then
a
set
of
features
is
extracted
from
each
sentence
in
the
train
and
test
sets
,
and
the
train
set
is
used
to
train
the
system
.
The
system
is
then
evaluated
on
the
test
set
.
The
system
learns
from
the
train
set
the
distribution
of
features
for
the
best
sentences
and
outputs
a
ranked
list
of
sentences
for
each
document
.
In
this
paper
,
we
rank
sentences
using
a
neural
network
algorithm
called
RankNet
(
Burges
et
al.
,
2005
)
.
From
the
labels
and
features
for
each
sentence
,
we
train
a
model
that
,
when
run
on
a
test
set
of
sentences
,
can
infer
the
proper
ranking
of
sentences
in
a
document
based
on
information
gathered
during
training
about
sentence
characteristics
.
To
accomplish
the
ranking
,
we
use
RankNet
(
Burges
et
al.
,
2005
)
,
a
ranking
algorithm
based
on
neural
networks
.
RankNet
is
a
pair-based
neural
network
algorithm
used
to
rank
a
set
of
inputs
,
in
this
case
,
the
set
of
sentences
in
a
given
document
.
The
system
is
trained
on
pairs
of
sentences
(
Si
,
Sj
)
,
such
that
Si
should
be
ranked
higher
or
equal
to
Sj.
Pairs
are
generated
between
sentences
in
a
single
document
,
not
across
documents
.
Each
pair
is
determined
from
the
input
labels
.
Since
our
sentences
are
labeled
using
ROUGE
(
see
Section
4.3
)
,
if
the
ROUGE
score
of
Si
is
greater
than
the
ROUGE
score
of
Sj
,
then
(
Si
,
Sj
)
is
one
input
pair
.
The
cost
function
for
RankNet
is
the
probabilistic
cross-entropy
cost
function
.
Training
is
performed
using
a
modified
version
of
the
back
propagation
algorithm
for
two
layer
nets
(
Le
Cun
et
al.
,
1998
)
,
which
is
based
on
optimizing
the
cost
function
by
gradient
descent
.
A
similar
method
of
training
on
sentence
pairs
in
the
context
of
multi-document
summarization
was
recently
shown
in
(
Toutanova
et
al.
,
2007
)
.
Our
system
,
NetSum
,
is
a
two-layer
neural
net
trained
using
RankNet
.
To
speed
up
the
performance
of
RankNet
,
we
implement
RankNet
in
the
framework
of
LambdaRank
(
Burges
et
al.
,
2006
)
.
For
details
,
see
(
Burges
et
al.
,
2006
;
Burges
et
al.
,
2005
)
.
We
experiment
with
between
5
and
15
hidden
nodes
and
with
an
error
rate
between
10-2
and
10-7
.
We
implement
4
versions
of
NetSum
.
The
first
version
,
NetSum
(
b
)
,
is
trained
for
our
first
summarization
problem
(
b
indicates
block
)
.
The
pairs
are
generated
using
the
maximum
ROUGE
scores
l1
(
see
Section
4.3
)
.
The
other
three
rankers
are
trained
to
identify
the
sentence
in
the
document
that
best
matches
highlight
n.
We
train
one
ranker
,
NetSum
(
n
)
,
for
each
highlight
n
,
for
n
=
1
,
2
,
3
,
resulting
in
three
rankers
.
NetSum
(
n
)
is
trained
using
pairs
generated
from
the
l1n
ROUGE
scores
between
sentence
Si
and
highlight
Hn
(
see
Section
4.3
)
.
4.2
Matching
Extracted
to
Generated
Sentences
In
this
section
,
we
describe
how
to
determine
which
sentence
in
the
document
best
matches
a
given
highlight
.
Choosing
three
sentences
most
similar
to
the
three
highlights
is
very
challenging
since
the
highlights
include
content
that
has
been
gathered
across
sentences
and
even
paragraphs
,
and
furthermore
include
vocabulary
that
may
not
be
present
in
the
text
.
Jing
showed
,
for
300
news
articles
,
that
19
%
of
human-generated
summary
sentences
contain
no
matching
article
sentence
(
Jing
,
2002
)
.
In
addition
,
only
42
%
of
the
summary
sentences
match
the
content
of
a
single
article
sentence
,
where
there
are
still
semantic
and
syntactic
transformations
between
the
summary
sentence
and
article
sentence
.
.
Since
each
highlight
is
human
generated
and
does
not
exactly
match
any
one
sentence
in
the
document
,
we
must
develop
a
method
to
identify
how
closely
related
a
highlight
is
to
a
sentence
.
We
use
the
ROUGE
(
Lin
,
2004b
)
measure
to
score
the
similarity
between
an
article
sentence
and
a
highlight
sentence
.
We
anticipate
low
ROUGE
scores
for
both
the
baseline
and
NetSum
due
to
the
difficulty
of
finding
a
single
sentence
to
match
a
highlight
.
Recall-Oriented
Understudy
for
Gisting
Evaluation
(
Lin
,
2004b
)
,
known
as
ROUGE
,
measures
the
quality
of
a
model-generated
summary
or
sentence
by
comparing
it
to
a
"
gold-standard
"
,
typically
human-generated
,
summary
or
sentence
.
It
has
been
shown
that
ROUGE
is
very
effective
for
measuring
both
single-document
summaries
and
single-document
headlines
(
Lin
,
2004a
)
.
ROUGE-N
is
a
N-gram
recall
between
a
model
-
generated
summary
and
a
reference
summary
.
We
use
ROUGE-N
,
for
N
=
1
,
for
labeling
and
evaluation
of
our
model-generated
highlights.1
ROUGE-1
and
ROUGE-2
have
been
shown
to
be
statistically
similar
to
human
evaluations
and
can
be
used
with
a
single
reference
summary
(
Lin
,
2004a
)
.
We
have
only
one
reference
summary
,
the
set
of
human-generated
highlights
,
per
document
.
In
our
work
,
the
reference
summary
can
be
a
single
highlight
sentence
or
the
highlights
as
a
block
.
We
calculate
ROUGE-N
as
where
R
is
the
reference
summary
,
Si
is
the
modelgenerated
summary
,
and
N
is
the
length
of
the
N-gram
gramj
2
The
numerator
cannot
excede
the
number
of
N-grams
(
non-unique
)
in
R.
We
label
each
sentence
Si
by
its
ROUGE-1
score
.
For
the
first
problem
of
matching
the
highlights
as
a
block
,
we
label
each
Si
by
li
,
the
maximum
ROUGE-1
score
between
Si
and
each
highlight
Hn
,
for
n
=
1
,
2
,
3
,
given
by
l1
=
maxn
(
R
(
Si
,
Hn
)
)
.
For
the
second
problem
of
matching
three
sentences
to
the
three
highlights
individually
,
we
label
each
sentence
Si
by
l1
&gt;
n
,
the
ROUGE-1
score
between
Si
and
Hn
,
given
by
l1n
=
R
(
Si
;
Hn
)
.
The
ranker
for
highlight
n
,
NetSum
(
n
)
,
is
passed
samples
labeled
using
l1n
.
RankNet
takes
as
input
a
set
of
samples
,
where
each
sample
contains
a
label
and
feature
vector
.
The
labels
were
previously
described
in
Section
4.3
.
In
this
section
,
we
describe
each
feature
in
detail
and
motivate
in
part
why
each
feature
is
chosen
.
We
generate
10
features
for
each
sentence
Si
in
each
document
,
listed
in
Table
1
.
Each
feature
is
chosen
to
identify
characteristics
of
an
article
sentence
that
may
match
those
of
a
highlight
sentence
.
Some
of
the
features
such
as
position
and
N-gram
frequencies
are
commonly
used
for
scoring
.
Sentence
scoring
based
on
1We
use
an
implementation
of
ROUGE
that
does
not
perform
stemming
or
stopword
removal
.
2ROUGE
is
typically
used
when
the
length
of
the
reference
summary
is
equal
to
length
of
the
model-generated
summary
.
Our
reference
summary
and
model-generated
summary
are
different
lengths
,
so
there
is
a
slight
bias
toward
longer
sentences
.
Feature
Name
Is
First
Sentence
Sentence
Position
SumBasic
Score
SumBasic
Bigram
Score
Title
Similarity
Score
Average
News
Query
Term
Score
News
Query
Term
Sum
Score
Relative
News
Query
Term
Score
Average
Wikipedia
Entity
Score
Wikipedia
Entity
Sum
Score
Table
1
:
Features
used
in
our
model
.
sentence
position
,
terms
common
with
the
title
,
appearance
of
keyword
terms
,
and
other
cue
phrases
is
known
as
the
Edmundsonian
Paradigm
(
Edmund-son
,
1969
;
Alfonesca
and
Rodriguez
,
2003
;
Mani
,
2001
)
.
We
use
variations
on
these
features
as
well
as
a
novel
set
of
features
based
on
third-party
data
.
Typically
,
news
articles
are
written
such
that
the
first
sentence
summarizes
the
article
.
Thus
,
we
include
a
binary
feature
F
(
Si
)
that
equals
1
if
Si
is
the
first
sentence
of
the
document
:
F
(
Si
)
=
ôi
;
1
,
where
ô
is
the
Kronecker
delta
function
.
This
feature
is
used
only
for
NetSum
(
b
)
and
NetSum
(
1
)
.
We
include
sentence
position
since
we
found
in
empirical
studies
that
the
sentence
to
best
match
highlight
H1
is
on
average
10
%
down
the
article
,
the
sentence
to
best
match
H2
is
on
average
20
%
down
the
article
,
and
the
sentence
to
best
match
H3
is
31
%
down
the
article.3
We
calculate
the
position
of
Si
in
document
D
as
where
i
=
{
1
,
.
.
.
,
1
}
is
the
sentence
number
and
I
is
the
number
of
sentences
in
D.
We
include
the
SumBasic
score
(
Nenkova
et
al.
,
2006
)
of
a
sentence
to
estimate
the
importance
of
a
sentence
based
on
word
frequency
.
We
calculate
the
SumBasic
score
of
Si
in
document
D
as
3Though
this
is
not
always
the
case
,
as
the
sentence
to
match
H2
precedes
that
to
match
H1
in
22.03
%
of
documents
,
and
the
sentence
to
match
H3
precedes
that
to
match
H2
in
29.32
%
of
and
precedes
that
to
match
H1
in
28.81
%
of
documents
.
where
p
(
w
)
is
the
probability
of
word
w
and
|
Si
|
is
the
number
of
words
in
sentence
Si
.
We
calculate
p
(
w
)
as
p
(
w
)
=
Co^
(
w
)
;
where
Count
(
w
)
is
the
number
of
times
word
w
appears
in
document
D
and
|
D
|
is
the
number
of
words
in
document
D.
Note
that
the
score
ofa
sentence
is
the
average
probability
of
a
word
in
the
sentence
.
We
also
include
the
SumBasic
score
over
bi-grams
,
where
w
in
Eq
3
is
replaced
by
bigrams
and
we
normalize
by
the
number
of
bigrams
in
Si
.
We
compute
the
similarity
of
a
sentence
Si
in
document
D
with
the
title
T
of
D
as
the
relative
probability
of
title
terms
t
£
T
in
Si
as
is
the
number
of
times
term
t
where
p
(
t
)
=
—
\
r
\
appears
in
T
over
the
number
of
terms
in
T.
The
remaining
features
we
use
are
based
on
third-party
data
sources
.
Previously
,
third-party
sources
such
as
WordNet
(
Fellbaum
,
1998
)
,
the
web
(
Ja-galamudi
et
al.
,
2006
)
,
or
click-through
data
(
Sun
et
al.
,
2005
)
have
been
used
as
features
.
We
propose
using
news
query
logs
and
Wikipedia
entities
to
enhance
features
.
We
base
several
features
on
query
terms
frequently
issued
to
Microsoft
's
news
search
engine
http
:
/
/
search.live.com
/
news
,
and
enti-ties4
found
in
the
online
open-source
encyclopedia
Wikipedia
(
Wikipedia.org
,
2007
)
.
If
a
query
term
or
Wikipedia
entity
appears
frequently
in
a
CNN
document
,
then
we
assume
highlights
should
include
that
term
or
entity
since
it
is
important
on
both
the
document
and
global
level
.
Sentences
containing
query
terms
or
Wikipedia
entities
therefore
contain
important
content
.
We
confirm
the
importance
of
these
third-party
features
in
Section
7
.
We
collected
several
hundred
of
the
most
frequently
queried
terms
in
February
2007
from
the
news
query
logs
.
We
took
the
daily
top
200
terms
for
10
days
.
Our
hypothesis
is
that
a
sentence
with
a
higher
number
of
news
query
terms
should
be
a
better
candidate
highlight
.
We
calculate
the
average
probability
of
news
query
terms
q
in
Si
as
4We
define
an
entity
as
a
title
of
a
Wikipedia
page
.
where
p
(
q
)
is
the
probability
of
a
news
term
q
and
\
q
G
Si
\
is
the
number
of
news
terms
in
Sj.
p
(
q
)
=
C
|
geD
|
g
'
&gt;
'
wnere
Count
(
q
)
is
the
number
of
times
term
q
appears
in
D
and
\
q
G
D
\
is
the
number
of
news
query
terms
in
D.
We
perform
term
disambiguation
on
each
document
using
an
entity
extractor
(
Cucerzan
,
2007
)
.
Terms
are
disambiguated
to
a
Wikipedia
entity
only
if
they
match
a
surface
form
in
Wikipedia
.
Wikipedia
surface
forms
are
terms
that
disambiguate
to
a
Wikipedia
entity
and
link
to
a
Wikipedia
page
with
the
entity
as
its
title
.
For
example
,
"
WHO
"
and
"
World
Health
Org
.
"
both
refer
to
the
World
Health
Organization
,
and
should
disambiguate
to
the
entity
"
World
Health
Organization
"
.
Sentences
in
CNN
document
D
that
contain
Wikipedia
entities
that
frequently
appear
in
CNN
document
D
are
considered
important
.
We
calculate
the
average
Wikipedia
entity
score
for
Si
as
We
also
include
the
sum
of
Wikipedia
entities
,
given
by
WE+
(
Sj
)
=
EeeSi
p
(
e
)
.
Note
that
all
features
except
position
features
are
a
variant
of
SumBasic
over
different
term
sets
.
All
features
are
computed
over
sentences
where
every
word
has
been
lowercased
and
punctuation
has
been
removed
after
sentence
breaking
.
We
examined
using
stemming
,
but
found
stemming
to
be
ineffective
.
5
Evaluation
We
evaluate
the
performance
of
NetSum
using
ROUGE
and
by
comparing
against
a
baseline
system
.
For
the
first
summarization
task
,
we
compare
against
the
baseline
of
choosing
the
first
three
sentences
as
the
block
summary
.
For
the
second
high
-
lights
task
,
we
compare
NetSum
(
n
)
against
the
baseline
of
choosing
sentence
n
(
to
match
highlight
n
)
.
Both
tasks
are
novel
in
attempting
to
match
highlights
rather
than
a
human-generated
summary
.
We
consider
ROUGE-1
to
be
the
measure
of
importance
and
thus
train
our
model
on
ROUGE-1
(
to
optimize
ROUGE-1
scores
)
and
likewise
evaluate
our
system
on
ROUGE-1
.
We
list
ROUGE-2
scores
for
completeness
,
but
do
not
expect
them
to
be
substantially
better
than
the
baseline
since
we
did
not
directly
optimize
for
ROUGE-2.5
For
every
document
in
our
corpus
,
we
compare
NetSum
's
output
with
the
baseline
output
by
computing
ROUGE-1
and
ROUGE-2
between
the
highlight
block
and
NetSum
and
between
the
highlight
block
and
the
block
of
sentences
.
Similarly
,
for
each
highlight
,
we
compute
ROUGE-1
and
ROUGE-2
between
highlight
n
and
NetSum
(
n
)
and
between
highlight
n
and
sentence
n
,
for
n
=
1
,
2
,
3
.
For
each
task
,
we
calculate
the
average
ROUGE-1
and
ROUGE-2
scores
of
NetSum
and
of
the
baseline
.
We
also
report
the
percent
of
documents
where
the
ROUGE-1
score
of
NetSum
is
equal
to
or
better
than
the
ROUGE-1
score
of
the
baseline
.
We
perform
all
experiments
using
ive-fold
cross-validation
on
our
dataset
of
1365
documents
.
We
divide
our
corpus
into
ive
random
sets
and
train
on
three
combined
sets
,
validate
on
one
set
,
and
test
on
the
remaining
set
.
We
repeat
this
procedure
for
every
combination
of
train
,
validation
,
and
test
sets
.
Our
results
are
the
micro-averaged
results
on
the
ive
test
sets
.
For
all
experiments
,
Table
3
lists
the
statistical
tests
performed
and
the
signiicance
of
performance
differences
between
NetSum
and
the
baseline
at
95
%
confidence
.
6
Results
:
Summarization
We
irst
ind
three
sentences
that
,
as
a
block
,
best
match
the
three
highlights
as
a
block
.
NetSum
(
b
)
produces
a
ranked
list
of
sentences
for
each
document
.
We
create
a
block
from
the
top
3
ranked
sentences
.
The
baseline
is
the
block
of
the
irst
3
sentences
of
the
document
.
A
similar
baseline
outper
-
5NetSum
can
directly
optimize
for
any
measure
by
training
on
it
,
such
as
training
on
ROUGE-2
or
on
a
weighted
sum
of
scores
could
be
further
improved
.
We
leave
such
studies
for
future
work
.
Table
2
:
Results
on
summarization
task
with
standard
error
at
95
%
confidence
.
Bold
indicates
significance
under
paired
tests
.
NetSum
(
b
)
Table
3
:
Paired
tests
for
statistical
significance
at
95
%
conidence
between
baseline
and
NetSum
performance
;
1
:
McNemar
,
2
:
Paired
t-test
,
3
:
Wilcoxon
signed-rank
.
"
x
"
indicates
pass
,
"
o
"
indicates
fail
.
Since
our
studies
are
pair-wise
,
tests
listed
here
are
more
accurate
than
error
bars
reported
in
Tables
2-5
.
forms
all
previous
systems
for
news
article
summarization
(
Nenkova
,
2005
)
and
has
been
used
in
the
DUC
workshops
(
DUC
,
2001
)
.
For
each
block
produced
by
NetSum
(
b
)
and
the
baseline
,
we
compute
the
ROUGE-1
and
ROUGE-2
scores
of
the
block
against
the
set
of
highlights
as
a
block
.
For
73.26
%
of
documents
,
NetSum
(
b
)
produces
a
block
with
a
ROUGE-1
score
that
is
equal
to
or
better
than
the
baseline
score
.
The
two
systems
produce
blocks
of
equal
ROUGE-1
score
for
24.69
%
of
documents
.
Under
ROUGE-2
,
NetSum
(
b
)
performs
equal
to
or
better
than
the
baseline
on
73.19
%
of
documents
and
equal
to
the
baseline
on
40.51
%
of
documents
.
Table
2
shows
the
average
ROUGE-1
and
ROUGE-2
scores
obtained
with
NetSum
(
b
)
and
the
baseline
.
NetSum
(
b
)
produces
a
higher
quality
block
on
average
for
ROUGE-1
.
Table
4
lists
the
sentences
in
the
block
produced
by
NetSum
(
b
)
and
the
baseline
block
,
for
the
articles
shown
in
Figure
1
.
The
NetSum
(
b
)
summary
achieves
a
ROUGE-1
score
of
0.52
,
while
the
baseline
summary
scores
only
0.36
.
Table
4
:
Block
results
for
the
block
produced
by
NetSum
(
b
)
and
the
baseline
block
for
the
example
article
.
ROUGE-1
scores
computed
against
the
highlights
as
a
block
are
listed
.
7
Results
:
Highlights
Our
second
task
is
to
extract
three
sentences
from
a
document
that
best
match
the
three
highlights
in
order
.
To
accomplish
this
,
we
train
NetSum
(
n
)
for
each
highlight
n
=
1
,
2
,
3
.
We
compare
NetSum
(
n
)
with
the
baseline
of
picking
the
nth
sentence
of
the
document
.
We
perform
ive-fold
cross-validation
across
our
1365
documents
.
Our
results
are
reported
for
the
micro-average
of
the
test
results
.
For
each
highlight
n
produced
by
both
NetSum
(
n
)
and
the
baseline
,
we
compute
the
ROUGE-1
and
ROUGE-2
scores
against
the
nth
highlight
.
We
expect
that
beating
the
baseline
for
n
=
1
is
a
more
dificult
task
than
for
n
=
2
or
3
since
the
irst
sentence
of
a
news
article
typically
acts
as
a
summary
of
the
article
and
since
we
expect
the
irst
highlight
to
summarize
the
article
.
NetSum
(
1
)
,
however
,
produces
a
sentence
with
a
ROUGE-1
score
that
is
equal
to
or
better
than
the
baseline
score
for
93.26
%
of
documents
.
The
two
systems
produce
sentences
of
equal
ROUGE-1
scores
for
82.84
%
of
documents
.
Under
ROUGE-2
,
NetSum
(
1
)
performs
equal
to
or
better
than
the
baseline
on
94.21
%
of
documents
.
Table
5
shows
the
average
ROUGE-1
and
ROUGE-2
scores
obtained
with
NetSum
(
1
)
and
the
baseline
.
NetSum
(
1
)
produces
a
higher
quality
sentence
on
average
under
ROUGE-1
.
The
content
ofhighlights
2
and
3
is
typically
from
later
in
the
document
,
so
we
expect
the
baseline
to
not
perform
as
well
in
these
tasks
.
NetSum
(
2
)
outperforms
the
baseline
since
it
is
able
to
identify
sentences
from
further
down
the
document
as
important
.
For
77.73
%
of
documents
,
NetSum
(
2
)
produces
a
sentence
with
a
ROUGE-1
score
that
is
equal
to
or
better
than
the
score
for
the
baseline
.
The
two
systems
produce
sentences
ofequal
ROUGE-1
score
for
33.92
%
of
documents
.
Under
ROUGE-2
,
Net-Sum
(
2
)
performs
equal
to
or
better
than
the
baseline
Baseline
(
l
)
Table
5
:
Results
on
ordered
highlights
task
with
standard
error
at
95
%
conidence
.
Bold
indicates
signiicance
under
paired
tests
.
NetSum
(
l
)
Baseline
Table
6
:
Highlight
results
for
highlight
n
produced
by
NetSum
(
n
)
and
highlight
n
produced
by
the
baseline
for
the
example
article
.
ROUGE-1
scores
computed
against
highlight
n
are
listed
.
84.84
%
of
the
time
.
For
81.09
%
of
documents
,
Net-Sum
(
3
)
produces
a
sentence
with
a
ROUGE-1
score
that
is
equal
to
or
better
than
the
score
for
the
baseline
.
The
two
systems
produce
sentences
of
equal
ROUGE-1
score
for
28.45
%
of
documents
.
Under
ROUGE-2
,
NetSum
(
3
)
performs
equal
to
or
better
than
the
baseline
89.91
%
of
the
time
.
Table
5
shows
the
average
ROUGE-1
and
Sum
(
3
)
,
and
the
baseline
.
Both
NetSum
(
2
)
and
Net-Sum
(
3
)
produce
a
higher
quality
sentence
on
average
under
both
measures
.
Table
6
gives
highlights
produced
by
NetSum
(
n
)
and
the
highlights
produced
by
the
baseline
,
for
the
article
shown
in
Figure
1
.
The
NetSum
(
n
)
highlights
produce
ROUGE-1
scores
equal
to
or
higher
than
the
baseline
ROUGE-1
scores
.
In
feature
ablation
studies
,
we
conirmed
that
the
inclusion
of
news-based
and
Wikipedia-based
features
improves
NetSum
's
peformance
.
For
example
,
we
removed
all
news-based
and
Wikipedia-based
features
in
NetSum
(
3
)
.
The
resulting
performance
moderately
declined
.
Under
ROUGE-1
,
the
baseline
produced
a
better
highlight
on
22.34
%
of
documents
,
versus
only
18.91
%
when
using
third-party
features
.
Similarly
,
NetSum
(
3
)
produced
a
summary
of
equal
or
better
ROUGE-1
score
on
only
77.66
%
of
documents
,
compared
to
81.09
%
of
documents
when
using
third-party
features
.
In
addition
,
the
average
ROUGE-1
score
dropped
to
0.2182
and
the
average
ROUGE-2
score
dropped
to
0.0448
.
The
performance
of
NetSum
with
third-party
features
over
NetSum
without
third-party
features
is
statistically
signiicant
at
95
%
conidence
.
However
,
NetSum
still
outperforms
the
baseline
without
third-party
features
,
leading
us
to
conclude
that
RankNet
and
simple
position
and
term
frequency
features
contribute
the
maximum
performance
gains
,
but
increased
ROUGE-1
and
ROUGE-2
scores
are
a
clear
beneit
of
third-party
features
.
8
Conclusions
We
have
presented
a
novel
approach
to
automatic
single-document
summarization
based
on
neural
networks
,
called
NetSum
.
Our
work
is
the
irst
to
use
both
neural
networks
for
summarization
and
third-party
datasets
for
features
,
using
Wikipedia
and
news
query
logs
.
We
have
evaluated
our
system
on
two
novel
tasks
:
1
)
producing
a
block
of
highlights
and
2
)
producing
three
ordered
highlight
sentences
.
Our
experiments
were
run
on
previously
unstudied
data
gathered
from
CNN.com
.
Our
system
shows
remarkable
performance
over
the
baseline
of
choosing
the
irst
n
sentences
of
the
document
,
where
the
performance
difference
is
statistically
signiicant
under
ROUGE-1
.
9
Future
Work
An
immediate
future
direction
is
to
further
explore
feature
selection
.
We
found
third-party
features
beneicial
to
the
performance
of
NetSum
and
such
sources
can
be
mined
further
.
In
addition
,
feature
selection
for
each
NetSum
system
could
be
performed
separately
since
,
for
example
,
highlight
1
has
different
characteristics
than
highlight
2
.
In
our
experiments
,
ROUGE
scores
are
fairly
low
because
a
highlight
rarely
matches
the
content
of
a
single
sentence
.
To
improve
NetSum
's
performance
,
we
must
consider
extracting
content
across
sentence
boundaries
.
Such
work
requires
a
system
to
produce
abstract
summaries
.
We
hope
to
incorporate
sentence
simpliication
and
sentence
splicing
and
merging
in
a
future
version
of
NetSum
.
Another
future
direction
is
the
identiication
of
"
hard
"
and
"
easy
"
inputs
.
Although
we
report
average
ROUGE
scores
,
such
measures
can
be
misleading
since
some
highlights
are
simple
to
match
and
some
are
much
more
dificult
.
A
better
system
evaluation
measure
would
incorporate
the
dificulty
of
the
input
and
weight
reported
results
accordingly
.
