We
present
a
domain-independent
unsupervised
topic
segmentation
approach
based
on
hybrid
document
indexing
.
Lexical
chains
have
been
successfully
employed
to
evaluate
lexical
cohesion
of
text
segments
and
to
predict
topic
boundaries
.
Our
approach
is
based
in
the
notion
of
semantic
cohesion
.
It
uses
spectral
embedding
to
estimate
semantic
association
between
content
nouns
over
a
span
of
multiple
text
segments
.
Our
method
significantly
outperforms
the
baseline
on
the
topic
segmentation
task
and
achieves
performance
comparable
to
state-of-the-art
methods
that
incorporate
domain
specific
information
.
1
Introduction
The
goal
of
topic
segmentation
is
to
discover
story
boundaries
in
the
stream
of
text
or
audio
recordings
.
Story
is
broadly
defined
as
segment
of
text
containing
topically
related
sentences
.
In
particular
,
the
task
may
require
segmenting
a
stream
of
broadcast
news
,
addressed
by
the
Topic
Detection
and
Tracking
(
TDT
)
evaluation
project
(
Wayne
,
2000
;
Allan
,
2002
)
.
In
this
case
topically
related
sentences
belong
to
the
same
news
story
.
While
we
are
considering
TDT
data
sets
in
this
paper
,
we
would
like
to
pose
the
problem
more
broadly
and
consider
a
domain-independent
approach
to
topic
segmentation
.
Previous
research
on
topic
segmentation
has
shown
that
lexical
coherence
is
a
reliable
indicator
of
topical
relatedness
.
Therefore
,
many
approaches
have
concentrated
on
different
ways
of
estimating
lexical
coherence
of
text
segments
,
such
as
semantic
similarity
between
words
(
Kozima
,
1993
)
,
similarity
between
blocks
of
text
(
Hearst
,
1994
)
,
and
adaptive
language
models
(
Beeferman
et
al.
,
1999
)
.
These
approaches
use
word
repetitions
to
evaluate
coherence
.
Since
the
sentences
covering
the
same
story
represent
a
coherent
discourse
segment
,
they
typically
contain
the
same
or
related
words
.
Repeated
words
build
lexical
chains
that
are
consequently
used
to
estimate
lexical
coherence
.
This
can
be
done
either
by
analyzing
the
number
of
overlapping
lexical
chains
(
Hearst
,
1994
)
or
by
building
a
short-range
and
long-range
language
model
(
Beefer-man
et
al.
,
1999
)
.
More
recently
,
topic
segmentation
with
lexical
chains
has
been
successfully
applied
to
segmentation
of
news
stories
,
multi-party
conversation
and
audio
recordings
(
Galley
et
al.
,
2003
)
.
When
the
task
is
to
segment
long
streams
of
text
containing
stories
which
may
continue
at
a
later
point
in
time
,
for
example
developing
news
stories
,
building
of
lexical
chains
becomes
intricate
.
In
addition
,
the
word
repetitions
do
not
account
for
synonymy
and
semantic
relatedness
between
words
and
therefore
may
not
be
able
to
discover
coherence
of
segments
with
little
word
overlap
.
Our
approach
aims
at
discovering
semantic
relat-edness
beyond
word
repetition
.
It
is
based
on
the
notion
of
semantic
cohesion
rather
than
lexical
cohesion
.
We
propose
to
use
a
similarity
metric
between
segments
of
text
that
takes
into
account
semantic
associations
between
words
spanning
a
number
ofseg-ments
.
This
method
approximates
lexical
chains
by
averaging
the
similarity
to
a
number
of
previous
text
Proceedings
of
the
2007
Joint
Conference
on
Empirical
Methods
in
Natural
Language
Processing
and
Computational
Natural
Language
Learning
,
pp.
351-359
,
Prague
,
June
2007
.
©
2007
Association
for
Computational
Linguistics
segments
and
accounts
for
synonymy
by
using
a
hybrid
document
indexing
scheme
.
Our
text
segmentation
experiments
show
a
significant
performance
improvement
over
the
baseline
.
The
rest
of
the
paper
is
organized
as
follows
.
Section
2
discusses
hybrid
indexing
.
Section
3
describes
our
segmentation
algorithm
.
Section
5
reports
the
experimental
results
.
We
conclude
in
section
6
.
2
Hybrid
Document
Indexing
For
the
topic
segmentation
task
we
would
like
to
define
a
similarity
measure
that
accounts
for
synonymy
and
semantic
association
between
words
.
This
similarity
measure
will
be
used
to
evaluate
semantic
cohesion
between
text
units
and
the
decrease
in
semantic
cohesion
will
be
used
as
an
indicator
of
a
story
boundary
.
First
,
we
develop
a
document
representation
which
supports
this
similarity
measure
.
Capturing
semantic
relations
between
words
in
a
document
representation
is
difficult
.
Different
approaches
tried
to
overcome
the
term
independence
assumption
of
the
bag-of-words
representation
(
Salton
and
McGill
,
1983
)
by
using
distributional
term
clusters
(
Slonim
and
Tishby
,
2000
)
and
expanding
the
document
vectors
with
synonyms
,
see
(
Levow
et
al.
,
2005
)
.
Since
content
words
can
be
combined
into
semantic
classes
there
has
been
a
considerable
interest
in
low-dimensional
representations
.
Latent
Semantic
Analysis
(
LSA
)
(
Deerwester
et
al.
,
1990
)
is
one
of
the
best
known
dimensionality
reduction
algorithms
.
In
the
LSA
space
documents
are
indexed
with
latent
semantic
concepts
.
LSA
maps
all
words
to
low
dimensional
vectors
.
However
,
the
notion
of
semantic
relatedness
is
defined
differently
for
subsets
of
the
vocabulary
.
In
addition
,
the
numerical
information
,
abbreviations
and
the
documents
'
style
may
be
very
good
indicators
of
their
topic
.
However
,
this
information
is
no
longer
available
after
the
dimensionality
reduction
.
We
use
a
hybrid
approach
to
document
indexing
to
address
these
issues
.
We
keep
the
notion
of
latent
semantic
concepts
and
also
try
to
preserve
the
specifics
of
the
document
collection
.
Therefore
,
we
divide
the
vocabulary
into
two
sets
:
nouns
and
the
rest
of
the
vocabulary
.
The
set
of
nouns
does
not
include
proper
nouns
.
We
use
a
method
of
spectral
embedding
,
as
described
below
and
compute
a
low-dimensional
representation
for
documents
using
only
the
nouns
.
We
also
compute
a
tf-idf
representation
for
documents
using
the
other
set
of
words
.
Since
we
can
treat
each
latent
semantic
concept
in
the
low-dimensional
representation
as
part
of
the
vocabulary
,
we
combine
the
two
vector
representations
for
each
document
by
concatenating
them
.
2.1
Spectral
Embedding
A
vector
space
representation
for
documents
and
sentences
is
convenient
and
makes
the
similarity
metrics
such
as
cosine
and
distance
readily
available
.
However
,
those
metrics
will
not
work
if
they
don
't
have
a
meaningful
linguistic
interpretation
.
Spectral
methods
comprise
a
family
of
algorithms
that
embed
terms
and
documents
in
a
low-dimensional
vector
space
.
These
methods
use
pair-wise
relations
between
the
data
points
encoded
in
a
similarity
matrix
.
The
main
step
is
to
find
an
embedding
for
the
data
that
preserves
the
original
similarities
.
GLSA
We
use
Generalized
Latent
Semantic
Analysis
(
GLSA
)
(
Matveeva
et
al.
,
2005
)
to
compute
spectral
embedding
for
nouns
.
GLSA
computes
term
vectors
and
since
we
would
like
to
use
spectral
embedding
for
nouns
,
it
is
well-suited
for
our
approach
.
GLSA
extends
the
ideas
of
LSA
by
defining
different
ways
to
obtain
the
similarities
matrix
and
has
been
shown
to
outperform
LSA
on
a
number
of
applications
(
Matveeva
and
Levow
,
2006
)
.
GLSA
begins
with
a
matrix
of
pair-wise
term
similarities
S
,
computes
its
eigenvectors
U
and
uses
the
first
k
of
them
to
represent
terms
and
documents
,
for
details
see
(
Matveeva
et
al.
,
2005
)
.
The
justification
for
this
approach
is
the
theorem
by
Eckart
and
Young
(
Golub
and
Reinsch
,
1971
)
stating
that
inner
product
similarities
between
the
term
vectors
based
on
the
eigenvectors
of
S
represent
the
best
element-wise
approximation
to
the
entries
in
S.
In
other
words
,
the
inner
product
similarity
in
the
GLSA
space
preserves
the
semantic
similarities
in
S.
Since
our
representation
will
try
to
preserve
semantic
similarities
in
S
it
is
important
to
have
a
matrix
of
similarities
which
is
linguistically
motivated
.
Nearest
Neighbors
in
GLSA
Space
prosecutor
testimony
eyewitness
investment
category
broadcast
television
satellite
surprise
announcement
disappointment
stunning
reaction
astonishment
Table
1
:
Words
'
nearest
neighbors
in
the
GLSA
semantic
space
.
2.2
Distributional
Term
Similarity
v
'
JJ
6
P
(
Wi
=
l
)
P
(
Wj
=
1
)
Thus
,
for
GLSA
,
S
(
wj
,
Wj
)
=
PMI
(
wj
,
Wj
)
.
Co-occurrence
Proximity
An
advantage
of
PMI
is
the
notion
of
proximity
.
The
co-occurrence
statistics
for
PMI
are
typically
computed
using
a
sliding
window
.
Thus
,
PMI
will
be
large
only
for
words
that
co-occur
within
a
small
context
of
fixed
size
.
Semantic
Association
vs.
Synonymy
Although
GLSA
was
successfully
applied
to
synonymy
induction
(
Matveeva
et
al.
,
2005
)
,
we
would
like
to
point
out
that
the
GLSA
discovers
semantic
association
in
a
broad
sense
.
Table
1
shows
a
few
words
from
the
TDT2
corpus
and
their
nearest
neighbors
in
the
GLSA
space
.
We
can
see
that
for
"
witness
"
,
"
finance
"
and
"
broadcast
"
words
are
grouped
into
corresponding
semantic
classes
.
The
nearest
neighbors
for
"
hearing
"
and
"
stay
"
represent
their
different
senses
.
Interestingly
,
even
for
the
abstract
noun
"
surprise
"
the
nearest
neighbors
are
meaningful
.
2.3
Document
Indexing
We
have
two
sets
of
the
vocabulary
terms
:
a
set
of
nouns
,
N
,
and
the
other
words
,
T.
We
compute
tf-idf
document
vectors
indexed
with
the
words
in
T
:
where
aj
(
wt
)
=
tf
(
wt
,
dj
)
*
idf
(
wt
)
.
We
also
compute
a
k-dimensional
representation
with
latent
concepts
cj
as
a
weighted
linear
combination
of
GLSA
term
vectors
wt
:
We
concatenate
these
two
representations
to
generate
a
hybrid
indexing
of
documents
:
In
our
experiments
,
we
compute
document
and
sentence
representation
using
three
indexing
schemes
:
the
tf-idfbaseline
,
the
GLSA
representation
and
the
hybrid
indexing
.
The
GLSA
indexing
computes
term
vectors
for
all
vocabulary
words
;
document
and
sentence
vectors
are
generated
as
linear
combinations
of
term
vectors
,
as
shown
above
.
2.4
Document
similarity
One
can
define
document
similarity
at
different
levels
of
semantic
content
.
Documents
can
be
similar
because
they
discuss
the
same
people
or
events
and
because
they
discuss
related
subjects
and
contain
se-mantically
related
words
.
Hybrid
Indexing
allows
us
to
combine
both
definitions
of
similarity
.
Each
representation
supports
a
different
similarity
measure
.
tf-idf
uses
term-matching
,
the
GLSA
representation
uses
semantic
association
in
the
latent
semantic
space
computed
for
all
words
,
and
hybrid
indexing
uses
a
combination
of
both
:
term-matching
for
named
entities
and
content
words
other
than
nouns
combined
with
semantic
association
for
nouns
.
In
the
GLSA
space
,
the
inner
product
between
document
vectors
contains
all
pair-wise
inner
product
between
their
words
,
which
allows
one
to
detect
semantic
similarity
beyond
term
matching
:
If
documents
contain
words
which
are
different
but
semantically
related
,
the
inner
product
between
the
term
vectors
will
contribute
to
the
document
similarity
,
as
illustrated
with
an
example
in
section
5
.
When
we
compare
two
documents
indexed
with
the
hybrid
indexing
scheme
,
we
compute
a
combination
of
similarity
measures
:
Document
similarity
contains
semantic
association
between
all
pairs
of
nouns
and
uses
term-matching
for
the
rest
of
the
vocabulary
.
3
Topic
Segmentation
with
Semantic
Cohesion
Our
approach
to
topic
segmentation
is
based
on
semantic
cohesion
supported
by
the
hybrid
indexing
.
Topic
segmentation
approaches
use
either
sentences
(
Galley
et
al.
,
2003
)
or
blocks
of
words
as
text
units
(
Hearst
,
1994
)
.
We
used
both
variants
in
our
experiments
.
When
using
blocks
,
we
computed
blocks
of
a
fixed
size
(
typically
20
words
)
sliding
over
the
documents
in
a
fixed
step
size
(
10
or
5
words
)
.
The
algorithm
predicts
a
story
boundary
when
the
semantic
cohesion
between
two
consecutive
units
drops
.
Blocks
can
cross
story
boundaries
,
thus
many
predicted
boundaries
will
be
displaced
with
respect
to
the
actual
boundary
.
Averaged
similarity
In
our
preliminary
experiments
we
used
the
largest
difference
in
score
to
predict
story
boundary
,
following
the
TextTiling
approach
(
Hearst
,
1994
)
.
We
found
,
however
,
that
in
our
document
collection
the
word
overlap
between
sentences
was
often
not
large
and
pair-wise
similarity
could
drop
to
zero
even
for
sentences
within
the
same
story
,
as
will
be
illustrated
below
.
We
could
not
obtain
satisfactory
results
with
this
approach
.
Therefore
,
we
used
the
average
similarity
by
using
a
history
of
fixed
size
n.
The
semantic
cohesion
score
was
computed
for
the
position
between
two
text
units
,
tj
and
tj
as
follows
:
Our
approach
predicts
story
boundaries
at
the
minima
of
the
semantic
cohesion
score
.
Approximating
Lexical
Chains
One
of
the
motivations
for
our
cohesion
score
is
that
it
approximates
lexical
chains
,
as
for
example
in
(
Galley
et
al.
,
2003
)
.
Galley
et
al.
(
Galley
et
al.
,
2003
)
define
lexical
chains
R1
,
RN
by
considering
repetitions
of
terms
t1
}
.
.
,
tN
and
assigning
larger
weights
to
short
and
compact
chains
.
Then
the
lexical
cohesion
score
between
two
text
units
tj
and
tj
is
based
on
the
number
of
chains
that
overlap
both
of
them
:
where
wk
(
tj
)
=
score
(
Rj
)
if
the
chain
Rj
overlaps
tj
and
zero
otherwise
.
Our
cohesion
score
takes
into
account
only
the
chains
for
words
that
occur
in
tj
and
have
another
occurrence
within
n
previous
sentences
.
Due
to
this
simplification
,
we
compute
the
score
based
on
inner
products
.
Once
we
make
the
transition
to
inner
products
,
we
can
use
hybrid
indexing
and
compute
semantic
cohesion
score
beyond
term
repetition
.
4
Related
Approaches
We
compare
our
approach
to
the
LCseg
algorithm
which
uses
lexical
chains
to
estimate
topic
boundaries
(
Galley
et
al.
,
2003
)
.
Hybrid
indexing
allows
us
to
compute
semantic
cohesion
score
rather
than
the
lexical
cohesion
score
based
on
word
repetitions
.
Choi
at
al.
used
LSA
for
segmentation
(
Choi
et
al.
,
2001
)
.
LSA
(
Deerwester
et
al.
,
1990
)
is
a
special
case
ofspectral
embedding
and
Choi
atal
.
(
Choi
et
al.
,
2001
)
used
all
vocabulary
words
to
compute
low-dimensional
document
vectors
.
We
use
GLSA
(
Matveeva
et
al.
,
2005
)
because
it
computes
term
vectors
as
opposed
to
the
dual
document-term
representation
with
LSA
and
uses
a
different
matrix
of
pair-wise
similarities
.
Furthermore
,
Choi
at
al.
(
Choi
et
al.
,
2001
)
used
clustering
to
predict
boundaries
whereas
we
used
the
average
similarity
scores
.
s1
:
The
Cuban
news
agency
Prensa
Latina
called
Clinton
's
announcement
Friday
that
Cubans
picked
up
at
sea
will
be
taken
to
Guantanamo
Bay
naval
base
a
"
new
and
dangerous
element
"
in
U
S
immigration
policy
.
s2
:
The
Cuban
government
has
not
yet
publicly
reacted
to
Clinton
'
s
announcement
that
Cuban
rafters
will
be
turned
away
from
the
United
States
and
taken
to
the
U
S
base
on
the
southeast
tip
of
Cuba
.
s5
:
The
arrival
of
Cuban
emigrants
could
be
an
"
extraordinary
aggravation
"
to
the
situation
,
Prensa
Latina
said
.
s6
:
It
noted
that
Cuba
had
already
denounced
the
use
of
the
base
as
a
camp
for
Haitian
refugees
.
whom
it
had
for
many
years
encouraged
to
come
to
the
United
States
.
s8
:
Cuba
considers
the
land
at
the
naval
base
,
leased
to
the
United
States
at
the
turn
of
the
century
,
to
be
illegally
occupied
.
s10
:
General
Motors
Corp
said
Friday
it
was
recalling
5,600
1993-94
model
Chevrolet
Lumina
,
Pontiac
Trans
Sport
and
Oldsmobile
Silhouette
minivans
equipped
with
a
power
sliding
door
and
built-in
child
seats
.
s14
:
If
this
occurs
,
the
shoulder
belt
may
not
properly
retract
,
the
CClYtYlClkeY
said
.
s15
:
GM
is
the
only
company
to
offer
the
power-sliding
door
.
s16
:
The
Company
said
it
was
not
aware
of
any
accidents
or
injuries
related
to
the
defect
.
s17
:
To
correct
the
problem
,
GM
said
dealers
will
install
a
modified
interior
trim
piece
that
will
reroute
the
seat
belt
.
Table
2
:
TDT
.
The
first
17
sentences
in
the
first
file
.
Existing
approaches
to
hybrid
indexing
used
different
weights
for
proper
nouns
,
nouns
phrase
heads
and
use
WordNet
synonyms
to
expand
the
documents
,
for
example
(
Hatzivassiloglou
et
al.
,
2000
;
Hatzivassiloglou
et
al.
,
2001
)
.
Our
approach
does
not
require
linguistic
resources
and
learning
the
weights
.
The
semantic
associations
between
nouns
are
estimated
using
spectral
embedding
.
The
first
TDT
collection
is
part
of
the
LCseg
toolkit1
(
Galley
et
al.
,
2003
)
and
we
used
it
to
compare
our
approach
to
LCseg
.
We
used
the
part
ofthis
collection
with
50
files
with
22
documents
each
.
We
also
used
the
TDT2
collection2
of
news
articles
from
six
news
agencies
in
1998
.
We
used
only
9,738
documents
that
are
assigned
to
one
topic
and
have
length
more
than
50
words
.
We
used
the
Lemur
toolkit3
with
stemming
and
stop
words
list
for
the
tf-idf
indexing
;
we
used
Bikel
's
parser4
to
obtain
the
POS-tags
and
select
nouns
;
we
used
the
PLA-PACK
package
(
Bientinesi
et
al.
,
2003
)
to
compute
the
eigenvalue
decomposition
.
3http
:
/
/
www.lemurproject.org
/
4http
:
/
/
www.cis.upenn.edu
/
dbikel
/
software.html
Evaluation
For
the
TDT
data
we
use
the
error
metric
pk
(
Beeferman
et
al.
,
1999
)
and
WindowD-iff
(
Pevzner
and
Hearst
,
2002
)
which
are
implemented
in
the
LCseg
toolkit
.
We
also
used
the
TDT
cost
metric
Cseg5
,
with
the
default
parameters
P
(
seg
)
=
0.3
,
Cmiss
=
1
,
Cfa
=
0.3
and
distance
of
50
words
.
All
these
measures
look
at
two
units
(
words
or
sentences
)
N
units
apart
and
evaluate
how
well
the
algorithm
can
predict
whether
there
is
a
boundary
between
them
or
not
.
Lower
values
mean
better
performance
for
all
measures
.
Global
vs.
Local
GLSA
Similarity
To
obtain
the
PMI
values
we
used
the
TDT2
collection
,
denoted
as
GLSAiocai
.
Since
co-occurrence
statistics
based
on
larger
collections
give
a
better
approximation
to
linguistic
similarities
,
we
also
used
700,000
documents
from
the
English
GigaWord
collection
,
denoted
as
GLSA
.
We
used
a
window
of
size
8
.
5.2
Topic
Segmentation
The
first
set
of
experiments
was
designed
to
evaluate
the
advantage
of
the
GLSA
representation
over
the
baseline
.
We
compare
our
approach
to
the
LCseg
algorithm
(
Galley
et
al.
,
2003
)
and
use
sentences
as
segmentation
unit
.
To
avoid
the
issue
of
parameters
setting
when
the
number
of
boundaries
is
not
known
,
we
provide
each
algorithm
with
the
actual
numbers
Figure
1
:
TDT
.
Pair-wise
sentence
similarities
for
tf-idf
(
left
)
,
GLSA
(
middle
)
;
x-axis
shows
story
boundaries
.
Details
for
the
first
20
sentences
,
table
2
(
right
)
.
Figure
2
:
TDT
.
Pair-wise
sentence
similarities
for
tf-idf
(
left
)
,
GLSA
(
middle
)
averaged
over
10
preceeding
sentences
;
LCseg
lexical
cohesion
scores
(
right
)
.
X-axis
shows
story
boundaries
.
of
boundaries
.
TDT
We
use
the
LCseg
approach
and
our
approach
with
the
baseline
tf-idfrepresentation
and
the
GLSA
representation
to
segment
this
corpus
.
Table
2
shows
a
few
sentences
.
Many
content
words
are
repeated
,
so
the
lexical
chains
is
definitely
a
sound
approach
.
As
shown
in
Table
2
,
in
the
first
story
the
word
"
Cuba
"
or
"
Cuban
"
is
repeated
in
every
sentence
thus
generating
a
lexical
chain
.
On
the
topic
boundary
,
the
word
overlap
between
sentences
is
very
small
.
At
the
same
time
,
the
repetition
of
words
may
also
be
interrupted
within
a
story
:
sentence
5
,
6
and
sentences
14
,
15
,
16
have
little
word
overlap
.
LCseg
deals
with
this
by
defining
several
parameters
to
control
chain
length
and
gaps
.
This
simple
example
illustrates
the
potential
benefit
of
semantic
cohesion
.
Table
2
shows
that
"
General
Motors
"
or
"
GM
"
are
not
repeated
in
every
sentence
of
the
second
story
.
However
,
"
GM
"
,
"
carmaker
"
and
"
company
"
are
semantically
related
.
Making
this
information
available
to
the
segmentation
algorithm
allows
it
to
establish
a
connection
between
each
sentence
of
the
second
story
.
We
computed
pair-wise
sentence
similarities
between
pairs
of
consecutive
sentences
in
the
tf-idf
and
GLSA
representations
.
Figure
1
shows
the
similarity
values
plotted
for
each
sentence
break
.
The
pair-wise
similarities
based
on
term-matching
are
very
spiky
and
there
are
many
zeros
within
the
story
.
The
GLSA-based
similarity
makes
the
dips
in
the
similarities
at
the
boundaries
more
prominent
.
The
last
plot
gives
the
details
for
the
sentences
in
table
2
.
In
the
tf-idfrepresentation
sentences
without
word
overlap
receive
zero
similarity
but
the
GLSA
representation
is
able
to
use
the
semantic
association
between
between
"
emigrants
"
and
"
refugees
"
for
sentences
5
and
6
,
and
also
the
semantic
association
between
"
carmaker
"
and
"
company
"
for
sentences
14
Table
3
:
TDT
segmentation
results
.
and
15
.
This
effect
increases
as
we
use
the
semantic
cohesion
score
as
in
equation
7
.
Figure
2
shows
the
similarity
values
for
tf-idf
and
GLSA
and
also
the
lexical
cohesion
scores
computed
by
LCseg
.
The
GLSA-based
similarities
are
not
quite
as
smooth
as
the
LC-seg
scores
,
but
they
correctly
discover
the
boundaries
.
LCseg
parameters
are
fine-tuned
for
this
document
collection
.
We
used
a
general
TDT2
GLSA
representation
for
this
collection
,
and
the
only
segmentation
parameter
we
used
is
to
avoid
placing
next
boundary
within
n
=
3
sentences
of
the
previous
one
.
For
this
reason
the
predicted
boundary
may
be
one
sentence
off
the
actual
boundary
.
These
results
are
summarized
in
Table
3
.
The
GLSA
representation
performs
significantly
better
than
the
tf-idf
baseline
.
Its
pk
and
WindowDiff
scores
with
default
parameters
for
LCseg
are
worse
than
for
LCseg
.
We
attribute
it
to
the
fact
that
we
did
not
fine-tuned
our
method
to
this
collection
and
that
boundaries
are
often
placed
one
position
off
the
actual
boundary
.
TDT2
For
this
collection
we
used
three
different
indexing
schemes
:
the
f-ifbaseline
,
the
GLSArep-resentation
and
the
hybrid
indexing
.
Each
representation
supports
a
different
similarity
measure
.
Our
TDT
experiments
showed
that
the
semantic
cohesion
score
based
on
the
GLSA
representation
improves
the
segmentation
results
.
The
variant
of
the
TDT
corpus
we
used
is
rather
small
and
well-balanced
,
see
(
Galley
et
al.
,
2003
)
for
details
.
In
the
second
phase
of
experiments
we
evaluate
our
approach
on
the
larger
TDT2
corpus
.
The
experiments
were
designed
to
address
the
following
issues
:
•
performance
comparison
between
GLSA
and
Hybrid
indexing
representations
.
As
mentioned
before
,
GLSA
embeds
all
words
in
a
low-dimensional
space
.
Whereas
semantic
#b
unknown
GLSAJocaZ
HybridJocaZ
Table
4
:
TDT2
segmentation
results
.
Sliding
blocks
with
size
20
and
stepsize
10
;
similarity
averaged
over
10
preceeding
blocks
.
classes
for
nouns
have
theoretical
linguistic
justification
,
it
is
harder
to
motivate
a
latent
space
representation
for
example
for
proper
nouns
.
Therefore
,
we
want
to
evaluate
the
advantage
of
using
spectral
embedding
only
for
nouns
.
•
collection
dependence
of
similarities
.
The
similarity
matrix
S
is
computed
using
the
TDT2
corpus
(
GLSAiocai
)
and
using
the
larger
Giga-Word
corpus
.
The
larger
corpus
provides
more
reliable
co-occurrence
statistics
.
On
the
other
hand
,
word
distribution
is
different
from
that
in
the
TDT2
corpus
.
We
wanted
to
evaluate
whether
semantic
similarities
are
collection
independent
.
Table
4
shows
the
performance
evaluation
.
We
show
the
results
computed
using
blocks
containing
20
words
(
after
preprocessing
)
with
step
size
10
.
We
tried
other
parameter
values
but
did
not
achieve
better
performance
,
which
is
consistent
with
other
research
(
Hearst
,
1994
;
Galley
et
al.
,
2003
)
.
We
show
the
results
for
two
settings
:
predict
a
known
number
of
boundaries
,
and
predict
boundaries
using
a
threshold
.
In
our
experiments
we
used
the
average
of
the
smallest
N
scores
as
threshold
,
N
=
4000
showing
best
results
.
The
spectral
embedding
based
representations
(
GLSA
,
Hybrid
)
significantly
outperform
the
baseline
.
This
confirms
the
advantage
of
the
semantic
cohesion
score
vs.
term-matching
.
Hybrid
indexing
outperforms
the
GLSA
representation
supporting
our
intuition
that
semantic
association
is
best
defined
for
nouns
.
We
used
the
GigaWord
corpus
to
obtain
the
pair-wise
word
associations
for
the
GLSA
and
Hybrid
representations
.
We
also
computed
GLSAiocai
and
Hybridiocai
using
the
TDT2
corpus
to
obtain
the
pair-wise
word
associations
.
The
co-occurrence
statistics
based
on
the
GigaWord
corpus
provide
more
reliable
estimations
of
semantic
association
despite
the
difference
in
term
distribution
.
The
difference
is
larger
for
the
GLSA
case
when
we
compute
the
embedding
for
all
words
,
GLSA
performs
better
than
GLSAiocai
.
Hybridiocai
performs
only
slightly
worse
than
Hybrid
.
This
seems
to
support
the
claim
that
semantic
associations
between
nouns
are
largely
collection
independent
.
On
the
other
hand
,
semantic
associations
for
proper
names
are
collection
dependent
at
least
because
the
collections
are
static
but
the
semantic
relations
of
proper
names
may
change
over
time
.
The
semantic
space
for
a
name
of
a
president
,
for
example
,
is
different
for
the
period
of
time
of
his
presidency
and
for
the
time
before
and
after
that
.
Disappointingly
,
we
could
not
achieve
good
results
with
LCseg
.
It
tends
to
split
stories
into
short
paragraphs
.
Hybrid
indexing
could
achieve
results
comparable
to
state-of-the
art
approaches
,
see
(
Fis-cus
et
al.
,
1998
)
for
an
overview
.
6
Conclusion
and
Future
Work
We
presented
a
topic
segmentation
approach
based
on
semantic
cohesion
scores
.
Our
approach
is
domain
independent
,
does
not
require
training
or
use
of
lexical
resources
.
The
scores
are
computed
based
on
the
hybrid
document
indexing
which
uses
spectral
embedding
in
the
space
of
latent
concepts
for
nouns
and
keeps
proper
nouns
and
other
specifics
of
the
documents
collections
unchanged
.
We
approximate
the
lexical
chains
approach
by
simplifying
the
definition
of
a
chain
which
allows
us
to
use
inner
products
as
basis
for
the
similarity
score
.
The
similarity
score
takes
into
account
semantic
relations
be
-
tween
nouns
beyond
term
matching
.
This
semantic
cohesion
approach
showed
good
results
on
the
topic
segmentation
task
.
We
intend
to
extend
the
hybrid
indexing
approach
by
considering
more
vocabulary
subsets
.
Syntactic
similarity
is
more
appropriate
for
verbs
,
for
example
,
than
co-occurrence
.
As
a
next
step
,
we
intend
to
embed
verbs
using
syntactic
similarity
.
It
would
also
be
interesting
to
use
lexical
chains
for
proper
names
and
learn
the
weights
for
different
similarity
scores
.
