Query
segmentation
is
the
process
of
taking
a
user
's
search-engine
query
and
dividing
the
tokens
into
individual
phrases
or
semantic
units
.
Identification
of
these
query
segments
can
potentially
improve
both
document-retrieval
precision
,
by
first
returning
pages
which
contain
the
exact
query
segments
,
and
document-retrieval
recall
,
by
allowing
query
expansion
or
substitution
via
the
segmented
units
.
We
train
and
evaluate
a
machine-learned
query
segmentation
system
that
achieves
86
%
segmentation-decision
accuracy
on
a
gold
standard
set
of
segmented
noun
phrase
queries
,
well
above
recently
published
approaches
.
Key
en-ablers
of
this
high
performance
are
features
derived
from
previous
natural
language
processing
work
in
noun
compound
bracketing
.
For
example
,
token
association
features
beyond
simple
N-gram
counts
provide
powerful
indicators
of
segmentation
.
1
Introduction
Billions
of
times
every
day
,
people
around
the
world
communicate
with
Internet
search
engines
via
a
small
text
box
on
a
web
page
.
The
user
provides
a
sequence
of
words
to
the
search
engine
,
and
the
search
engine
interprets
the
query
and
tries
to
return
web
pages
that
not
only
contain
the
query
tokens
,
but
that
are
also
somehow
about
the
topic
or
idea
that
the
query
terms
describe
.
Recent
years
have
seen
a
widespread
recognition
that
the
user
is
indeed
providing
natural
language
text
to
the
search
engine
;
query
tokens
are
not
independent
,
unordered
symbols
to
be
matched
on
a
web
document
but
rather
ordered
words
and
phrases
with
syntactic
relationships
.
For
example
,
Zhai
(
1997
)
pointed
out
that
indexing
on
single-word
symbols
is
not
able
to
distinguish
a
search
for
"
bank
terminology
"
from
one
for
"
terminology
bank
.
"
The
reader
can
submit
these
queries
to
a
current
search
engine
to
confirm
that
modern
indexing
does
recognize
the
effect
of
token
order
on
query
meaning
in
some
way
.
Accurately
interpreting
query
semantics
also
depends
on
establishing
relationships
between
the
query
tokens
.
For
example
,
consider
the
query
"
two
man
power
saw
.
"
There
are
a
number
of
possible
interpretations
of
this
query
,
and
these
can
be
expressed
through
a
number
of
different
segmentations
or
bracketings
of
the
query
terms
:
[
two
]
[
manpower
]
[
saw
]
,
etc.
One
simple
way
to
make
use
of
these
interpretations
in
search
would
be
to
put
quotation
marks
around
the
phrasal
segments
to
require
the
search
engine
to
only
find
pages
with
exact
phrase
matches
.
If
,
as
seems
likely
,
the
searcher
is
seeking
pages
about
the
large
,
mechanically-powered
two-man
saws
used
by
lumberjacks
and
sawyers
to
cut
big
trees
,
then
the
first
segmentation
is
correct
.
Indeed
,
a
phrasal
search
for
"
two
man
power
saw
"
on
Google
does
find
the
device
of
interest
.
So
does
the
second
interpretation
,
but
along
with
other
,
less-relevant
pages
discussing
competitions
involving
"
two-man
handsaw
,
Proceedings
of
the
2007
Joint
Conference
on
Empirical
Methods
in
Natural
Language
Processing
and
Computational
Natural
Language
Learning
,
pp.
819-826
,
Prague
,
June
2007
.
©
2007
Association
for
Computational
Linguistics
two-woman
handsaw
,
power
saw
log
bucking
,
etc.
"
The
top
document
returned
for
the
third
interpretation
,
meanwhile
,
describes
a
man
on
a
rampage
at
a
subway
station
with
two
cordless
power
saws
,
while
the
fourth
interpretation
finds
pages
about
topics
ranging
from
hockey
's
thrilling
two-man
power
play
advantage
to
the
man
power
situation
during
the
Second
World
War
.
Clearly
,
choosing
the
right
segmentation
means
finding
the
right
documents
faster
.
Query
segmentation
can
also
help
if
insufficient
pages
are
returned
for
the
original
query
.
A
technique
such
as
query
substitution
or
expansion
(
Jones
et
al.
,
2006
)
can
be
employed
using
the
segmented
units
.
For
example
,
we
could
replace
the
sexist
"
two
man
"
modifier
with
the
politically-correct
"
two
person
"
phrase
in
order
to
find
additional
relevant
documents
.
Without
segmentation
,
expanding
via
the
individual
words
"
two
,
"
"
man
,
"
"
power
,
"
or
"
saw
"
could
produce
less
sensible
results
.
In
this
paper
,
we
propose
a
data-driven
,
machine-learned
approach
to
query
segmentation
.
Similar
to
previous
segmentation
approaches
described
in
Section
2
,
we
make
a
decision
to
segment
or
not
to
segment
between
each
pair
of
tokens
in
the
query
.
Unlike
previous
work
,
we
view
this
as
a
classification
task
where
the
decision
parameters
are
learned
dis-criminatively
from
gold
standard
data
.
In
Section
3
,
we
describe
our
approach
and
the
features
we
use
.
Section
4
describes
our
labelled
data
,
as
well
as
the
specific
tools
used
for
our
experiments
.
Section
5
provides
the
results
of
our
evaluation
,
and
shows
the
strong
gains
in
performance
possible
using
a
wide
set
of
features
within
a
discriminative
framework
.
2
Related
Work
Query
segmentation
has
previously
been
approached
in
an
unsupervised
manner
.
Risvik
et
al.
(
2003
)
combine
the
frequency
count
of
a
segment
and
the
mutual
information
(
MI
)
between
pairs
of
words
in
the
segment
in
a
heuristic
scoring
function
.
The
system
chooses
the
segmentation
with
the
highest
score
as
the
output
segmentation
.
Jones
et
al.
(
2006
)
use
MI
between
pairs
of
tokens
as
the
sole
factor
in
deciding
on
segmentation
breaks
.
If
the
MI
is
above
a
threshold
(
optimized
on
a
small
training
set
)
,
the
pair
of
tokens
is
joined
in
a
segment
.
Otherwise
,
a
segmentation
break
is
made
.
Query
segmentation
is
related
to
the
task
of
noun
compound
(
NC
)
bracketing
.
NC
bracketing
determines
the
syntactic
structure
of
an
NC
as
expressed
by
a
binary
tree
,
or
,
equivalently
,
a
binary
bracketing
(
Nakov
and
Hearst
,
2005a
)
.
Zhai
(
1997
)
first
identified
the
importance
of
syntactic
query
/
corpus
parsing
for
information
retrieval
,
but
did
not
consider
query
segmentation
itself
.
In
principle
,
as
N
increases
,
the
number
of
binary
trees
for
an
N-token
compound
is
much
greater
than
the
2N-1
possible
segmentations
.
In
practice
,
empirical
NC
research
has
focused
on
three-word
compounds
.
The
computational
problem
is
thus
deciding
whether
the
three-word
NC
has
a
left
or
right-bracketing
structure
(
Lauer
,
1995
)
.
For
the
segmentation
task
,
analysing
a
three-word
NC
requires
deciding
between
four
different
segmentations
.
For
example
,
there
are
two
bracketings
for
"
used
car
parts
,
"
the
left-bracketing
"
[
[
used
car
]
parts
]
"
and
the
right-bracketing
"
[
used
[
car
parts
]
]
,
"
while
there
are
four
segmentations
,
including
the
case
where
there
is
only
one
segment
,
"
[
used
car
parts
]
"
and
the
base
case
where
each
token
forms
its
own
segment
,
"
[
used
]
[
car
]
[
parts
]
.
"
Query
segmentation
thus
naturally
handles
the
case
where
the
query
consists
of
multiple
,
separate
noun
phrases
that
should
not
be
analysed
with
a
single
binary
tree
.
Despite
the
differences
between
the
tasks
,
it
is
worth
investigating
whether
the
information
that
helps
disambiguate
left
and
right-bracketings
can
also
be
useful
for
segmentation
.
In
particular
,
we
explored
many
of
the
sources
of
information
used
by
Nakov
and
Hearst
(
2005a
)
,
as
well
as
several
novel
features
that
aid
segmentation
performance
and
should
also
prove
useful
for
NC
analysis
researchers
.
Unlike
all
previous
approaches
that
we
are
aware
of
,
we
apply
our
features
in
a
flexible
discriminative
framework
rather
than
a
classification
based
on
a
vote
or
average
of
features
.
NC
analysis
has
benefited
from
the
recent
trend
of
using
web-derived
features
rather
than
corpus-based
counts
(
Keller
and
Lapata
,
2003
)
.
Lapata
and
Keller
(
2004
)
first
used
web-based
co-occurrence
counts
for
the
bracketing
of
NCs
.
Recent
innovations
have
been
to
use
statistics
"
beyond
the
N-gram
,
"
such
as
counting
the
number
of
web
pages
where
a
pair
of
words
w
,
x
participate
in
a
genitive
relationship
(
"
w
's
x
"
)
,
occur
collapsed
as
a
single
phrase
(
"
wx
"
)
(
Nakov
and
Hearst
,
2005a
)
or
have
a
definite
article
as
a
left-boundary
marker
(
"
the
w
x
"
)
(
Nicholson
and
Baldwin
,
2006
)
.
We
show
strong
performance
gains
when
such
features
are
employed
for
query
segmentation
.
NC
bracketing
is
part
of
a
larger
field
of
research
on
multiword
expressions
including
general
NC
interpretation
.
NC
interpretation
explores
not
just
the
syntactic
dependencies
among
compound
constituents
,
but
the
semantics
of
the
nominal
relationships
(
Girju
et
al.
,
2005
)
.
Web-based
statistics
have
also
had
an
impact
on
these
wider
analysis
tasks
,
including
work
on
interpretation
of
verb
nominalisa-tions
(
Nicholson
and
Baldwin
,
2006
)
and
NC
coordination
(
Nakov
and
Hearst
,
2005b
)
.
3
Methodology
3.1
Segmentation
Classification
Consider
a
query
x
=
[
x
\
,
x2
,
.
.
.
,
xN
}
consisting
of
N
query
tokens
.
Segmentation
is
a
mapping
S
:
x
—
y
£
YN
,
where
y
is
a
segmentation
from
the
set
YN
.
Since
we
can
either
have
or
not
have
a
segmentation
break
at
each
of
the
N
—
1
spaces
between
the
N
tokens
,
|
YN
|
=
2N-1
.
Supervised
machine
learning
can
be
applied
to
derive
the
mapping
S
automatically
,
given
a
set
of
training
examples
consisting
of
pairs
of
queries
and
their
segmentations
T
=
{
(
xi
,
yi
)
}
.
Typically
this
would
be
done
via
a
set
of
features
\
I
&gt;
(
x
,
y
)
for
the
structured
examples
.
A
set
of
weights
w
can
be
learned
discriminatively
such
that
each
training
example
(
xi
,
yi
)
has
a
higher
score
,
Scorew
(
x
,
y
)
=
w
•
\
I
&gt;
(
x
,
y
)
,
than
alternative
query-segmentation
pairs
,
(
xi
,
zi
)
,
z
=
y^.1
At
test
time
,
the
classifier
chooses
the
segmentation
for
x
that
has
the
highest
score
according
to
the
learned
parameterization
:
y
=
argmaxy
Scorew
(
x
,
y
)
.
Unlike
many
problems
in
NLP
such
as
parsing
or
part-of-speech
tagging
,
the
small
cardinality
of
Yn
makes
enumerating
all
the
alternative
query
segmentations
computationally
feasible
.
In
our
preliminary
experiments
,
we
used
a
Support
Vector
Machine
(
SVM
)
ranker
(
Joachims
,
2002
)
to
learn
the
structured
classifier.2
We
also
in
-
'
See
e.g.
Collins
(
2002
)
for
a
popular
training
algorithm
.
2
A
ranking
approach
was
also
used
previously
by
Daume
III
and
Marcu
(
2004
)
for
the
CoNLL-99
nested
noun
phrase
identification
task
.
vestigated
a
Hidden
Markov
Model
SVM
(
Altun
et
al.
,
2003
)
to
label
the
segmentation
breaks
using
information
from
past
segmentation
decisions
.
Ultimately
,
the
mappings
produced
by
these
approaches
were
not
as
accurate
as
a
simple
formulation
that
creates
a
full
query
segmentation
y
as
the
combination
of
independent
classification
decisions
made
between
each
pair
of
tokens
in
the
query.3
In
the
classification
framework
,
the
input
is
a
query
,
x
,
a
position
in
the
query
,
i
,
where
0
&lt;
i
&lt;
N
,
and
the
output
is
a
segmentation
decision
yes
/
no
.
The
training
set
of
segmented
queries
is
converted
into
examples
of
decisions
between
tokens
and
learning
is
performed
on
this
set
.
At
test
time
,
N
—
1
segmentation
decisions
are
made
for
the
N-length
query
and
an
output
segmentation
y
is
produced
.
Here
,
features
depend
only
on
the
input
query
x
and
the
position
in
the
query
i.
For
a
decision
at
position
i
,
we
use
features
from
tokens
up
to
three
positions
to
the
left
and
to
the
right
of
the
decision
location
.
That
is
,
for
a
decision
between
x
l0
and
xR0
,
we
extract
features
from
a
window
of
six
tokens
in
the
query
:
{
.
.
.
,
xL2
,
xL1
,
xlo
,
xro
,
xri
,
xr2
,
.
.
.
}
.
We
now
detail
the
features
derived
from
this
window
.
There
are
a
number
ofpossible
indicators
ofwhether
a
segmentation
break
occurs
between
a
pair
of
tokens
.
Some
of
these
features
fire
separately
for
each
token
x
in
our
feature
window
,
while
others
are
defined
over
pairs
or
sets
of
tokens
in
the
window
.
We
first
describe
the
features
that
are
defined
for
the
tokens
around
the
decision
boundary
,
xL0
and
xR0
,
before
describing
how
these
same
features
are
extended
to
longer
phrases
and
other
token
pairs
.
Table
1
lists
the
binary
features
that
fire
if
particular
aspects
of
a
token
or
pair
of
tokens
are
present
.
For
example
,
one
of
the
POS-tags
features
will
fire
if
the
pair
's
part-of-speech
tags
are
DT
JJ
,
another
feature
will
fire
if
the
position
of
the
pair
in
the
to
-
3The
structured
learners
did
show
large
gains
over
the
classification
framework
on
the
dev-set
when
using
only
the
basic
features
for
the
decision-boundary
tokens
(
see
Section
3.2.1
)
,
but
not
when
the
full
feature
set
was
deployed
.
Also
,
features
only
available
to
structured
learners
,
e.g.
number
of
segments
in
query
,
etc.
,
did
improve
the
performance
of
the
structured
approaches
,
but
not
above
that
of
the
simpler
classifier
.
Table
l
:
Indicator
features
.
is-the
is-free
POS-tags
fwd-pos
rev-pos
Part-of-speech
tags
of
pair
x
lo
xro
position
from
beginning
,
i
position
from
end
N
—
i
ken
is
2
,
etc.
The
two
lexical
features
(
for
when
the
token
is
"
the
"
and
when
the
token
is
"
free
"
)
fire
separately
for
the
left
and
right
tokens
around
the
decision
boundary
.
They
are
designed
to
add
discrimination
for
these
common
query
words
,
motivated
by
examples
in
our
training
set
.
For
example
,
in
the
training
set
,
"
free
"
often
occurs
in
its
own
segment
when
it
's
on
the
left-hand-side
of
a
decision
boundary
(
e.g.
"
free
"
"
online
"
.
.
.
)
,
but
may
join
into
a
larger
segment
when
it
's
on
the
right-hand-side
of
a
collocation
(
e.g.
"
sulfite
free
"
or
"
sugar
free
"
)
.
The
classifier
can
use
the
feature
weights
to
encourage
or
discourage
segmentation
in
these
specific
situations
.
For
statistical
features
,
previous
work
(
Section
2
)
suggests
that
the
mutual
information
between
the
decision
tokens
xL0
and
xR0
may
be
appropriate
.
The
log
of
the
pointwise
mutual
information
(
Church
and
Hanks
,
1989
)
between
the
decision-boundary
tokens
xlo
,
xro
is
:
This
is
equivalent
to
the
sum
:
log
C
(
xL0xR0
)
+
log
K
—
log
C
(
xL0
)
—
log
C
(
xR0
)
.
For
web-based
features
,
the
counts
C
(
.
)
can
be
taken
as
a
search
engine
's
count
of
the
number
of
pages
containing
the
term
.
The
normalizer
K
is
thus
the
total
number
of
pages
on
the
Tnternet
.
Represented
as
a
summation
,
we
can
see
that
providing
MT
as
the
feature
effectively
ties
the
weights
on
the
logarithmic
counts
C
(
xL0xR0
)
,
C
(
xL0
)
,
and
C
(
xR0
)
.
Another
approach
would
be
to
provide
these
logarithmic
counts
as
separate
features
to
our
learning
algorithm
,
which
can
then
set
the
weights
optimally
for
segmentation
.
We
call
this
set
of
counts
the
"
Basic
"
features
.
Tn
Section
5
,
we
confirm
results
on
our
development
set
that
showed
using
the
basic
features
untied
increased
segmentation
Table
2
:
Statistical
features
.
Description
web-count
pair-count
definite
collapsed
and-count
genitive
Qcounts-2
Counts
of
"
x
"
in
query
database
performance
by
up
to
4
%
over
using
MT
-
an
important
observation
for
all
researchers
using
association
models
as
features
in
their
discriminative
classifiers
.
Furthermore
,
with
this
technique
,
we
do
not
need
to
normalize
the
counts
for
the
other
pairwise
statistical
features
given
in
Table
2
.
We
can
simply
rely
on
our
learning
algorithm
to
increase
or
decrease
the
weights
on
the
logarithm
of
the
counts
as
needed
.
To
illustrate
how
the
statistical
features
work
,
consider
a
query
from
our
development
set
:
"
star
wars
weapons
guns
.
"
The
phrase
"
star
wars
"
can
easily
be
interpreted
as
a
phrase
;
there
is
a
high
co-occurrence
count
(
pair-count
)
,
and
many
pages
where
they
occur
as
a
single
phrase
(
collapsed
)
,
e.g.
"
starwars.com
.
"
"
Weapons
"
and
"
guns
,
"
on
the
other
hand
,
should
not
be
joined
together
.
Although
they
may
have
a
high
co-occurrence
count
,
the
coordination
feature
(
and-count
)
is
high
(
"
weapons
and
guns
"
)
showing
these
to
be
related
concepts
but
not
phrasal
constituents
.
Tncluding
this
novel
feature
resulted
in
noticeable
gains
on
the
development
set
.
Since
this
is
a
query-based
segmentation
,
features
that
consider
whether
sets
of
tokens
occurred
elsewhere
in
the
query
database
may
provide
domain-specific
discrimination
.
For
each
of
the
Qcount
features
,
we
look
for
two
quantities
:
the
number
of
times
the
phrase
occurs
as
a
query
on
its
own
and
the
number
of
times
the
phrase
occurs
within
another
query.4
Tncluding
both
of
these
counts
also
resulted
in
performance
gains
on
the
development
set
.
We
also
extensively
investigated
other
corpus-based
features
,
such
as
the
number
of
times
the
phrase
occurred
hyphenated
or
capitalized
,
and
the
4We
exclude
counts
from
the
training
,
development
,
and
testing
queries
discussed
in
Section
4.1
.
corpus-based
distributional
similarity
(
Lin
,
1998
)
between
a
pair
of
tokens
.
These
features
are
not
available
from
search-engine
statistics
because
search
engines
disregard
punctuation
and
capitalization
,
and
collecting
page-count-based
distributional
similarity
statistics
is
computationally
infeasible
.
Unfortunately
,
none
of
the
corpus-based
features
improved
performance
on
the
development
set
and
are
thus
excluded
from
further
consideration
.
This
is
perhaps
not
surprising
.
For
such
a
task
that
involves
real
user
queries
,
with
arbitrary
spellings
and
sometimes
exotic
vocabulary
,
gathering
counts
from
web
search
engines
is
the
only
way
to
procure
reliable
and
broad-coverage
statistics
.
Although
the
tokens
at
the
decision
boundary
are
of
paramount
importance
,
information
from
the
neighbouring
tokens
is
also
critical
for
segmentation
decision
discrimination
.
We
thus
include
features
that
take
into
consideration
the
preceding
and
following
tokens
,
xLi
and
xR1
,
as
context
information
.
We
gather
all
the
token
indicator
features
for
each
of
these
tokens
,
as
well
as
all
pairwise
features
between
xL1
and
xL0
,
and
then
xR0
and
xR1
.
Tf
context
tokens
are
not
available
at
this
position
in
the
query
,
a
feature
fires
to
indicate
this
.
Also
,
if
the
context
features
are
available
,
we
include
trigram
web
and
query-database
counts
of
"
xL1
xL0
xR0
"
and
"
xL0
xR0
xR1
"
,
and
a
fourgram
spanning
both
contexts
.
Furthermore
,
if
tokens
xL2
and
xR2
are
available
,
we
collect
relevant
token-level
,
pairwise
,
trigram
,
and
fourgram
counts
including
these
tokens
as
well
.
Tn
Section
5
,
we
show
that
context
features
are
very
important
.
They
allow
our
system
to
implicitly
leverage
surrounding
segmentation
decisions
,
which
cannot
be
accessed
directly
in
an
independent
segmentation-decision
classifier
.
For
example
,
consider
the
query
"
bank
loan
amoritization
schedule
.
"
Although
"
loan
amoritization
"
has
a
strong
connection
,
we
may
nevertheless
insert
a
break
between
them
because
"
bank
loan
"
and
"
amoritization
schedule
"
each
have
even
stronger
association
.
Motivated
by
work
in
noun
phrase
parsing
,
it
might
be
beneficial
to
check
if
,
for
example
,
token
xL0
is
more
likely
to
modify
a
later
token
,
such
as
xR1
.
For
example
,
in
"
female
bus
driver
"
,
we
might
not
wish
to
segment
"
female
bus
"
because
"
female
"
has
a
much
stronger
association
with
"
driver
"
than
with
"
bus
"
.
Thus
,
as
features
,
we
include
the
pair-wise
counts
between
xL0
and
xR1
,
and
then
xL1
and
xR0
.
Features
from
longer
range
dependencies
did
not
improve
performance
on
the
development
set
.
Our
dataset
was
taken
from
the
AOL
search
query
database
(
Pass
et
al.
,
2006
)
,
a
collection
of
35
million
queries
submitted
to
the
AOL
search
engine
.
Most
punctuation
has
been
removed
from
the
queries.5
Along
with
the
query
,
each
entry
in
the
database
contains
an
anonymous
user
TD
and
the
domain
of
the
URL
the
user
clicked
on
,
if
they
selected
one
of
the
returned
pages
.
For
our
data
,
we
used
only
those
queries
with
a
click-URL
.
This
subset
has
a
higher
proportion
of
correctly-spelled
queries
,
and
facilitates
annotation
(
described
below
)
.
We
then
tagged
the
search
queries
using
a
maximum
entropy
part-of-speech
tagger
(
Ratnaparkhi
,
1996
)
.
As
our
approach
was
designed
particularly
for
noun
phrase
queries
,
we
selected
for
our
final
experiments
those
AOL
queries
containing
only
determiners
,
adjectives
,
and
nouns
.
We
also
only
considered
phrases
of
length
four
or
greater
,
since
queries
of
these
lengths
are
most
likely
to
benefit
from
a
segmentation
,
but
our
approach
works
for
queries
of
any
length
.
Future
experiments
will
investigate
applying
the
current
approach
to
phrasal
verbs
,
prepositional
idioms
and
segments
with
other
parts
of
speech
.
We
randomly
selected
500
queries
for
training
,
500
for
development
,
and
500
for
final
testing
.
These
were
all
manually
segmented
by
our
annota-tors
.
Manual
segmentation
was
done
with
improving
search
precision
in
mind
.
Annotators
were
asked
to
analyze
each
query
and
form
an
idea
of
what
the
user
was
searching
for
,
taking
into
consideration
the
click-URL
or
performing
their
own
online
searches
,
if
needed
.
The
annotators
were
then
asked
to
segment
the
query
to
improve
search
retrieval
,
by
forcing
a
search
engine
to
find
pages
with
the
segments
'
including
,
unfortunately
,
all
quotation
marks
,
precluding
our
use
of
users
'
own
segmentations
as
additional
labelled
examples
or
feature
data
for
our
system
occurring
as
unbroken
units
.
One
annotator
segmented
all
three
data
sets
,
and
these
were
used
for
all
the
experiments
.
Two
additional
annotators
also
segmented
the
final
test
set
to
allow
inter-annotator
agreement
calculation
.
The
pairwise
agreement
on
segmentation
decisions
(
between
each
pair
of
tokens
)
was
between
84.0
%
and
84.6
%
.
The
agreement
on
entire
queries
was
between
57.6
%
and
60.8
%
.
All
three
agreed
completely
on
219
of
the
500
queries
,
and
we
use
this
"
intersected
"
set
for
a
separate
evaluation
in
our
ex-periments.6
Tf
we
take
the
proportion
of
segmentation
decisions
the
annotators
would
be
expected
to
agree
on
by
chance
to
be
50
%
,
the
Kappa
statistic
(
Jurafsky
and
Martin
,
2000
,
page
315
)
is
around
.
69
,
below
the
.
8
considered
to
be
good
reliability
.
This
observed
agreement
was
lower
than
we
anticipated
,
and
reflects
both
differences
in
query
interpretation
and
in
the
perceived
value
of
different
segmentations
for
retrieval
performance
.
An-notators
agreed
that
terms
like
"
real
estate
,
"
"
work
force
,
"
"
west
palm
beach
,
"
and
"
private
investigator
"
should
be
separate
segments
.
These
are
collocations
in
the
linguistics
sense
(
Manning
and
Schutze
,
1999
,
pages
183-187
)
;
we
cannot
substitute
related
words
for
terms
in
these
expressions
nor
apply
syntactic
transformations
or
paraphrases
(
e.g.
we
don
't
say
"
investigator
of
privates
"
)
.
However
,
for
a
query
such
as
"
bank
manager
,
"
should
we
exclude
web
pages
that
discuss
"
manager
of
the
bank
"
or
"
branch
manager
for
XYZ
bank
"
?
Tf
a
user
is
searching
for
a
particular
webpage
,
excluding
such
results
could
be
harmful
.
However
,
for
query
substitution
or
expansion
,
identifying
that
"
bank
manager
"
is
a
single
unit
may
be
useful
.
We
can
resolve
the
conflicting
objectives
of
our
two
motivating
applications
by
moving
to
a
multi-layer
query
bracketing
scheme
,
first
segmenting
unbreakable
collocations
and
then
building
them
into
semantic
units
with
a
query
segmentation
grammar
.
This
will
be
the
subject
of
future
research
.
All
of
our
statistical
feature
information
was
collected
using
the
Google
SOAP
Search
APT.7
For
training
and
classifying
our
data
,
we
use
the
popular
6All
queries
and
statistical
feature
information
is
available
at
http
:
/
/
www.cs.ualberta.ca
/
~
bergsma
/
QuerySegmentation
/
7http
:
/
/
code.google.com
/
apis
/
soapsearch
/
Support
Vector
Machine
(
SVM
)
learning
package
SVMHgh
*
(
Joachims
,
1999
)
.
SVMs
are
maximum-margin
classifiers
that
achieve
good
performance
on
a
range
of
tasks
.
Tn
each
case
,
we
learn
a
linear
kernel
on
the
training
set
segmentation
decisions
and
tune
the
parameter
that
trades-off
training
error
and
margin
on
the
development
set
.
We
use
the
following
two
evaluation
criteria
:
Seg-Acc
:
Segmentation
decision
accuracy
:
the
proportion
of
times
our
classifier
's
decision
to
insert
a
segment
break
or
not
between
a
pair
of
tokens
agrees
with
the
gold
standard
decision
.
Qry-Acc
:
Query
segmentation
accuracy
:
the
proportion
of
queries
for
which
the
complete
segmentation
derived
from
our
classifications
agrees
with
the
gold
standard
segmentation
.
5
Results
the
SVM
set
the
threshold
for
MT
on
the
training
set
.
Note
that
the
Basic
,
Decision-Boundary
system
(
Section
3.2.1
)
,
which
uses
exactly
the
same
cooccurrence
information
as
the
MT
system
(
in
the
form
of
the
Basic
features
)
but
allows
the
SVM
to
discriminatively
weight
the
logarithmic
counts
,
immediately
increases
Seg-Acc
performance
by
3.7
%
.
Even
more
strikingly
,
adding
the
Basic
count
information
for
the
Context
tokens
(
Section
3.2.2
)
boosts
performance
by
another
8.5
%
,
increasing
Qry-Acc
by
over
22
%
.
Smaller
,
further
gains
arise
by
adding
Dependency
token
information
(
Section
3.2.3
)
.
Also
,
notice
that
moving
from
Basic
features
for
the
Decision-Boundary
tokens
to
all
of
our
indicator
(
Table
1
)
and
statistical
(
Table
2
)
features
(
referred
to
as
All
features
)
increases
performance
from
71.7
%
to
84.3
%
.
These
gains
convincingly
justify
8Statistically
significant
intra-row
differences
in
Qry-Acc
are
marked
with
an
asterix
(
McNemar
's
test
,
p
&lt;
0.05
)
Table
3
:
Segmentation
Performance
(
%
)
Feature
Type
Feature
Span
Test
Set
Intersection
Set
Decision-Boundary
Decision-Boundary
,
Context
Decision-Boundary
,
Context
,
Dependency
our
use
of
an
expanded
feature
set
for
this
task
.
Tncluding
Context
with
the
expanded
features
adds
another
2
%
,
while
adding
Dependency
information
actually
seems
to
hinder
performance
slightly
,
although
gains
were
seen
when
adding
Dependency
information
on
the
development
set
.
Note
,
however
,
that
these
results
must
also
be
considered
in
light
of
the
low
inter-annotator
agreement
(
Section
4.1
)
.
Tndeed
,
results
are
lower
if
we
evaluate
using
the
test-set
labels
from
another
an-notator
(
necessarily
training
on
the
original
anno-tator
's
labels
)
.
On
the
intersected
set
of
the
three
annotators
,
however
,
results
are
better
still
:
88.7
%
Seg-Acc
and
69.4
%
Qry-Acc
on
the
intersected
queries
for
the
full-featured
system
(
Table
3
)
.
Since
high
performance
is
dependent
on
consistent
training
and
test
labellings
,
it
seems
likely
that
developing
more-explicit
annotation
instructions
may
allow
further
improvements
in
performance
as
within-set
and
between-set
annotation
agreement
increases
.
Tt
would
also
be
theoretically
interesting
,
and
of
significant
practical
importance
,
to
develop
a
learning
approach
that
embraces
the
agreement
of
the
annotations
as
part
of
the
learning
algorithm
.
Our
initial
ranking
formulation
(
Section
3.1
)
,
for
example
,
could
learn
a
model
that
prefers
segmentations
with
higher
agreement
,
but
still
prefers
any
annotated
segmentation
to
alternative
,
unobserved
structures
.
As
there
is
growing
interest
in
making
maximal
use
of
annotation
resources
within
discriminative
learning
techniques
(
Zaidan
et
al.
,
2007
)
,
developing
a
general
empirical
approach
to
learning
from
ambiguously-labelled
examples
would
be
both
an
important
contribution
to
this
trend
and
a
potentially
helpful
technique
in
a
number
of
NLP
domains
.
6
Conclusion
We
have
developed
a
novel
approach
to
search
query
segmentation
and
evaluated
this
approach
on
actual
user
queries
,
reducing
error
by
56
%
over
a
recent
comparison
approach
.
Gains
in
performance
were
made
possible
by
both
leveraging
recent
progress
in
feature
engineering
for
noun
compound
bracketing
,
as
well
as
using
a
flexible
,
discriminative
incorporation
of
association
information
,
beyond
the
decision-boundary
tokens
.
We
have
created
and
made
available
a
set
of
manually-segmented
user
queries
,
and
thus
provided
a
new
testing
platform
for
other
researchers
in
this
area
.
Our
initial
formulation
of
query
segmentation
as
a
structured
learning
problem
,
and
our
leveraging
of
association
statistics
beyond
the
decision
boundary
,
also
provides
powerful
tools
for
noun
compound
bracketing
researchers
to
both
move
beyond
three-word
compounds
and
to
adopt
discriminative
feature
weighting
techniques
.
The
positive
results
achieved
on
this
important
application
should
encourage
further
inter-disciplinary
collaboration
between
noun
compound
interpretation
and
information
retrieval
researchers
.
For
example
,
analysing
the
semantics
of
multiword
expressions
may
allow
for
more-focused
query
expansion
;
knowing
to
expand
"
bank
manager
"
to
include
pages
describing
a
"
manager
of
the
bank
,
"
but
not
doing
the
same
for
non-compositional
phrases
like
"
real
estate
"
or
"
private
investigator
,
"
requires
exactly
the
kind
of
techniques
being
developed
in
the
noun
compound
interpretation
community
.
Thus
for
query
expansion
,
as
for
query
segmentation
,
work
in
natural
language
processing
has
the
potential
to
make
a
real
and
immediate
impact
on
search-engine
technology
.
The
next
step
in
this
research
is
to
directly
investigate
how
query
segmentation
affects
search
performance
.
For
such
an
evaluation
,
we
would
need
to
know
,
for
each
possible
segmentation
(
including
no
segmentation
)
,
the
document
retrieval
performance
.
This
could
be
the
proportion
of
returned
documents
that
are
deemed
to
be
relevant
to
the
original
query
.
Exactly
such
an
evaluation
was
recently
used
by
Ku-maran
and
Allan
(
2007
)
for
the
related
task
of
query
contraction
.
Of
course
,
a
dataset
with
queries
and
retrieval
scores
may
serve
for
more
than
evaluation
;
it
may
provide
the
examples
used
by
the
learning
module
.
That
is
,
the
parameters
of
the
contraction
or
segmentation
scoring
function
could
be
discrim-inatively
set
to
optimize
the
retrieval
of
the
training
set
queries
.
A
unified
framework
for
query
contraction
,
segmentation
,
and
expansion
,
all
based
on
dis-criminatively
optimizing
retrieval
performance
,
is
a
very
appealing
future
research
direction
.
Tn
this
framework
,
the
size
of
the
training
sets
would
not
be
limited
by
human
annotation
resources
,
but
by
the
number
of
queries
for
which
retrieved-document
relevance
judgments
are
available
.
Generating
more
training
examples
would
allow
the
use
of
more
powerful
,
finer-grained
lexical
features
for
classification
.
Acknowledgments
We
gratefully
acknowledge
support
from
the
Natural
Sciences
and
Engineering
Research
Council
of
Canada
,
the
Alberta
Tngenuity
Fund
,
the
Alberta
Tn-genuity
Center
for
Machine
Learning
,
and
the
Alberta
Tnformatics
Circle
of
Research
Excellence
.
