This
paper
is
concerned
with
the
problem
of
question
search
.
In
question
search
,
given
a
question
as
query
,
we
are
to
return
questions
semantically
equivalent
or
close
to
the
queried
question
.
More
specifically
,
we
first
summarize
questions
in
a
data
structure
consisting
of
question
topic
and
question
focus
.
Then
we
model
question
topic
and
question
focus
in
a
language
modeling
framework
for
search
.
We
also
propose
to
use
the
MDL-based
tree
cut
model
for
identifying
question
topic
and
question
focus
automatically
.
Experimental
results
indicate
that
our
approach
of
identifying
question
topic
and
question
focus
for
search
significantly
outperforms
the
baseline
methods
such
as
Vector
Space
Model
(
VSM
)
and
Language
Model
for
Information
Retrieval
(
LMIR
)
.
1
Introduction
Over
the
past
few
years
,
online
services
have
been
building
up
very
large
archives
of
questions
and
their
answers
,
for
example
,
traditional
FAQ
services
and
emerging
community-based
Q
&amp;
A
services
(
e.g.
,
Yahoo
!
Answers1
,
Live
QnA2
,
and
Baidu
Zhidao3
)
.
To
make
use
of
the
large
archives
of
questions
and
their
answers
,
it
is
critical
to
have
functionality
facilitating
users
to
search
previous
answers
.
Typically
,
such
functionality
is
achieved
by
first
retrieving
questions
expected
to
have
the
same
answers
as
a
queried
question
and
then
returning
the
related
answers
to
users
.
For
example
,
given
question
Q1
in
Table
1
,
question
Q2
can
be
re
-
turned
and
its
answer
will
then
be
used
to
answer
Ql
because
the
answer
of
Q2
is
expected
to
partially
satisfy
the
queried
question
Ql.
This
is
what
we
called
question
search
.
In
question
search
,
returned
questions
are
semantically
equivalent
or
close
to
the
queried
question
.
Expected
:
Not
Expected
:
Q3
:
Any
nice
hotels
in
Berlin
or
Hamburg
?
Q4
:
How
long
does
it
take
to
Hamburg
from
Berlin
?
Table
1
.
An
Example
on
Question
Search
Many
methods
have
been
investigated
for
tackling
the
problem
of
question
search
.
For
example
,
Jeon
et
al.
have
compared
the
uses
of
four
different
retrieval
methods
,
i.e.
vector
space
model
,
Okapi
,
language
model
,
and
translation-based
model
,
within
the
setting
of
question
search
(
Jeon
et
al.
,
2005b
)
.
For
example
,
obviously
,
Q2
can
be
considered
semantically
closer
to
Q1
than
Q3-Q5
although
all
questions
(
Q2-Q5
)
are
related
to
Q1
.
The
existing
methods
are
not
able
to
tell
the
difference
between
question
Q2
and
questions
Q3
,
Q4
,
and
Q5
in
terms
of
their
relevance
to
question
Q1
.
We
will
clarify
this
in
the
following
.
In
this
paper
,
we
propose
to
conduct
question
search
by
identifying
question
topic
and
question
focus
.
The
question
topic
usually
represents
the
major
context
/
constraint
of
a
question
(
e.g.
,
Berlin
,
Hamburg
)
which
characterizes
users
'
interests
.
In
contrast
,
question
focus
(
e.g.
,
cool
club
,
cheap
hotel
)
presents
certain
aspect
(
or
descriptive
features
)
of
the
question
topic
.
For
the
aim
of
retrieving
seman-tically
equivalent
(
or
close
)
questions
,
we
need
to
assure
that
returned
questions
are
related
to
the
queried
question
with
respect
to
both
question
topic
and
question
focus
.
For
example
,
in
Table
1
,
Q2
preserves
certain
useful
information
of
Q1
in
the
aspects
of
both
question
topic
(
Berlin
)
and
question
focus
(
fun
club
)
although
it
loses
some
useful
information
in
question
topic
(
Hamburg
)
.
In
contrast
,
questions
Q3-Q5
are
not
related
to
Q1
in
question
focus
(
although
being
related
in
question
topic
,
e.g.
Hamburg
,
Berlin
)
,
which
makes
them
unsuitable
as
the
results
of
question
search
.
We
also
propose
to
use
the
MDL-based
(
Minimum
Description
Length
)
tree
cut
model
for
automatically
identifying
question
topic
and
question
focus
.
Given
a
question
as
query
,
a
structure
called
question
tree
is
constructed
over
the
question
collection
including
the
queried
question
and
all
the
related
questions
,
and
then
the
MDL
principle
is
applied
to
find
a
cut
of
the
question
tree
specifying
the
question
topic
and
the
question
focus
of
each
question
.
In
a
summary
,
we
summarize
questions
in
a
data
structure
consisting
of
question
topic
and
question
focus
.
On
the
basis
of
this
,
we
then
propose
to
model
question
topic
and
question
focus
in
a
language
modeling
framework
for
search
.
We
empirically
conduct
the
question
search
with
questions
about
'
travel
'
and
'
computers
&amp;
internet
'
.
Both
kinds
of
questions
are
from
Yahoo
!
Experimental
results
show
that
our
approach
can
significantly
improve
traditional
methods
(
e.g.
VSM
,
LMIR
)
in
retrieving
relevant
questions
.
The
rest
of
the
paper
is
organized
as
follow
.
In
Section
2
,
we
present
our
approach
to
question
search
which
is
based
on
identifying
question
topic
and
question
focus
.
In
Section
3
,
we
empirically
verify
the
effectiveness
of
our
approach
to
question
search
.
In
Section
4
,
we
employ
a
translation-based
retrieval
framework
for
extending
our
approach
to
fix
the
issue
called
'
lexical
chasm
'
.
Section
5
surveys
the
related
work
.
Section
6
concludes
the
paper
by
summarizing
our
work
and
discussing
the
future
directions
.
2
Our
Approach
to
Question
Search
Our
approach
to
question
search
consists
of
two
steps
:
(
a
)
summarize
questions
in
a
data
structure
consisting
of
question
topic
and
question
focus
;
(
b
)
model
question
topic
and
question
focus
in
a
language
modeling
framework
for
search
.
In
the
step
(
a
)
,
we
employ
the
MDL-based
(
Minimum
Description
Length
)
tree
cut
model
for
automatically
identifying
question
topic
and
question
focus
.
Thus
,
this
section
will
begin
with
a
brief
review
of
the
MDL-based
tree
cut
model
and
then
follow
that
by
an
explanation
of
steps
(
a
)
and
(
b
)
.
[
ftl2
,
n21
,
n22
,
n23J
.
A
straightforward
way
for
determining
a
cut
of
a
tree
is
to
collapse
the
nodes
of
less
frequency
into
their
parent
nodes
.
However
,
the
method
is
too
heuristic
for
it
relies
much
on
manually
tuned
frequency
threshold
.
In
our
practice
,
we
turn
to
use
a
theoretically
well-motivated
method
based
on
the
MDL
principle
.
MDL
is
a
principle
of
data
compression
and
statistical
estimation
from
information
theory
(
Rissanen
,
1978
)
.
Given
a
sample
S
and
a
tree
cut
T
,
we
employ
MLE
to
estimate
the
parameters
of
the
corresponding
tree
cut
model
M
=
(
r
,
0
)
,
where
9
denotes
the
estimated
parameters
.
description
length
L
(
T
)
,
the
parameter
description
length
L
(
9
\
r
)
,
and
the
data
description
length
L
(
S
\
T,9
)
,
i.e.
The
model
description
length
L
(
T
)
is
a
subjective
quantity
which
depends
on
the
coding
scheme
employed
.
Here
,
we
simply
assume
that
each
tree
cut
model
is
equally
likely
a
priori
.
The
parameter
description
length
L
(
9
\
r
)
is
calculated
as
where
\
S
\
denotes
the
sample
size
and
k
denotes
the
number
of
free
parameters
in
the
tree
cut
model
,
i.e.
k
equals
the
number
of
nodes
in
r
minus
one
.
The
data
description
length
L
(
S
\
T,9
)
is
calculated
as
where
C
is
the
class
that
n
belongs
to
and
/
(
C
)
denotes
the
total
frequency
of
instances
in
class
C
in
the
sample
S.
With
the
description
length
defined
as
(
3
)
,
we
wish
to
select
a
tree
cut
model
with
the
minimum
description
length
and
output
it
as
the
result
.
Note
that
the
model
description
length
L
(
r
)
can
be
ignored
because
it
is
the
same
for
all
tree
cut
models
.
The
MDL-based
tree
cut
model
was
originally
introduced
for
handling
the
problem
of
generalizing
case
frames
using
a
thesaurus
(
Li
and
Abe
,
1998
)
.
To
the
best
of
our
knowledge
,
no
existing
work
utilizes
it
for
question
search
.
This
may
be
partially
because
of
the
unavailability
of
the
resources
(
e.g.
,
thesaurus
)
which
can
be
used
for
embodying
the
questions
in
a
tree
structure
.
In
Section
2.2
,
we
will
introduce
a
tree
structure
called
question
tree
for
representing
questions
.
2.2
Identifying
question
topic
and
question
focus
In
principle
,
it
is
possible
to
identify
question
topic
and
question
focus
of
a
question
by
only
parsing
the
question
itself
(
for
example
,
utilizing
a
syntactic
parser
)
.
However
,
such
a
method
requires
accurate
parsing
results
which
cannot
be
obtained
from
the
noisy
data
from
online
services
.
Instead
,
we
propose
using
the
MDL-based
tree
cut
model
which
identifies
question
topics
and
question
foci
for
a
set
of
questions
together
.
More
specifically
,
the
method
consists
of
two
phases
:
1
)
Constructing
a
question
tree
:
represent
the
queried
question
and
all
the
related
questions
in
a
tree
structure
called
question
tree
;
2
)
Determining
a
tree
cut
:
apply
the
MDL
principle
to
the
question
tree
,
which
yields
the
cut
specifying
question
topic
and
question
focus
.
In
the
following
,
with
a
series
of
definitions
,
we
will
describe
how
a
question
tree
is
constructed
from
a
collection
of
questions
.
Let
's
begin
with
explaining
the
representation
of
a
question
.
A
straightforward
method
is
to
represent
a
question
as
a
bag-of-words
(
possibly
ignoring
stop
words
)
.
However
,
this
method
cannot
discern
'
the
hotels
in
Paris
'
from
'
the
Paris
hotel
'
.
Thus
,
we
turn
to
use
the
linguistic
units
carrying
on
more
semantic
information
.
Specifically
,
we
make
use
of
two
kinds
of
units
:
BaseNP
(
Base
Noun
Phrase
)
and
WH-ngram
.
A
BaseNP
is
defined
as
a
simple
and
non-recursive
noun
phrase
(
Cao
and
Li
,
2002
)
.
A
WH-ngram
is
an
ngram
beginning
with
WH-words
.
The
WH-words
that
we
consider
include
'
when
'
,
'
what
'
,
'
where
'
,
'
which
'
,
and
'
how
'
.
We
refer
to
these
two
kinds
of
units
as
'
topic
terms
'
.
With
'
topic
terms
'
,
we
represent
a
question
as
a
topic
chain
and
a
set
of
questions
as
a
question
tree
.
Definition
1
(
Topic
Profile
)
The
topic
profile
9t
of
a
topic
term
t
in
a
categorized
question
collection
is
a
probability
distribution
of
categories
{
p
(
c
\
t
)
}
c6C
where
C
is
a
set
of
categories
.
where
count
(
c
,
t
)
is
the
frequency
of
the
topic
term
t
within
category
c
.
Clearly
,
we
have
T
,
cecP
(
c
\
t
)
=
1
.
By
'
categorized
questions
'
,
we
refer
to
the
questions
that
are
organized
in
a
tree
of
taxonomy
.
For
example
,
at
Yahoo
!
Answers
,
the
question
"
How
do
I
install
my
wireless
router
"
is
categorized
as
"
Computers
&amp;
Internet
Computer
Networking
"
.
Actually
,
we
can
find
categorized
questions
at
other
online
services
such
as
FAQ
sites
,
too
.
Definition
2
(
Specificity
)
The
specificity
s
(
t
)
of
a
topic
term
t
is
the
inverse
of
the
entropy
of
the
topic
profile
9t
.
More
specifically
,
where
s
isa
smoothing
parameter
used
to
cope
with
the
topic
terms
whose
entropy
is
0
.
In
our
experiments
,
the
value
of
s
was
set
0.001
.
We
use
the
term
specificity
to
denote
how
specific
a
topic
term
is
in
characterizing
information
needs
of
users
who
post
questions
.
A
topic
term
of
high
specificity
(
e.g.
,
Hamburg
,
Berlin
)
usually
specifies
the
question
topic
corresponding
to
the
main
context
of
a
question
because
it
tends
to
occur
only
in
a
few
categories
.
A
topic
term
of
low
specificity
is
usually
used
to
represent
the
question
focus
(
e.g.
,
cool
club
,
where
to
see
)
which
is
relatively
volatile
and
might
occur
in
many
categories
.
2
)
s
(
tk
)
&gt;
s
(
tl
)
,
1
&lt;
k
&lt;
l
&lt;
m.
For
example
,
the
topic
chain
of
"
any
cool
clubs
in
Berlin
or
Hamburg
?
"
is
"
Hamburg
-
&gt;
Berlin
-
&gt;
cool
club
"
because
the
specificities
for
'
Hamburg
'
,
'
Berlin
'
,
and
'
cool
club
'
are
0.99
,
0.62
,
and
0.36
.
Definition
4
(
Question
Tree
)
A
question
tree
of
a
question
set
Q
=
{
q
£
}
^Li
is
a
prefix
tree
built
over
the
topic
chains
Qc
=
{
qf
}
^
=
i
of
the
question
set
Q.
Clearly
,
if
a
question
set
contains
only
one
question
,
its
question
tree
will
be
exactly
same
as
the
topic
chain
of
the
question
.
Note
that
the
root
node
of
a
question
tree
is
associated
with
empty
string
as
the
definition
of
prefix
tree
requires
(
Fredkin
,
1960
)
.
Figure
2
.
An
Example
of
a
Question
Tree
Given
the
topic
chains
with
respect
to
the
questions
in
Table
1
as
follow
,
we
can
have
the
question
tree
presented
in
Figure
2
.
According
to
the
definition
of
a
topic
chain
,
the
topic
terms
in
a
topic
chain
of
a
question
are
ordered
by
their
specificity
values
.
Thus
,
a
cut
of
a
topic
chain
naturally
separates
the
topic
terms
of
low
specificity
(
representing
question
focus
)
from
the
topic
terms
of
high
specificity
(
representing
question
topic
)
.
Given
a
topic
chain
of
a
question
consisting
of
m
topic
terms
,
there
exist
(
m
—
1
)
possible
cuts
.
The
question
is
:
which
cut
is
the
best
?
We
propose
using
the
MDL-based
tree
cut
model
for
the
search
of
the
best
cut
in
a
topic
chain
.
Instead
of
dealing
with
each
topic
chain
individually
,
the
proposed
method
handles
a
set
of
questions
together
.
Specifically
,
given
a
queried
question
,
we
construct
a
question
tree
consisting
of
both
the
queried
question
and
the
related
questions
,
and
then
apply
the
MDL
principle
to
select
the
best
cut
of
the
question
tree
.
For
example
,
in
Figure
2
,
we
hope
to
get
the
cut
indicated
by
the
dashed
line
.
The
topic
terms
on
the
left
of
the
dashed
line
represent
the
question
topic
and
those
on
the
right
of
the
dashed
line
represent
the
question
focus
.
Note
that
the
tree
cut
yields
a
cut
for
each
individual
topic
chain
(
each
path
)
within
the
question
tree
accordingly
.
A
cut
of
a
topic
chain
qc
of
a
question
q
separates
the
topic
chain
in
two
parts
:
HEAD
and
TAIL
.
HEAD
(
denoted
as
H
(
qc
)
)
is
the
subsequence
of
the
original
topic
chain
qc
before
the
cut
.
TAIL
(
denoted
as
T
(
qc
)
)
is
the
subsequence
of
qc
after
the
cut
.
Thus
,
qc
=
H
(
qc
)
-
&gt;
T
(
qc
)
.
For
instance
,
given
the
tree
cut
specified
in
Figure
2
,
for
the
topic
chain
of
Q1
"
Hamburg
-
&gt;
Berlin
-
&gt;
cool
club
"
,
the
HEAD
and
TAIL
are
"
Hamburg
-
&gt;
Berlin
"
and
"
cool
club
"
respectively
.
2.3
Modeling
question
topic
and
question
focus
for
search
We
employ
the
framework
of
language
modeling
(
for
information
retrieval
)
to
develop
our
approach
to
question
search
.
In
the
language
modeling
approach
to
information
retrieval
,
the
relevance
of
a
targeted
question
q
to
a
queried
question
q
is
given
by
the
probability
p
(
qjq
)
of
generating
the
queried
question
q
Ql
:
Any
cool
clubs
in
Berlin
or
Hamburg
?
cool
club
from
the
language
model
formed
by
the
targeted
question
q.
The
targeted
question
q
is
from
a
collection
C
of
questions
.
Following
the
framework
,
we
propose
a
mixture
model
for
modeling
question
structure
(
namely
,
question
topic
and
question
focus
)
within
the
process
of
searching
questions
:
In
the
mixture
model
,
it
is
assumed
that
the
process
of
generating
question
topics
and
the
process
of
generating
question
foci
are
independent
from
each
other
.
In
traditional
language
modeling
,
a
single
multinomial
model
p
(
tjq
)
over
terms
is
estimated
for
each
targeted
question
q.
In
our
case
,
two
multinomial
models
p
(
t
|
H
(
q
)
)
and
p
(
t
|
r
(
q
)
)
need
to
be
estimated
for
each
targeted
question
q.
If
unigram
document
language
models
are
used
,
the
equation
(
9
)
can
then
be
re-written
as
,
(
1-2
)
•
nt6r
(
q
)
P
(
t
|
^
(
q
)
)
where
count
(
q
,
t
)
is
the
frequency
of
t
within
q.
To
avoid
zero
probabilities
and
estimate
more
accurate
language
models
,
the
HEAD
and
TAIL
of
questions
are
smoothed
using
background
collection
,
+
{
1-p
)
^p
{
tjC
)
(
12
)
where
p
(
tjH
(
q
)
)
,
p
(
tjT
(
q
)
)
,
and
p
(
tjC
)
are
the
MLE
estimators
with
respect
to
the
HEAD
of
q
,
the
TAIL
of
q
,
and
the
collection
C.
3
Experimental
Results
We
have
conducted
experiments
to
verify
the
effectiveness
of
our
approach
to
question
search
.
Particularly
,
we
have
investigated
the
use
of
identifying
question
topic
and
question
focus
for
search
.
3.1
Dataset
and
evaluation
measures
We
made
use
of
the
questions
obtained
from
Yahoo
!
Answers
for
the
evaluation
.
More
specifically
,
we
utilized
the
resolved
questions
under
two
of
the
top-level
categories
at
Yahoo
!
Answers
,
namely
'
travel
'
and
'
computers
&amp;
internet
'
.
The
questions
include
314,616
items
from
the
'
travel
'
category
and
210,785
items
from
the
'
computers
&amp;
internet
'
category
.
Each
resolved
question
consists
of
three
fields
:
'
title
'
,
'
description
'
,
and
'
answers
'
.
For
search
we
use
only
the
'
title
'
field
.
It
is
assumed
that
the
titles
of
the
questions
already
provide
enough
semantic
information
for
understanding
users
'
information
needs
.
We
developed
two
test
sets
,
one
for
the
category
'
travel
'
denoted
as
'
TRL-TST
'
,
and
the
other
for
'
computers
&amp;
internet
'
denoted
as
'
CI-TST
'
.
In
order
to
create
the
test
sets
,
we
randomly
selected
200
questions
for
each
category
.
To
obtain
the
ground-truth
of
question
search
,
we
employed
the
Vector
Space
Model
(
VSM
)
(
Sal-ton
et
al.
,
1975
)
to
retrieve
the
top
20
results
and
obtained
manual
judgments
.
The
top
20
results
don
't
include
the
queried
question
itself
.
Given
a
returned
result
by
VSM
,
an
assessor
is
asked
to
label
it
with
'
relevant
or
'
irrelevant
'
.
If
a
returned
result
is
considered
semantically
equivalent
(
or
close
)
to
the
queried
question
,
the
assessor
will
label
it
as
'
relevant
'
;
otherwise
,
the
assessor
will
label
it
as
'
irrelevant
'
.
Two
assessors
were
involved
in
the
manual
judgments
.
Each
of
them
was
asked
to
label
100
questions
from
'
TRL-TST
'
and
100
from
'
CI-TST
'
.
In
the
process
of
manually
judging
questions
,
the
assessors
were
presented
only
the
titles
of
the
questions
(
for
both
the
queried
questions
and
the
returned
questions
)
.
Table
2
provides
the
statistics
on
the
final
test
set
.
#
Queries
#
Returned
#
Relevant
Table
2
.
Statistics
on
the
Test
Data
We
utilized
two
baseline
methods
for
demonstrating
the
effectiveness
of
our
approach
,
the
VSM
and
the
LMIR
(
language
modeling
method
for
information
retrieval
)
(
Ponte
and
Croft
,
1998
)
.
We
made
use
of
three
measures
for
evaluating
the
results
of
question
search
methods
.
They
are
MAP
,
R-precision
,
and
MRR
.
3.2
Searching
questions
about
'
travel
'
In
the
experiments
,
we
made
use
of
the
questions
about
'
travel
'
to
test
the
performance
of
our
approach
to
question
search
.
More
specifically
,
we
used
the
200
queries
in
the
test
set
'
TRL-TST
'
to
search
for
'
relevant
'
questions
from
the
314,616
questions
categorized
as
'
travel
'
.
Note
that
only
the
questions
occurring
in
the
test
set
can
be
evaluated
.
We
made
use
of
the
taxonomy
of
questions
provided
at
Yahoo
!
Answers
for
the
calculation
of
specificity
of
topic
terms
.
The
taxonomy
is
organized
in
a
tree
structure
.
In
the
following
experiments
,
we
only
utilized
as
the
categories
of
questions
the
leaf
nodes
of
the
taxonomy
tree
(
regarding
'
travel
'
)
,
which
includes
355
categories
.
We
randomly
divided
the
test
queries
into
five
even
subsets
and
conducted
5-fold
cross-validation
experiments
.
In
each
trial
,
we
tuned
the
parameters
X
,
a
,
and
/
?
in
the
equation
(
10
)
-
(
12
)
with
four
of
the
five
subsets
and
then
applied
it
to
one
remaining
subset
.
The
experimental
results
reported
below
are
those
averaged
over
the
five
trials
.
Methods
I
MAP
I
R-Precision
I
MRR
influential
the
value
of
A
is
on
the
performance
of
question
search
in
terms
of
MRR
.
The
result
was
obtained
with
the
200
queries
directly
,
instead
of
5-fold
cross-validation
.
From
Figure
3
,
we
see
that
our
approach
performs
best
when
A
is
around
0.7
.
That
is
,
our
approach
tends
to
emphasize
question
topic
more
than
question
focus
.
We
also
examined
the
correctness
of
question
topics
and
question
foci
of
the
200
queried
questions
.
The
question
topics
and
question
foci
were
obtained
with
the
MDL-based
tree
cut
model
automatically
.
In
the
result
,
69
questions
have
incorrect
question
topics
or
question
foci
.
Further
analysis
shows
that
the
errors
came
from
two
categories
:
(
a
)
59
questions
have
only
the
HEAD
parts
(
that
is
,
none
of
the
topic
terms
fall
within
the
TAIL
part
)
,
and
(
b
)
10
have
incorrect
orders
of
topic
terms
because
the
specificities
of
topic
terms
were
estimated
inaccurately
.
For
questions
only
having
the
HEAD
parts
,
our
approach
(
equation
(
9
)
)
reduces
to
traditional
language
modeling
approach
.
Thus
,
even
when
the
errors
of
category
(
a
)
occur
,
our
approach
can
still
work
not
worse
than
the
traditional
language
modeling
approach
.
This
also
explains
why
our
approach
performs
best
when
A
is
around
0.7
.
The
error
category
(
a
)
pushes
our
model
to
emphasize
more
in
question
topic
.
Methods
Results
How
cold
does
it
usually
get
in
Charlotte
,
NC
during
winters
?
Rochester
,
NY
?
How
cold
does
the
Mojave
Desert
get
in
the
winter
?
How
cold
is
Alaska
in
March
and
outdoor
activities
?
Table
4
provides
the
TOP-3
search
results
which
approach
)
respectively
.
The
questions
in
bold
are
labeled
as
'
relevant
'
in
the
evaluation
set
.
The
queried
question
seeks
for
the
'
weather
'
information
about
'
Alaska
'
.
Both
VSM
and
LMIR
rank
certain
Table
3
.
Searching
Questions
about
'
Travel
'
In
Table
3
,
our
approach
denoted
by
LMIR-CUT
is
implemented
exactly
as
equation
(
10
)
.
Neither
VSM
nor
LMIR
uses
the
data
structure
composed
of
question
topic
and
question
focus
.
From
Table
3
,
we
see
that
our
approach
outperforms
the
baseline
approaches
VSM
and
LMIR
in
terms
of
all
the
measures
.
Figure
3
.
In
equation
(
9
)
,
we
use
the
parameter
A
to
balance
the
contribution
of
question
topic
and
the
contribution
of
question
focus
.
Figure
3
illustrates
how
'
irrelevant
'
questions
higher
than
'
relevant
'
questions
.
The
'
irrelevant
'
questions
are
not
about
'
Alaska
'
although
they
are
about
'
weather
'
.
The
reason
is
that
neither
VSM
nor
PVSM
is
aware
that
the
query
consists
of
the
two
aspects
'
weather
'
(
how
cold
,
winter
)
and
'
Alaska
'
.
In
contrast
,
our
approach
assures
that
both
aspects
are
matched
.
Note
that
the
HEAD
part
of
the
topic
chain
of
the
queried
question
given
by
our
approach
is
"
Alaska
"
and
the
TAIL
part
is
"
winter
-
&gt;
how
cold
"
.
3.3
Searching
questions
about
'
computers
&amp;
internet
'
In
the
experiments
,
we
made
use
of
the
questions
about
'
computers
&amp;
internet
'
to
test
the
performance
of
our
proposed
approach
to
question
search
.
More
specifically
,
we
used
the
200
queries
in
the
test
set
'
CI-TST
''
to
search
for
'
relevant
'
questions
from
the
210,785
questions
categorized
as
'
computers
&amp;
internet
'
.
For
the
calculation
of
specificity
of
topic
terms
,
we
utilized
as
the
categories
of
questions
the
leaf
nodes
of
the
taxonomy
tree
regarding
'
computers
&amp;
Internet
'
,
which
include
23
categories
.
We
conducted
5-fold
cross-validation
for
the
parameter
tuning
.
The
experimental
results
reported
in
Table
5
are
averaged
over
the
five
trials
.
errors
is
also
similar
to
that
in
Section
3.2
,
which
also
justifies
the
trend
presented
in
Figure
4
.
Table
5
.
Searching
Questions
about
'
Computers
&amp;
Internet
'
Again
,
we
see
that
our
approach
outperforms
the
baseline
approaches
VSM
and
LMIR
in
terms
of
all
the
measures
.
We
conducted
a
significance
test
(
t-test
)
on
the
improvements
of
our
approach
over
VSM
and
LMIR
.
The
result
indicates
that
the
improvements
are
statistically
significant
(
p-value
&lt;
0.05
)
in
terms
of
all
the
evaluation
measures
.
We
also
conducted
the
experiment
similar
to
that
in
Figure
3
.
Figure
4
provides
the
result
.
The
trend
is
consistent
with
that
in
Figure
3
.
We
examined
the
correctness
of
(
automatically
identified
)
question
topics
and
question
foci
of
the
200
queried
questions
,
too
.
In
the
result
,
65
questions
have
incorrect
question
topics
or
question
foci
.
Among
them
,
47
fall
in
the
error
category
(
a
)
and
18
in
the
error
category
(
b
)
.
The
distribution
of
Figure
4
.
Balancing
between
Question
Topic
and
Question
Focus
4
Using
Translation
Probability
In
the
setting
of
question
search
,
besides
the
topic
what
we
address
in
the
previous
sections
,
another
research
topic
is
to
fix
lexical
chasm
between
questions
.
Sometimes
,
two
questions
that
have
the
same
meaning
use
very
different
wording
.
For
example
,
the
questions
"
where
to
stay
in
Hamburg
?
"
and
"
the
best
hotel
in
Hamburg
?
"
have
almost
the
same
meaning
but
are
lexically
different
in
question
focus
(
where
to
stay
vs.
best
hotel
)
.
This
is
the
so-called
'
lexical
chasm
'
.
Jeon
and
Bruce
(
2007
)
proposed
a
mixture
model
for
fixing
the
lexical
chasm
between
questions
.
The
model
is
a
combination
of
the
language
modeling
approach
(
for
information
retrieval
)
and
translation-based
approach
(
for
information
retrieval
)
.
Our
idea
of
modeling
question
structure
for
search
can
naturally
extend
to
Jeon
et
al.
's
model
.
More
specifically
,
by
using
translation
probabilities
,
we
can
rewrite
equation
(
11
)
and
(
12
)
as
follow
:
+P2
•
It'er
(
«
Mt
|
t
'
)
•
p
{
t
'
\
T
{
q
)
)
(
14
)
+
(
1
-
/
?
i
-
/
?
2
)
-
pXt
|
C
)
where
7Y
(
t
|
t
'
)
denotes
the
probability
that
topic
term
t
is
the
translation
of
V.
In
our
experiments
,
to
estimate
the
probability
7Y
(
t
|
t
'
)
,
we
used
the
collections
of
question
titles
and
question
descriptions
as
the
parallel
corpus
and
the
IBM
model
1
(
Brown
et
al.
,
1993
)
as
the
alignment
model
.
Usually
,
users
reiterate
or
paraphrase
their
questions
(
already
described
in
question
titles
)
in
question
descriptions
.
We
utilized
the
new
model
elaborated
by
equation
(
13
)
and
(
14
)
for
searching
questions
about
'
travel
'
and
'
computers
&amp;
internet
'
.
The
new
model
is
denoted
as
'
SMT-CUT
'
.
Table
6
provides
the
evaluation
results
.
The
evaluation
was
conducted
with
exactly
the
same
setting
as
in
Section
3
.
From
Table
6
,
we
see
that
the
performance
of
our
approach
can
be
further
boosted
by
using
translation
probability
.
R-Precision
LMIR-CUT
Table
6
.
Using
Translation
Probability
5
Related
Work
The
major
focus
of
previous
research
efforts
on
question
search
is
to
tackle
the
lexical
chasm
problem
between
questions
.
The
research
of
question
search
is
first
conducted
using
FAQ
data
.
FAQ
Finder
(
Burke
et
al.
,
1997
)
heuristically
combines
statistical
similarities
and
semantic
similarities
between
questions
to
rank
FAQs
.
Conventional
vector
space
models
are
used
to
calculate
the
statistical
similarity
and
WordNet
(
Fellbaum
,
1998
)
is
used
to
estimate
the
semantic
similarity
.
Sneiders
(
2002
)
proposed
template
based
FAQ
retrieval
systems
.
Lai
et
al.
(
2002
)
proposed
an
approach
to
automatically
mine
FAQs
from
the
Web
.
Jijkoun
and
Rijke
(
2005
)
used
supervised
learning
methods
to
extend
heuristic
extraction
of
Q
/
A
pairs
from
FAQ
pages
,
and
treated
Q
/
A
pair
retrieval
as
a
fielded
search
task
.
Harabagiu
et
al.
(
2005
)
used
a
Question
Answer
Database
(
known
as
QUAB
)
to
support
interactive
question
answering
.
They
compared
seven
different
similarity
metrics
for
selecting
related
questions
from
QUAB
and
found
that
the
concept-based
metric
performed
best
.
Recently
,
the
research
of
question
search
has
been
further
extended
to
the
community-based
Q
&amp;
A
data
.
For
example
,
Jeon
et
al.
(
Jeon
et
al.
,
2005a
;
Jeon
et
al.
,
2005b
)
compared
four
different
retrieval
methods
,
i.e.
vector
space
model
,
Okapi
,
language
model
(
LM
)
,
and
translation-based
model
,
for
automatically
fixing
the
lexical
chasm
between
questions
of
question
search
.
They
found
that
the
translation-based
model
performed
best
.
However
,
all
the
existing
methods
treat
questions
just
as
plain
texts
(
without
considering
question
structure
)
.
In
this
paper
,
we
proposed
to
conduct
question
search
by
identifying
question
topic
and
question
focus
.
To
the
best
of
our
knowledge
,
none
of
the
existing
studies
addressed
question
search
by
modeling
both
question
topic
and
question
focus
.
Question
answering
(
e.g.
,
Pasca
and
Harabagiu
,
2001
;
Echihabi
and
Marcu
,
2003
;
Voorhees
,
2004
;
Metzler
and
Croft
,
2005
)
relates
to
question
search
.
Question
answering
automatically
extracts
short
answers
for
a
relatively
limited
class
of
question
types
from
document
collections
.
In
contrast
to
that
,
question
search
retrieves
answers
for
an
unlimited
range
of
questions
by
focusing
on
finding
semanti-cally
similar
questions
in
an
archive
.
6
Conclusions
and
Future
Work
In
this
paper
,
we
have
proposed
an
approach
to
question
search
which
models
question
topic
and
question
focus
in
a
language
modeling
framework
.
The
contribution
of
this
paper
can
be
summarized
in
4-fold
:
(
1
)
A
data
structure
consisting
of
question
topic
and
question
focus
was
proposed
for
summarizing
questions
;
(
2
)
The
MDL-based
tree
cut
model
was
employed
to
identify
question
topic
and
question
focus
automatically
;
(
3
)
A
new
form
of
language
modeling
using
question
topic
and
question
focus
was
developed
for
question
search
;
(
4
)
Extensive
experiments
have
been
conducted
to
evaluate
the
proposed
approach
using
a
large
collection
of
real
questions
obtained
from
Yahoo
!
Answers
.
Though
we
only
utilize
data
from
community-based
question
answering
service
in
our
experiments
,
we
could
also
use
categorized
questions
from
forum
sites
and
FAQ
sites
.
Thus
,
as
future
work
,
we
will
try
to
investigate
the
use
of
the
proposed
approach
for
other
kinds
of
web
services
.
Acknowledgement
We
would
like
to
thank
Xinying
Song
,
Shasha
Li
,
and
Shilin
Ding
for
their
efforts
on
developing
the
evaluation
data
.
We
would
also
like
to
thank
Stephan
H.
Stiller
for
his
proof-reading
of
the
paper
.
