Speech
recognition
transcripts
are
far
from
perfect
;
they
are
not
of
sufficient
quality
to
be
useful
on
their
own
for
spoken
document
retrieval
.
This
is
especially
the
case
for
conversational
speech
.
Recent
efforts
have
tried
to
overcome
this
issue
by
using
statistics
from
speech
lattices
instead
of
only
the
1-best
transcripts
;
however
,
these
efforts
have
invariably
used
the
classical
vector
space
retrieval
model
.
This
paper
presents
a
novel
approach
to
lattice-based
spoken
document
retrieval
using
statistical
language
models
:
a
statistical
model
is
estimated
for
each
document
,
and
probabilities
derived
from
the
document
models
are
directly
used
to
measure
relevance
.
Experimental
results
show
that
the
lattice-based
language
modeling
method
outperforms
both
the
language
modeling
retrieval
method
using
only
the
1-best
transcripts
,
as
well
as
a
recently
proposed
lattice-based
vector
space
retrieval
method
.
1
Introduction
Information
retrieval
(
IR
)
is
the
task
of
ranking
a
collection
of
documents
according
to
an
estimate
of
their
relevance
to
a
query
.
With
the
recent
growth
in
the
amount
of
speech
recordings
in
the
form
of
voice
mails
,
news
broadcasts
,
and
so
forth
,
the
task
of
spoken
document
retrieval
(
SDR
)
-
information
retrieval
in
which
the
document
collection
is
in
the
form
of
speech
recordings
-
is
becoming
increasingly
important
.
SDR
on
broadcast
news
corpora
has
been
"
deemed
to
be
a
solved
problem
"
,
due
to
the
fact
that
the
performance
of
retrieval
engines
working
on
1-best
automatic
speech
recognition
(
ASR
)
transcripts
was
found
to
be
"
virtually
the
same
as
their
performance
on
the
human
reference
transcripts
"
(
NIST
,
2000
)
.
However
,
this
is
still
not
the
case
for
SDR
on
data
which
are
more
challenging
,
such
as
conversational
speech
in
noisy
environments
,
as
the
1-best
transcripts
of
these
data
contain
too
many
recognition
errors
to
be
useful
for
retrieval
.
One
way
to
ameliorate
this
problem
is
to
work
with
not
just
one
ASR
hypothesis
for
each
utterance
,
but
multiple
hypotheses
presented
in
a
lattice
data
structure
.
A
lattice
is
a
connected
directed
acyclic
graph
in
which
each
edge
is
labeled
with
a
term
hypothesis
and
a
likelihood
value
(
James
,
1995
)
;
each
path
through
a
lattice
gives
a
hypothesis
of
the
sequence
of
terms
spoken
in
the
utterance
.
Each
lattice
can
be
viewed
as
a
statistical
model
of
the
possible
transcripts
of
an
utterance
(
given
the
speech
recognizer
's
state
of
knowledge
)
;
thus
,
an
IR
model
based
on
statistical
inference
will
seem
to
be
a
more
natural
and
more
principled
approach
to
lattice-based
SDR
.
This
paper
thus
proposes
a
lattice-based
SDR
method
based
on
the
statistical
language
modeling
approach
of
Song
and
Croft
(
1999
)
.
In
this
method
,
the
expected
word
count
-
the
mean
number
of
occurrences
of
a
word
given
a
lattice
's
statistical
model
-
is
computed
for
each
word
in
each
lattice
.
Using
these
expected
counts
,
a
statistical
language
model
is
estimated
for
each
spoken
document
,
and
a
document
's
relevance
to
a
query
is
computed
as
a
probability
under
this
model
.
Proceedings
of
the
2007
Joint
Conference
on
Empirical
Methods
in
Natural
Language
Processing
and
Computational
Natural
Language
Learning
,
pp.
810-818
,
Prague
,
June
2007
.
©
2007
Association
for
Computational
Linguistics
The
rest
of
this
paper
is
organized
as
follows
.
In
Section
2
we
review
related
work
in
the
areas
of
speech
processing
and
IR
.
Section
3
describes
our
proposed
method
as
well
as
the
baseline
methods
.
Details
of
the
experimental
setup
are
given
in
Section
4
,
and
experimental
results
are
in
Section
5
.
Finally
,
Section
6
concludes
our
discussions
and
outlines
our
future
work
.
2
Related
Work
2.1
Lattices
for
Spoken
Document
Retrieval
James
and
Young
(
1994
)
first
introduced
the
lattice
as
a
representation
for
indexing
spoken
documents
,
as
part
of
a
method
for
vocabulary-independent
keyword
spotting
.
The
lattice
representation
was
later
applied
to
the
task
of
spoken
document
retrieval
by
James
(
1995
)
:
James
counted
how
many
times
each
query
word
occurred
in
each
phone
lattice
with
a
sufficiently
high
normalized
log
likelihood
,
and
these
counts
were
then
used
in
retrieval
under
a
vector
space
model
with
tf
•
idf
weighting
.
Jones
et
al.
(
1996
)
combined
retrieval
from
phone
lattices
using
variations
of
James
'
method
with
retrieval
from
1-best
word
transcripts
to
achieve
better
results
.
Since
then
,
a
number
of
different
methods
for
SDR
using
lattices
have
been
proposed
.
For
instance
,
Siegler
(
1999
)
used
word
lattices
instead
of
phone
lattices
as
the
basis
of
retrieval
,
and
generalized
the
tf
•
idf
formalism
to
allow
uncertainty
in
word
counts
.
Chelba
and
Acero
(
2005
)
prepro-cessed
lattices
into
more
compact
Position
Specific
Posterior
Lattices
(
PSPL
)
,
and
computed
an
aggregate
score
for
each
document
based
on
the
posterior
probability
of
edges
and
the
proximity
of
search
terms
in
the
document
.
Mamou
et
al.
(
2006
)
converted
each
lattice
into
a
word
confusion
network
(
Mangu
et
al.
,
2000
)
,
and
estimated
the
inverse
document
frequency
(
idf
)
of
each
word
t
as
the
ratio
of
the
total
number
of
words
in
the
document
collection
to
the
total
number
of
occurrences
of
t.
Despite
the
differences
in
the
details
,
the
above
lattice-based
SDR
methods
have
all
been
based
on
the
classical
vector
space
retrieval
model
with
tf
idf
weighting
.
2.2
Expected
Counts
from
Lattices
A
speech
recognizer
generates
a
1-best
transcript
of
a
spoken
document
by
considering
possible
transcripts
of
the
document
,
and
then
selecting
the
transcript
with
the
highest
probability
.
However
,
unlike
a
text
document
,
such
a
1-best
transcript
is
likely
to
be
inexact
due
to
speech
recognition
errors
.
To
represent
the
uncertainty
in
speech
recognition
,
and
to
incorporate
information
from
multiple
transcription
hypotheses
rather
than
only
the
1-best
,
it
is
desirable
to
use
expected
word
counts
from
lattices
output
by
a
speech
recognizer
.
In
the
context
of
spoken
document
search
,
Siegler
(
1999
)
described
expected
word
counts
and
formulated
a
way
to
estimate
expected
word
counts
from
lattices
based
on
the
relative
ranks
of
word
hypothesis
probabilities
;
Chelba
and
Acero
(
2005
)
used
a
more
explicit
formula
for
computing
word
counts
based
on
summing
edge
posterior
probabilities
in
lattices
;
Saraclar
and
Sproat
(
2004
)
performed
word-spotting
in
speech
lattices
by
looking
for
word
occurrences
whose
expected
counts
were
above
a
certain
threshold
;
and
Yu
et
al.
(
2005
)
searched
for
phrases
in
spoken
documents
using
a
similar
measure
,
the
expected
word
relevance
.
Expected
counts
have
also
been
used
to
summarize
the
phonotactics
of
a
speech
recording
represented
in
a
lattice
:
Hatch
et
al.
(
2005
)
performed
speaker
recognition
by
computing
the
expected
counts
of
phone
bigrams
in
a
phone
lattice
,
and
estimating
an
unsmoothed
probability
distribution
of
phone
bigrams
.
Although
many
uses
ofexpected
counts
have
been
studied
,
the
use
of
statistical
language
models
built
from
expected
word
counts
has
not
been
well
explored
.
2.3
Retrieval
via
Statistical
Language
Modeling
Finally
,
the
statistical
language
modeling
approach
to
retrieval
was
used
by
Ponte
and
Croft
(
1998
)
for
IR
with
text
documents
,
and
it
was
shown
to
outperform
the
tf
•
idf
approach
for
this
task
;
this
method
was
further
improved
on
in
Song
and
Croft
(
1999
)
.
Chen
et
al.
(
2004
)
applied
Song
and
Croft
's
method
to
Mandarin
spoken
document
retrieval
using
1-best
ASR
transcripts
.
In
this
task
,
it
was
also
shown
to
outperform
tf
•
idf
.
Thus
,
the
statistical
language
modeling
approach
to
retrieval
has
been
shown
to
be
superior
to
the
vector
space
approach
for
both
these
IR
tasks
.
2.4
Contributions
of
Our
Work
The
main
contributions
of
our
work
include
•
extending
the
language
modeling
IR
approach
from
text-based
retrieval
to
lattice-based
spoken
document
retrieval
;
and
•
formulating
a
method
for
building
a
statistical
language
model
based
on
expected
word
counts
derived
from
lattices
.
Our
method
is
motivated
by
the
success
of
the
statistical
retrieval
framework
over
the
vector
space
approach
with
tf
•
idf
for
text-based
IR
,
as
well
as
for
spoken
document
retrieval
via
1-best
transcripts
.
Our
use
of
expected
counts
differs
from
Saraclar
and
Sproat
(
2004
)
in
that
we
estimate
probability
models
from
the
expected
counts
.
Conceptually
,
our
method
is
close
to
that
of
Hatch
et
al.
(
2005
)
,
as
both
methods
build
a
language
model
to
summarize
the
content
of
a
spoken
document
represented
in
a
lattice
.
In
practice
,
our
method
differs
from
Hatch
et
al.
(
2005
)
'
s
in
many
ways
:
first
,
we
derive
word
statistics
for
representing
semantics
,
instead
of
phone
bigram
statistics
for
representing
phonotac-tics
;
second
,
we
introduce
a
smoothing
mechanism
(
Zhai
and
Lafferty
,
2004
)
to
the
language
model
that
is
specific
for
information
retrieval
.
3
Methods
We
now
describe
the
formulation
of
three
different
SDR
methods
:
a
baseline
statistical
retrieval
method
which
works
on
1-best
transcripts
,
our
proposed
statistical
lattice-based
SDR
method
,
as
well
as
a
previously
published
vector
space
lattice-based
SDR
method
.
3.1
Baseline
Statistical
Retrieval
Method
Our
baseline
retrieval
method
is
motivated
by
Song
and
Croft
(
1999
)
,
and
uses
the
language
model
smoothing
methods
of
Zhai
and
Lafferty
(
2004
)
.
This
method
is
used
to
perform
retrieval
on
the
documents
'
1-best
ASR
transcripts
and
reference
human
transcripts
.
Let
C
be
the
collection
of
documents
to
retrieve
from
.
For
each
document
d
contained
in
C
,
and
each
query
q
,
the
relevance
of
d
to
q
can
be
defined
as
Pr
(
d
|
q
)
.
This
probability
cannot
be
computed
directly
,
but
under
the
assumption
that
the
prior
Pr
(
d
)
is
uniform
over
all
documents
in
C
,
we
see
that
and
Lafferty
,
1999
)
.
where
C
(
w
|
q
)
is
the
word
count
of
w
in
q.
Before
using
Equation
1
,
we
must
estimate
a
unigram
model
from
d
:
that
is
,
an
assignment
ofproba-bilities
Pr
(
w
|
d
)
for
all
w
£
V.
One
way
to
do
this
is
to
use
a
maximum
likelihood
estimate
(
MLE
)
-
an
assignment
of
Pr
(
w
|
d
)
for
all
w
which
maximizes
the
probability
of
generating
d.
The
MLE
is
given
by
the
equation
where
C
(
w
|
d
)
is
the
number
of
occurrences
of
w
in
d
,
and
|
d
|
is
the
total
number
of
words
in
d.
However
,
using
this
formula
means
we
will
get
a
value
of
zero
for
Pr
(
q
|
d
)
if
even
a
single
query
word
Qi
is
not
found
in
d.
To
overcome
this
problem
,
we
smooth
the
model
by
assigning
some
probability
mass
to
such
unseen
words
.
Specifically
,
we
adopt
a
two-stage
smoothing
method
(
Zhai
and
Lafferty
,
entire
speech
segment
;
then
Here
,
U
denotes
a
background
language
model
,
and
A
&gt;
0
and
A
£
(
0
,
1
)
are
parameters
to
the
smoothing
procedure
.
This
is
a
combination
of
Bayesian
smoothing
using
Dirichlet
priors
(
MacKay
and
Peto
,
1984
)
and
Jelinek-Mercer
smoothing
(
Jelinek
and
Mercer
,
1980
)
.
The
parameter
A
can
be
set
empirically
according
to
the
nature
of
the
queries
.
For
the
parameter
/
x
,
we
adopt
the
estimation
procedure
of
Zhai
and
Lafferty
(
2004
)
:
we
maximize
the
leave-one-out
log
likelihood
of
the
document
collection
,
namely
by
using
Newton
's
method
to
solve
the
equation
3.2
Our
Proposed
Statistical
Lattice-Based
Retrieval
Method
We
now
propose
our
lattice-based
retrieval
method
.
In
contrast
to
the
above
baseline
method
,
our
proposed
method
works
on
the
lattice
representation
of
spoken
documents
,
as
generated
by
a
speech
recognizer
.
First
,
each
spoken
document
is
divided
into
M
short
speech
segments
.
A
speech
recognizer
then
generates
a
lattice
for
each
speech
segment
.
As
previously
stated
,
a
lattice
is
a
connected
directed
acyclic
graph
with
edges
labeled
with
word
hypotheses
and
likelihoods
.
Thus
,
each
path
through
the
lattice
contains
a
hypothesis
ofthe
series
ofwords
spoken
in
this
speech
segment
,
t
=
t1t2
•
•
•
tN
,
along
with
acoustic
probabilities
Pr
(
o1
|
t1
)
,
Pr
(
o2
|
t2
)
,
•
•
•
Pr
(
oN
|
tN
)
,
where
oi
denotes
the
acoustic
observations
for
the
time
interval
of
the
word
ti
hypothesized
by
the
speech
recognizer
.
Let
o
=
o1o2
•
•
•
oN
denote
the
acoustic
observations
for
the
We
then
rescore
each
lattice
with
an
n-gram
language
model
.
Effectively
,
this
means
multiplying
the
acoustic
probabilities
with
n-gram
probabilities
:
This
produces
an
expanded
lattice
in
which
paths
(
hypotheses
)
are
weighted
by
their
posterior
probabilities
rather
than
their
acoustic
likelihoods
:
specifically
,
by
Pr
(
t
,
o
)
«
Pr
(
t
|
o
)
rather
than
Pr
(
o
|
t
)
(
Odell
,
1995
)
.
The
lattice
is
then
pruned
,
by
removing
those
paths
in
the
lattice
whose
log
posterior
probabilities
-
to
be
precise
,
whose
7
ln
Pr
(
t
|
o
)
-
are
not
within
a
threshold
6
of
the
best
path
's
log
posterior
probability
(
in
our
implementation
,
7
=
10000.5
)
.
where
C
(
w
|
t
)
is
the
word
count
of
w
in
the
hypothesized
transcript
t.
We
can
also
analogously
compute
the
expected
document
length
:
where
|
t
|
denotes
the
number
ofwords
in
t.
In
addition
,
we
also
modify
the
procedure
for
estimating
a
,
by
replacing
C
(
w
|
d
)
and
Figure
1
:
Example
of
a
word
confusion
network
|
d
|
in
Equation
3
with
|
_E
[
C
(
uj
|
d
)
]
+
and
S
«
,
ev
|
_E
[
C
(
ty
|
d
)
]
+
\
\
respectively
.
The
probability
estimates
from
Equation
4
can
then
be
substituted
into
Equation
1
to
yield
relevance
scores
.
3.3
Baseline
tf
•
idf
Lattice-Based
Retrieval
Method
As
a
further
comparison
,
we
also
implemented
Mamou
et
al.
(
2006
)
'
s
vector
space
retrieval
method
(
without
query
refinement
via
lexical
affinities
)
.
In
this
method
,
each
document
d
is
represented
as
a
word
confusion
network
(
WCN
)
(
Mangu
et
al.
,
2000
)
-
a
simplified
lattice
which
can
be
viewed
as
a
sequence
of
confusion
sets
c1
;
c2
,
c3
,
•
•
•
.
Each
ci
corresponds
approximately
to
a
time
interval
in
the
spoken
document
and
contains
a
group
of
word
hypotheses
,
and
each
word
w
in
this
group
of
hypotheses
is
labeled
with
the
probability
Pr
(
w
|
ci
;
d
)
-
the
probability
that
w
was
spoken
in
the
time
interval
of
ci
.
A
confusion
set
may
also
give
a
probability
for
Pr
(
e
|
ci
,
d
)
,
the
probability
that
no
word
was
spoken
in
the
time
of
ci
.
Figure
1
gives
an
example
of
a
WCN
.
Mamou
et
al.
's
retrieval
method
proceeds
as
follows
.
First
,
the
documents
are
divided
into
speech
segments
,
lattices
are
generated
from
the
speech
segments
,
and
the
lattices
are
pruned
according
to
the
path
probability
threshold
6
,
as
described
in
Section
3.2
.
The
lattice
for
each
speech
segment
is
then
converted
into
a
WCN
according
to
the
algorithm
segments
in
each
document
are
then
concatenated
to
form
a
single
WCN
per
document
.
•
the
"
average
document
length
"
audi
,
computed
•
the
"
inverse
document
frequency
"
idf
(
w
)
,
computed
as
4
Experiments
4.1
Document
Collection
To
evaluate
our
proposed
retrieval
method
,
we
performed
experiments
using
the
Hub5
Mandarin
training
corpus
released
by
the
Linguistic
Data
Consortium
(
LDC98T26
)
.
This
is
a
conversational
telephone
speech
corpus
which
is
17
hours
long
,
and
contains
recordings
of
42
telephone
calls
corresponding
to
approximately
600Kb
of
transcribed
Mandarin
text
.
Each
conversation
has
been
broken
up
into
speech
segments
of
less
than
8
seconds
each
.
As
the
telephone
calls
in
LDC98T26
have
not
been
divided
neatly
into
"
documents
"
,
we
had
to
choose
a
suitable
unit
of
retrieval
which
could
serve
as
a
"
document
"
.
An
entire
conversation
would
be
too
long
for
such
a
purpose
,
while
a
speech
segment
or
speaker
turn
would
be
too
short
.
We
decided
to
use
t
;
-
minute
time
windows
with
50
%
overlap
as
retrieval
units
,
following
Abberley
et
al.
(
1999
)
and
Tuerk
et
al.
(
2001
)
.
The
42
telephone
conversations
were
thus
divided
into
4,312
retrieval
units
(
"
documents
"
)
.
Each
document
comprises
multiple
consecutive
speech
segments
.
4.2
Queries
and
Ground
Truth
Relevance
Judgements
We
then
formulated
18
queries
(
14
test
queries
,
4
development
queries
)
to
issue
on
the
document
collection
.
Each
query
was
comprised
of
one
or
more
written
Chinese
keywords
.
We
then
obtained
ground
truth
relevance
judgements
by
manually
examining
each
of
the
4,312
documents
to
see
if
it
is
relevant
to
the
topic
of
each
query
.
The
number
of
retrieval
units
relevant
to
each
query
was
found
to
range
from
4
to
990
.
The
complete
list
of
queries
and
the
number
of
documents
relevant
to
each
query
are
given
in
Table
1
.
4.3
Preprocessing
of
Documents
and
Queries
Next
,
we
processed
the
document
collection
with
a
speech
recognizer
.
For
this
task
we
used
the
Abacus
system
(
Hon
et
al.
,
1994
)
,
a
large
vocabulary
continuous
speech
recognizer
which
contains
a
triphone-based
acoustic
system
and
a
frame-synchronized
search
algorithm
for
effective
word
decoding
.
Each
Mandarin
syllable
was
modeled
by
one
to
four
triphone
models
.
Acoustic
models
were
trained
from
a
corpus
of
200
hours
of
telephony
speech
from
500
speakers
sampled
at
8kHz
.
For
each
speech
frame
,
we
extracted
a
39-dimensional
feature
vector
consisting
of
12
MFCCs
and
normalized
energy
,
and
their
first
and
second
order
derivatives
.
Sentence-based
cepstral
mean
subtraction
was
applied
for
acoustic
normalization
both
in
the
training
and
testing
.
Each
triphone
was
modeled
by
a
left
-
Test
queries
Contact
information
The
weather
Housing
matters
Studies
,
academia
Litigation
Raising
children
Christian
churches
Clothing
Eating
out
Playing
sports
Dealings
with
banks
Computers
and
software
Development
queries
Keywords
#
relevant
documents
Passport
and
visa
matters
Washington
D.
C.
Working
life
Table
1
:
List
of
test
and
development
queries
to-right
3-state
hidden
Markov
model
(
HMM
)
,
each
state
having
16
Gaussian
mixture
components
.
In
total
,
we
built
1,923
untied
within-syllable
triphone
models
for
43
Mandarin
phonemes
,
as
well
as
3
silence
models
.
The
search
algorithm
was
supported
by
a
loop
grammar
of
over
80,000
words
.
We
processed
the
speech
segments
in
our
collection
corpus
,
to
generate
lattices
incorporating
acoustic
likelihoods
but
not
n-gram
model
probabilities
.
We
then
rescored
the
lattices
using
a
backoff
tri
-
gram
language
model
interpolated
in
equal
proportions
from
two
trigram
models
:
•
a
model
built
from
corpora
of
transcripts
of
conversations
,
comprised
of
a
320Kb
subset
of
the
Callhome
Mandarin
corpus
(
LDC96T16
)
and
the
CSTSC-Flight
corpus
from
the
Chinese
Corpus
Consortium
(
950Kb
)
The
unigram
counts
from
this
model
were
also
used
as
the
background
language
model
U
in
Equations
2
and
4
.
The
reference
transcripts
,
queries
,
and
trigram
model
training
data
were
all
segmented
into
words
using
Low
et
al.
(
2005
)
'
s
Chinese
word
segmenter
,
trained
on
the
Microsoft
Research
(
MSR
)
corpus
,
with
the
speech
recognizer
's
vocabulary
used
as
an
external
dictionary
.
The
1-bestASR
transcripts
were
decoded
from
the
rescored
lattices
.
Lattice
rescoring
,
trigram
model
building
,
WCN
generation
,
and
computation
of
expected
word
counts
were
done
using
the
SRILM
toolkit
(
Stolcke
,
2002
)
,
while
lattice
pruning
was
done
with
the
help
of
the
AT
&amp;
T
FSM
Library
(
Mohri
et
al.
,
1998
)
.
We
also
computed
the
character
error
rate
(
CER
)
and
syllable
error
rate
(
SER
)
of
the
1-best
transcripts
,
and
the
lattice
oracle
CER
,
for
one
of
the
telephone
conversations
in
the
speech
corpus
(
ma_416
0
)
.
The
CER
was
found
to
be
69
%
,
the
SER
63
%
,
and
the
oracle
CER
29
%
.
4.4
Retrieval
and
Evaluation
We
then
performed
retrieval
on
the
document
collection
using
the
algorithms
in
Section
3
,
using
the
reference
transcripts
,
the
1-best
ASR
transcripts
,
lattices
,
and
WCNs
.
We
set
A
=
0.1
,
which
was
suggested
by
Zhai
and
Lafferty
(
2004
)
to
give
good
retrieval
performance
for
keyword
queries
.
The
results
of
retrieval
were
checked
against
the
ground
truth
relevance
judgements
,
and
evaluated
in
terms
of
the
non-interpolated
mean
average
precision
(
MAP
)
:
Retrieval
method
Retrieval
source
MAP
for
development
queries
test
queries
Reference
transcripts
1-best
transcripts
Vector
space
tf
■
idf
Statistical
Lattices
,
Table
2
:
Summary
of
experimental
results
where
L
denotes
the
total
number
of
queries
,
Ri
the
total
number
of
documents
relevant
to
the
ith
query
,
and
ri
;
j
the
position
of
the
jth
relevant
document
in
the
ranked
list
output
by
the
retrieval
method
for
query
i.
For
the
lattice-based
retrieval
methods
,
we
performed
retrieval
with
the
development
queries
using
several
values
of
6
between
0
and
100,000
,
and
then
used
the
value
of
6
with
the
best
MAP
to
do
retrieval
with
the
test
queries
.
5
Experimental
Results
The
results
of
our
experiments
are
summarized
in
Table
2
;
the
MAP
of
the
two
lattice-based
retrieval
methods
,
Mamou
et
al.
(
2006
)
'
s
vector
space
method
and
our
proposed
statistical
retrieval
method
,
are
shown
in
Figure
2
and
Figure
3
respectively
.
The
results
show
that
,
for
the
vector
space
retrieval
method
,
the
MAP
of
the
development
queries
is
highest
at
6
=
27
,
500
,
at
which
point
the
MAP
for
the
test
queries
is
0.1599
;
and
for
our
proposed
method
,
the
MAP
for
the
development
queries
is
highest
at
6
=
65,000
,
and
at
this
point
the
MAP
for
the
test
queries
reaches
0.2154
.
As
can
be
seen
,
the
performance
of
our
statistical
lattice-based
method
shows
a
marked
improvement
over
the
MAP
of
0.1364
achieved
using
only
the
1-best
ASR
transcripts
,
and
indeed
a
one-tailed
Student
's
t-test
shows
that
this
improvement
is
statistically
significant
at
the
99.5
%
confidence
level
.
The
statistical
method
also
yields
better
performance
than
Mamou
et
al.
's
vector
space
method
-
a
t-test
For
4
development
queries
Retrieval
using
word
probabilities
from
word
confusion
networks
-
q
(
max
.
log
probability
difference
of
paths
)
Figure
3
:
MAP
of
our
proposed
statistical
method
for
lattice-based
retrieval
,
at
various
pruning
thresholds
6
shows
the
performance
difference
to
be
statistically
significant
at
the
97.5
%
confidence
level
.
6
Conclusions
and
Future
Work
We
have
presented
a
method
for
performing
spoken
document
retrieval
using
lattices
which
is
based
on
a
statistical
language
modeling
retrieval
framework
.
Results
show
that
our
new
method
can
significantly
improve
the
retrieval
MAP
compared
to
using
only
the
1-best
ASR
transcripts
.
Also
,
our
proposed
retrieval
method
has
been
shown
to
outperform
Mamou
et
al.
(
2006
)
'
s
vector
space
lattice-based
retrieval
method
.
Besides
the
better
empirical
performance
,
our
method
also
has
other
advantages
over
Mamou
et
al.
's
vector
space
method
.
For
one
,
our
method
computes
expected
word
counts
directly
from
rescored
lattices
,
and
does
not
require
an
additional
step
to
convert
lattices
lossily
to
WCNs
.
Furthermore
,
our
method
uses
all
the
hypotheses
in
each
lattice
,
rather
than
just
the
top
10
word
hypotheses
at
each
time
interval
.
Most
importantly
,
our
method
provides
a
more
natural
and
more
principled
approach
to
lattice-based
spoken
document
retrieval
based
on
a
sound
statistical
foundation
,
by
harnessing
the
fact
that
lattices
are
themselves
statistical
models
;
the
statistical
approach
also
means
that
our
method
can
be
more
easily
augmented
with
additional
statistical
knowledge
sources
in
a
principled
way
.
For
future
work
,
we
plan
to
test
our
proposed
method
on
English
speech
corpora
,
and
with
larger-scale
retrieval
tasks
involving
more
queries
and
more
documents
.
We
would
like
to
extend
our
method
to
other
speech
processing
tasks
,
such
as
spoken
document
classification
and
example-based
spoken
document
retrieval
as
well
.
