We
present
a
maximally
streamlined
approach
to
learning
HMM-based
acoustic
models
for
automatic
speech
recognition
.
In
our
approach
,
an
initial
mono-phone
HMM
is
iteratively
refined
using
a
split-merge
EM
procedure
which
makes
no
assumptions
about
subphone
structure
or
context-dependent
structure
,
and
which
uses
only
a
single
Gaussian
per
HMM
state
.
Despite
the
much
simplified
training
process
,
our
acoustic
model
achieves
state-of-the-art
results
on
phone
classification
(
where
it
outperforms
almost
all
other
methods
)
and
competitive
performance
on
phone
recognition
(
where
it
outperforms
standard
CD
triphone
/
subphone
/
GMM
approaches
)
.
We
also
present
an
analysis
of
what
is
and
is
not
learned
by
our
system
.
1
Introduction
Continuous
density
hidden
Markov
models
(
HMMs
)
underlie
most
automatic
speech
recognition
(
ASR
)
systems
in
some
form
.
While
the
basic
algorithms
for
HMM
learning
and
inference
are
quite
general
,
acoustic
models
of
speech
standardly
employ
rich
speech-specific
structures
to
improve
performance
.
For
example
,
it
is
well
known
that
a
monophone
HMM
with
one
state
per
phone
is
too
coarse
an
approximation
to
the
true
articulatory
and
acoustic
process
.
The
HMM
state
space
is
therefore
refined
in
several
ways
.
To
model
phone-internal
dynamics
,
phones
are
split
into
beginning
,
middle
,
and
end
subphones
(
Jelinek
,
1976
)
.
To
model
cross-phone
coarticulation
,
the
states
of
the
HMM
are
refined
by
splitting
the
phones
into
context-dependent
tri-phones
.
These
states
are
then
re-clustered
(
Odell
,
1995
)
and
the
parameters
of
their
observation
distributions
are
tied
back
together
(
Young
and
Woodland
,
1994
)
.
Finally
,
to
model
complex
emission
densities
,
states
emit
mixtures
of
multivariate
Gaus-sians
.
This
standard
structure
is
shown
schematically
in
Figure
1
.
While
this
rich
structure
is
phonetically
well-motivated
and
empirically
successful
,
so
much
structural
bias
may
be
unnecessary
,
or
even
harmful
.
For
example
in
the
domain
of
syntactic
parsing
with
probabilistic
context-free
grammars
(
PCFGs
)
,
a
surprising
recent
result
is
that
automatically
induced
grammar
refinements
can
outperform
sophisticated
methods
which
exploit
substantial
manually
articulated
structure
(
Petrov
et
al.
,
2006
)
.
In
this
paper
,
we
consider
a
much
more
automatic
,
data-driven
approach
to
learning
HMM
structure
for
acoustic
modeling
,
analagous
to
the
approach
taken
by
Petrov
et
al.
(
2006
)
for
learning
PCFGs
.
We
start
with
a
minimal
monophone
HMM
in
which
there
is
a
single
state
for
each
(
context-independent
)
phone
.
Moreover
,
the
emission
model
for
each
state
is
a
single
multivariate
Gaussian
(
over
the
standard
MFCC
acoustic
features
)
.
We
then
iteratively
refine
this
minimal
HMM
through
state
splitting
,
adding
complexity
as
needed
.
States
in
the
refined
HMMs
are
always
substates
of
the
original
HMM
and
are
therefore
each
identified
with
a
unique
base
phone
.
States
are
split
,
estimated
,
and
(
perhaps
)
merged
,
based
on
a
likelihood
criterion
.
Our
model
never
allows
explicit
Gaussian
mixtures
,
though
substates
may
develop
similar
distributions
and
thereby
emulate
such
mixtures
.
In
principle
,
discarding
the
traditional
structure
can
either
help
or
hurt
the
model
.
Incorrect
prior
splits
can
needlessly
fragment
training
data
and
incorrect
prior
tying
can
limit
the
model
's
expressivity
.
On
the
other
hand
,
correct
assumptions
can
increase
the
efficiency
of
the
learner
.
Empirically
,
Proceedings
of
the
2007
Joint
Conference
on
Empirical
Methods
in
Natural
Language
Processing
and
Computational
Natural
Language
Learning
,
pp.
897-905
,
Prague
,
June
2007
.
©
2007
Association
for
Computational
Linguistics
yiûû
.
jxrl
.
/
m.
m
.
/
m.
-
oca
.
JVJVJVJV
JVJVJVJV
JVJVJVJX
.
Figure
1
:
Comparison
of
the
standard
model
to
our
model
(
here
shown
with
k
=
4
subphones
per
phone
)
for
the
word
dad
.
The
dependence
of
subphones
across
phones
in
our
model
is
not
shown
,
while
the
context
clustering
in
the
standard
model
is
shown
only
schematically
.
we
show
that
our
automatic
approach
outperforms
classic
systems
on
the
task
of
phone
recognition
on
the
TIMIT
data
set
.
In
particular
,
it
outperforms
standard
state-tied
triphone
models
like
Young
and
Woodland
(
1994
)
,
achieving
a
phone
error
rate
of
26.4
%
versus
27.7
%
.
In
addition
,
our
approach
gives
state-of-the-art
performance
on
the
task
of
phone
classification
on
the
TIMIT
data
set
,
suggesting
that
our
learned
structure
is
particularly
effective
at
modeling
phone-internal
structure
.
Indeed
,
our
error
rate
of
21.4
%
is
outperformed
only
by
the
recent
structured
margin
approach
of
Sha
and
Saul
(
2006
)
.
It
remains
to
be
seen
whether
these
positive
results
on
acoustic
modeling
will
facilitate
better
word
recognition
rates
in
a
large
vocabulary
speech
recognition
system
.
We
also
consider
the
structures
learned
by
the
model
.
Subphone
structure
is
learned
,
similar
to
,
but
richer
than
,
standard
begin-middle-end
structures
.
Cross-phone
coarticulation
is
also
learned
,
with
classic
phonological
classes
often
emerging
naturally
.
Many
aspects
of
this
work
are
intended
to
simplify
rather
than
further
articulate
the
acoustic
process
.
It
should
therefore
be
clear
that
the
basic
techniques
of
splitting
,
merging
,
and
learning
using
EM
are
not
in
themselves
new
for
ASR
.
Nor
is
the
basic
latent
induction
method
new
(
Matsuzaki
et
al.
,
2005
;
Petrov
et
al.
,
2006
)
.
What
is
novel
in
this
paper
is
(
1
)
the
construction
of
an
automatic
system
for
acoustic
modeling
,
with
substantially
streamlined
structure
,
(
2
)
the
investigation
of
variational
inference
for
such
a
task
,
(
3
)
the
analysis
of
the
kinds
of
structures
learned
by
such
a
system
,
and
(
4
)
the
empirical
demonstration
that
such
a
system
is
not
only
competitive
with
the
traditional
approach
,
but
can
indeed
outperform
even
very
recent
work
on
some
preliminary
measures
.
2
Learning
In
the
following
,
we
propose
a
greatly
simplified
model
that
does
not
impose
any
manually
specified
structural
constraints
.
Instead
of
specifying
structure
a
priori
,
we
use
the
Expectation-Maximization
(
EM
)
algorithm
for
HMMs
(
Baum-Welch
)
to
automatically
induce
the
structure
in
a
way
that
maximizes
data
likelihood
.
In
general
,
our
training
data
consists
of
sets
of
acoustic
observation
sequences
and
phone
level
transcriptions
r
which
specify
a
sequence
of
phones
from
a
set
of
phones
Y
,
but
does
not
label
each
time
frame
with
a
phone
.
We
refer
to
an
observation
sequence
as
x
=
xi
,
.
.
.
,
xT
where
xi
G
R39
are
standard
MFCC
features
(
Davis
and
Mermel-stein
,
1980
)
.
We
wish
to
induce
an
HMM
over
a
set
of
states
S
for
which
we
also
have
a
function
n
:
S
—
Y
that
maps
every
state
in
S
to
a
phone
in
Y.
Note
that
in
the
usual
formulation
of
the
EM
algorithm
for
HMMs
,
one
is
interested
in
learning
HMM
parameters
9
that
maximize
the
likelihood
of
the
observations
P
(
x
|
9
)
;
in
contrast
,
we
aim
to
maximize
the
joint
probability
of
our
observations
and
phone
transcriptions
P
(
x
,
r
|
9
)
or
observations
and
phone
sequences
P
(
x
,
y
|
9
)
(
see
below
)
.
We
now
describe
this
relatively
straightforward
modification
of
the
EM
algorithm
.
For
clarity
of
exposition
we
first
consider
a
simplified
scenario
in
which
we
are
given
hand-aligned
phone
labels
y
=
yi
,
.
.
.
,
yT
for
each
time
t
,
as
is
the
case
for
the
TIMIT
dataset
.
Our
procedure
does
not
require
such
extensive
annotation
of
the
training
data
and
in
fact
gives
better
performance
when
the
exact
transition
point
between
phones
are
not
pre-specified
but
learned
.
We
define
forward
and
backward
probabilities
(
Rabiner
,
1989
)
in
the
following
way
:
the
forward
probability
is
the
probability
of
observing
the
sequence
xi
,
.
.
.
,
xt
with
transcription
yi
,
.
.
.
,
yt
and
previous
Figure
2
:
Iterative
refinement
of
the
/
ih
/
phone
with
1
,
2,4
,
8
substates
.
and
the
backward
probability
is
the
probability
of
observing
the
sequence
xt+i
,
xT
with
transcription
yt+i
,
.
.
.
,
yT
,
given
that
we
start
in
state
s
at
time
t
:
where
X
are
the
model
parameters
.
As
usual
,
we
parameterize
our
HMMs
with
ass
&gt;
,
the
probability
of
transitioning
from
state
s
to
s
'
,
and
bs
(
x
)
—
N
(
ps
,
Es
)
,
the
probability
emitting
the
observation
x
when
in
state
s.
These
probabilities
can
be
computed
using
the
standard
forward
and
backward
recursions
(
Rabiner
,
1989
)
,
except
that
at
each
time
t
,
we
only
consider
states
st
for
which
n
(
st
)
=
yt
,
because
we
have
hand-aligned
labels
for
the
observations
.
These
quantities
also
allow
us
to
compute
the
posterior
counts
necessary
for
the
E-step
of
the
EM
algorithm
.
One
way
of
inducing
arbitrary
structural
annotations
would
be
to
split
each
HMM
state
in
into
m
substates
,
and
re-estimate
the
parameters
for
the
split
HMM
using
EM
.
This
approach
has
two
major
drawbacks
:
for
larger
m
it
is
likely
to
converge
to
poor
local
optima
,
and
it
allocates
substates
uniformly
across
all
states
,
regardless
of
how
much
annotation
is
required
for
good
performance
.
To
avoid
these
problems
,
we
apply
a
hierarchical
parameter
estimation
strategy
similar
in
spirit
to
the
work
of
Sankar
(
1998
)
and
Ueda
et
al.
(
2000
)
,
but
here
applied
to
HMMs
rather
than
to
GMMs
.
Beginning
with
the
baseline
model
,
where
each
state
corresponds
to
one
phone
,
we
repeatedly
split
and
re-train
the
HMM
.
This
strategy
ensures
that
each
split
HMM
is
initialized
"
close
"
to
some
reasonable
maximum
.
Concretely
,
each
state
s
in
the
HMM
is
split
in
two
new
states
si
,
s2
with
n
(
si
)
=
n
(
s2
)
=
n
(
s
)
.
We
initialize
EM
with
the
parameters
of
the
previous
HMM
,
splitting
every
previous
state
s
in
two
and
adding
a
small
amount
of
randomness
e
&lt;
1
%
to
its
transition
and
emission
probabilities
to
break
symmetry
:
and
similarly
for
s2
.
The
incoming
transitions
are
split
evenly
.
We
then
apply
the
EM
algorithm
described
above
to
re-estimate
these
parameters
before
performing
subsequent
split
operations
.
Since
adding
substates
divides
HMM
statistics
into
many
bins
,
the
HMM
parameters
are
effectively
estimated
from
less
data
,
which
can
lead
to
overfitting
.
Therefore
,
it
would
be
to
our
advantage
to
split
sub
-
states
only
where
needed
,
rather
than
splitting
them
all
.
We
realize
this
goal
by
merging
back
those
splits
s
—
sis2
for
which
,
if
the
split
were
reversed
,
the
loss
in
data
likelihood
would
be
smallest
.
We
approximate
the
loss
in
data
likelihood
for
a
merge
si
s2
—
s
with
the
following
likelihood
ratio
(
Petrov
et
al.
,
2006
)
:
ttPW
)
nP
(
x
,
y
)
.
sequences
t
Here
P
(
x
,
y
)
is
the
joint
likelihood
of
an
emission
sequence
x
and
associated
state
sequence
y.
This
quantity
can
be
recovered
from
the
forward
and
backward
probabilities
using
P
(
x
,
y
)
=
Yl
at
(
s
)
•
Pt
(
s
)
.
Pt
(
x
,
y
)
is
an
approximation
to
the
same
joint
likelihood
where
states
si
and
s2
are
merged
.
We
approximate
the
true
loss
by
only
considering
merging
states
si
and
s2
at
time
t
,
a
value
which
can
be
efficiently
computed
from
the
forward
and
backward
probabilities
.
The
forward
score
for
the
merged
state
s
at
time
t
is
just
the
sum
of
the
two
split
scores
:
while
the
backward
score
is
a
weighted
sum
of
the
split
scores
:
where
pi
and
p2
are
the
relative
(
posterior
)
frequencies
of
the
states
si
and
s2
.
Thus
,
the
likelihood
after
merging
si
and
s2
at
time
t
can
be
computed
from
these
merged
forward
and
backward
scores
as
:
where
the
second
sum
is
over
the
other
substates
of
xt
,
i.e.
{
s
'
:
n
(
s
'
)
=
xt
,
s
'
£
{
si
,
s2
}
}
.
This
expression
is
an
approximation
because
it
neglects
interactions
between
instances
of
the
same
states
at
multiple
places
in
the
same
sequence
.
In
particular
,
since
phones
frequently
occur
with
multiple
consecutive
repetitions
,
this
criterion
may
vastly
overestimate
the
actual
likelihood
loss
.
As
such
,
we
also
implemented
the
exact
criterion
,
that
is
,
for
each
split
,
we
formed
a
new
HMM
with
si
and
s2
merged
and
calculated
the
total
data
likelihood
.
This
method
is
much
more
computationally
expensive
,
requiring
a
full
forward-backward
pass
through
the
data
for
each
potential
merge
,
and
was
not
found
to
produce
noticeably
better
performance
.
Therefore
,
all
experiments
use
the
approximate
criterion
.
2.4
The
Automatically-Aligned
Case
It
is
straightforward
to
generalize
the
hand-aligned
case
to
the
case
where
the
phone
transcription
is
known
,
but
no
frame
level
labeling
is
available
.
The
main
difference
is
that
the
phone
boundaries
are
not
known
in
advance
,
which
means
that
there
is
now
additional
uncertainty
over
the
phone
states
.
The
forward
and
backward
recursions
must
thus
be
expanded
to
consider
all
state
sequences
that
yield
the
given
phone
transcription
.
We
can
accomplish
this
with
standard
Baum-Welch
training
.
3
Inference
An
HMM
over
refined
subphone
states
s
£
S
naturally
gives
posterior
distributions
P
(
s
|
x
)
over
sequences
of
states
s.
We
would
ideally
like
to
extract
the
transcription
r
of
underlying
phones
which
is
most
probable
according
to
this
posterior1
.
The
transcription
is
two
stages
removed
from
s.
First
,
it
collapses
the
distinctions
between
states
s
which
correspond
to
the
same
phone
y
=
n
(
s
)
.
Second
,
it
collapses
the
distinctions
between
where
phone
transitions
exactly
occur
.
Viterbi
state
sequences
can
easily
be
extracted
using
the
basic
Viterbi
algorithm
.
On
the
other
hand
,
finding
the
best
phone
sequence
or
transcription
is
intractable
.
As
a
compromise
,
we
extract
the
phone
sequence
(
not
transcription
)
which
has
highest
probability
in
a
variational
approximation
to
the
true
distribution
(
Jordan
et
al.
,
1999
)
.
Let
the
true
posterior
distribution
over
phone
sequences
be
P
(
y
|
x
)
.
We
form
an
approximation
Q
(
y
)
~
P
(
y
|
x
)
,
where
Q
is
an
approximation
specific
to
the
sequence
x
and
factor
-
1
Remember
that
by
"
transcription
"
we
mean
a
sequence
of
phones
with
duplicates
removed
.
Q
(
y
)
=
II
q
(
t
,
xt
,
yt+i
)
.
We
would
like
to
fit
the
values
q
,
one
for
each
time
step
and
state-state
pair
,
so
as
to
make
Q
as
close
to
P
as
possible
:
min
KL
(
P
(
y
|
x
)
|
|
Q
(
y
)
)
.
The
solution
can
be
found
analytically
using
Lagrange
multipliers
:
q
(
t
y
y
'
)
=
P
(
Yt
=
y
,
Yt+i
=
y
'
|
x
)
q
(
,
y
,
y
)
=
P
(
Y
=
yx
)
.
where
we
have
made
the
position-specific
random
variables
Yt
explicit
for
clarity
.
This
approximation
depends
only
on
our
ability
to
calculate
posteriors
over
phones
or
phone-phone
pairs
at
individual
positions
t
,
which
is
easy
to
obtain
from
the
state
posteriors
,
for
example
:
Finding
the
Viterbi
phone
sequence
in
the
approximate
distribution
Q
,
can
be
done
with
the
Forward-Backward
algorithm
over
the
lattice
of
q
values
.
4
Experiments
We
tested
our
model
on
the
TIMIT
database
,
using
the
standard
setups
for
phone
recognition
and
phone
classification
.
We
partitioned
the
TIMIT
data
into
training
,
development
,
and
(
core
)
test
sets
according
to
standard
practice
(
Lee
and
Hon
,
1989
;
Gunawar-dana
et
al.
,
2005
;
Sha
and
Saul
,
2006
)
.
In
particular
,
we
excluded
all
sa
sentences
and
mapped
the
61
phonetic
labels
in
TIMIT
down
to
48
classes
before
training
our
HMMs
.
At
evaluation
,
these
48
classes
were
further
mapped
down
to
39
classes
,
again
in
the
standard
way
.
MFCC
coefficients
were
extracted
from
the
TIMIT
source
as
in
Sha
and
Saul
(
2006
)
,
including
delta
and
delta-delta
components
.
For
all
experiments
,
our
system
and
all
baselines
we
implemented
used
full
covariance
when
parameterizing
emission
Figure
3
:
Phone
recognition
error
for
models
of
increasing
size
models.2
All
Gaussians
were
endowed
with
weak
inverse
Wishart
priors
with
zero
mean
and
identity
covariance.3
4.1
Phone
Recognition
In
the
task
of
phone
recognition
,
we
fit
an
HMM
whose
output
,
with
subsequent
states
collapsed
,
corresponds
to
the
training
transcriptions
.
In
the
TIMIT
data
set
,
each
frame
is
manually
phone-annotated
,
so
the
only
uncertainty
in
the
basic
setup
is
the
identity
of
the
(
sub
)
states
at
each
frame
.
We
therefore
began
with
a
single
state
for
each
phone
,
in
a
fully
connected
HMM
(
except
for
special
treatment
of
dedicated
start
and
end
states
)
.
We
incrementally
trained
our
model
as
described
in
Section
2
,
with
up
to
6
split-merge
rounds
.
We
found
that
reversing
25
%
of
the
splits
yielded
good
overall
performance
while
maintaining
compactness
of
the
model
.
We
decoded
using
the
variational
decoder
described
in
Section
3
.
The
output
was
then
scored
against
the
reference
phone
transcription
using
the
standard
string
edit
distance
.
During
both
training
and
decoding
,
we
used
"
flattened
"
emission
probabilities
by
exponentiating
to
some
0
&lt;
y
&lt;
1
.
We
found
the
best
setting
for
7
to
be
0.2
,
as
determined
by
tuning
on
the
development
set
.
This
flattening
compensates
for
the
non
-
2Most
of
our
findings
also
hold
for
diagonal
covariance
Gaussians
,
albeit
the
final
error
rates
are
2-3
%
higher
.
3Following
previous
work
with
PCFGs
(
Petrov
et
al.
,
2006
)
,
we
experimented
with
smoothing
the
substates
towards
each
other
to
prevent
overfitting
,
but
we
were
unable
to
achieve
any
performance
gains
.
State-Tied
Triphone
HMM
Bayesian
Triphone
HMM
Table
1
:
Phone
recognition
error
rates
on
the
TIMIT
core
test
from
Glass
(
2003
)
.
1
These
results
are
on
a
slightly
easier
test
set
.
independence
of
the
frames
,
partially
due
to
overlapping
source
samples
and
partially
due
to
other
unmodeled
correlations
.
Figure
3
shows
the
recognition
error
as
the
model
grows
in
size
.
In
addition
to
the
basic
setup
described
so
far
(
split
and
merge
)
,
we
also
show
a
model
in
which
merging
was
not
performed
(
split
only
)
.
As
can
be
seen
,
the
merging
phase
not
only
decreases
the
number
of
HMM
states
at
each
round
,
but
also
improves
phone
recognition
error
at
each
round
.
We
also
compared
our
hierarchical
split
only
model
with
a
model
where
we
directly
split
all
states
into
2k
substates
,
so
that
these
models
had
the
same
number
of
states
as
a
a
hierarchical
model
after
k
split
and
merge
cycles
.
While
for
small
k
,
the
difference
was
negligible
,
we
found
that
the
error
increased
by
1
%
absolute
for
k
=
5
.
This
trend
is
to
be
expected
,
as
the
possible
interactions
between
the
substates
grows
with
the
number
of
substates
.
Also
shown
in
Figure
3
,
and
perhaps
unsurprising
,
is
that
the
error
rate
can
be
further
reduced
by
allowing
the
phone
boundaries
to
drift
from
the
manual
alignments
provided
in
the
TIMIT
training
data
.
The
split
and
merge
,
automatic
alignment
line
shows
the
result
of
allowing
the
EM
fitting
phase
to
reposition
each
phone
boundary
,
giving
absolute
improvements
ofupto0.6
%
.
We
investigated
how
much
improvement
in
accuracy
one
can
gain
by
computing
the
variational
approximation
introduced
in
Section
3
versus
extracting
the
Viterbi
state
sequence
and
projecting
that
sequence
to
its
phone
transcription
.
The
gap
varies
,
Error
Rate
Table
2
:
Phone
classification
error
rates
on
the
TIMIT
core
test
.
but
on
a
model
with
roughly
1000
states
(
5
split-merge
rounds
)
,
the
variational
decoder
decreases
error
from
26.5
%
to
25.6
%
.
The
gain
in
accuracy
comes
at
a
cost
in
time
:
we
must
run
a
(
possibly
pruned
)
Forward-Backward
pass
over
the
full
state
space
S
,
then
another
over
the
smaller
phone
space
Y.
In
our
experiments
,
the
cost
of
variational
decoding
was
a
factor
of
about
3
,
which
may
or
may
not
justify
a
relative
error
reduction
of
around
4
%
.
The
performance
of
our
best
model
(
split
and
merge
,
automatic
alignment
,
and
variational
decoding
)
on
the
test
set
is
26.4
%
.
A
comparison
of
our
performance
with
other
methods
in
the
literature
is
shown
in
Table
1
.
Despite
our
structural
simplicity
,
we
outperform
state-tied
triphone
systems
like
Young
and
Woodland
(
1994
)
,
a
standard
baseline
for
this
task
,
by
nearly
2
%
absolute
.
However
,
we
fall
short
of
the
best
current
systems
.
4.2
Phone
Classification
Phone
classification
is
the
fairly
constrained
task
of
classifying
in
isolation
a
sequence
of
frames
which
is
known
to
span
exactly
one
phone
.
In
order
to
quantify
how
much
of
our
gains
over
the
triphone
baseline
stem
from
modeling
context-dependencies
and
how
much
from
modeling
the
inner
structure
of
the
phones
,
we
fit
separate
HMM
models
for
each
phone
,
using
the
same
split
and
merge
procedure
as
above
(
though
in
this
case
only
manual
alignments
are
reasonable
because
we
test
on
manual
segmentations
)
.
For
each
test
frame
sequence
,
we
compute
the
likelihood
of
the
sequence
from
the
forward
probabilities
of
each
individual
phone
HMM
.
The
phone
giving
highest
likelihood
to
the
input
was
selected
.
The
error
rate
is
a
simple
fraction
of
test
phones
classified
correctly
.
Table
2
shows
a
comparison
of
our
performance
with
that
of
some
other
methods
in
the
literature
.
A
minimal
comparison
is
to
a
GMM
with
the
same
number
of
mixtures
per
phone
as
our
model
's
maxi
-
Hypothesis
O
oo
•
o
•
•
.
„
o
•
•
■
•
»
°
OO
°
=
'
o
□
o
o
.
vowels
/
semivowels
nasals
/
flaps
strong
fricatives
weak
fricatives
°
°
.
Figure
4
:
Phone
confusion
matrix
.
76
%
of
the
substitutions
fall
within
the
shown
classes
.
mum
substates
per
phone
.
While
these
models
have
the
same
number
of
total
Gaussians
,
in
our
model
the
Gaussians
are
correlated
temporally
,
while
in
the
GMM
they
are
independent
.
Enforcing
begin-middle-end
HMM
structure
(
see
HMM
Baseline
)
increases
accuracy
somewhat
,
but
our
more
general
model
clearly
makes
better
use
of
the
available
parameters
than
those
baselines
.
Indeed
,
our
best
model
achieves
a
surprising
performance
of
21.4
%
,
greatly
outperforming
other
generative
methods
and
achieving
performance
competitive
with
state-of-the-art
discriminative
methods
.
Only
the
recent
structured
margin
approach
of
Sha
and
Saul
(
2006
)
gives
a
better
performance
than
our
model
.
The
strength
of
our
system
on
the
classification
task
suggests
that
perhaps
it
is
modeling
phone-internal
structure
more
effectively
than
cross-phone
context
.
5
Analysis
While
the
overall
phone
recognition
and
classification
numbers
suggest
that
our
system
is
broadly
comparable
to
and
perhaps
in
certain
ways
superior
to
classical
approaches
,
it
is
illuminating
to
investigate
what
is
and
is
not
learned
by
the
model
.
Figure
4
gives
a
confusion
matrix
over
the
substitution
errors
made
by
our
model
.
The
majority
ofthe
Figure
5
:
Phone
contexts
and
subphone
structure
.
The
/
l
/
phone
after
3
split-merge
iterations
is
shown
.
confusions
are
within
natural
classes
.
Some
particularly
frequent
and
reasonable
confusions
arise
between
the
consonantal
/
r
/
and
the
vocalic
/
er
/
(
the
same
confusion
arises
between
/
l
/
and
/
el
/
,
but
the
standard
evaluation
already
collapses
this
distinction
)
,
the
reduced
vowels
/
ax
/
and
/
ix
/
,
the
voiced
and
voiceless
alveolar
sibilants
/
z
/
and
/
s
/
,
and
the
voiced
and
voiceless
stop
pairs
.
Other
vocalic
confusions
are
generally
between
vowels
and
their
corresponding
reduced
forms
.
Overall
,
76
%
of
the
substitutions
are
within
the
broad
classes
shown
in
the
figure
.
We
can
also
examine
the
substructure
learned
for
the
various
phones
.
Figure
2
shows
the
evolution
of
the
phone
/
ih
/
from
a
single
state
to
8
substates
during
split
/
merge
(
no
merges
were
chosen
for
this
phone
)
,
using
hand-alignment
of
phones
to
frames
.
These
figures
were
simplified
from
the
complete
state
transition
matrices
as
follows
:
(
1
)
adjacent
phones
'
substates
are
collapsed
,
(
2
)
adjacent
phones
are
selected
based
on
frequency
and
inbound
probability
(
and
forced
to
be
the
same
across
figures
)
,
(
3
)
infrequent
arcs
are
suppressed
.
In
the
first
split
,
(
b
)
,
a
sonorant
/
non-sonorant
distinction
is
learned
over
adjacent
phones
,
along
with
a
state
chain
which
captures
basic
duration
(
a
self-looping
state
gives
an
exponential
model
of
duration
;
the
sum
of
two
such
states
is
more
expressive
)
.
Note
that
the
nat
-
ural
classes
interact
with
the
chain
in
a
way
which
allows
duration
to
depend
on
context
.
In
further
refinements
,
more
structure
is
added
,
including
a
two-track
path
in
(
d
)
where
one
track
captures
the
distinct
effects
on
higher
formants
of
r-coloring
and
nasalization
.
Figure
5
shows
the
corresponding
diagram
for
/
l
/
,
where
some
merging
has
also
occurred
.
Different
natural
classes
emerge
in
this
case
,
with
,
for
example
,
preceding
states
partitioned
into
front
/
high
vowels
vs.
rounded
vowels
vs.
other
vowels
vs.
consonants
.
Following
states
show
a
front
/
back
distinction
and
a
consonant
distinction
,
and
the
phone
/
m
/
is
treated
specially
,
largely
because
the
/
lm
/
sequence
tends
to
shorten
the
/
l
/
substantially
.
Note
again
how
context
,
internal
structure
,
and
duration
are
simultaneously
modeled
.
Of
course
,
it
should
be
emphasized
that
post
hoc
analysis
of
such
structure
is
a
simplification
and
prone
to
seeing
what
one
expects
;
we
present
these
examples
to
illustrate
the
broad
kinds
of
patterns
which
are
detected
.
As
a
final
illustration
of
the
nature
of
the
learned
models
,
Table
3
shows
the
number
of
substates
allocated
to
each
phone
by
the
split
/
merge
process
(
the
maximum
is
32
for
this
stage
)
for
the
case
of
hand-aligned
(
left
)
as
well
as
automatically-aligned
(
right
)
phone
boundaries
.
Interestingly
,
in
the
hand-aligned
case
,
the
vowels
absorb
most
of
the
complexity
since
many
consonantal
cues
are
heavily
evidenced
on
adjacent
vowels
.
However
,
in
the
automatically-aligned
case
,
many
vowel
frames
with
substantial
consontant
coloring
are
re-allocated
to
those
adjacent
consonants
,
giving
more
complex
consonants
,
but
comparatively
less
complex
vowels
.
6
Conclusions
We
have
presented
a
minimalist
,
automatic
approach
for
building
an
accurate
acoustic
model
for
phonetic
classification
and
recognition
.
Our
model
does
not
require
any
a
priori
phonetic
bias
or
manual
specification
of
structure
,
but
rather
induces
the
structure
in
an
automatic
and
streamlined
fashion
.
Starting
from
a
minimal
monophone
HMM
,
we
automatically
learn
models
that
achieve
highly
competitive
performance
.
On
the
TIMIT
phone
recognition
task
our
model
clearly
outperforms
standard
state-tied
triphone
models
like
Young
and
Woodland
(
1994
)
.
For
phone
classification
,
our
model
Consonants
Table
3
:
Number
of
substates
allocated
per
phone
.
The
left
column
gives
the
number
of
substates
allocated
when
training
on
manually
aligned
training
sequences
,
while
the
right
column
gives
the
number
allocated
when
we
automatically
determine
phone
boundaries
.
achieves
performance
competitive
with
the
state-of-the-art
discriminative
methods
(
Sha
and
Saul
,
2006
)
,
despite
being
generative
in
nature
.
This
result
together
with
our
analysis
of
the
context-dependencies
and
substructures
that
are
being
learned
,
suggests
that
our
model
is
particularly
well
suited
for
modeling
phone-internal
structure
.
It
does
,
of
course
remain
to
be
seen
if
and
how
these
benefits
can
be
scaled
to
larger
systems
.
