This
paper
presents
a
new
unsupervised
algorithm
(
WordEnds
)
for
inferring
word
boundaries
from
transcribed
adult
conversations
.
Phone
ngrams
before
and
after
observed
pauses
are
used
to
bootstrap
a
simple
discriminative
model
of
boundary
marking
.
This
fast
algorithm
delivers
high
performance
even
on
morphologically
complex
words
in
English
and
Arabic
,
and
promising
results
on
accurate
phonetic
transcriptions
with
extensive
pronunciation
variation
.
Expanding
training
data
beyond
the
traditional
miniature
datasets
pushes
performance
numbers
well
above
those
previously
reported
.
This
suggests
that
WordEnds
is
a
viable
model
of
child
language
acquisition
and
might
be
useful
in
speech
understanding
.
1
Introduction
Words
are
essential
to
most
models
of
language
and
speech
understanding
.
Word
boundaries
define
the
places
at
which
speakers
can
fluently
pause
,
and
limit
the
application
of
most
phonological
rules
.
Words
are
a
key
constituent
in
structural
analyses
:
the
output
of
morphological
rules
and
the
constituents
in
syntactic
parsing
.
Most
speech
recognizers
are
word-based
.
And
,
words
are
entrenched
in
the
writing
systems
of
many
languages
.
Therefore
,
it
is
generally
accepted
that
children
learning
their
first
language
must
learn
how
to
segment
speech
into
a
sequence
of
words
.
Similar
,
but
more
limited
,
learning
occurs
when
adults
hear
speech
containing
unfamiliar
words
.
These
words
must
be
accurately
delimited
,
so
that
they
can
be
added
to
the
lexicon
and
nearby
familiar
words
recognized
correctly
.
Current
speech
recognizers
typically
misinterpret
such
speech
.
This
paper
will
consider
algorithms
which
segment
phonetically
transcribed
speech
into
words
.
For
example
,
Figure
1
shows
a
transcribed
phrase
from
the
Buckeye
corpus
(
Pitt
et
al.
,
2005
;
Pitt
et
al.
,
2007
)
and
the
automatically
segmented
output
.
Like
almost
all
previous
researchers
,
I
use
human-transcribed
input
to
work
around
the
limitations
of
current
speech
recognizers
.
In
most
available
datasets
,
words
are
transcribed
using
standard
dictionary
pronunciations
(
henceforth
"
dictionary
transcriptions
"
)
.
These
transcriptions
are
approximately
phonemic
and
,
more
importantly
,
assign
a
constant
form
to
each
word
.
I
will
also
use
one
dataset
with
accurate
phonetic
transcriptions
,
including
natural
variation
in
the
pronunciation
of
words
.
Handling
this
variation
is
an
important
step
towards
eventually
using
phone
lattices
or
features
produced
by
real
speech
recognizers
.
This
paper
will
focus
on
segmentation
of
speech
between
adults
.
This
is
the
primary
input
for
speech
recognizers
.
Moreover
,
understanding
such
speech
is
the
end
goal
ofchild
language
acquisition
.
Models
tested
only
on
simplified
child-directed
speech
are
incomplete
without
an
algorithm
for
upgrading
the
understander
to
handle
normal
adult
speech
.
2
The
task
in
more
detail
This
paper
uses
a
simple
model
of
the
segmentation
task
,
which
matches
prior
work
and
the
available
datasets
.
Possible
enhancements
to
the
model
are
discussed
at
the
end
.
IN
REAL
:
ohlThikidsinner
#
ahrpiyp
@
lThA
?
HAvkids
#
ohrThADurHAviynqkids
DICT
:
ahlThiykidzinTher
#
ahrpiyp
@
lThAtHAvkidz
#
owrThAtahrHAvinqkidz
Figure
1
:
Part
of
Buckeye
corpus
dialog
2101a
,
in
accurate
phonetic
transcription
(
REAL
)
and
dictionary
pronunciations
(
DICT
)
.
Both
use
modified
arpabet
,
with
#
marking
pauses
.
Notice
the
two
distinct
pronunciations
of
"
that
"
in
the
accurate
transcription
.
Automatically
inserted
word
boundaries
are
shown
at
bottom
.
This
paper
considers
only
languages
with
an
established
tradition
of
words
,
e.g.
not
Chinese
.
I
assume
that
the
authors
of
each
corpus
have
given
us
reasonable
phonetic
transcriptions
and
word
boundaries
.
The
datasets
are
informal
conversations
in
which
debatable
word
segmentations
are
rare
.
The
transcribed
data
is
represented
as
a
sequence
of
phones
,
with
neither
prosodic
/
stress
information
nor
feature
representations
for
the
phones
.
These
phone
sequences
are
presented
to
segmentation
algorithms
as
strings
of
ASCII
characters
.
Large
phonesets
may
be
represented
using
capital
letters
and
punctuation
or
,
more
readably
,
using
multicharacter
phone
symbols
.
Well-designed
(
e.g.
easily
decodable
)
multi-character
codes
do
not
affect
the
algorithms
or
evaluation
metrics
in
this
paper
.
Testing
often
also
uses
orthographic
datasets
.
Finally
,
the
transcriptions
are
divided
into
"
phrases
"
at
pauses
in
the
speech
signal
(
silences
,
breaths
,
etc
)
.
These
pause
phrases
are
not
necessarily
syntactic
or
prosodic
constituents
.
Disfluen-cies
in
conversational
speech
create
pauses
where
you
might
not
expect
them
,
e.g.
immediately
following
the
definite
article
(
Clark
and
Wasow
,
1998
;
Fox
Tree
and
Clark
,
1997
)
.
Therefore
,
I
have
chosen
corpora
in
which
pauses
have
been
marked
carefully
.
2.2
Affixes
and
syllables
A
theory
of
word
segmentation
must
explain
how
affixes
differ
from
free-standing
function
words
.
For
example
,
we
must
explain
why
English
speakers
consider
"
the
"
to
be
a
word
,
but
"
-
ing
"
to
be
an
affix
,
although
neither
occurs
by
itself
in
fluent
prepared
English
.
We
must
also
explain
why
the
Arabic
determiner
"
Al
-
"
is
not
a
word
,
though
its
syntactic
and
semantic
role
seems
similar
to
English
"
the
"
.
Viewed
another
way
,
we
must
show
how
to
esti
-
mate
the
average
word
length
.
Conversational
English
has
short
words
(
about
3
phones
)
,
because
most
grammatical
morphemes
are
free-standing
.
Languages
with
many
affixes
have
longer
words
,
e.g.
my
Arabic
data
averages
5.6
phones
per
word
.
Pauses
are
vital
for
deciding
what
is
an
affix
.
Attempts
to
segment
transcriptions
without
pauses
,
e.g.
(
Christiansen
et
al.
,
1998
)
,
have
worked
poorly
.
Claims
that
humans
can
extract
words
without
pauses
seem
to
be
based
on
psychological
experiments
such
as
(
Saffran
,
2001
;
Jusczyk
and
Aslin
,
1995
)
which
conflate
words
and
morphemes
.
Even
then
,
explicit
boundaries
seem
to
improve
performance
(
Seidl
and
Johnson
,
2006
)
.
Another
significant
part
of
this
task
is
finding
syllable
boundaries
.
For
English
,
many
phone
strings
have
multiple
possible
syllabifications
.
Because
words
average
only
1.26
syllables
,
segmenting
pre-syllabified
input
has
a
very
high
baseline
:
100
%
precision
and
80
%
recall
of
boundary
positions
.
2.3
Algorithm
testing
Unsupervised
algorithms
are
presented
with
the
transcription
,
divided
only
at
phrase
boundaries
.
Their
task
is
to
infer
the
phrase-internal
word
boundaries
.
The
primary
worry
in
testing
is
that
development
may
have
biased
the
algorithm
towards
a
particular
language
,
speaking
style
,
and
/
or
corpus
size
.
Addressing
this
requires
showing
that
different
corpora
can
be
handled
with
a
common
set
of
parameter
settings
.
Therefore
a
test
/
training
split
within
one
corpus
serves
little
purpose
and
is
not
standard
.
Supervised
algorithms
are
given
training
data
with
all
word
boundaries
marked
,
and
must
infer
word
boundaries
in
a
separate
test
set
.
Simple
supervised
algorithms
perform
extremely
well
(
Cairns
et
al.
,
1997
;
Teahan
et
al.
,
2000
)
,
but
don
't
address
our
main
goal
:
learning
how
to
segment
.
Notice
that
phrase
boundaries
are
not
randomly
selected
word
boundaries
.
Syntactic
and
communicative
constraints
make
pauses
more
likely
at
certain
positions
than
others
.
Therefore
,
the
"
supervised
"
algorithms
for
this
task
train
on
a
representative
set
ofword
boundaries
whereas
"
unsupervised
"
algorithms
train
on
a
biased
set
of
word
boundaries
.
Moreover
,
supplying
all
the
word
boundaries
for
even
a
small
amount
of
data
effectively
tells
the
supervised
algorithms
the
average
word
length
,
a
parameter
which
is
otherwise
not
easy
to
estimate
.
Standard
evaluation
metrics
include
the
precision
,
recall
and
F-score
1
of
the
phrase-internal
boundaries
(
BP
,
BR
,
BF
)
,
of
the
extracted
word
tokens
(
WP
,
WR
,
WF
)
,
and
ofthe
resulting
lexicon
ofword
types
(
LP
,
LR
,
LF
)
.
Outputs
don
't
look
good
until
BF
is
at
least
90
%
.
3
Previous
work
Learning
to
segment
words
is
an
old
problem
,
with
extensive
prior
work
surveyed
in
(
Batchelder
,
2002
;
Brent
and
Cartwright
,
1996
;
Cairns
et
al.
,
1997
;
Goldwater
,
2006
;
Hockema
,
2006
;
Rytting
,
2007
)
.
There
are
two
major
approaches
.
Phonotactic
methods
model
which
phone
sequences
are
likely
within
words
and
which
occur
primarily
across
or
adjacent
to
word
boundaries
.
Language
modelling
methods
build
word
ngram
models
,
like
those
used
in
speech
recognition
.
Statistical
criteria
define
the
"
best
"
model
fitting
the
input
data
.
In
both
cases
,
details
are
complex
and
variable
.
3.1
Phonotactic
Methods
Supervised
phonotactic
methods
date
back
at
least
to
(
Lamel
and
Zue
,
1984
)
,
see
also
(
Harrington
et
al.
,
1989
)
.
Statistics
of
phone
trigrams
provide
sufficient
information
to
segment
adult
conversational
speech
(
dictionary
transcriptions
with
simulated
phonology
)
with
about
90
%
precision
and
93
%
recall
(
Cairns
et
al.
,
1997
)
,
see
also
(
Hockema
,
2006
)
.
Teahan
et
al.
's
compression-based
model
(
2000
)
achieves
BF
over
99
%
on
orthographic
English
.
Segmentation
by
adults
is
sensitive
to
phono-tactic
constraints
(
McQueen
,
1998
;
Weber
,
2000
)
.
To
build
unsupervised
algorithms
,
Brent
and
Cartwright
suggested
(
1996
)
inferring
phonotac-tic
constraints
from
phone
sequences
observed
at
1F
=
-
prpfr
where
P
is
the
precision
and
R
is
the
recall
.
phrase
boundaries
.
However
,
experimental
results
are
poor
.
Early
results
using
neural
nets
by
Cairns
et
al.
(
1997
)
and
Christiansen
et
al
(
1998
)
are
discouraging
.
Rytting
(
2007
)
seems
to
have
the
best
result
:
61.0
%
boundary
recall
with
60.3
%
precision
2
on
26K
words
of
modern
Greek
data
,
average
word
length
4.4
phones
.
This
algorithm
used
mutual
information
plus
phrase-final
2-phone
sequences
.
He
obtained
similar
results
(
Rytting
,
2004
)
using
phrase-final
3-phone
sequences
.
Word
segmentation
experiments
by
Christiansen
and
Allen
(
1997
)
and
Harrington
et
al.
(
1989
)
.
simulated
the
effects
of
pronunciation
variation
and
/
or
recognizer
error
.
Rytting
(
2007
)
uses
actual
speech
recognizer
output
.
These
experiments
broke
useful
new
ground
,
but
poor
algorithm
performance
(
BF
&lt;
50
%
even
on
dictionary
transcriptions
)
makes
it
hard
to
draw
conclusions
from
their
results
.
3.2
Language
modelling
methods
So
far
,
language
modelling
methods
have
been
more
effective
.
Brent
(
1999
)
and
Venkataraman
(
2001
)
present
incremental
splitting
algorithms
with
BF
about
82
%
3
on
the
Bernstein-Ratner
(
BR87
)
corpus
of
infant-directed
English
with
disfluencies
and
interjections
removed
(
Bernstein
Ratner
,
1987
;
Brent
,
1999
)
.
Batchelder
(
2002
)
achieved
almost
identical
results
using
a
clustering
algorithm
.
The
most
recent
algorithm
(
Goldwater
,
2006
)
achieves
a
BF
of
85.8
%
using
a
Dirichlet
Process
bigram
model
,
estimated
using
a
Gibbs
sampling
algorithm.4
Language
modelling
methods
incorporate
a
bias
towards
re-using
hypothesized
words
.
This
suggests
they
should
systematically
segment
morphologically
complex
words
,
so
as
to
exploit
the
structure
they
share
with
other
words
.
Goldwater
,
the
only
author
to
address
this
issue
explicitly
,
reports
that
her
algorithm
breaks
off
common
affixes
(
e.g.
"
ing
"
,
"
s
"
)
.
Batchelder
reports
a
noticable
drop
in
performance
on
Japanese
data
,
which
might
relate
to
its
more
complex
words
(
average
4.1
phones
)
.
2These
numbers
have
been
adjusted
so
as
not
to
include
boundaries
between
phrases
.
3Numbers
are
from
Goldwater
's
(
2006
)
replication
.
4Goldwater
numbers
are
from
the
December
2007
version
of
her
code
,
with
its
suggested
parameter
values
:
a0
=
3000
,
ai
=
300
,
p#
=
0.2
.
4
The
new
approach
Previous
algorithms
have
modelled
either
whole
words
or
very
short
(
e.g.
2-3
)
phone
sequences
.
The
new
approach
proposed
in
this
paper
,
"
lexical-ized
phonotactics
,
"
models
extended
sequences
of
phones
at
the
starts
and
ends
of
word
sequences
.
This
allows
a
new
algorithm
,
called
WordEnds
,
to
successfully
mark
word
boundaries
with
a
simple
local
classifier
.
This
method
models
sequences
of
phones
that
start
or
end
at
a
word
boundary
.
When
words
are
long
,
such
a
sequence
may
cover
only
part
of
the
word
e.g.
a
group
of
suffixes
or
a
suffix
plus
the
end
of
the
stem
.
A
sequence
may
also
include
parts
ofmultiple
short
words
,
capturing
some
simple
bits
of
syntax
.
These
longer
sequences
capture
not
only
purely
phonotactic
constraints
,
but
also
information
about
the
inventory
of
lexical
items
.
This
improves
handling
of
complex
,
messy
inputs
.
(
Cf.
Ando
and
Lee
's
(
2000
)
kanji
segmenter
.
)
On
the
other
hand
,
modelling
only
partial
words
helps
the
segmenter
handle
long
,
infrequent
words
.
Long
words
are
typically
created
by
productive
morphology
and
,
thus
,
often
start
and
end
just
like
other
words
.
Only
32
%
of
words
in
Switchboard
occur
both
before
and
after
pauses
,
but
many
of
the
other
68
%
have
similar-looking
beginnings
or
endings
.
Given
an
inter-character
position
in
a
phrase
,
its
right
and
left
contexts
are
the
character
sequences
to
its
right
and
left
.
By
convention
,
phrases
input
to
WordEnds
are
padded
with
a
single
blank
at
each
end
.
So
the
middle
position
of
the
phrase
"
afunjoke
"
has
right
context
"
jokeU
"
and
left
context
"
Uafun
.
"
Since
this
is
a
word
boundary
,
the
right
context
looks
like
the
start
of
a
real
word
sequence
,
and
the
left
context
looks
like
the
end
of
one
.
This
is
not
true
for
the
immediately
previous
position
,
which
has
right
context
"
njokeU
"
and
left
context
"
Uafu
.
"
Boundaries
will
be
marked
where
the
right
and
left
contexts
look
like
what
we
have
observed
at
the
starts
and
ends
of
phrases
.
4.2
Statistical
model
To
formalize
this
,
consider
a
fixed
inter-character
position
in
a
phrase
.
It
may
be
a
word
boundary
(
b
)
or
not
(
-
b
)
.
Let
r
and
l
be
its
right
and
left
contexts
.
The
input
data
will
(
see
Section
4.3
)
give
us
P
(
b
|
r
)
and
P
(
b
|
1
)
.
Deciding
whether
to
mark
a
boundary
at
this
position
requires
estimating
P
(
b
|
r
,
l
)
.
To
express
P
(
b
|
r
,
l
)
in
terms
of
P
(
b
|
1
)
and
P
(
b
|
r
)
,
I
will
assume
that
r
and
l
are
conditionally
independent
given
b.
This
corresponds
roughly
to
a
unigram
language
model
.
Let
P
(
b
)
be
the
probability
of
a
boundary
at
a
random
inter-character
position
.
I
will
assume
that
the
average
word
length
,
and
therefore
P
(
b
)
,
is
not
absurdly
small
or
large
.
P^fpffl.
Q
is
typically
not
1
,
because
a
right
and
left
context
often
co-occur
simply
because
they
both
tend
to
occur
at
boundaries
.
Contexts
that
occur
primarily
inside
words
(
e.g.
not
at
a
syllable
boundary
)
often
restrict
the
adjacent
context
,
violating
conditional
independence
given
-
b.
However
,
in
these
cases
,
P
(
b
|
r
)
and
/
or
P
(
b
|
1
)
will
be
very
low
,
so
P
(
b
|
r
,
l
)
will
be
very
low
.
So
(
correctly
)
no
boundary
will
be
marked
.
Thus
,
we
can
compute
P
(
b
|
r
,
l
)
from
P
(
b
|
r
)
,
P
(
b
|
1
)
,
and
P
(
b
)
.
A
boundary
is
marked
if
P
(
b
|
r,1
)
&gt;
0.5
.
4.3
Estimating
context
probabilities
Estimation
of
P
(
b
|
r
)
and
P
(
b
|
1
)
uses
a
simple
ngram
backoff
algorithm
.
The
details
will
be
shown
for
P
(
b
|
1
)
.
P
(
b
|
r
)
is
similar
.
Suppose
for
the
moment
that
word
boundaries
are
marked
.
The
left
context
l
might
be
very
long
and
unusual
.
So
we
will
estimate
its
statistics
using
a
shorter
lefthand
neighborhood
P
(
b
|
1
)
is
then
estimated
as
the
number
of
times
I
'
occurs
before
a
boundary
,
divided
by
the
total
number
of
times
l
'
occurs
in
the
corpus
.
The
suffix
l
'
is
chosen
to
be
the
longest
suffix
of
l
which
occurs
at
least
10
times
in
the
corpus
,
i.e.
often
enough
for
a
reliable
estimate
in
the
presence
language
med
size
Table
1
:
Key
parameters
for
each
test
dataset
include
the
language
,
transcription
method
,
number
of
words
(
small
,
medium
,
largesubsets
)
,
averagephonesperword
,
averagewordsperphrase
,
andpercentofwordtypesthatoccuronly
once
(
hapax
)
.
Phones
/
word
is
replaced
by
characters
/
word
for
the
orthographic
corpus
.
of
noise.5
l
'
may
cross
word
boundaries
and
,
if
our
position
is
near
a
pause
,
may
contain
the
blank
at
the
lefthand
end
of
the
phrase
.
The
length
of
l
'
is
limited
to
Nmax
characters
to
reduce
overfitting
.
Unfortunately
,
our
input
data
has
boundaries
only
at
pauses
(
#
)
.
So
applying
this
method
to
the
raw
input
data
produces
estimates
of
P
(
#
|
r
)
and
P
Because
phrase
boundaries
are
not
a
representative
selection
of
word
boundaries
,
P
(
#
|
r
)
and
P
are
not
good
estimates
of
P
(
b
|
r
)
and
P
(
b
|
1
)
.
Moreover
,
initially
,
we
don
't
know
P
(
b
)
.
Therefore
,
WordEnds
bootstraps
the
estimation
using
a
binary
model
of
the
relationship
between
word
and
phrase
boundaries
.
To
a
first
approximation
,
an
ngram
occurs
at
the
end
of
a
phrase
if
and
only
if
it
can
occur
at
the
end
of
a
word
.
Since
the
magnitude
of
P
(
#
,
l
)
isn
't
helpful
,
we
simply
check
whether
it
is
zero
and
,
accordingly
,
set
P
(
b
|
1
)
to
either
zero
or
a
constant
,
very
high
value
.
In
fact
,
real
data
contains
phrase
endings
corrupted
by
disfluencies
,
foreign
words
,
etc.
So
Word-Ends
actually
sets
P
(
b
|
1
)
high
only
if
Pis
above
a
threshold
(
currently
0.003
)
chosen
to
reflect
the
expected
amount
of
corruption
.
In
the
equations
from
Section
4.2
,
if
either
P
(
b
|
r
)
or
P
(
b
|
1
)
is
zero
,
then
P
(
b
|
r
,
l
)
is
zero
.
If
both
values
are
very
high
,
then
Q
is
P
(
bP
)
(
P
)
(
b
|
l
)
+
e
,
with
e
very
small
.
So
P
(
b
|
r
,
l
)
is
close
to
1
.
So
,
in
the
bootstrapping
phase
,
the
test
for
marking
a
boundary
is
independent
of
P
(
b
)
and
reduces
to
testing
whether
P
(
#
|
r
)
and
Pare
both
over
threshold
.
So
,
WordEnds
estimates
P
(
#
|
r
)
and
P
from
the
input
data
,
then
uses
this
bootstrapping
5A
single
character
is
used
if
no
suffix
occurs
10
times
.
method
(
Nmax
=
5
)
6
to
infer
preliminary
word
boundaries
.
The
preliminary
boundaries
are
used
to
estimate
P
(
b
)
and
to
re-estimate
P
(
b
|
r
)
and
P
(
b
|
1
)
,
using
Nmax
=
4
.
Final
boundaries
are
then
marked
.
In
a
full
understanding
system
,
output
of
the
word
segmenter
would
be
passed
to
morphological
and
local
syntactic
processing
.
Because
the
segmenter
is
myopic
,
certain
errors
in
its
output
would
be
easier
to
fix
with
the
wider
perspective
available
to
this
later
processing
.
Because
standard
models
of
morphological
learning
don
't
address
the
interaction
with
word
segmentation
,
WordEnds
does
a
simple
version
of
this
repair
process
using
a
placeholder
algorithm
called
Mini-morph
.
Mini-morph
fixes
two
types
of
defects
in
the
segmentation
.
Short
fragments
are
created
when
two
nearby
boundaries
represent
alternative
reasonable
segmentations
rather
than
parts
of
a
common
segmentation
.
For
example
,
"
treestake
"
has
potential
boundaries
both
before
and
after
the
s.
This
issue
was
noted
by
Harrington
et
al.
(
1988
)
who
used
a
list
of
known
very
short
words
to
detect
these
cases
.
See
also
(
Cairns
et
al.
,
1997
)
.
Also
,
surrounding
words
sometimes
mislead
WordEnds
into
undersegmenting
a
phone
sequence
which
has
an
"
obvious
"
analysis
using
well-established
component
words
.
Mini-morph
classifies
each
word
in
the
segmentation
as
a
fragment
,
a
word
that
is
reliable
enough
to
use
in
subdividing
other
words
,
or
unknown
status
.
6Values
for
Nmax
were
chosen
empirically
.
They
could
be
adjusted
for
differences
in
entropy
rate
,
but
this
is
very
similar
across
the
datasets
in
this
paper
.
Because
it
has
only
a
feeble
model
of
morphology
,
Mini-morph
has
been
designed
to
be
cautious
:
most
words
are
classified
as
unknown
.
To
classify
a
word
,
we
compare
its
frequency
w
as
a
word
in
the
segmentation
to
the
frequencies
p
and
s
with
which
it
occurs
as
a
prefix
and
suffix
of
words
in
the
segmentation
(
including
itself
)
.
The
word
's
fragment
ratio
f
is
.
Values
of
f
are
typically
over
0.8
for
freely
occurring
words
,
under
0.1
for
fragments
and
strongly-attached
affixes
,
and
intermediate
for
clitics
,
some
affixes
,
and
words
with
restricted
usage
.
However
,
most
words
haven
't
been
seen
enough
times
for
f
to
be
reliable
.
So
a
word
is
classified
as
a
fragment
if
p
+
s
&gt;
1000
and
f
&lt;
0.2
.
It
is
classified
as
a
reliable
word
if
p
+
s
&gt;
50
and
f
&gt;
0.5
.
To
revise
the
input
segmentation
of
the
corpus
,
Mini-morph
merges
each
fragment
with
an
adjacent
word
if
the
newly-created
merged
word
occurred
at
least
10
times
in
the
input
segmentation
.
When
mergers
with
both
adjacent
words
are
possible
,
the
algorithm
alternates
which
to
prefer
.
Each
word
is
then
sudivided
into
a
sequence
of
reliable
words
,
when
possible
.
Because
words
are
typically
short
and
reliable
words
rare
,
a
simple
recursive
algorithm
is
used
,
biased
towards
using
shorter
words
.
WordEnds
calls
Mini-morph
twice
,
once
to
revise
the
preliminary
segmentation
produced
by
the
bootstrapping
phase
and
a
second
time
to
revise
the
final
segmentation
.
6
Test
corpora
WordEnds
was
tested
on
a
diverse
set
of
seven
corpora
,
summarized
in
Table
1
.
Notice
that
the
Arabic
dataset
has
much
longer
words
than
those
used
by
previous
authors
.
Subsets
were
extracted
from
the
larger
corpora
,
to
control
for
training
set
size
.
Gold-water
's
algorithm
,
the
best
performing
of
previous
methods
,
was
also
tested
on
the
small
versions
.
The
first
three
corpora
all
use
dictionary
transcriptions
with
1-character
phone
symbols
.
The
Bernstein-Ratner
(
BR87
)
corpus
was
described
above
(
Section
3.2
)
.
The
Arabic
corpus
was
created
by
removing
punctuation
and
word
boundaries
from
the
Buckwalter
version
of
the
LDC
's
transcripts
of
7Subdivision
is
done
only
once
for
each
word
type
.
8It
is
too
slow
to
run
on
the
larger
ones
.
Gulf
Arabic
Conversational
Telephone
Speech
(
Appen
,
2006
)
.
Filled
pauses
and
foreign
words
were
kept
as
is
.
Word
fragments
were
kept
,
but
the
telltale
hyphens
were
removed
.
The
Spanish
corpus
was
produced
in
a
similar
way
from
the
Callhome
Spanish
dataset
(
Wheatley
,
1996
)
,
removing
all
accents
.
Orthographic
forms
were
used
for
words
without
pronunciations
(
e.g.
foreign
,
fragments
)
The
other
two
English
dictionary
transcriptions
were
produced
in
a
similar
way
from
the
Buckeye
corpus
(
Pitt
et
al.
,
2005
;
Pitt
et
al.
,
2007
)
and
Mississippi
State
's
corrected
version
of
the
LDC
's
Switchboard
transcripts
(
Godfrey
and
Holliman
,
1994
;
Deshmukh
et
al.
,
1998
)
.
These
use
a
"
readable
phonetic
"
version
of
arpabet
.
Each
phone
is
represented
with
a
1-2
character
code
,
chosen
to
look
like
English
orthography
and
to
ensure
thatcharacter
sequences
decode
uniquely
into
phone
sequences
.
Buckeye
does
not
provide
dictionary
pronunciations
for
word
fragments
,
so
these
were
transcribed
as
"
X
"
.
Switchboard
was
also
transcribed
using
standard
English
orthography
.
The
Buckeye
corpus
also
provides
an
accurate
phonetic
transcription
of
its
data
,
showing
allo-phonic
variation
(
e.g.
glottal
stop
,
dental
/
nasal
flaps
)
,
segment
deletions
,
quality
shifts
/
uncertainty
,
and
nasalization
.
Some
words
are
"
massively
"
reduced
(
Johnson
,
2003
)
,
going
well
beyond
standard
phonological
rules
.
We
represented
its
64
phones
using
codes
with
1-3
characters
.
7
Test
results
Table
2
presents
test
results
for
the
small
corpora
.
The
numbers
for
the
four
English
dictionary
and
orthographic
transcriptions
are
very
similar
.
This
confirms
the
finding
of
Batchelder
(
2002
)
that
variations
in
transcription
method
have
only
minor
impacts
on
segmenter
performance
.
Performance
seems
to
be
largely
determined
by
structural
and
lexical
properties
(
e.g.
word
length
,
pause
frequency
)
.
For
the
English
dictionary
datasets
,
the
primary
overall
evaluation
numbers
(
BF
and
WF
)
for
the
two
algorithms
differ
less
than
the
variation
created
by
tweaking
parameters
or
re-running
Goldwater
's
(
randomized
)
algorithm
.
Both
degrade
similarly
on
the
phonetic
version
of
Buckeye
.
The
most
visible
overall
difference
is
speed
.
WordEnds
processes
WordEnds
Goldwater
Table
2
:
Results
for
WordEnds
and
Goldwater
on
the
small
test
corpora
.
See
Section
2.3
for
definitions
of
metrics
.
medium
w
/
out
morph
transcription
Switchboard
orthographic
phonetic
dictionary
Table
3
:
Results
for
WordEnds
on
the
medium
and
large
datasets
,
also
on
the
medium
dataset
without
Mini-morph
.
See
Table
1
for
datasetsizes
.
each
small
dataset
in
around
30-40
seconds
.
Gold-water
requires
around
2000
times
as
long
:
14.5-32
hours
,
depending
on
the
dataset
.
However
,
WordEnds
keeps
affixes
on
words
whereas
Goldwater
's
algorithm
removes
them
.
This
creates
a
systematic
difference
in
the
balance
between
boundary
recall
and
precision
.
It
also
causes
Goldwater
's
LF
values
to
drop
dramatically
between
the
child-directed
BR87
corpus
and
the
adult-directed
speech
.
For
the
same
reason
,
WordEnds
maintains
good
performance
on
the
Arabic
dataset
,
but
Goldwater
's
performance
(
especially
LF
)
is
much
worse
.
It
is
quite
likely
that
Goldwater
's
algorithm
is
finding
morphemes
rather
than
words
.
Datasets
around
30K
words
are
traditional
for
this
task
.
However
,
a
child
learner
has
access
to
much
more
data
,
e.g.
Weijer
(
1999
)
measured
1890
words
per
hour
spoken
near
an
infant
.
WordEnds
performs
much
better
when
more
data
is
available
(
Table
3
)
.
Numbers
for
even
the
harder
datasets
(
Buckeye
phonetic
,
Spanish
)
are
starting
to
look
promising
.
The
Spanish
results
show
that
data
with
infrequent
pauses
can
be
handled
in
two
very
different
ways
:
aggressive
model-based
segmentation
(
Gold
-
water
)
or
feeding
more
data
to
a
more
cautious
segmenter
(
WordEnds
)
.
The
two
calls
to
Mini-morph
sometimes
make
almost
no
difference
,
e.g.
on
the
Arabic
data
.
But
it
can
make
large
improvements
,
e.g.
BF
+6.9
%
,
WF
+10.5
%
,
LF
+5.8
%
on
the
BR
corpus
.
Table
3
shows
details
for
the
medium
datasets
.
Its
contribution
seems
to
diminish
as
the
datasets
get
bigger
,
e.g.
improvements
of
BF
+4.7
%
,
WF
+9.3
%
,
LF
+3.7
%
on
the
small
dictionary
Switchboard
corpus
but
only
BF
+1.3
%
,
WF
+3.3
%
,
LF
+3.4
%
on
the
large
one
.
8
Some
specifics
of
performance
Examining
specific
mistakes
confirms
that
Word-Ends
does
not
systematically
remove
affixes
on
English
dictionary
data
.
On
the
large
Switchboard
corpus
,
"
-
ed
"
is
never
removed
from
its
stem
and
"
-
ing
"
is
removed
only
16
times
.
The
Mini-morph
postprocessor
misclassifies
,
and
thus
segments
off
,
some
affixes
that
are
homophonous
with
free-standing
words
,
such
as
"
-
en
"
/
"
in
"
and
"
-
es
"
/
"
is
"
.
A
smarter
model
of
morphology
and
local
syntax
could
probably
avoid
this
.
There
is
a
visible
difference
between
English
"
the
"
and
the
Arabic
determiner
"
Al
-
"
.
The
English
determiner
is
almost
always
segmented
off
.
From
the
medium-sized
Switchboard
corpus
,
only
434
lexical
items
are
posited
with
"
the
"
attached
to
a
following
word
.
Arabic
"
Al
"
is
sometimes
attached
and
sometimes
segmented
off
.
In
the
medium
Arabic
dataset
,
the
correct
and
computed
lexicons
contain
similar
numbers
of
words
starting
with
Al
(
4873
and
4608
)
,
but
there
is
only
partial
overlap
(
2797
words
)
.
Some
of
this
disagreement
involves
foreign
language
nouns
,
which
the
markup
in
the
original
corpus
separates
from
the
determiner.9
Mistakes
on
twenty
specific
items
account
for
24
%
of
the
errors
on
the
large
Switchboard
corpus
.
The
first
two
items
,
accounting
for
over
11
%
of
the
mistakes
,
involve
splitting
"
uhhuh
"
and
"
umhum
"
.
Most
of
the
rest
involve
merging
common
collocations
(
e.g.
"
a
lot
"
)
or
splitting
common
compounds
that
have
a
transparent
analysis
(
e.g.
"
something
"
)
.
9
Discussion
and
conclusions
Performance
of
WordEnds
is
much
stronger
than
previous
reported
results
,
including
good
results
on
Arabic
and
promising
results
on
accurate
phonetic
transcriptions
.
This
is
partly
due
to
good
algorithm
design
and
partly
due
to
using
more
training
data
.
This
sets
a
much
higher
standard
for
models
of
child
language
acquisition
and
also
suggests
that
it
is
not
crazy
to
speculate
about
inserting
such
an
algorithm
into
the
speech
recognition
pipeline
.
Performance
would
probably
be
improved
by
better
models
of
morphology
and
/
or
phonology
.
An
ngram
model
of
morpheme
sequences
(
e.g.
like
Goldwater
uses
)
might
avoid
some
of
the
mistakes
mentioned
in
Section
8
.
Feature-based
or
gestural
phonology
(
Browman
and
Goldstein
,
1992
)
might
help
model
segmental
variation
.
Finite-state
models
(
Belz
,
2000
)
might
be
more
compact
.
Prosody
,
stress
,
and
other
sub-phonemic
cues
might
disam-biguate
some
problem
situations
(
Hockema
,
2006
;
Rytting
,
2007
;
Salverda
et
al.
,
2003
)
.
However
,
it
is
not
obvious
which
of
these
approaches
will
actually
improve
performance
.
Additional
phonetic
features
may
not
be
easy
to
detect
9The
author
does
not
read
Arabic
and
,
thus
,
is
not
in
a
position
to
explain
why
the
annotaters
did
this
.
reliably
,
e.g.
marking
lexical
stress
in
the
presence
of
contrastive
stress
and
utterance-final
lengthening
.
The
actual
phonology
of
fast
speech
may
not
be
quite
what
we
expect
,
e.g.
performance
on
the
phonetic
version
of
Buckeye
was
slightly
improved
by
merging
nasal
flap
with
n
,
and
dental
flap
with
d
and
glottal
stop
.
The
sets
of
word
initial
and
final
segments
may
not
form
natural
phonological
classes
,
because
they
are
partly
determined
by
morphological
and
lexical
constraints
(
Rytting
,
2007
)
.
Moreover
,
the
strong
performance
from
the
basic
segmental
model
makes
it
hard
to
rule
out
the
possibility
that
high
performance
could
be
achieved
,
even
on
data
with
phonetic
variation
,
by
throwing
enough
training
data
at
a
simple
segmental
algorithm
.
Finally
,
the
role
of
child-directed
speech
needs
to
be
examined
more
carefully
.
Child-directed
speech
displays
helpful
features
such
as
shorter
phrases
and
fewer
reductions
(
Bernstein
Ratner
,
1996
;
van
de
Weijer
,
1999
)
.
These
features
may
make
segmentation
easier
to
learn
,
but
the
strong
results
presented
here
for
adult-directed
speech
make
it
trickier
to
argue
that
this
help
is
necessary
for
learning
.
Moreover
,
it
is
not
clear
how
learning
to
segment
child-directed
speech
might
make
it
easier
to
learn
to
segment
speech
directed
at
adults
or
older
children
.
It
's
possible
that
learning
child-directed
speech
makes
it
easier
to
learn
the
basic
principles
of
phonology
,
semantics
,
or
higher-level
linguistic
structure
.
This
might
somehow
feed
back
into
learning
segmentation
.
However
,
it
's
also
possible
that
its
only
raison
d'
etre
is
social
:
enabling
earlier
communication
between
children
and
adults
.
Acknowledgments
Many
thanks
to
the
UIUC
prosody
group
,
Mitch
Marcus
,
Cindy
Fisher
,
and
Sharon
Goldwater
.
