We
present
a
probabilistic
model
of
diachronic
phonology
in
which
individual
word
forms
undergo
stochastic
edits
along
the
branches
of
a
phylogenetic
tree
.
Our
approach
allows
us
to
achieve
three
goals
with
a
single
unified
model
:
(
1
)
reconstruction
of
both
ancient
and
modern
word
forms
,
(
2
)
discovery
of
general
phonological
changes
,
and
(
3
)
selection
among
different
phyloge-nies
.
We
learn
our
model
using
a
Monte
Carlo
EM
algorithm
and
present
quantitative
results
validating
the
model
.
1
Introduction
Modeling
how
languages
change
phonologically
over
time
(
diachronic
phonology
)
is
a
central
topic
in
historical
linguistics
(
Campbell
,
1998
)
.
The
questions
involved
range
from
reconstruction
of
ancient
word
forms
,
to
the
elucidation
of
phonological
drift
processes
,
to
the
determination
of
phylogenetic
relationships
between
languages
.
However
,
this
problem
has
received
relatively
little
attention
from
the
computational
community
.
What
work
there
is
has
focused
on
the
reconstruction
of
phylogenies
on
the
basis
of
a
Boolean
matrix
indicating
the
properties
of
words
in
different
languages
(
Gray
and
Atkinson
,
2003
;
Evans
et
al.
,
2004
;
Ringe
et
al.
,
2002
;
Nakhleh
et
al.
,
2005
)
.
In
this
paper
,
we
present
a
novel
framework
,
along
with
a
concrete
model
and
experiments
,
for
the
probabilistic
modeling
of
diachronic
phonology
.
We
focus
on
the
case
where
the
words
are
etymological
cognates
across
languages
,
e.g.
French
/
aire
and
Spanish
hacer
from
Latin
/
acere
(
to
do
)
.
Given
this
information
as
input
,
we
learn
a
model
acting
at
the
level
of
individual
phoneme
sequences
,
which
can
be
used
for
reconstruction
and
prediction
,
Our
model
is
fully
generative
,
and
can
be
used
to
reason
about
a
variety
of
types
of
information
.
For
example
,
we
can
observe
a
word
in
one
or
more
modern
languages
,
say
French
and
Spanish
,
and
query
the
corresponding
word
form
in
another
language
,
say
Italian
.
This
kind
of
lexicon-filling
has
applications
in
machine
translation
.
Alternatively
,
we
can
also
reconstruct
ancestral
word
forms
or
inspect
the
rules
learned
along
each
branch
of
a
phylogeny
to
identify
salient
patterns
.
Finally
,
the
model
can
be
used
as
a
building
block
in
a
system
for
inferring
the
topology
of
phylogenetic
trees
.
We
discuss
all
of
these
cases
further
in
Section
4
.
The
contributions
of
this
paper
are
threefold
.
First
,
the
approach
to
modeling
language
change
at
the
phoneme
sequence
level
is
new
,
as
is
the
specific
model
we
present
.
Second
,
we
compiled
a
new
corpus1
and
developed
a
methodology
for
quantitatively
evaluating
such
approaches
.
Finally
,
we
describe
an
efficient
inference
algorithm
for
our
model
and
empirically
study
its
performance
.
While
our
word-level
model
of
phonological
change
is
new
,
there
have
been
several
computational
investigations
into
diachronic
linguistics
which
are
relevant
to
the
present
work
.
The
task
of
reconstructing
phylogenetic
trees
'
nlp.cs.berkeley.edu
/
pages
/
historical.html
Proceedings
of
the
2007
Joint
Conference
on
Empirical
Methods
in
Natural
Language
Processing
and
Computational
Natural
Language
Learning
,
pp.
887-896
,
Prague
,
June
200
?
.
©
2007
Association
for
Computational
Linguistics
for
languages
has
been
studied
by
several
authors
.
These
approaches
descend
from
glottochronology
(
Swadesh
,
1955
)
,
which
views
a
language
as
a
collection
of
shared
cognates
but
ignores
the
structure
of
those
cognates
.
This
information
is
obtained
from
manually
curated
cognate
lists
such
as
the
data
of
Dyen
et
al.
(
1997
)
.
As
an
example
of
a
cognate
set
encoding
,
consider
the
meaning
"
eat
"
.
There
would
be
one
column
for
the
cognate
set
which
appears
in
French
as
manger
and
Italian
as
mangiare
since
both
descend
from
the
Latin
mandere
(
to
chew
)
.
There
would
be
another
column
for
the
cognate
set
which
appears
in
both
Spanish
and
Portuguese
as
comer
,
descending
from
the
Latin
comedere
(
to
consume
)
.
If
this
were
the
only
data
,
algorithms
based
on
this
data
would
tend
to
conclude
that
French
and
Italian
were
closely
related
and
that
Spanish
and
Portuguese
were
equally
related
.
However
,
the
cognate
set
representation
has
several
disadvantages
:
it
does
not
capture
the
fact
that
the
cognate
is
closer
between
Spanish
and
Portuguese
than
between
French
and
Spanish
,
nor
do
the
resulting
models
let
us
conclude
anything
about
the
regular
processes
which
caused
these
languages
to
diverge
.
Also
,
the
existing
cognate
data
has
been
curated
at
a
relatively
high
cost
.
In
our
work
,
we
track
each
word
using
an
automatically
obtained
cognate
list
.
While
our
cognates
may
be
noisier
,
we
compensate
by
modeling
phonological
changes
rather
than
boolean
mutations
in
cognate
sets
.
There
has
been
other
computational
work
in
this
broad
domain
.
Venkataraman
et
al.
(
1997
)
describe
an
information
theoretic
measure
of
the
distance
between
two
dialects
of
Chinese
.
Like
our
approach
,
they
use
a
probabilistic
edit
model
as
a
formaliza-tion
of
the
phonological
process
.
However
,
they
do
not
consider
the
question
of
reconstruction
or
inference
in
multi-node
phylogenies
,
nor
do
they
present
a
learning
algorithm
for
such
models
.
Finally
,
for
the
specific
application
of
cognate
prediction
in
machine
translation
,
essentially
transliteration
,
there
have
been
several
approaches
,
including
Kondrak
(
2002
)
.
However
,
the
phenomena
of
interest
,
and
therefore
the
models
,
are
extremely
different
.
Kondrak
(
2002
)
presents
a
model
for
learning
"
sound
laws
,
"
general
phonological
changes
governing
two
completely
observed
aligned
cognate
lists
.
His
model
can
be
viewed
as
a
special
Figure
1
:
Tree
topologies
used
in
our
experiments
.
*
Topology
3
and
*
Topology
4
are
incorrect
evolutionary
tree
used
for
our
experiments
on
the
selection
of
phylogenies
(
Section
4.4
)
.
case
of
ours
using
a
simple
two-node
topology
.
and
Newton
(
1997
)
and
Li
et
al.
(
2000
)
each
independently
presented
a
Bayesian
model
for
computing
posteriors
over
evolutionary
trees
.
A
key
difference
with
our
model
is
that
independence
across
evolutionary
sites
is
assumed
in
their
work
,
while
the
evolution
of
the
phonemes
in
our
model
depends
on
the
environment
in
which
the
change
occurs
.
2
A
model
of
phonological
change
Assume
we
have
a
fixed
set
of
word
types
(
cognate
sets
)
in
our
vocabulary
V
and
a
set
of
languages
L.
Each
word
type
i
has
a
word
/
orm
wu
in
each
language
l
G
L
,
which
is
represented
as
a
sequence
of
phonemes
and
might
or
might
not
be
observed
.
The
languages
are
arranged
according
to
some
tree
topology
T
(
see
Figure
1
for
examples
)
.
One
might
consider
models
that
simultaneously
induce
the
topology
and
cognate
set
assignments
,
but
let
us
fix
both
for
now
.
We
discuss
one
way
to
relax
this
assumption
and
present
experimental
results
in
Section
4.4
.
Our
generative
model
(
Figure
3
)
specifies
a
distribution
over
the
word
forms
[
wu
}
for
each
word
type
i
G
V
and
each
language
l
G
L.
The
generative
process
starts
at
the
root
language
and
generates
all
the
word
forms
in
each
language
in
a
top-down
manner
.
One
appealing
aspect
about
our
model
is
that
,
at
a
high-level
,
it
reflects
the
actual
phonological
process
that
languages
undergo
.
However
,
important
phenomena
like
lexical
drift
,
borrowing
,
and
other
non-phonological
changes
are
not
modeled
.
Our
generative
model
can
be
summarized
as
follows
:
In
the
remainder
of
this
section
,
we
describe
each
of
the
steps
in
the
model
.
For
the
distribution
w
~
LanguageModel
,
we
used
a
simple
bigram
phoneme
model
.
The
phonemes
were
partitioned
into
natural
classes
(
see
Section
4
for
details
)
.
A
root
word
form
consisting
of
n
phonemes
x1
•
•
•
xn
is
generated
with
probability
where
plm
is
the
distribution
of
the
language
model
.
The
stochastic
edit
model
y
~
Edit
(
x
,
O
)
describes
how
a
single
old
word
form
x
=
x1
•
•
•
xn
changes
along
one
branch
of
the
phylogeny
with
parameters
O
to
produce
a
new
word
form
y.
This
process
is
parameterized
by
rule
probabilities
Ok
-
&gt;
i
,
which
are
specific
to
branch
(
k
—
&gt;
l
)
.
The
generative
process
is
as
follows
:
for
each
phoneme
xj
in
the
old
word
form
,
walking
from
left
to
right
,
choose
a
rule
to
apply
.
There
are
three
types
of
rules
:
(
1
)
deletion
of
the
phoneme
,
(
2
)
substitution
with
another
phoneme
(
possibly
the
same
one
)
,
or
(
3
)
insertion
of
another
phoneme
,
either
before
or
after
the
existing
one
.
The
probability
of
applying
a
rule
depends
on
a
context
(
NaturalClass
(
xi_1
)
,
NaturalClass
(
xi+1
)
)
.
Figure
2
illustrates
the
edits
on
an
example
.
The
context-dependence
allows
us
to
represent
phenomena
such
as
the
fact
that
s
is
likely
to
be
deleted
only
in
word-final
contexts
.
The
edit
model
we
have
presented
approximately
encodes
a
limited
form
of
classic
rewrite-driven
segmental
phonology
(
Chomsky
and
Halle
,
1968
)
.
One
Edits
applied
Rules
used
Figure
2
:
An
example
of
edits
that
were
used
to
transform
the
Latin
word
FOCUS
(
/
fokus
/
)
into
the
Italian
word
fuoco
(
/
fwoko
/
)
(
fire
)
along
with
the
context-specific
rules
that
were
applied
.
could
imagine
basing
our
model
on
more
modern
phonological
theory
,
but
the
computational
properties
of
the
edit
model
are
compelling
,
and
it
is
adequate
for
many
kinds
of
phonological
change
.
In
addition
to
simple
edits
,
we
can
model
some
classical
changes
that
appear
to
be
too
complex
to
be
captured
by
a
single
left-to-right
edit
model
of
this
kind
.
For
instance
,
bleeding
and
feeding
arrangements
occur
when
one
phonological
change
introduces
a
new
context
,
which
triggers
another
phonological
change
,
but
the
two
cannot
occur
simultaneously
.
For
example
,
vowel
raising
e
—
i
/
_
c
might
be
needed
before
palatalization
t
—
c
/
_
i.
Instead
of
capturing
such
an
interaction
directly
,
we
can
break
up
a
branch
into
two
segments
joined
at
an
intermediate
language
node
,
conflating
the
concept
of
historically
intermediate
languages
with
the
concept
of
intermediate
stages
in
the
application
of
sequential
rules
.
However
,
many
complex
processes
are
not
well-represented
by
our
basic
model
.
One
problematic
case
is
chained
shifts
such
as
Grimm
's
law
in
Proto-Germanic
or
the
Great
Vowel
Shift
in
English
.
To
model
such
dependent
rules
,
we
would
need
to
use
a
more
complex
prior
distributions
over
the
edit
parameters
.
Another
difficult
case
is
prosodic
changes
,
such
as
unstressed
vowel
neutralizations
,
which
would
require
a
representation
of
suprasegmental
features
.
While
our
basic
model
does
not
account
for
these
phenomena
,
extensions
within
the
generative
framework
could
capture
such
richness
.
3
Learning
and
inference
We
use
a
Monte
Carlo
EM
algorithm
to
fit
the
parameters
of
our
model
.
The
algorithm
iterates
between
a
stochastic
E-step
,
which
computes
recon
-
Figure
3
:
The
graphical
model
representation
of
our
model
:
9
are
the
parameters
specifying
the
stochastic
edits
e
,
which
govern
how
the
words
w
evolve
.
The
plate
notation
indicates
the
replication
of
the
nodes
corresponding
to
the
evolving
words
.
sanctions
based
on
the
current
edit
parameters
,
and
an
M-step
,
which
updates
the
edit
parameters
based
on
the
reconstructions
.
The
E-step
needs
to
produce
expected
counts
of
how
many
times
each
edit
(
such
as
o
—
o
)
was
used
in
each
context
.
An
exact
E-step
would
require
summing
over
all
possible
edits
involving
all
languages
in
the
phylogeny
(
all
unobserved
{
e
}
,
{
w
}
variables
in
Figure
3
)
.
Unfortunately
,
unlike
in
the
case
of
HMMs
and
PCFGs
,
our
model
permits
no
tractable
dynamic
program
to
compute
these
counts
exactly
.
Therefore
,
we
resort
to
a
Monte
Carlo
E-step
,
where
many
samples
of
the
edit
variables
are
collected
,
and
counts
are
computed
based
on
these
samples
.
Samples
are
drawn
using
Gibbs
sampling
(
Ge-man
and
Geman
,
1984
)
:
for
each
word
form
of
a
particular
language
wu
,
we
fix
all
other
variables
in
the
model
and
sample
wu
along
with
its
corresponding
edits
.
In
the
E-step
,
we
fix
the
parameters
,
which
renders
the
word
types
conditionally
independent
,
just
as
in
an
HMM
.
Therefore
,
we
can
process
each
word
type
in
turn
without
approximation
.
First
consider
the
simple
4-language
topology
in
Figure
3
.
Suppose
that
the
words
in
languages
A
,
C
and
D
are
fixed
,
and
we
wish
to
infer
the
word
at
language
B
along
with
the
three
corresponding
sets
of
edits
(
remember
the
edits
fully
determine
the
words
)
.
There
are
an
exponential
number
of
possible
words
/
edits
,
but
it
turns
out
that
we
can
exploit
the
Markov
structure
in
the
edit
model
to
consider
all
such
words
/
edits
using
dynamic
programming
,
in
a
way
broadly
similar
to
the
forward-backward
algorithm
for
HMMs
.
Figure
4
shows
the
lattice
for
the
dynamic
program
.
Each
path
connecting
the
two
shaded
endpoint
states
represents
a
particular
word
form
for
language
B
and
a
corresponding
set
of
edits
.
Each
node
in
the
lattice
is
a
state
of
the
dynamic
program
,
which
is
a
5-tuple
(
iA
,
ic
,
iD
,
c
\
,
c2
)
,
where
%
A
,
iC
and
iD
are
the
cursor
positions
(
represented
by
dots
in
Figure
4
)
in
each
of
the
word
forms
of
A
,
C
and
D
,
respectively
;
c
\
is
the
natural
class
of
the
phoneme
in
the
word
form
for
B
that
was
last
generated
;
and
c2
corresponds
to
the
phoneme
that
will
be
generated
next
.
Each
state
transition
involves
applying
a
rule
to
A
's
current
phoneme
(
which
produces
0-2
phonemes
in
B
)
and
applying
rules
to
B
's
new
0-2
phonemes
.
There
are
three
types
of
rules
(
deletion
,
substitution
,
insertion
)
,
resulting
in
30+32+34
=
91
types
of
state
transitions
.
For
illustration
,
Figure
4
shows
the
simpler
case
where
B
only
has
one
child
C.
Given
these
rules
,
the
new
state
is
computed
by
advancing
the
appropriate
cursors
and
updating
the
natural
classes
c
\
and
c2
.
The
weight
of
each
transition
w
(
s
—
t
)
is
a
product
of
the
language
model
probability
and
the
rule
probabilities
that
were
chosen
.
For
each
state
s
,
the
dynamic
program
computes
W
(
s
)
,
the
sum
of
the
weights
of
all
paths
leaving
s
,
W
(
s
)
=
w
(
s
—
t
)
W
(
t
)
.
To
sample
a
path
,
we
start
at
the
leftmost
state
,
choose
the
transition
with
probability
proportional
to
its
contribution
in
the
sum
for
computing
W
(
s
)
,
and
repeat
until
we
reach
the
rightmost
state
.
We
applied
a
few
approximations
to
speed
up
the
sampling
of
words
,
which
reduced
the
running
time
by
several
orders
of
magnitude
.
For
example
,
we
pruned
rules
with
low
probability
and
restricted
the
An
example
of
a
dynamic
programming
lattice
Types
of
state
transitions
(
x
:
ancient
phoneme
,
y
:
intermediate
,
z
:
modern
)
xxxxxxxx
x
Figure
4
:
The
dynamic
program
involved
in
sampling
an
intermediate
word
form
given
one
ancient
and
one
modern
word
form
.
One
lattice
node
is
expanded
to
show
the
dynamic
program
state
(
represented
by
the
part
not
grayed
out
)
and
three
of
the
many
possible
transitions
leaving
the
state
.
Each
transition
is
labeled
with
the
weight
of
the
transition
,
which
is
the
product
of
the
relevant
model
probabilities
.
At
the
bottom
,
the
13
types
of
state
transitions
are
shown
.
state
space
of
the
dynamic
program
by
limiting
the
deviation
in
cursor
positions
.
where
a
is
the
concentration
hyperparameter
of
the
Dirichlet
prior
.
The
value
a
—
1
can
be
interpreted
as
the
number
of
pseudocounts
for
a
rule
.
4
Experiments
In
this
section
we
show
the
results
of
our
experiments
with
our
model
.
The
experimental
conditions
are
summarized
in
Table
1
,
with
additional
informa
-
Experiment
Topology
Table
1
:
Conditions
under
which
each
of
the
experiments
presented
in
this
section
were
performed
.
The
topology
indices
correspond
to
those
displayed
in
Figure
1
.
Note
that
by
conditional
independence
,
the
topology
used
for
Spanish
reconstruction
reduces
to
a
chain
.
The
heldout
column
indicates
how
many
words
,
if
any
,
were
heldout
for
edit
distance
evaluation
,
and
from
which
language
.
tion
on
the
specifics
of
the
experiments
presented
in
Section
4.5
.
We
start
with
a
description
of
the
corpus
we
created
for
these
experiments
.
In
order
to
train
and
evaluate
our
system
,
we
compiled
a
corpus
of
Romance
cognate
words
.
The
raw
data
was
taken
from
three
sources
:
the
wiktionary.org
website
,
a
Bible
parallel
corpus
(
Resnik
et
al.
,
1999
)
and
the
Europarl
corpus
(
Koehn
,
2002
)
.
From
an
XML
dump
of
the
Wik-tionary
data
,
we
extracted
multilingual
translations
,
which
provide
a
list
of
word
tuples
in
a
large
number
of
languages
,
including
a
few
ancient
languages
.
The
Europarl
and
the
biblical
data
were
processed
and
aligned
in
the
standard
way
,
using
combined
GIZA++
alignments
(
Och
and
Ney
,
2003
)
.
We
performed
our
experiments
with
four
languages
from
the
Romance
family
(
Latin
,
Italian
,
Spanish
,
and
Portuguese
)
.
For
each
of
these
languages
,
we
used
a
simple
in-house
rule-based
system
to
convert
the
words
into
their
IPA
represen-tations.2
After
augmenting
our
alignments
with
the
transitive
closure3
of
the
Europarl
,
Bible
and
Wiktionary
data
,
we
filtered
out
non-cognate
words
by
thresholding
the
ratio
of
edit
distance
to
word
length.4
The
preprocessing
is
constraining
in
that
we
require
that
all
the
elements
of
a
tuple
to
be
cognates
,
which
leaves
out
a
significant
portion
of
the
data
behind
(
see
the
row
Full
entries
in
Table
2
)
.
However
,
our
approach
relies
on
this
assumption
,
as
there
is
no
explicit
model
of
non-cognate
words
.
An
interesting
direction
for
future
work
is
the
joint
modeling
of
phonology
with
the
determination
of
the
cognates
,
but
our
simpler
setting
lets
us
focus
on
the
properties
of
the
edit
model
.
Moreover
,
the
restriction
to
full
entries
has
the
side
advantage
that
the
Latin
bottleneck
prevents
the
introduction
of
too
many
neologisms
,
which
are
numerous
in
the
Europarl
data
,
to
the
final
corpus
.
Since
we
used
automatic
tools
for
preparing
our
corpus
rather
than
careful
linguistic
analysis
,
our
cognate
list
is
much
noiser
in
terms
of
the
presence
of
borrowed
words
and
phonemeic
transcription
errors
compared
to
the
ones
used
by
previous
approaches
(
Swadesh
,
1955
;
Dyen
et
al.
,
1997
)
.
The
benefit
of
our
mechanical
preprocessing
is
that
more
cognate
data
can
easily
be
made
available
,
allowing
us
to
effectively
train
richer
models
.
We
show
in
the
rest
of
this
section
that
our
phonological
model
can
indeed
overcome
this
noise
and
recover
meaningful
patterns
from
the
data
.
Name
Languages
Tuples
Word
forms
Raw
sources
of
data
used
to
create
the
corpus
2The
tool
and
the
rules
we
used
are
available
at
nlp.cs.berkeley.edu
/
pages
/
historical.html
.
3For
example
,
we
would
infer
from
an
la-es
bible
alignment
confessionem-confesion
(
confession
)
and
an
es-it
Europarl
alignment
confesion-confessione
that
the
Latin
word
con-fessionem
and
the
Italian
word
confessione
are
related
.
Wiktionary
Europarl
Main
stages
of
preprocessing
of
the
corpus
Cognates
Full
entries
Table
2
:
Statistics
of
the
dataset
we
compiled
for
the
evaluation
of
our
model
.
We
show
the
languages
represented
,
the
number
of
tuples
and
the
number
of
word
forms
found
in
each
of
the
source
of
data
and
pre-processing
steps
involved
in
the
creation
of
the
dataset
we
used
to
test
our
model
.
By
full
entry
,
we
mean
the
number
of
tuples
that
are
jointly
considered
cognate
by
our
preprocessing
systemandthathave
aword
formknownforeach
of
the
languages
of
interest
.
These
last
row
forms
the
dataset
used
for
our
experiments
.
Language
Baseline
Improvement
and
d
is
the
Levenshtein
distance
.
Table
3
:
Results
of
the
edit
distance
experiment
.
The
language
column
corresponds
to
the
language
held-out
for
evaluation
.
We
show
the
mean
edit
distance
across
the
evaluation
examples
.
4.2
Reconstruction
of
word
forms
We
ran
the
system
using
Topology
1
in
Figure
1
to
demonstrate
the
the
system
can
propose
reasonable
reconstructions
of
Latin
word
forms
on
the
basis
of
modern
observations
.
Half
of
the
Latin
words
at
the
root
of
the
tree
were
held
out
,
and
the
(
uniform
cost
)
Levenshtein
edit
distance
from
the
predicted
reconstruction
to
the
truth
was
computed
.
Our
baseline
is
to
pick
randomly
,
for
each
heldout
node
in
the
tree
,
an
observed
neighboring
word
(
i.e.
copy
one
of
the
modern
forms
)
.
We
stopped
EM
after
15
iterations
,
and
reported
the
result
on
a
Viterbi
derivation
using
the
parameters
obtained
.
Our
model
outperformed
this
baseline
by
a
9
%
relative
reduction
in
average
edit
distance
.
Similarly
,
reconstruction
of
modern
forms
was
also
demonstrated
,
with
an
improvement
of
11
%
(
see
Table
3
)
.
To
give
a
qualitative
feel
for
the
operation
of
the
system
(
good
and
bad
)
,
consider
the
example
in
Figure
5
,
taken
from
this
experiment
.
The
Latin
dentis
/
dentis
/
(
teeth
)
is
nearly
correctly
reconstructed
as
/
dentes
/
,
reconciling
the
appearance
of
the
/
j
/
in
the
/
djentes
/
Figure
5
:
An
example
of
a
Latin
reconstruction
given
the
Spanish
and
Italian
word
forms
.
Spanish
and
the
disappearance
of
the
final
/
s
/
in
the
Italian
.
Note
that
the
/
is
/
vs.
/
es
/
ending
is
difficult
to
predict
in
this
context
(
indeed
,
it
was
one
of
the
early
distinctions
to
be
eroded
in
vulgar
Latin
)
.
While
the
uniform-cost
edit
distance
misses
important
aspects
of
phonology
(
all
phoneme
substitutions
are
not
equal
,
for
instance
)
,
it
is
parameter-free
and
still
seems
to
correlate
to
a
large
extent
with
linguistic
quality
of
reconstruction
.
It
is
also
superior
to
held-out
log-likelihood
,
which
fails
to
penalize
errors
in
the
modeling
assumptions
,
and
to
measuring
the
percentage
of
perfect
reconstructions
,
which
ignores
the
degree
of
correctness
of
each
reconstructed
word
.
4.3
Inference
of
phonological
changes
Another
use
of
our
model
is
to
automatically
recover
the
phonological
drift
processes
between
known
or
partially
known
languages
.
To
facilitate
evaluation
,
we
continued
in
the
well-studied
Romance
evolutionary
tree
.
Again
,
the
root
is
Latin
,
but
we
now
add
an
additional
modern
language
,
Portuguese
,
and
two
additional
hidden
nodes
.
One
of
the
nodes
characterizes
the
least
common
ancestor
of
modern
Spanish
and
Portuguese
;
the
other
,
the
least
common
ancestor
of
all
three
modern
languages
.
In
Figure
1
,
Topology
2
,
these
two
nodes
are
labelled
vl
(
Vulgar
Latin
)
and
ib
(
Proto-Ibero
Romance
)
respectively
.
Since
we
are
omitting
many
other
branches
,
these
names
should
not
be
understood
as
referring
to
actual
historical
proto-languages
,
but
,
at
best
,
to
collapsed
points
representing
several
centuries
of
evolution
.
Nonetheless
,
the
major
reconstructed
rules
still
correspond
to
well
known
phenomena
and
the
learned
model
generally
places
them
on
reasonable
branches
.
Figure
6
shows
the
top
four
general
rules
for
each
of
the
evolutionary
branches
in
this
experiment
,
ranked
by
the
number
of
times
they
were
used
in
the
derivations
during
the
last
iteration
of
EM
.
The
la
,
es
,
pt
,
and
it
forms
are
fully
observed
while
the
vl
and
ib
forms
are
automatically
reconstructed
.
Figure
6
also
shows
a
specific
example
of
the
evolution
of
the
Latin
VERBUM
(
word
/
verb
)
,
along
with
the
specific
edits
employed
by
the
model
.
While
quantitative
evaluation
such
as
measuring
edit
distance
is
helpful
for
comparing
results
,
it
is
also
illuminating
to
consider
the
plausibility
of
the
learned
parameters
in
a
historical
light
,
which
we
do
here
briefly
.
In
particular
,
we
consider
rules
on
the
branch
between
la
and
vl
,
for
which
we
have
historical
evidence
.
For
example
,
documents
such
as
the
Appendix
Probi
(
Baehrens
,
1922
)
provide
indications
of
orthographic
confusions
which
resulted
from
the
growing
gap
between
Classical
Latin
and
Vulgar
Latin
phonology
around
the
3rd
and
4th
centuries
AD
.
The
Appendix
lists
common
misspellings
of
Latin
words
,
from
which
phonological
changes
can
be
inferred
.
On
the
la
to
vl
branch
,
rules
for
word-final
deletion
of
classical
case
markers
dominate
the
list
(
rules
ranks
1
and
3
for
deletion
of
final
/
s
/
,
ranks
2
and
4
for
deletion
of
final
/
m
/
)
.
It
is
indeed
likely
that
these
were
generally
eliminated
in
Vulgar
Latin
.
For
the
deletion
of
the
/
m
/
,
the
Appendix
Probi
contains
pairs
such
as
PASSIM
NON
PASSI
and
OLIM
NON
OLI
.
For
the
deletion
of
final
/
s
/
,
this
was
observed
in
early
inscriptions
,
e.g.
CORNELIO
for
CORNE-LIOS
(
Allen
,
1989
)
.
The
frequent
leveling
of
the
distinction
between
/
o
/
and
/
u
/
(
rules
ranked
5
and
6
)
can
be
also
be
found
in
the
Appendix
Probi
:
COLUBER
NON
COLOBER
.
Note
that
in
the
specific
example
shown
,
the
model
lowers
the
orignal
/
u
/
and
then
re-raises
it
in
the
pt
branch
due
to
a
latter
process
along
that
branch
.
Similarily
,
major
canonical
rules
were
discovered
in
other
branches
as
well
,
for
example
,
/
v
/
to
/
b
/
fortition
in
Spanish
,
/
s
/
to
/
z
/
voicing
in
Italian
,
palatalization
along
several
branches
,
and
so
on
.
Of
course
,
the
recovered
words
and
rules
are
not
perfect
.
For
example
,
reconstructed
Ibero
/
trinta
/
to
Spanish
/
treinta
/
(
thirty
)
is
generated
in
an
odd
fashion
using
rules
/
e
/
to
/
i
/
and
/
n
/
to
/
in
/
.
Moreover
,
even
when
otherwise
reasonable
systematic
sound
changes
are
captured
,
the
crudeness
of
our
fixed-granularity
contexts
can
prevent
the
true
context
many
environments
many
environments
initial
or
intervocalic
/
ROUNDED
_
UNROUNDED
m
—
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
u
—
o
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
r
—
r
e
—
e
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
v
—
b
o
—
u
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
S
~
*
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Figure
6
:
The
tree
shows
the
system
's
hypothesised
derivation
of
a
selected
Latin
word
form
,
VERBUM
(
word
/
verb
)
into
the
modern
Spanish
,
Italian
and
Portuguese
pronunciations
.
The
Latin
root
and
modern
leaves
were
observed
while
the
hidden
nodes
as
well
as
all
the
derivations
were
obtained
using
the
parameters
computed
by
our
model
after
15
iterations
of
EM
.
Nontrivial
rules
(
i.e.
rules
that
are
not
identities
)
used
at
each
stage
are
shown
along
the
corresponding
edge
.
The
boxes
display
the
top
four
nontrivial
rules
corresponding
to
each
of
these
evolutionary
branches
,
ordered
by
the
number
of
time
they
were
applied
during
the
last
E
round
of
sampling
.
Note
that
since
our
natural
classes
are
of
fixed
granularity
,
some
rules
must
be
redundantly
discovered
,
which
tends
to
flood
the
top
of
the
rule
lists
with
duplicates
of
the
top
few
rules
.
We
summarized
such
redundancies
in
the
above
tables
.
from
being
captured
,
resulting
in
either
rules
applying
with
low
probability
in
overly
coarse
environments
or
rules
being
learned
redundantly
in
overly
fine
environments
.
4.4
Selection
of
phylogenies
In
this
experiment
,
we
show
that
our
model
can
be
used
to
select
between
various
topologies
of
phylo-genies
.
We
first
presented
to
the
algorithm
the
universally
accepted
evolutionary
tree
corresponding
to
the
evolution
of
Latin
into
Spanish
,
Portuguese
and
Italian
(
Topology
2
in
Figure
1
)
.
We
estimated
the
log-likelihood
L
*
of
the
data
under
this
topology
.
Next
,
we
estimated
the
log-likelihood
L
under
two
defective
topologies
(
*
Topology
3
and
*
Topology
4
)
.
We
recorded
the
log-likelihood
ratio
L
*
—
L
after
the
last
iteration
of
EM
.
Note
that
the
two
likelihoods
are
comparable
since
the
complexity
of
the
two
models
is
the
same.5
We
obtained
a
ratio
of
L
*
—
L
=
—
4458
—
(
—
4766
)
=
307
for
Topology
2
versus
*
Topology
3
,
and
—
4877
—
(
—
5125
)
=
248
for
Topology
2
versus
*
Topology
4
(
the
experimental
setup
is
described
in
Table
1
)
.
As
one
would
hope
,
this
log-likelihood
ratio
is
positive
in
both
cases
,
indicating
that
the
system
prefers
the
true
topology
over
the
wrong
ones
.
While
it
may
seem
,
at
the
first
glance
,
that
this
result
is
limited
in
scope
,
knowing
the
relative
arrange
-
5If
a
word
was
not
reachable
in
one
of
the
topology
,
it
was
ignored
in
both
models
for
the
computation
of
the
likelihoods
.
ment
of
all
groups
of
four
nodes
is
actually
sufficient
for
constructing
a
full-fledged
phylogenetic
tree
.
Indeed
,
quartet-based
methods
,
which
have
been
very
popular
in
the
computational
biology
community
,
are
precisely
based
on
this
fact
(
Erdos
et
al.
,
1996
)
.
There
is
a
rich
literature
on
this
subject
and
approximate
algorithms
exist
which
are
robust
to
misclassi-fication
of
a
subset
of
quartets
(
Wu
et
al.
,
2007
)
.
4.5
More
experimental
details
This
section
summarizes
the
values
of
the
parameters
we
used
in
these
experiments
,
their
interpretation
,
and
the
effect
of
setting
them
to
other
values
.
The
Dirichlet
prior
on
the
parameters
can
be
interpreted
as
adding
pseudocounts
to
the
corresponding
edits
.
It
is
an
important
way
of
infusing
parsimony
into
the
model
by
setting
the
prior
of
the
self-substitution
parameters
much
higher
than
that
of
the
other
parameters
.
We
used
6.0
as
the
prior
on
the
self-substitution
parameters
,
and
for
all
environments
,
1.1
was
divided
uniformly
across
the
other
edits
.
As
long
as
the
prior
on
self-substitution
is
kept
within
this
rough
order
of
magnitude
,
varying
them
has
a
limited
effect
on
our
results
.
We
also
initialized
the
parameters
with
values
that
encourage
self-substitutions
.
Again
,
the
results
were
robust
to
perturbation
of
initialization
as
long
as
the
value
for
self-substitution
dominates
the
other
parameters
.
The
experiments
used
two
natural
classes
for
vowels
(
rounded
and
unrounded
)
,
and
six
natural
classes
for
consonants
,
based
on
the
place
of
articulation
(
alveolar
,
bilabial
,
labiodental
,
palatal
,
postalveolar
,
and
velar
)
.
We
conducted
experiments
to
evaluate
the
effect
of
using
different
natural
classes
and
found
that
finer
ones
can
help
if
enough
data
is
used
for
training
.
We
defer
the
meticulous
study
of
the
optimal
granularity
to
future
work
,
as
it
would
be
a
more
interesting
experiment
under
a
loglinear
model
.
In
such
a
model
,
contexts
of
different
granularities
can
coexist
,
whereas
such
coexistence
is
not
recognized
by
the
current
model
,
giving
rise
to
many
duplicate
rules
.
We
estimated
the
bigram
phoneme
model
on
the
words
in
the
root
languages
that
were
not
heldout
.
Just
as
in
machine
translation
,
the
language
model
was
found
to
contribute
significantly
to
reconstruction
performance
.
We
tried
to
increase
the
weight
of
the
language
model
by
exponentiating
it
to
a
power
,
as
is
often
done
in
NLP
applications
,
but
we
did
not
find
that
it
had
any
significant
impact
on
performance
.
In
the
reconstruction
experiments
,
when
the
data
was
not
reachable
by
the
model
,
the
word
used
in
the
initialization
was
used
as
the
prediction
,
and
the
evolution
of
these
words
were
ignored
when
re-estimating
the
parameters
.
Words
were
initialized
by
picking
at
random
,
for
each
unobserved
node
,
an
observed
node
's
corresponding
word
.
5
Conclusion
We
have
presented
a
novel
probabilistic
model
of
diachronic
phonology
and
an
associated
inference
procedure
.
Our
experiments
indicate
that
our
model
is
able
to
both
produce
accurate
reconstructions
as
measured
by
edit
distance
and
identify
linguistically
plausible
rules
that
account
for
the
phonological
changes
.
We
believe
that
the
probabilistic
framework
we
have
introduced
for
diachronic
phonology
is
promising
,
and
scaling
it
up
to
richer
phylogenetic
may
indeed
reveal
something
insightful
about
language
change
.
6
Acknowledgement
We
would
like
to
thank
Bonnie
Chantarotwong
for
her
help
with
the
IPA
converter
and
our
reviewers
for
their
comments
.
This
work
was
supported
by
a
FQRNT
fellowship
to
the
first
author
,
a
NDSEG
fellowship
to
the
second
author
,
NSF
grant
number
BCS-0631518
to
the
third
author
,
and
a
Microsoft
Research
New
Faculty
Fellowship
to
the
fourth
author
.
