We
present
a
nonparametric
Bayesian
model
of
tree
structures
based
on
the
hierarchical
Dirichlet
process
(
HDP
)
.
Our
HDP-PCFG
model
allows
the
complexity
of
the
grammar
to
grow
as
more
training
data
is
available
.
In
addition
to
presenting
a
fully
Bayesian
model
for
the
PCFG
,
we
also
develop
an
efficient
variational
inference
procedure
.
On
synthetic
data
,
we
recover
the
correct
grammar
without
having
to
specify
its
complexity
in
advance
.
We
also
show
that
our
techniques
can
be
applied
to
full-scale
parsing
applications
by
demonstrating
its
effectiveness
in
learning
state-split
grammars
.
1
Introduction
Probabilistic
context-free
grammars
(
PCFGs
)
have
been
a
core
modeling
technique
for
many
aspects
of
linguistic
structure
,
particularly
syntactic
phrase
structure
in
treebank
parsing
(
Charniak
,
1996
;
Collins
,
1999
)
.
An
important
question
when
learning
PCFGs
is
how
many
grammar
symbols
to
allocate
to
the
learning
algorithm
based
on
the
amount
of
available
data
.
The
question
of
"
how
many
clusters
(
symbols
)
?
"
has
been
tackled
in
the
Bayesian
nonparametrics
literature
via
Dirichlet
process
(
DP
)
mixture
models
(
Antoniak
,
1974
)
.
DP
mixture
models
have
since
been
extended
to
hierarchical
Dirichlet
processes
(
HDPs
)
and
HDP-HMMs
(
Teh
et
al.
,
2006
;
Beal
et
al.
,
2002
)
and
applied
to
many
different
types
of
clustering
/
induction
problems
in
NLP
(
Johnson
et
al.
,
2006
;
Goldwater
et
al.
,
2006
)
.
In
this
paper
,
we
present
the
hierarchical
Dirichlet
process
PCFG
(
HDP-PCFG
)
.
a
nonparametric
Bayesian
model
of
syntactic
tree
structures
based
on
Dirichlet
processes
.
Specifically
,
an
HDP-PCFG
is
defined
to
have
an
infinite
number
of
symbols
;
the
Dirichlet
process
(
DP
)
prior
penalizes
the
use
of
more
symbols
than
are
supported
by
the
training
data
.
Note
that
"
nonparametric
"
does
not
mean
"
no
parameters
"
;
rather
,
it
means
that
the
effective
number
of
parameters
can
grow
adaptively
as
the
amount
of
data
increases
,
which
is
a
desirable
property
of
a
learning
algorithm
.
As
models
increase
in
complexity
,
so
does
the
uncertainty
over
parameter
estimates
.
In
this
regime
,
point
estimates
are
unreliable
since
they
do
not
take
into
account
the
fact
that
there
are
different
amounts
of
uncertainty
in
the
various
components
of
the
parameters
.
The
HDP-PCFG
is
a
Bayesian
model
which
naturally
handles
this
uncertainty
.
We
present
an
efficient
variational
inference
algorithm
for
the
HDP-PCFG
based
on
a
structured
mean-field
approximation
of
the
true
posterior
over
parameters
.
The
algorithm
is
similar
in
form
to
EM
and
thus
inherits
its
simplicity
,
modularity
,
and
efficiency
.
Unlike
EM
,
however
,
the
algorithm
is
able
to
take
the
uncertainty
of
parameters
into
account
and
thus
incorporate
the
DP
prior
.
Finally
,
we
develop
an
extension
of
the
HDP-PCFG
for
grammar
refinement
(
HDP-PCFG-GR
)
.
Since
treebanks
generally
consist
of
coarsely-labeled
context-free
tree
structures
,
the
maximum-likelihood
treebank
grammar
is
typically
a
poor
model
as
it
makes
overly
strong
independence
assumptions
.
As
a
result
,
many
generative
approaches
to
parsing
construct
refinements
of
the
treebank
grammar
which
are
more
suitable
for
the
modeling
task
.
Lexical
methods
split
each
pre-terminal
symbol
into
many
subsymbols
,
one
for
each
word
,
and
then
focus
on
smoothing
sparse
lexical
statis
-
Proceedings
of
the
2007
Joint
Conference
on
Empirical
Methods
in
Natural
Language
Processing
and
Computational
Natural
Language
Learning
,
pp.
688-697
,
Prague
,
June
2007
.
©
2007
Association
for
Computational
Linguistics
model
to
automatically
learn
the
number
of
subsymbols
for
each
symbol
.
2
Models
based
on
Dirichlet
processes
At
the
heart
of
the
HDP-PCFG
is
the
Dirichlet
process
(
DP
)
mixture
model
(
Antoniak
,
1974
)
,
which
is
the
nonparametric
Bayesian
counterpart
to
the
classical
finite
mixture
model
.
In
order
to
build
up
an
understanding
of
the
HDP-PCFG
,
we
first
review
the
Bayesian
treatment
of
the
finite
mixture
model
(
Section
2.1
)
.
We
then
consider
the
DP
mixture
model
(
Section
2.2
)
and
use
it
as
a
building
block
for
developing
nonparametric
structured
versions
of
the
HMM
(
Section
2.3
)
and
PCFG
(
Section
2.4
)
.
Our
presentation
highlights
the
similarities
between
these
models
so
that
each
step
along
this
progression
reflects
only
the
key
differences
.
2.1
Bayesian
inite
mixture
model
We
begin
by
describing
the
Bayesian
finite
mixture
model
to
establish
basic
notation
that
will
carry
over
the
more
complex
models
we
consider
later
.
Bayesian
finite
mixture
model
The
model
has
K
components
whose
prior
distribution
is
specified
by
/
3
=
(
f^
,
.
.
.
,
ffK
)
.
The
Dirichlet
hyperparameter
a
controls
how
uniform
this
distribution
is
:
as
a
increases
,
it
becomes
increasingly
likely
that
the
components
have
equal
probability
.
For
each
mixture
component
z
£
{
1
,
.
.
.
,
K
}
,
the
parameters
of
the
component
0z
are
drawn
from
some
prior
Go
.
Given
the
model
parameters
(
/
3,0
)
,
the
data
points
are
generated
i.i.d.
by
first
choosing
a
component
and
then
generating
from
a
data
model
F
parameterized
by
that
component
.
In
document
clustering
,
for
example
,
each
data
point
xi
is
a
document
represented
by
its
term-frequency
vector
.
Each
component
(
cluster
)
z
has
multinomial
parameters
0z
which
specifies
a
distribution
F
(
•
;
0z
)
over
words
.
It
is
customary
to
use
a
conjugate
Dirichlet
prior
G0
=
Dirichlet
(
a
'
,
.
.
.
,
a
'
)
over
the
multinomial
parameters
,
which
can
be
interpreted
as
adding
a
'
—
1
pseu-docounts
for
each
word
.
We
now
consider
the
extension
of
the
Bayesian
finite
mixture
model
to
a
nonparametric
Bayesian
mixture
model
based
on
the
Dirichlet
process
.
We
focus
on
the
stick-breaking
representation
(
Sethuraman
,
1994
)
of
the
Dirichlet
process
instead
of
the
stochastic
process
definition
(
Ferguson
,
1973
)
or
the
Chinese
restaurant
process
(
Pitman
,
2002
)
.
The
stick-breaking
representation
captures
the
DP
prior
most
explicitly
and
allows
us
to
extend
the
finite
mixture
model
with
minimal
changes
.
Later
,
it
will
enable
us
to
readily
define
structured
models
in
a
form
similar
to
their
classical
versions
.
Furthermore
,
an
efficient
variational
inference
algorithm
can
be
developed
in
this
representation
(
Section
2.6
)
.
The
key
difference
between
the
Bayesian
finite
mixture
model
and
the
DP
mixture
model
is
that
the
latter
has
a
countably
infinite
number
of
mixture
components
while
the
former
has
a
predefined
K.
Note
that
if
we
have
an
infinite
number
of
mixture
components
,
it
no
longer
makes
sense
to
consider
a
symmetric
prior
over
the
component
probabilities
;
the
prior
over
component
probabilities
must
decay
in
some
way
.
The
stick-breaking
distribution
achieves
this
as
follows
.
We
write
/
3
~
GEM
(
a
)
to
mean
that
/
3
=
(
ff
1
,
ff2
,
.
.
.
)
is
distributed
according
to
the
stick-breaking
distribution
.
Here
,
the
concentration
parameter
a
controls
the
number
of
effective
components
.
To
draw
/
3
~
GEM
(
a
)
,
we
first
generate
a
countably
infinite
collection
of
stick-breaking
proportions
u1
,
u2
,
.
.
.
,
where
each
uz
~
Beta
(
1
,
a
)
.
The
stick-breaking
weights
/
3
are
then
defined
in
terms
of
the
stick
proportions
:
The
procedure
for
generating
f3
can
be
viewed
as
iteratively
breaking
off
remaining
portions
of
a
unit
-
A
A
.
.
.
Figure
1
:
A
sample
ff
~
GEM
(
1
)
.
length
stick
(
Figure
1
)
.
The
component
probabilities
{
f
z
}
will
decay
exponentially
in
expectation
,
but
there
is
always
some
probability
of
getting
a
smaller
component
before
a
larger
one
.
The
parameter
a
determines
the
decay
of
these
probabilities
:
a
larger
a
implies
a
slower
decay
and
thus
more
components
.
Given
the
component
probabilities
,
the
rest
of
the
DP
mixture
model
is
identical
to
the
finite
mixture
model
:
DP
mixture
model
The
next
stop
on
the
way
to
the
HDP-PCFG
is
the
set
of
hidden
states
,
where
each
state
can
be
thought
of
as
a
mixture
component
.
The
parameters
of
the
mixture
component
are
the
emission
and
transition
parameters
.
The
main
aspect
that
distinguishes
it
from
a
flat
finite
mixture
model
is
that
the
transition
parameters
themselves
must
specify
a
distribution
over
next
states
.
Hence
,
we
have
not
just
one
top-level
mixture
model
over
states
,
but
also
a
collection
of
mixture
models
,
one
for
each
state
.
In
developing
a
nonparametric
version
of
the
HMM
in
which
the
number
of
states
is
infinite
,
we
need
to
ensure
that
the
transition
mixture
models
of
each
state
share
a
common
inventory
of
possible
next
states
.
We
can
achieve
this
by
tying
these
mixture
models
together
using
the
hierarchical
Dirichlet
process
(
HDP
)
(
Tehetal
.
,
2006
)
.
The
stick-breaking
representation
of
an
HDP
is
defined
as
follows
:
first
,
the
top-level
stick-breaking
weights
f
are
drawn
according
to
the
stick-breaking
prior
as
before
.
Then
,
a
new
set
of
stick-breaking
weights
ff
'
are
generated
according
based
on
f
:
where
the
distribution
of
DP
can
be
characterized
in
terms
of
the
following
finite
partition
property
:
for
all
partitions
of
the
positive
integers
into
sets
where
ff
(
A
)
=
^fc
€
A
fffc.1
The
resulting
ff
'
is
another
distribution
over
the
positive
integers
whose
similarity
to
f
is
controlled
by
a
concentration
parameter
a
'
.
[
draw
emission
parameters
]
[
draw
transition
parameters
]
■
DP
(
a
'
,
/
3
)
For
each
time
step
i
e
{
1
,
.
Multinomial
"
[
emit
current
observation
]
[
choose
next
state
]
Each
state
z
is
associated
with
emission
parameters
.
In
addition
,
each
z
is
also
associated
with
transition
parameters
(
/
&gt;
T
,
which
specify
a
distribution
over
next
states
.
These
transition
parameters
are
drawn
from
a
DP
centered
on
the
top-level
stick-breaking
weights
f
according
to
Equations
(
2
)
and
(
3
)
.
Assume
that
z1
is
always
fixed
to
a
special
S
TART
state
,
so
we
do
not
need
to
generate
it
.
We
now
present
the
HDP-PCFG
,
which
is
the
focus
of
this
paper
.
For
simplicity
,
we
consider
Chomsky
normal
form
(
CNF
)
grammars
,
which
has
two
types
of
rules
:
emissions
and
binary
productions
.
We
consider
each
grammar
symbol
as
a
mixture
component
whose
parameters
are
the
rule
probabilities
for
that
symbol
.
In
general
,
we
do
not
know
the
appropriate
number
of
grammar
symbols
,
so
our
strategy
is
to
let
the
number
of
grammar
symbols
be
infinite
and
place
a
DP
prior
over
grammar
symbols
.
1Note
that
this
property
is
a
specific
instance
of
the
general
stochastic
process
definition
of
Dirichlet
processes
.
'
Dirichlet
(
aT
)
[
draw
rule
type
parameters
]
'
Dirichlet
(
aE
)
[
draw
emission
parameters
]
[
choose
rule
type
]
[
emit
terminal
symbol
]
[
generate
children
symbols
]
Parameters
Figure
2
:
The
definition
and
graphical
model
of
the
HDP-PCFG
.
Since
parse
trees
have
unknown
structure
,
there
is
no
convenient
way
of
representing
them
in
the
visual
language
of
traditional
graphical
models
.
Instead
,
we
show
a
simple
fixed
example
tree
.
Node
1
has
two
children
,
2
and
3
,
each
of
which
has
one
observed
terminal
child
.
We
use
L
(
i
)
and
to
denote
the
left
and
right
children
of
node
i.
In
the
HMM
,
the
transition
parameters
of
a
state
specify
a
distribution
over
single
next
states
;
similarly
,
the
binary
production
parameters
of
a
grammar
symbol
must
specify
a
distribution
over
pairs
of
grammar
symbols
for
its
children
.
We
adapt
the
HDP
machinery
to
tie
these
binary
production
distributions
together
.
The
key
difference
is
that
now
we
must
tie
distributions
over
pairs
of
grammar
symbols
together
via
distributions
over
single
grammar
symbols
.
Another
difference
is
that
in
the
HMM
,
at
each
time
step
,
both
a
transition
and
a
emission
are
made
,
whereas
in
the
PCFG
either
a
binary
production
or
an
emission
is
chosen
.
Therefore
,
each
grammar
symbol
must
also
have
a
distribution
over
the
type
of
rule
to
apply
.
In
a
CNF
PCFG
,
there
are
only
two
types
of
rules
,
but
this
can
be
easily
generalized
to
include
unary
productions
,
which
we
use
for
our
parsing
experiments
.
To
summarize
,
the
parameters
of
each
grammar
symbol
z
consists
of
(
1
)
a
distribution
over
a
finite
number
of
rule
types
(
/
&gt;
T
,
(
2
)
an
emission
distribution
(
(
(
f
over
terminal
symbols
,
and
(
3
)
a
binary
production
distribution
0^
over
pairs
of
children
grammar
symbols
.
Figure
2
describes
the
model
in
detail
.
Figure
3
shows
the
generation
of
the
binary
production
distributions
.
We
draw
from
a
DP
centered
on
ffffT
,
which
is
the
product
distribution
over
pairs
of
symbols
.
The
result
is
a
doubly-infinite
matrix
where
most
of
the
probability
mass
is
con
-
left
child
state
right
child
state
Figure
3
:
The
generation
of
binary
production
probabilities
given
the
top-level
symbol
probabilities
p.
First
,
P
is
drawn
from
the
stick-breaking
prior
,
as
in
any
DP-based
model
(
a
)
.
Next
,
the
outer-product
PPT
is
formed
,
resulting
in
a
doubly-infinite
matrix
matrix
(
b
)
.
We
use
this
as
the
base
distribution
for
generating
the
binary
production
distribution
from
a
DP
centered
on
PPT
(
c
)
.
centrated
in
the
upper
left
,
just
like
the
top-level
distribution
ffffT
.
Note
that
we
have
replaced
the
general
G0
and
F
)
pair
with
Dirichlet
(
aE
)
and
Multinomial
(
0E
)
to
specialize
to
natural
language
,
but
there
is
no
difficulty
in
working
with
parse
trees
with
arbitrary
non-multinomial
observations
or
more
sophisticated
word
models
.
In
many
natural
language
applications
,
there
is
a
hard
distinction
between
pre-terminal
symbols
(
those
that
only
emit
a
word
)
and
non-terminal
symbols
(
those
that
only
rewrite
as
two
non-terminal
or
pre-terminal
symbols
)
.
This
can
be
accomplished
by
letting
aT
=
(
0
,
0
)
,
which
forces
a
draw
0T
to
assign
probability
1
to
one
rule
type
.
An
alternative
definition
of
an
HDP-PCFG
would
be
as
follows
:
for
each
symbol
z
,
draw
a
distribution
over
left
child
symbols
1z
~
DP
(
ff
)
and
an
independent
distribution
over
right
child
symbols
rz
~
DP
(
ff
)
.
Then
define
the
binary
production
distribution
as
their
cross-product
0f
=
1zrj1
.
This
also
yields
a
distribution
over
symbol
pairs
and
hence
defines
a
different
type
of
nonparametric
PCFG
.
This
model
is
simpler
and
does
not
require
any
additional
machinery
beyond
the
HDP-HMM
.
However
,
the
modeling
assumptions
imposed
by
this
alternative
are
unappealing
as
they
assume
the
left
child
and
right
child
are
independent
given
the
parent
,
which
is
certainly
not
the
case
in
natural
language
.
2.5
HDP-PCFG
for
grammar
refinement
An
important
motivation
for
the
HDP-PCFG
is
that
of
refining
an
existing
treebank
grammar
to
alleviate
unrealistic
independence
assumptions
and
to
improve
parsing
accuracy
.
In
this
scenario
,
the
set
of
symbols
is
known
,
but
we
do
not
know
how
many
subsymbols
to
allocate
per
symbol
.
We
introduce
the
HDP-PCFG
for
grammar
refinement
for
this
task
.
The
essential
difference
is
that
now
we
have
a
collection
of
HDP-PCFG
models
for
each
symbol
s
£
S
,
each
one
operating
at
the
subsymbol
level
.
While
these
HDP-PCFGs
are
independent
in
the
prior
,
they
are
coupled
through
their
interactions
in
the
parse
trees
.
For
completeness
,
we
have
also
included
unary
productions
,
which
are
essentially
the
PCFG
counterpart
of
transitions
in
HMMs
.
Finally
,
since
each
node
i
in
the
parse
tree
involves
a
symbolsubsymbol
pair
(
si
,
zi
)
,
each
subsymbol
needs
to
specify
a
distribution
over
both
child
symbols
and
subsymbols
.
The
former
can
be
handled
through
a
finite
Dirichlet
distribution
since
all
symbols
are
known
and
observed
,
but
the
latter
must
be
handled
with
the
Dirichlet
process
machinery
,
since
the
number
of
subsymbols
is
unknown
.
2.6
Variational
inference
We
present
an
inference
algorithm
for
the
HDP-PCFG
model
described
in
Section
2.4
,
which
can
also
be
adapted
to
the
HDP-PCFG-GR
model
with
a
bit
more
bookkeeping
.
Most
previous
inference
algorithms
for
DP-based
models
involve
sampling
(
Escobar
and
West
,
1995
;
Teh
et
al.
,
2006
)
.
However
,
we
chose
to
use
variational
inference
(
Blei
and
Jordan
,
2005
)
,
which
provides
a
fast
deterministic
alternative
to
sampling
,
hence
avoiding
issues
of
diagnosing
convergence
and
aggregating
samples
.
Furthermore
,
our
variational
inference
algorithm
establishes
a
strong
link
with
past
work
on
PCFG
refinement
and
induction
,
which
has
traditionally
employed
the
EM
algorithm
.
In
EM
,
the
E-step
involves
a
dynamic
program
that
exploits
the
Markov
structure
of
the
parse
tree
,
and
the
M-step
involves
computing
ratios
based
on
expected
counts
extracted
from
the
E-step
.
our
vari-ational
algorithm
resembles
the
EM
algorithm
in
form
,
but
the
ratios
in
the
M-step
are
replaced
with
weights
that
reflect
the
uncertainty
in
parameter
es
-
HDP-PCFG
for
grammar
refinement
(
HDP-PCFG-GR
)
For
each
node
i
in
the
parse
tree
:
Parameters
Trees
Figure
4
:
We
approximate
the
true
posterior
p
over
parameters
9
and
latent
parse
trees
z
using
a
structured
mean-field
distribution
q
,
in
which
the
distribution
over
parameters
are
completely
factorized
but
the
distribution
over
parse
trees
is
unconstrained
.
timates
.
Because
of
this
procedural
similarity
,
our
method
is
able
to
exploit
the
desirable
properties
of
EM
such
as
simplicity
,
modularity
,
and
efficiency
.
2.7
Structured
mean-field
approximation
We
denote
parameters
of
the
HDP-PCFG
as
9
=
(
ff
,
0
)
,
where
ff
denotes
the
top-level
symbol
probabilities
and
0
denotes
the
rule
probabilities
.
The
hidden
variables
of
the
model
are
the
training
parse
trees
z.
We
denote
the
observed
sentences
as
x.
The
goal
of
Bayesian
inference
is
to
compute
the
posterior
distribution
p
(
9
,
z
|
x
)
.
The
central
idea
behind
variational
inference
is
to
approximate
this
intractable
posterior
with
a
tractable
approximation
.
In
particular
,
we
want
to
find
the
best
distribution
q
*
as
defined
by
where
Q
is
a
tractable
subset
of
distributions
.
We
use
a
structured
mean-field
approximation
,
meaning
that
we
only
consider
distributions
that
factorize
as
follows
(
Figure
4
)
:
degenerate
distribution
truncated
at
K
;
i.e.
,
f
z
=
0
for
z
&gt;
K.
While
the
posterior
grammar
does
have
an
infinite
number
of
symbols
,
the
exponential
decay
of
the
DP
prior
ensures
that
most
of
the
probability
mass
is
contained
in
the
first
few
symbols
(
Ish-waran
and
James
,
2001
)
.
2
While
our
variational
approximation
q
is
truncated
,
the
actual
PCFG
model
is
not
.
As
K
increases
,
our
approximation
improves
.
2.8
Coordinate-wise
ascent
The
optimization
problem
defined
by
Equation
(
4
)
is
intractable
and
nonconvex
,
but
we
can
use
a
simple
coordinate-ascent
algorithm
that
iteratively
optimizes
each
factor
of
q
in
turn
while
holding
the
others
fixed
.
The
algorithm
turns
out
to
be
similar
in
form
to
EM
for
an
ordinary
PCFG
:
optimizing
q
(
z
)
is
the
analogue
of
the
E-step
,
and
optimizing
q
(
0
)
is
the
analogue
of
the
M-step
;
however
,
optimizing
q
(
ff
)
has
no
analogue
in
EM
.
We
summarize
each
of
these
updates
below
(
see
(
Liang
et
al.
,
2007
)
for
complete
derivations
)
.
Parse
trees
q
(
z
)
:
The
distribution
over
parse
trees
q
(
z
)
can
be
summarized
by
the
expected
sufficient
statistics
(
rule
counts
)
,
which
we
denote
as
C
(
z
—
zj
zr
)
for
binary
productions
and
C
(
z
—
&gt;
X
)
for
emissions
.
We
can
compute
these
expected
counts
using
dynamic
programming
as
in
the
E-step
of
EM
.
While
the
classical
E-step
uses
the
current
rule
probabilities
0
,
our
mean-field
approximation
involves
an
entire
distribution
q
(
0
)
.
Fortunately
,
we
can
still
handle
this
case
by
replacing
each
rule
probability
with
a
weight
that
summarizes
the
uncertainty
over
the
rule
probability
as
represented
by
q.
We
define
this
weight
in
the
sequel
.
It
is
a
common
perception
that
Bayesian
inference
is
slow
because
one
needs
to
compute
integrals
.
our
mean-field
inference
algorithm
is
a
counterexample
:
because
we
can
represent
uncertainty
over
rule
probabilities
with
single
numbers
,
much
of
the
existing
PCFG
machinery
based
on
EM
can
be
modularly
imported
into
the
Bayesian
framework
.
Rule
probabilities
q
(
0
)
:
For
an
ordinary
PCFG
,
the
M-step
simply
involves
taking
ratios
ofexpected
2In
particular
,
the
variational
distance
between
the
stick-breaking
distribution
and
the
truncated
version
decreases
exponentially
as
the
truncation
level
K
increases
.
For
the
variational
HDP-PCFG
,
the
optimal
q
(
0
)
is
given
by
the
standard
posterior
update
for
Dirichlet
distributions
:
3
where
C
(
z
)
is
the
matrix
of
counts
of
rules
with
left-hand
side
z.
These
distributions
can
then
be
summarized
with
multinomial
weights
which
are
the
only
necessary
quantities
for
updating
q
(
z
)
in
the
next
iteration
:
where
is
the
digamma
function
.
The
emission
parameters
can
be
defined
similarly
.
Inspection
of
Equations
(
6
)
and
(
9
)
reveals
that
the
only
difference
between
the
maximum
likelihood
and
the
mean-field
update
is
that
the
latter
applies
the
exp
(
^
(
)
)
function
to
the
counts
(
Figure
5
)
.
When
the
truncation
K
is
large
,
aB
/
/
/
z
;
/
/
/
zr
is
near
0
for
most
right-hand
sides
(
zj
,
zr
)
,
so
exp
(
\
I
&gt;
(
-
)
)
has
the
effect
of
downweighting
counts
.
Since
this
subtraction
affects
large
counts
more
than
small
counts
,
there
is
a
rich-get-richer
effect
:
rules
that
have
already
have
large
counts
will
be
preferred
.
Specifically
,
consider
a
set
of
rules
with
the
same
left-hand
side
.
The
weights
for
all
these
rules
only
differ
in
the
numerator
(
Equation
(
9
)
)
,
so
applying
exp
(
^
(
)
)
creates
a
local
preference
for
right-hand
sides
with
larger
counts
.
Also
note
that
the
rule
weights
are
not
normalized
;
they
always
sum
to
at
most
one
and
are
equal
to
one
exactly
when
q
(
0
)
is
degenerate
.
This
lack
of
normalization
gives
an
extra
degree
of
freedom
not
present
in
maximum
likelihood
estimation
:
it
creates
a
global
preference
for
left-hand
sides
that
have
larger
total
counts
.
3Because
we
have
truncated
the
top-level
symbol
weights
,
the
DP
prior
on
&lt;
/
&gt;
f
reduces
to
a
finite
Dirichlet
distribution
.
Figure
5
:
The
exp
(
^
(
)
)
function
,
which
is
used
in
computing
the
multinomial
weights
for
mean-field
inference
.
It
has
the
effect
of
reducing
a
larger
fraction
of
small
counts
than
large
counts
.
and
q
(
z
)
,
there
is
no
closed
form
expression
for
the
optimal
P
*
,
and
the
objective
function
(
Equation
(
4
)
)
is
not
convex
in
P
*
.
Nonetheless
,
we
can
apply
a
standard
gradient
projection
method
(
Bert-sekas
,
1999
)
to
improve
P
*
to
a
local
maxima
.
The
part
of
the
objective
function
in
Equation
(
4
)
that
depends
on
P
*
is
as
follows
:
See
Liang
et
al.
(
2007
)
for
the
derivation
of
the
gradient
.
In
practice
,
this
optimization
has
very
little
effect
on
performance
.
We
suspect
that
this
is
because
the
objective
function
is
dominated
by
p
(
x
|
z
)
and
p
(
z
|
0
)
,
while
the
contribution
of
p
(
0
|
ff
)
is
minor
.
3
Experiments
We
now
present
an
empirical
evaluation
of
the
HDP-PCFG
(
-
GR
)
model
and
variational
inference
techniques
.
We
first
give
an
illustrative
example
of
the
ability
of
the
HDP-PCFG
to
recover
a
known
grammar
and
then
present
the
results
of
experiments
on
large-scale
treebank
parsing
.
3.1
Recovering
a
synthetic
grammar
In
this
section
,
we
show
that
the
HDP-PCFG-GR
can
recover
a
simple
grammar
while
a
standard
standard
PCFG
Figure
6
:
(
a
)
A
synthetic
grammar
with
a
uniform
distribution
over
rules
.
(
b
)
The
grammar
generates
trees
of
the
form
shown
on
the
right
.
PCFG
fails
to
do
so
because
it
has
no
built-in
control
over
grammar
complexity
.
From
the
grammar
in
Figure
6
,
we
generated
2000
trees
.
The
two
terminal
symbols
always
have
the
same
subscript
,
but
we
collapsed
X
to
X
in
the
training
data
.
We
trained
the
HDP-PCFG-GR
,
with
truncation
K
=
20
,
for
both
S
and
X
for
100
iterations
.
We
set
all
hyperparameters
to
1
.
Figure
7
shows
that
the
HDP-PCFG-GR
recovers
the
original
grammar
,
which
contains
only
4
subsymbols
,
leaving
the
other
16
subsymbols
unused
.
The
standard
PCFG
allocates
all
the
subsymbols
to
fit
the
exact
co-occurrence
statistics
of
left
and
right
terminals
.
Recall
that
a
rule
weight
,
as
defined
in
Equation
(
9
)
,
is
analogous
to
a
rule
probability
for
standard
PCFGs
.
We
say
a
rule
is
effective
if
its
weight
is
at
least
10-6
and
its
left
hand-side
has
posterior
is
also
at
least
10-6
.
In
general
,
rules
with
weight
smaller
than
10-6
can
be
safely
pruned
without
affect
parsing
accuracy
.
The
standard
PCFG
uses
all
20
subsymbols
of
both
S
and
X
to
explain
the
data
,
resulting
in
8320
effective
rules
;
in
contrast
,
the
HDP-PCFG
uses
only
4
subsymbols
for
X
and
1
for
S
,
resulting
in
only
68
effective
rules
.
If
the
threshold
is
relaxed
from
10-6
to
10-3
,
then
only
20
rules
are
effective
,
which
corresponds
exactly
to
the
true
grammar
.
3.2
Parsing
the
Penn
Treebank
In
this
section
,
we
show
that
our
variational
HDP-PCFG
can
scale
up
to
real-world
data
sets
.
We
ran
experiments
on
the
Wall
Street
Journal
(
WSJ
)
portion
of
the
Penn
Treebank
.
We
trained
on
sections
2-21
,
used
section
24
for
tuning
hyperparameters
,
and
tested
on
section
22
.
We
binarize
the
trees
in
the
treebank
as
follows
:
for
each
non-terminal
node
with
symbol
X
,
we
in
-
Figure
7
:
The
posteriors
over
the
subsymbols
of
the
standard
PCFG
is
roughly
uniform
,
whereas
the
posteriors
of
the
HDP-PCFG
is
concentrated
on
four
subsymbols
,
which
is
the
true
number
of
symbols
in
the
grammar
.
troduce
a
right-branching
cascade
of
new
nodes
with
symbol
X.
The
end
result
is
that
each
node
has
at
most
two
children
.
To
cope
with
unknown
words
,
we
replace
any
word
appearing
fewer
than
5
times
in
the
training
set
with
one
of
50
unknown
word
tokens
derived
from
10
word-form
features
.
Our
goal
is
to
learn
a
refined
grammar
,
where
each
symbol
in
the
training
set
is
split
into
K
subsymbols
.
We
compare
an
ordinary
PCFG
estimated
with
maximum
likelihood
(
Matsuzaki
et
al.
,
2005
)
and
the
HDP-PCFG
estimated
using
the
variational
inference
algorithm
described
in
Section
2.6
.
To
parse
new
sentences
with
a
grammar
,
we
compute
the
posterior
distribution
over
rules
at
each
span
and
extract
the
tree
with
the
maximum
expected
correct
number
of
rules
(
Petrov
and
Klein
,
2007
)
.
There
are
six
hyperparameters
in
the
HDP-PCFG-GR
model
,
which
we
set
in
the
following
manner
:
a
=
1
,
aT
=
1
(
uniform
distribution
over
unar-ies
versus
binaries
)
,
aE
=
1
(
uniform
distribution
over
terminal
words
)
,
au
(
s
)
=
ab
(
s
)
=
,
where
N
(
s
)
is
the
number
of
different
unary
(
binary
)
right-hand
sides
of
rules
with
left-hand
side
s
in
the
tree-bank
grammar
.
The
two
most
important
hyperparameters
are
au
and
aB
,
which
govern
the
sparsity
of
the
right-hand
side
for
unary
and
binary
rules
.
We
set
au
=
aB
although
more
performance
could
probably
be
gained
by
tuning
these
individually
.
It
turns
out
that
there
is
not
a
single
aB
that
works
for
all
truncation
levels
,
as
shown
in
Table
1
.
If
the
top-level
distribution
/
3
is
uniform
,
the
value
of
aB
corresponding
to
a
uniform
prior
over
pairs
of
children
subsymbols
is
K2
.
Interestingly
,
the
optimal
aB
appears
to
be
superlinear
but
subquadratic
truncation
K
uniform
aB
Table
1
:
For
each
truncation
level
,
we
report
the
aB
that
yielded
the
highest
Fi
score
on
the
development
set
.
PCFG
(
smoothed
)
HDP-PCFG
Table
2
:
Shows
development
Fi
and
grammar
sizes
(
the
number
of
effective
rules
)
as
we
increase
the
truncation
K.
in
K.
We
used
these
values
of
aB
in
the
following
experiments
.
The
regime
in
which
Bayesian
inference
is
most
important
is
when
training
data
is
scarce
relative
to
the
complexity
of
the
model
.
We
train
on
just
section
2
of
the
Penn
Treebank
.
Table
2
shows
how
the
HDP-PCFG-GR
can
produce
compact
grammars
that
guard
against
overfitting
.
Without
smoothing
,
ordinary
PCFGs
trained
using
EM
improve
as
K
increases
but
start
to
overfit
around
K
=
4
.
Simple
add-1.01
smoothing
prevents
overfitting
but
at
the
cost
of
a
sharp
increase
in
grammar
sizes
.
The
HDP-PCFG
obtains
comparable
performance
with
a
much
smaller
number
of
rules
.
We
also
trained
on
sections
2-21
to
demonstrate
that
our
methods
can
scale
up
and
achieve
broadly
comparable
results
to
existing
state-of-the-art
parsers
.
When
using
a
truncation
level
of
K
=
16
,
the
standard
PCFG
with
smoothing
obtains
an
Fi
score
of
88.36
using
706157
effective
rules
while
the
HDP-PCFG-GR
obtains
an
Fi
score
of
87.08
using
428375
effective
rules
.
We
expect
to
see
greater
benefits
from
the
HDP-PCFG
with
a
larger
truncation
level
.
4
Related
work
The
question
of
how
to
select
the
appropriate
grammar
complexity
has
been
studied
in
earlier
work
.
It
is
well
known
that
more
complex
models
necessarily
have
higher
likelihood
and
thus
a
penalty
must
be
imposed
for
more
complex
grammars
.
Examples
of
such
penalized
likelihood
procedures
include
Stolcke
and
Omohundro
(
1994
)
,
which
used
an
asymptotic
Bayesian
model
selection
criterion
and
Petrov
et
al.
(
2006
)
,
which
used
a
split-merge
algorithm
which
procedurally
determines
when
to
switch
between
grammars
of
various
complexities
.
These
techniques
are
model
selection
techniques
that
use
heuristics
to
choose
among
competing
statistical
models
;
in
contrast
,
the
HDP-PCFG
relies
on
the
Bayesian
formalism
to
provide
implicit
control
over
model
complexity
within
the
framework
of
a
single
probabilistic
model
.
Johnson
et
al.
(
2006
)
also
explored
nonparamet-ric
grammars
,
but
they
do
not
give
an
inference
algorithm
for
recursive
grammars
,
e.g.
,
grammars
including
rules
of
the
form
A
—
BC
and
B
—
DA
.
Recursion
is
a
crucial
aspect
of
PCFGs
and
our
inference
algorithm
does
handle
it
.
Finkel
et
al.
(
2007
)
independently
developed
another
nonpara-metric
model
of
grammars
.
Though
their
model
is
also
based
on
hierarchical
Dirichlet
processes
and
is
similar
to
ours
,
they
present
a
different
inference
algorithm
which
is
based
on
sampling
.
Kurihara
and
Sato
(
2004
)
and
Kurihara
and
Sato
(
2006
)
applied
variational
inference
to
PCFGs
.
Their
algorithm
is
similar
to
ours
,
but
they
did
not
consider
nonpara-metric
models
.
5
Conclusion
We
have
presented
the
HDP-PCFG
,
a
nonparametric
Bayesian
model
for
PCFGs
,
along
with
an
efficient
variational
inference
algorithm
.
While
our
primary
contribution
is
the
elucidation
of
the
model
and
algorithm
,
we
have
also
explored
some
important
empirical
properties
of
the
HDP-PCFG
and
also
demonstrated
the
potential
of
variational
HDP-PCFGs
on
a
full-scale
parsing
task
.
