One
may
need
to
build
a
statistical
parser
for
a
new
language
,
using
only
a
very
small
labeled
treebank
together
with
raw
text
.
We
argue
that
bootstrapping
a
parser
is
most
promising
when
the
model
uses
a
rich
set
of
redundant
features
,
as
in
recent
models
for
scoring
dependency
parses
(
McDonald
et
al.
,
2005
)
.
Drawing
on
Abney
's
(
2004
)
analysis
of
the
Yarowsky
algorithm
,
we
perform
bootstrapping
by
entropy
regulariza-tion
:
we
maximize
a
linear
combination
of
conditional
likelihood
on
labeled
data
and
confidence
(
negative
Renyi
entropy
)
on
unlabeled
data
.
In
initial
experiments
,
this
surpassed
EM
for
training
a
simple
feature-poor
generative
model
,
and
also
improved
the
performance
of
a
feature-rich
,
conditionally
estimated
model
where
EM
could
not
easily
have
been
applied
.
We
discuss
how
our
feature
set
could
be
extended
with
cross-lingual
or
cross-domain
features
,
to
incorporate
knowledge
from
parallel
or
comparable
corpora
during
bootstrapping
.
1
Motivation
In
this
paper
,
we
address
the
problem
of
bootstrapping
new
statistical
parsers
for
new
languages
,
genres
,
or
domains
.
Why
is
this
problem
important
?
Many
applications
of
multilingual
NLP
require
parsing
in
order
to
extract
information
,
opinions
,
and
answers
from
text
,
and
to
produce
improved
translations
.
Yet
an
adequate
labeled
training
corpus
—
a
large
tree-bank
of
manually
constructed
parse
trees
of
typical
sentences
—
is
rarely
available
and
would
be
prohibitively
expensive
to
develop
.
We
show
how
it
is
possible
to
train
instead
from
a
small
hand-labeled
treebank
in
the
target
domain
,
together
with
a
large
unannotated
collection
of
indomain
sentences
.
Additional
resources
such
as
parsers
for
other
domains
or
languages
can
be
integrated
naturally
.
Dependency
parsing
is
important
as
a
key
component
in
leading
systems
for
information
extrac
-
tion
(
Weischedel
,
2004
)
1
and
question
answering
(
Peng
et
al.
,
2005
)
.
These
systems
rely
on
edges
or
paths
in
dependency
parse
trees
to
define
their
extraction
patterns
and
classification
features
.
Parsing
is
also
key
to
the
latest
advances
in
machine
translation
,
which
translate
syntactic
phrases
(
Galley
et
al.
,
2006
;
Marcu
et
al.
,
2006
;
Cowan
et
al.
,
2006
)
.
2
Our
Approach
Our
approach
rests
on
three
observations
:
•
Recent
"
feature-based
"
parsing
models
are
an
excellent
fit
for
bootstrapping
,
because
the
parse
is
often
overdetermined
by
many
redundant
features
.
•
The
feature-based
framework
is
flexible
enough
to
incorporate
other
sources
of
guidance
during
training
or
testing
—
such
as
the
knowledge
contained
in
a
parser
for
another
language
or
domain
.
•
Maximizing
a
combination
of
likelihood
on
labeled
data
and
confidence
on
unlabeled
data
is
a
principled
approach
to
bootstrapping
.
2.1
Feature-Based
Parsing
McDonald
et
al.
(
2005
)
introduced
a
simple
,
flexible
framework
for
scoring
dependency
parses
.
Each
directed
edge
e
in
the
dependency
tree
is
described
with
a
high-dimensional
feature
vector
f
(
e
)
.
The
edge
's
score
is
the
dot
product
f
(
e
)
•
6
,
where
6
is
a
learned
weight
vector
.
The
overall
score
of
a
dependency
tree
is
the
sum
of
the
scores
of
all
edges
in
the
tree
.
1Ralph
Weischedel
(
p.c.
)
reports
that
this
system
's
performance
degrades
considerably
when
only
phrase
chunking
is
available
rather
than
full
parsing
.
Proceedings
of
the
2007
Joint
Conference
on
Empirical
Methods
in
Natural
Language
Processing
and
Computational
Natural
Language
Learning
,
pp.
661-611
,
Prague
,
June
2001
.
©
2001
Association
for
Computational
Linguistics
Given
an
n-word
input
sentence
,
the
parser
begins
by
scoring
each
of
the
O
(
n2
)
possible
edges
,
and
then
seeks
the
highest-scoring
legal
dependency
tree
formed
by
any
n
—
1
of
these
edges
,
using
an
O
(
n3
)
dynamic
programming
algorithm
(
Eisner
,
1996
)
for
projective
trees
.
For
non-projective
parsing
,
O
(
n3
)
,
or
with
some
trickery
O
(
n2
)
,
greedy
algorithms
exist
(
Chu
and
Liu
,
1965
;
Edmonds
,
1967
;
Gabow
et
al.
,
1986
)
.
The
feature
function
f
may
pay
attention
to
many
properties
of
the
directed
edge
e.
Of
course
,
features
may
consider
the
parent
and
child
words
connected
by
e
,
and
their
parts
of
speech.2
But
some
features
used
by
McDonald
et
al.
(
2005
)
also
consider
the
parts
of
speech
of
words
adjacent
to
the
parent
and
child
,
or
between
the
parent
and
child
,
as
well
as
the
number
of
words
between
the
parent
and
child
.
In
general
,
these
features
are
not
available
in
a
generative
model
such
as
a
PCFG
.
Although
feature-based
models
are
often
trained
purely
discriminatively
,
we
will
see
in
§
2.6
how
to
train
them
to
model
conditional
probabilities
.
2.2
Feature-Based
Parsing
and
Bootstrapping
The
above
parsing
model
is
robust
,
thanks
to
its
many
features
.
On
the
Penn
Treebank
WSJ
sections
02-21
,
for
example
,
McDonald
's
parser
extracts
5.5
million
feature
types
from
supervised
edges
alone
,
with
about
120
feature
tokens
firing
per
edge
.
The
highest-scoring
parse
tree
represents
a
consensus
among
all
features
on
all
prospective
edges
.
Even
if
a
prospective
edge
has
some
discouraging
features
(
i.e.
,
with
negative
or
zero
weights
)
,
it
may
still
have
a
relatively
high
score
thanks
to
its
other
features
.
Furthermore
,
even
if
the
edge
has
a
low
total
score
,
it
may
still
appear
in
the
consensus
parse
if
the
alternatives
are
even
worse
or
are
incompatible
with
other
high-scoring
edges
.
Put
another
way
,
the
parser
is
not
able
to
include
high-scoring
features
or
edges
independently
of
one
another
.
Selecting
a
good
feature
means
accepting
all
other
features
on
that
edge
.
It
also
means
rejecting
various
other
edges
,
because
of
the
global
constraints
that
a
legal
parse
tree
must
give
each
word
only
one
parent
and
must
be
free
of
cycles
and
,
in
2Note
that
since
we
are
not
trying
to
predict
parts
of
speech
,
we
treat
the
output
of
one
or
more
automatic
taggers
as
yet
more
inputs
to
edge
feature
functions
.
the
projective
case
,
crossings
.
Our
observation
is
that
this
situation
is
ideal
for
so-called
"
bootstrapping
,
"
"
co-training
,
"
or
"
minimally
supervised
"
learning
methods
(
Yarowsky
,
1995
;
Blum
and
Mitchell
,
1998
;
Yarowsky
and
Wi-centowski
,
2000
)
.
Such
methods
should
thrive
when
the
right
answer
is
overdetermined
owing
to
redundant
features
and
/
or
global
constraints
.
Concretely
,
suppose
we
start
by
training
a
supervised
parser
on
only
100
examples
,
using
some
reg-ularization
method
to
prevent
overfitting
to
this
set
.
While
many
features
might
truly
be
relevant
to
the
task
,
only
a
few
appear
often
enough
in
this
small
training
set
to
acquire
significantly
positive
or
negative
weights
.
Even
this
lightly
trained
parser
may
be
quite
sure
of
itself
on
some
test
sentences
in
a
large
unanno-tated
corpus
,
when
one
parse
scores
far
higher
than
all
others
.
More
generally
,
the
parser
may
be
sure
about
part
of
a
sentence
:
it
may
be
certain
that
a
particular
edge
is
present
(
or
absent
)
,
because
that
edge
tends
to
be
present
(
or
absent
)
in
all
high-scoring
parses
.
Retraining
the
feature
weights
6
on
these
high-confidence
edges
can
learn
about
additional
features
that
are
correlated
with
an
edge
's
success
or
failure
.
For
example
,
it
may
now
learn
strong
weights
for
lexically
specific
features
that
were
never
observed
in
the
supervised
training
set
.
The
retrained
parser
may
now
be
able
to
confidently
parse
even
more
of
the
unannotated
examples
;
so
we
can
iterate
the
process
.
Our
hope
is
that
the
model
identifies
new
good
and
bad
edges
at
each
step
,
and
does
so
correctly
.
The
more
features
and
global
constraints
the
model
has
,
•
the
more
power
it
will
have
to
discriminate
among
edges
even
when
6
is
insufficiently
trained
.
(
Some
feature
weights
may
be
too
weak
(
i.e.
,
too
close
to
zero
)
because
the
initial
labeled
set
is
small
.
)
•
the
more
robust
it
will
be
against
errors
even
when
6
is
incorrectly
trained
.
(
Some
feature
weights
may
be
too
strong
or
have
the
wrong
sign
,
because
of
overfitting
or
mistaken
parses
during
bootstrapping
.
)
In
the
former
case
,
strong
features
lend
their
strength
to
weak
ones
.
In
the
latter
case
,
a
conflict
among
strong
features
weakens
the
ones
that
depart
from
the
consensus
,
or
discounts
the
example
sentence
if
there
is
no
consensus
.
Previous
work
on
parser
bootstrapping
has
not
been
able
to
exploit
this
redundancy
among
features
,
because
it
has
used
PCFG-like
models
with
far
fewer
features
(
Steedman
et
al.
,
2003
)
.
2.3
Adaptation
and
Projection
via
Features
The
previous
section
assumed
that
we
had
a
small
supervised
treebank
in
the
target
language
and
domain
(
plus
a
large
unsupervised
corpus
)
.
We
now
consider
other
,
more
dubious
,
knowledge
sources
that
might
supplement
or
replace
this
small
tree-bank
.
In
each
case
,
we
can
use
these
knowledge
sources
to
derive
features
that
may
—
or
may
not
—
prove
trustworthy
during
bootstrapping
.
Parses
from
a
different
domain
.
One
might
have
a
treebank
for
a
different
domain
or
genre
of
the
target
language
.
One
could
simply
include
these
trees
in
the
initial
supervised
training
,
and
hope
that
bootstrapping
corrects
any
learned
weights
that
are
inappropriate
to
the
target
domain
,
as
discussed
above
.
In
fact
,
McClosky
et
al.
(
2006
)
found
a
similar
technique
to
be
effective
—
though
only
in
a
model
with
a
large
feature
space
(
"
PCFG
+
reranking
"
)
,
as
we
would
predict
.
However
,
another
approach
is
to
train
a
separate
out-of-domain
parser
,
and
use
this
to
generate
additional
features
on
the
supervised
and
unsupervised
in-domain
data
(
Blitzer
et
al.
,
2006
)
.
Bootstrapping
now
teaches
us
where
to
trust
the
out-of-domain
parser
.
If
our
basic
model
has
100
features
,
we
could
add
features
101
through
200
,
where
for
example
/
i23
(
e
)
=
/
23
•
logPr
(
e
)
and
Pr
(
e
)
is
the
posterior
edge
probability
according
to
the
out-of-domain
parser
.
Learning
that
this
feature
has
a
high
weight
means
learning
to
trust
the
out-of-domain
parser
's
decision
on
edges
where
in-domain
feature
23
fires
.
Even
more
sensibly
,
we
could
add
features
such
as
/
2o1
(
e
)
=
J2^
1
/
(
e
)
•
9i
,
where
f
and
6
are
the
feature
and
weight
vectors
for
the
out-of-domain
parser
.
Learning
that
this
feature
has
a
high
weight
means
learning
to
trust
the
out-of-domain
parser
's
feature
weights
for
a
particular
class
of
features
(
those
numbered
1
through
10
)
.
This
addresses
the
intuition
that
some
linguistic
phenomena
remain
stable
across
domains
.
Parses
of
translations
.
Suppose
we
have
translations
into
English
of
some
of
our
supervised
or
unsu-pervised
sentences
.
Good
probabilistic
dependency
parsers
already
exist
for
English
,
so
we
run
one
over
the
English
translation
.
We
can
now
derive
many
additional
features
on
candidate
edges
on
the
target
sentence
.
For
example
,
dependency
edges
in
the
target
language
of
the
form
c
—
^
p
(
this
denotes
a
child-to-parent
dependency
with
label
possessor
)
might
often
correspond
to
dependency
paths
in
the
,
prep
pobj
.
where
c
'
,
p
'
range
over
word
tokens
in
the
English
translation
,
"
of
"
is
a
literal
English
word
,
and
the
probabilities
are
posteriors
provided
by
a
probabilistic
aligner
and
a
probabilistic
English
parser
.
Note
that
this
is
a
single
feature
(
not
a
feature
family
parameterized
by
c
,
p
)
.
It
scores
any
candidate
edge
on
prep
pobj
English
&lt;
—
of
&lt;
—
path
.
This
method
is
inspired
by
Hwa
et
al.
(
2005
)
,
who
bootstrapped
parsers
for
Spanish
and
Chinese
by
projecting
dependencies
from
English
translations
and
training
a
new
parser
on
the
resulting
noisy
treebank
.
They
used
only
1-best
translations
,
1-best
alignments
,
dependency
paths
of
length
1
,
and
no
labeled
data
in
Spanish
or
Chinese
.
Hwa
et
al.
(
2005
)
used
a
manually
written
postprocessor
to
correct
some
of
the
many
incorrect
projections
.
By
contrast
,
our
framework
uses
the
projected
dependencies
only
as
one
source
of
features
.
They
may
be
overridden
by
other
features
in
particular
cases
,
and
will
be
given
a
high
weight
only
if
they
tend
to
agree
with
other
features
during
bootstrapping
.
A
similar
soft
projection
of
dependencies
was
used
in
supervised
machine
translation
by
Smith
and
Eisner
(
2006
)
,
who
used
a
source
sentence
's
dependency
paths
to
bias
the
generation
of
its
translation
.
Note
that
these
bilingual
features
will
only
fire
on
those
supervised
or
unsupervised
sentences
for
which
we
have
an
English
translation
.
In
particular
,
they
will
usually
be
unavailable
on
the
test
set
.
However
,
we
hope
that
they
will
seed
and
facilitate
the
bootstrapping
process
,
by
helping
us
confidently
parse
some
unsupervised
sentences
that
we
would
not
be
able
to
confidently
parse
without
an
English
translation
.
Parses
of
comparable
English
sentences
.
World
knowledge
can
be
useful
in
parsing
.
Suppose
you
see
a
French
sentence
that
contains
mangeons
and
pommes
,
and
you
know
that
manger
=
eat
and
pomme
=
apple
.
You
might
reasonably
guess
that
pommes
is
the
direct
object
of
mangeons
,
because
you
know
that
apple
is
a
plausible
direct
object
for
eat
.
We
can
discover
this
last
bit
of
world
knowledge
from
comparable
English
text
.
Translation
dictionaries
can
themselves
be
induced
from
comparable
corpora
(
Schafer
and
Yarowsky
,
2002
;
Schafer
,
2006
;
Klementiev
and
Roth
,
2006
)
,
or
extracted
from
bitext
or
digitized
versions
of
human-readable
dictionaries
if
these
are
available
.
The
above
inference
pattern
can
be
captured
by
features
similar
to
those
in
equation
(
1
)
.
For
example
,
one
can
define
a
feature
j
by
where
each
event
in
the
event
space
is
a
pair
(
c
'
,
p
'
)
of
same-sentence
tokens
in
comparable
English
text
,
all
pairs
being
equally
likely
.
Thus
,
to
estimate
Pr
(
-
|
•
)
,
the
denominator
counts
same-sentence
token
pairs
(
c
'
,
p
'
)
in
the
comparable
English
corpus
that
translate
into
the
types
(
c
,
p
)
,
and
the
numerator
counts
such
pairs
that
are
also
related
by
a
PreP
of
|
po-j
path
.
Since
the
lexical
translations
and
dependency
paths
are
typically
not
labeled
in
the
English
corpus
,
a
given
pair
must
be
counted
fractionally
according
to
its
posterior
probability
of
satisfying
these
conditions
,
given
models
of
contextual
translation
and
English
parsing.3
2.4
Bootstrapping
as
Optimization
Section
2.2
assumed
a
relatively
conventional
kind
of
bootstrapping
,
where
each
iteration
retrains
the
model
on
the
examples
where
it
is
currently
most
confident
.
This
kind
of
"
confidence
thresholding
"
has
been
popular
in
previous
bootstrapping
work
(
as
cited
in
§
2.2
)
.
It
attempts
to
maintain
high
accuracy
while
gradually
expanding
coverage
.
The
assumption
is
that
throughout
the
training
procedure
,
the
parser
's
confidence
is
a
trustworthy
guide
to
its
correctness
.
Different
bootstrapping
procedures
use
different
learners
,
smoothing
methods
,
confidence
measures
,
and
procedures
for
"
forgetting
"
the
label-ings
from
previous
iterations
.
In
his
analysis
of
Yarowsky
(
1995
)
,
Abney
(
2004
)
formulates
several
variants
of
bootstrapping
.
These
are
shown
to
increase
either
the
likelihood
of
the
training
data
,
or
a
lower
bound
on
that
likelihood
.
In
particular
,
Abney
defines
a
function
K
that
is
an
upper
bound
on
the
negative
log-likelihood
,
and
shows
his
bootstrapping
algorithms
locally
minimize
K.
We
now
present
a
generalization
of
Abney
's
K
function
and
relate
it
to
another
semi-supervised
learning
technique
,
entropy
regularization
(
Brand
,
1999
;
Grandvalet
and
Bengio
,
2005
;
Jiao
et
al.
,
2006
)
.
Our
experiments
will
tune
the
feature
weight
vector
,
6
,
to
minimize
our
function
.
We
will
do
so
simply
by
applying
a
generic
function
minimization
method
(
stochastic
gradient
descent
)
,
rather
than
by
crafting
a
new
Yarowsky-style
or
Abney-style
iterative
procedure
for
our
specific
function
.
Suppose
we
have
examples
xi
and
corresponding
possible
labelings
yi
;
k.
We
are
trying
to
learn
a
parametric
model
pe
(
yi
,
k
|
Xi
)
.
If
p
(
yi
,
k
|
Xi
)
is
a
"
labeling
distribution
"
that
reflects
our
uncertainty
about
the
true
labels
,
then
our
expected
negative
log-likelihood
of
the
model
is
3Similarly
,
Jansche
(
2005
)
imputes
"
missing
"
trees
by
using
comparable
corpora
.
of
the
labeling
distribution
p
;
a
learner
might
be
allowed
to
manipulate
either
in
order
to
decrease
K.
The
summands
of
K
in
equation
(
3
)
can
be
divided
into
two
cases
,
according
to
whether
Xi
is
labeled
or
not
.
For
the
labeled
examples
{
Xi
:
i
e
L
}
,
the
labeling
distribution
p
i
is
a
point
distribution
that
assigns
all
probability
to
the
true
,
known
label
y
*
.
Then
H
(
pi
)
=
0
.
The
total
contribution
of
these
examples
to
K
simplifies
to
J2ieL
—
log
pe
(
y
*
|
xi
)
,
i.e.
,
just
the
negative
log-likelihood
on
the
labeled
data
.
But
what
is
the
labeling
distribution
for
the
unla-beled
examples
{
xi
:
i
e
L
}
?
Abney
simply
uses
a
uniform
distribution
over
labels
(
e.g.
,
parses
)
,
to
reflect
that
the
label
is
unknown
.
If
his
bootstrapping
algorithm
"
labels
"
xi
,
then
i
moves
into
L
and
H
(
p
i
)
is
thereby
reduced
from
maximal
to
0
.
As
a
result
,
a
method
that
labels
the
most
confident
examples
may
reduce
K
,
and
Abney
shows
that
his
method
does
so
.
Our
approach
is
different
:
we
will
take
the
labeling
distribution
p
i
to
be
our
actual
current
belief
pe
i
,
and
manipulate
it
through
changing
6
rather
than
L.
L
remains
the
original
set
of
supervised
examples
.
The
total
contribution
of
the
unsupervised
examples
to
K
then
simplifies
to
J2H
(
pe
,
i
)
.
We
have
no
reason
to
believe
that
these
two
contributions
(
supervised
and
unsupervised
)
should
be
weighted
equally
.
We
thus
introduce
a
multiplier
7
to
form
the
actual
objective
function
that
we
minimize
with
respect
to
6
:
4
One
may
regard
7
as
a
Lagrange
multiplier
that
is
used
to
constrain
the
classifier
's
uncertainty
H
to
be
low
,
as
presented
in
the
work
on
entropy
regular-ization
(
Brand
,
1999
;
Grandvalet
and
Bengio
,
2005
;
Jiao
et
al.
,
2006
)
.
Conventional
bootstrapping
retrains
on
the
most
confident
unsupervised
examples
,
making
them
4This
function
is
not
necessarily
convex
in
0
,
because
of
the
addition
of
the
entropy
term
(
Jiao
et
al.
,
2006
)
.
One
might
try
an
annealing
strategy
:
start
7
at
zero
(
where
the
function
is
convex
)
and
gradually
increase
it
,
hoping
to
"
ride
"
the
global
maximum
.
Although
we
could
increase
7
until
the
entropy
term
dominates
the
minimizations
and
we
approach
a
completely
deterministic
classifier
,
it
is
preferable
to
use
some
labeled
heldout
data
to
evaluate
a
stopping
criterion
.
more
confident
.
Gradient
descent
on
equation
(
4
)
essentially
does
the
same
,
since
unsupervised
examples
contribute
to
(
4
)
only
through
H
,
and
the
shape
of
the
H
function
means
that
it
is
most
rapidly
decreased
by
making
the
most
confident
unsupervised
examples
more
confident
.
Besides
favoring
models
that
are
self-confident
on
the
unlabeled
data
,
the
objective
function
(
4
)
also
explicitly
asks
the
model
to
continue
to
get
the
correct
answers
on
the
initial
supervised
corpus
.
1
/
7
controls
the
strength
of
this
request
.
One
could
obtain
a
similar
effect
in
conventional
bootstrapping
by
up-weighting
the
initial
labeled
corpus
when
retraining
.
Minimizing
equation
(
4
)
for
parsing
is
more
computationally
intensive
than
in
many
other
applications
of
bootstrapping
,
such
as
word
sense
disambiguation
or
document
classification
.
With
millions
of
features
,
our
objective
could
take
many
iterations
to
converge
to
a
local
optimum
,
if
we
were
only
to
update
our
parameter
vector
6
after
each
iteration
through
a
large
unsupervised
corpus
.
For
many
machine
learning
problems
over
large
datasets
,
online
learning
methods
such
as
stochastic
gradient
descent
(
SGD
)
have
been
empirically
observed
to
converge
in
fewer
iterations
(
Bottou
,
2003
)
.
In
SGD
,
instead
of
taking
an
optimization
step
in
the
direction
of
the
gradient
calculated
over
all
unsupervised
training
examples
,
we
parse
each
example
,
calculate
the
gradient
of
the
objective
function
evaluated
on
that
example
alone
,
and
then
take
a
small
step
downhill
.
The
update
rule
is
thus
where
6
(
t
)
is
the
parameter
vector
at
time
t
,
F
(
t
)
(
6
)
is
the
objective
function
specialized
to
the
time-t
example
,
and
n
&gt;
0
is
a
learning
rate
that
we
choose
.
We
check
for
convergence
after
each
pass
through
the
example
set
.
2.6
Algorithms
and
Complexity
To
evaluate
equation
(
4
)
,
we
need
a
conditional
model
of
trees
given
a
sentence
xi
.
We
define
one
by
exponentiating
and
normalizing
the
tree
scores
:
pe
,
i
(
yi
,
fc
)
=
exp
(
Eeey.ifc
f
(
e
)
^
6
)
/
Zi
.
With
exponentially
many
parses
of
xi
,
does
our
objective
function
(
4
)
now
have
prohibitive
com
-
putational
complexity
?
The
complexity
is
actually
similar
to
that
of
the
inside
algorithm
for
parsing
.
In
fact
,
the
first
term
of
(
4
)
for
projective
parsing
is
found
by
running
the
O
(
n3
)
inside
algorithm
on
supervised
data,5
and
its
gradient
is
found
by
the
corresponding
O
(
n3
)
outside
algorithm
.
For
non-projective
parsing
,
the
analogy
to
the
inside
algorithm
is
the
O
(
n3
)
"
matrix-tree
algorithm
,
"
which
is
dominated
asymptotically
by
a
matrix
determinant
(
Smith
and
Smith
,
2007
;
Koo
et
al.
,
2007
;
McDonald
and
Satta
,
2007
)
.
The
gradient
of
a
determinant
may
be
computed
by
matrix
inversion
,
so
evaluating
the
gradient
again
has
the
same
O
(
n3
)
complexity
as
evaluating
the
function
.
The
second
term
of
(
4
)
is
the
Shannon
entropy
of
the
posterior
distribution
over
parses
.
Computing
this
for
projective
parsing
takes
O
(
n3
)
time
,
using
a
dynamic
programming
algorithm
that
is
closely
related
to
the
inside
algorithm
(
Hwa
,
2000
)
.
6
For
non-projective
parsing
,
unfortunately
,
the
runtime
rises
to
O
(
n4
)
,
since
it
requires
determinants
of
n
distinct
matrices
(
each
incorporating
a
log
factor
in
a
different
column
;
we
omit
the
details
)
.
The
gradient
evaluation
in
both
cases
is
again
about
as
expensive
as
the
function
evaluation
.
A
convenient
speedup
is
to
replace
Shannon
entropy
with
Renyi
entropy
.
The
family
of
Renyi
entropy
measures
is
parameterized
by
a
:
In
our
setting
,
where
p
=
pe
,
i
,
the
events
y
are
the
possible
parses
of
xi
.
Observe
that
under
our
definition
of
p
,
£
y
p
(
y
)
a
=
{
£
y
expEeey
f
(
e
)
•
(
a6
)
]
}
/
Za
.
We
already
have
Zi
from
running
the
inside
algorithm
,
and
we
can
find
the
numerator
by
running
the
inside
algorithm
again
with
6
scaled
by
a.
Thus
with
Renyi
entropy
,
all
computations
and
their
gradients
are
O
(
n3
)
—
even
in
the
non-projective
case
.
Renyi
entropy
is
also
a
theoretically
attractive
generalization
.
It
can
be
shown
that
lima^1
Ra
(
p
)
5The
numerator
of
pe
,
i
(
Vi
)
(
see
definition
above
)
is
trivial
since
y
*
is
a
single
known
parse
.
But
the
denominator
Zi
is
a
normalizing
constant
that
sums
over
all
parses
;
it
is
found
by
a
dependency-parsing
variant
of
the
inside
algorithm
,
following
(
Eisner
,
1996
)
.
6See
also
(
Mann
and
McCallum
,
2007
)
for
similar
results
on
conditional
random
fields
.
is
in
fact
the
Shannon
entropy
H
(
p
)
and
that
linia^oo
R
«
(
p
)
=
—
logmaxy
p
(
y
)
,
i.e.
the
negative
log
probability
of
the
modal
or
"
Viterbi
"
label
(
Arndt
,
2001
;
Karakos
et
al.
,
2007
)
.
The
a
=
2
case
,
widely
used
as
a
measure
of
purity
in
decision
tree
learning
,
is
often
called
the
"
Gini
index
.
"
Finally
,
when
a
=
0
,
we
get
the
log
of
the
number
of
labels
,
which
equals
the
H
(
uniform
distribution
)
that
Abney
used
in
equation
(
3
)
.
3
Evaluation
For
this
paper
,
we
performed
some
initial
bootstrapping
experiments
on
small
corpora
,
using
the
features
from
(
McDonald
et
al.
,
2005
)
.
Afterdiscussing
experimental
setup
(
§
3.1
)
,
we
look
at
the
correlation
of
confidence
with
accuracy
and
with
oracle
likelihood
,
and
at
the
fine-grained
behaviour
of
models
'
dependency
edge
posteriors
(
§
3.2
)
.
We
then
compare
our
confidence-maximizing
bootstrapping
to
EM
,
which
has
been
widely
used
in
semi-supervised
learning
(
§
3.4
)
.
Section
3.3
presents
overall
bootstrapping
accuracy
.
3.1
Experimental
Design
We
bootstrapped
non-projective
parsers
for
languages
assembled
for
the
CoNLL
dependency
parsing
competitions
(
Buchholz
and
Marsi
,
2006
)
.
We
selected
German
,
Spanish
,
and
Czech
(
Brants
et
al.
,
2002
;
Civit
Torruella
and
Marti
Antonin
,
2002
;
Bohmova
et
al.
,
2003
)
.
After
removing
sentences
more
than
60
words
long
,
we
randomly
divided
each
corpus
into
small
seed
sets
of
100
and
1000
trees
;
development
and
test
sets
of
200
trees
each
;
and
an
unlabeled
training
set
from
the
rest
.
These
treebanks
contain
strict
dependency
trees
,
in
the
sense
that
their
only
nodes
are
the
words
and
a
distinguished
root
node
.
In
the
Czech
dataset
,
more
than
one
word
can
attach
to
the
root
;
also
,
the
trees
in
German
,
Spanish
,
and
Czech
may
be
non-projective
.
We
use
the
MSTParser
implementation
described
in
McDonald
et
al.
(
2005
)
for
feature
extraction
.
Since
our
seed
sets
are
so
small
,
we
extracted
features
from
all
edges
in
both
the
seed
and
the
unlabeled
parts
of
our
training
data
,
not
just
the
edges
annotated
as
correct
.
Since
this
produced
many
more
features
,
we
pruned
our
features
to
those
with
at
least
10
occurrences
over
all
edges
.
Correlation
of
Acc
.
(
Shannon
,
(
Viterbi
)
Xent
.
Table
1
:
Correlation
,
on
development
sentences
,
of
Renyi
entropy
with
model
accuracy
and
with
cross-entropy
(
"
Xent
.
"
)
.
Since
these
are
measures
of
uncertainty
,
we
see
a
negative
correlation
.
As
a
increases
,
we
place
more
confidence
in
high-probability
parses
and
correlate
better
with
accuracy
.
We
used
stochastic
gradient
descent
first
to
minimize
equation
(
4
)
on
the
labeled
seed
sets
.
Then
we
continued
to
optimize
over
the
labeled
and
unla-beled
data
together
.
We
tested
for
convergence
using
accuracy
on
development
data
.
3.2
Empirically
Evaluating
Entropy
Bootstrapping
assumes
that
where
the
parser
is
confident
,
it
tends
to
be
correct
.
Standard
bootstrapping
methods
retrain
directly
on
confident
links
;
similarly
,
our
approach
tries
to
make
the
parser
even
more
confident
on
those
links
.
Is
this
assumption
really
true
empirically
?
Yes
:
not
only
does
confidence
on
unlabeled
data
correlate
with
cross-entropy
,
but
both
confidence
and
cross-entropy
correlate
well
with
accuracy
.
As
we
will
see
,
some
confidence
measures
correlate
better
than
others
.
In
particular
,
measures
that
are
more
peaked
around
the
one-best
prediction
of
the
parser
,
as
in
Viterbi
re-estimation
,
perform
well
.
If
we
train
a
non-projective
German
parser
on
small
seed
sets
of
100
and
1000
trees
,
only
,
how
well
does
its
own
confidence
predict
its
performance
?
For
200
points
—
labeled
development
sentences
—
we
measured
the
linear
correlation
of
various
Renyi
entropies
(
6
)
,
normalized
by
sentence
length
,
with
tree
accuracy
(
Table
1
)
.
We
also
measured
how
these
normalized
Renyi
entropies
correlate
with
the
posterior
log-probability
the
model
assigns
to
the
true
parse
(
the
cross-entropy
)
.
Since
Renyi
entropy
is
a
measure
of
uncertainty
,
we
see
a
negative
correlation
with
accuracy
.
This
correlation
strengthens
as
we
raise
a
to
oo
,
so
we
might
expect
Viterbi
re-estimation
,
or
a
differen
-
Bootstrapping
with
R
„
(
Viterbi
)
Figure
1
:
Posterior
probability
of
correct
and
incorrect
edges
in
German
test
data
under
various
models
.
We
show
the
distribution
of
posterior
probabilities
for
correct
edges
,
known
from
an
oracle
,
in
black
and
incorrect
edges
in
gray
.
In
the
upper
row
,
learning
on
an
initial
supervised
set
raises
the
posterior
probability
of
correct
edges
while
dragging
along
some
incorrect
edges
.
In
the
lower
row
,
we
see
that
adding
unlabeled
data
with
R2
entropy
continues
the
pattern
of
the
supervised
learner
.
Roo
(
Viterbi
)
training
induces
a
second
mode
in
correct
posterior
probabilities
near
1
although
it
does
shift
more
incorrect
edges
closer
to
1
.
Figure
2
:
Precision-recall
curves
for
selecting
edges
according
to
their
posterior
probabilities
:
better
bootstrapping
puts
more
area
under
the
curve
.
tiable
objective
function
with
a
very
high
a
,
to
perform
best
on
held-out
data
.
Note
also
that
the
cross-entropy
,
which
looks
at
the
true
labels
on
the
held-out
data
,
does
not
itself
correlate
very
much
better
with
accuracy
than
the
best
unsupervised
confidence
measures
.
Finally
,
we
see
that
Renyi
entropies
with
higher
a
are
more
stable
:
when
calculated
for
a
model
trained
on
more
data
,
they
improve
their
correlation
with
accuracy
.
From
tree
confidence
,
we
now
turn
to
edge
confidence
:
what
is
the
posterior
probability
that
a
model
assigns
to
each
of
the
n2
edges
in
the
dependency
graph
?
Figure
1
shows
smoothed
histograms
of
true
edges
(
black
)
and
false
edges
(
gray
)
in
held-out
data
,
according
to
the
posterior
probabilities
we
assign
to
them
.
Since
there
are
many
more
false
edges
,
the
figures
are
cropped
to
zoom
in
on
the
distribution
of
true
edges
.
As
we
start
training
on
the
labeled
seed
set
,
the
posterior
probabilities
of
true
edges
move
towards
one
;
many
false
edges
also
get
greater
mass
,
but
not
to
the
same
extent
.
As
we
add
unlabeled
data
,
we
can
see
the
different
learning
strategies
of
different
confidence
measures
.
R2
gradually
moves
a
few
true
and
many
fewer
false
edges
towards
1
,
while
Roo
(
Viterbi
)
learning
is
so
confident
as
to
induce
a
bimodal
distribution
in
the
posteriors
of
true
edges
.
Figure
2
visualizes
the
same
data
as
four
precision-recall
curves
,
which
show
how
noisy
the
highest-conidence
edges
are
,
across
a
range
of
con-idence
thresholds
.
Although
the
very
high
precision
end
stays
stable
after
10
iterations
on
the
seed
set
,
the
addition
of
unlabeled
data
puts
more
area
under
the
curve
.
Again
,
Ro
dominates
R2
.
3.3
Bootstrapping
Results
We
performed
bootstrapping
experiments
on
the
full
CoNLL
sets
for
Czech
,
German
,
and
Spanish
using
the
non-projective
model
from
McDonald
et
al.
(
2005
)
.
Performance
confirms
the
results
of
our
analysis
above
(
Table
2
)
.
Adding
unlabeled
data
improves
performance
over
that
of
the
seed
set
,
with
the
exception
of
the
Czech
data
with
R2
bootstrapping
.
As
we
saw
in
§
3.2
,
bootstrapping
with
Ro
dominates
bootstrapping
with
R2
conidence
.
For
comparison
,
we
also
show
the
results
obtained
by
supervised
training
on
the
combined
seed
and
unla-beled
sets
.
Recall
that
we
did
not
use
the
tree
annotations
to
perform
feature
selection
;
models
trained
with
only
supported
features
ought
to
perform
better
.
Although
we
see
statistically
signiicant
improvements
(
at
the
.
05
level
on
a
paired
permutation
test
)
,
the
quality
of
the
parsers
is
still
quite
poor
,
in
contrast
to
other
applications
of
bootstrapping
which
"
rival
supervised
methods
"
(
Yarowsky
,
1995
)
.
Almost
certainly
,
the
CoNLL
datasets
,
comprising
at
most
some
tens
of
thousands
of
sentences
per
language
,
are
too
small
to
afford
qualitative
improvements
.
Also
,
at
these
relatively
small
training
sizes
,
our
preliminary
attempts
to
leverage
comparable
English
corpora
did
not
improve
performance
.
What
features
were
learned
,
and
how
dependent
is
performance
on
the
seed
set
?
We
analyzed
the
performance
of
German
bootstrapping
on
a
develop
-
accuracy
Seed
trees
Table
2
:
Dependency
accuracy
of
the
McDonald
model
on
200
test
sentences
.
When
a
=
0
,
training
only
occurs
on
the
supervised
seed
data
.
As
a
increases
,
we
train
based
on
confidence
in
our
model
's
analysis
of
the
unlabeled
data
.
Boldface
results
are
the
best
in
their
rows
in
a
permutation
test
at
the
.
05
level
.
ment
set
(
Table
3
)
.
Using
the
parameters
at
the
last
iteration
of
supervised
training
on
the
seed
set
as
a
baseline
,
we
tried
updating
to
their
bootstrapped
values
the
weights
of
only
those
features
that
occurred
in
the
seed
set
.
This
achieved
nearly
the
same
accuracy
as
updating
all
the
features
.
As
one
would
expect
,
using
only
the
non-seed
features
'
weights
performs
abysmally
.
This
might
be
the
case
simply
because
the
seed
set
is
likely
to
contain
frequently
occurring
features
.
If
,
however
,
we
use
only
the
features
occurring
in
an
alternate
training
set
of
the
same
size
(
100
sentences
)
,
we
get
much
worse
performance
.
These
results
indicate
that
our
bootstrapped
parser
is
still
heavily
dependent
on
the
features
that
happened
to
ire
in
the
seed
set
;
we
have
not
"
forgotten
"
our
initial
conditions
.
Similar
experiments
show
that
unlexicalized
features
contribute
the
most
to
bootstrapping
performance
.
Since
in
our
log-linear
models
features
have
been
trained
to
work
together
,
we
must
not
put
too
much
weight
on
these
ablation
results
.
These
experiments
do
,
however
,
suggest
that
bootstrapping
improved
our
results
by
reining
the
values
of
known
,
non-lexicalized
features
.
Perhaps
the
most
popular
statistical
method
for
learning
from
incomplete
data
is
the
EM
algorithm
(
Dempster
et
al.
,
1977
)
.
Since
we
cannot
try
EM
on
McDonald
's
conditional
model
,
we
ran
some
pilot
experiments
using
the
generative
dependency
model
with
valence
(
DMV
)
of
Klein
and
Manning
(
2004
)
.
As
in
their
experiments
,
and
unlike
the
other
experiments
in
the
current
paper
,
we
restricted
ourselves
M
feat
.
acc
.
non-seed
non-lexical
non-bilex
.
bilexical
Table
3
:
Using
all
features
,
dependency
accuracy
on
German
development
data
rose
to
64.3
%
on
bootstrapping
.
We
show
the
contribution
of
different
feature
splits
to
the
performance
of
this
inal
model
.
For
example
,
although
this
model
was
trained
by
updating
all
15.5M
feature
weights
,
it
performs
as
well
if
we
then
keep
only
the
1.4M
features
that
appeared
at
least
once
in
the
seed
set
,
zeroing
out
the
weights
of
the
others
.
We
do
as
well
as
the
full
feature
set
if
we
keep
only
the
3.5M
non-lexicalized
features
.
%
accuracy
Bulg
.
German
Spanish
supervised
semi-supervised
EM
Conf
.
Table
4
:
Dependency
accuracy
of
the
DMV
model
(
Klein
and
Manning
,
2004
)
.
Maximizing
confidence
using
Ri
(
Shannon
)
entropy
improved
performance
over
its
own
conditional
likelihood
(
CL
)
baseline
and
over
maximum
likelihood
(
ML
)
.
EM
degraded
its
ML
baseline
.
Since
these
models
were
only
trained
and
tested
on
sentences
of
10
words
or
fewer
,
accuracy
is
much
higher
than
the
full
results
in
Table
2
.
to
sentences
of
ten
words
or
fewer
and
to
part-of-speech
sequences
alone
,
without
any
lexical
information
.
Since
the
DMV
models
projective
trees
,
we
ran
experiments
on
three
CoNLL
corpora
that
had
augmented
their
primary
non-projective
parses
with
alternate
projective
annotations
:
Bulgarian
(
Simov
et
al.
,
2005
)
,
German
,
and
Spanish
.
We
performed
supervised
maximum
likelihood
and
conditional
likelihood
estimation
on
a
seed
set
of
100
sentences
for
each
language
.
These
models
respectively
initialized
EM
and
conidence
training
on
unlabeled
data
.
We
see
(
Table
4
)
that
EM
degrades
the
performance
of
its
ML
baseline
.
Meri-aldo
(
1994
)
saw
a
similar
degradation
over
small
(
and
large
)
seed
sets
in
HMM
POS
tagging
.
We
tried
ixing
and
not
ixing
the
feature
expectations
on
the
seed
set
during
EM
and
show
the
former
,
better
numbers
.
Conidence
maximization
improved
over
both
its
own
conditional
likelihood
initializer
and
also
over
ML
.
We
selected
optimal
smoothing
parameters
for
all
models
and
optimal
a
(
equation
(
6
)
)
and
y
(
equation
(
4
)
)
for
the
confidence
model
on
labeled
held-out
data
.
4
Future
Work
We
hypothesize
that
qualitatively
better
bootstrapping
results
will
require
much
larger
unlabeled
data
sets
.
In
scaling
up
bootstrapping
to
larger
unla-beled
training
sets
,
we
must
carefully
weight
tradeoffs
between
expanding
coverage
and
introducing
noise
from
out-of-domain
data
.
We
could
also
better
exploit
the
data
we
have
with
richer
models
of
syntax
.
In
supervised
dependency
parsing
,
second-order
edge
features
provide
improvements
(
McDonald
and
Pereira
,
2006
;
Riedel
and
Clarke
,
2006
)
;
moreover
,
the
feature-based
approach
is
not
limited
to
dependency
parsing
.
Similar
techniques
could
score
parses
in
other
formalisms
,
such
as
CFG
or
TAG
.
In
this
case
,
f
extracts
features
from
each
of
the
derivation
tree
's
rewrite
rules
(
CFG
)
or
elementary
trees
(
TAG
)
.
In
lexicalized
formalisms
,
f
will
still
be
able
to
score
lexical
dependencies
that
are
implicitly
represented
in
the
parse
.
Finally
,
we
want
to
investigate
whether
larger
training
sets
will
provide
traction
for
sparser
cross-lingual
and
cross-domain
features
.
5
Conclusions
Feature-rich
dependency
models
promise
to
help
bootstrapping
by
providing
many
redundant
features
for
the
learner
,
and
they
can
also
cleanly
incorporate
cross-domain
and
cross-language
information
.
We
explored
bootstrapping
feature-rich
non-projective
dependency
parsers
for
Czech
,
German
,
and
Spanish
.
Our
bootstrapping
method
maximizes
a
linear
combination
of
likelihood
and
conidence
.
In
initial
experiments
on
small
datasets
,
this
surpassed
EM
for
training
a
simple
feature-poor
generative
model
,
and
also
improved
the
performance
of
a
feature-rich
,
conditionally
estimated
model
where
EM
could
not
easily
have
been
applied
.
For
our
models
and
training
sets
,
more
peaked
measures
of
confidence
,
measured
by
Renyi
entropy
,
outperformed
smoother
ones
.
Acknowledgments
The
authors
thank
the
anonymous
reviewers
,
Noah
A.
Smith
,
and
Keith
Hall
for
helpful
comments
,
and
Ryan
McDonald
for
making
his
parsing
code
publicly
available
.
This
work
was
supported
in
part
by
NSF
ITR
grant
IIS-0313193
.
