In
morphologically
rich
languages
,
should
morphological
and
syntactic
disambiguation
be
treated
sequentially
or
as
a
single
problem
?
We
describe
several
efficient
,
probabilistically-interpretable
ways
to
apply
joint
inference
to
morphological
and
syntactic
disambiguation
using
lattice
parsing
.
Joint
inference
is
shown
to
compare
favorably
to
pipeline
parsing
methods
across
a
variety
of
component
models
.
State-of-the-art
performance
on
Hebrew
Treebank
parsing
is
demonstrated
using
the
new
method
.
The
benefits
of
joint
inference
are
modest
with
the
current
component
models
,
but
appear
to
increase
as
components
themselves
improve
.
1
Introduction
As
the
field
of
statistical
NLP
expands
to
handle
more
languages
and
domains
,
models
appropriate
for
standard
benchmark
tasks
do
not
always
work
well
in
new
situations
.
Take
,
for
example
,
parsing
the
Wall
Street
Journal
Penn
Treebank
,
a
longstanding
task
for
which
highly
accurate
context-free
models
stabilized
by
the
year
2000
(
Collins
,
1999
;
Charniak
,
2000
)
.
On
this
task
,
the
Collins
model
achieves
90
%
F1-accuracy
.
Extended
for
new
languages
by
Bikel
(
2004
)
,
it
achieves
only
75
%
on
Arabic
and
72
%
on
Hebrew.1
It
should
come
as
no
surprise
that
Semitic
parsing
lags
behind
English
.
The
Collins
model
was
carefully
designed
and
tuned
for
WSJEnglish
.
Many
of
the
features
in
the
model
depend
on
English
syntax
or
Penn
Treebank
annotation
conventions
.
Inherent
in
its
crafting
is
the
assumption
that
a
million
words
of
training
text
are
available
.
Finally
,
for
English
,
it
need
not
handle
morphological
ambiguity
.
Indeed
,
the
figures
cited
above
for
Arabic
and
Hebrew
are
achieved
using
gold-standard
morphological
disambiguation
and
part-of-speech
tagging
.
*
The
authors
acknowledge
helpful
feedback
from
the
anonymous
reviewers
,
Sharon
Goldwater
,
Rebecca
Hwa
,
Alon
Lavie
,
and
Shuly
Wintner
.
1Compared
to
the
Penn
Treebank
,
the
Arabic
Treebank
(
Maamouri
et
al.
,
2004
)
has
60
%
as
many
word
tokens
,
and
the
Hebrew
Treebank
(
Sima'an
et
al.
,
2001
)
has
6
%
.
Given
only
surface
words
,
Arabic
performance
drops
by
1.5
Fi
points
.
The
Hebrew
Treebank
(
unlike
Arabic
)
is
built
over
morphemes
,
a
convention
we
view
as
sensible
,
though
it
complicates
parsing
.
This
paper
considers
parsing
for
morphologically
rich
languages
,
with
Hebrew
as
a
test
case
.
Morphology
and
syntax
are
two
levels
of
linguistic
description
that
interact
.
This
interaction
,
we
argue
,
can
affect
disambiguation
,
so
we
explore
here
the
matter
of
joint
disambiguation
.
This
involves
the
comparison
of
a
pipeline
(
where
morphology
is
inferred
first
and
syntactic
parsing
follows
)
with
joint
inference
.
We
present
a
generalization
of
the
two
,
and
show
new
ways
to
do
joint
inference
for
this
task
that
does
not
involve
a
computational
blow-up
.
The
paper
is
organized
as
follows
.
§
2
describes
the
state
of
the
art
in
NLP
for
Hebrew
and
some
phenomena
it
exhibits
that
motivate
joint
inference
for
morphology
and
syntax
.
§
3
describes
our
approach
to
joint
inference
using
lattice
parsing
,
and
gives
three
variants
of
weighted
lattice
parsing
with
their
probabilistic
interpretations
.
The
different
factor
models
and
their
stand-alone
performance
are
given
in
§
4
.
§
5
presents
experiments
on
Hebrew
parsing
and
explores
the
benefits
of
joint
inference
.
2
Background
In
this
section
we
discuss
prior
work
on
statistical
morphological
and
syntactic
processing
of
Hebrew
and
motivate
the
joint
approach
.
Wintner
(
2004
)
reviews
work
in
Hebrew
NLP
,
emphasizing
that
the
challenges
stem
from
the
writing
system
,
rich
morphology
,
unique
word
formation
process
of
roots
and
patterns
,
and
relative
lack
of
annotated
corpora
.
We
know
of
no
publicly
available
statistical
parser
designed
specifically
for
Hebrew
.
Sima'an
et
al.
Proceedings
of
the
2007
Joint
Conference
on
Empirical
Methods
in
Natural
Language
Processing
and
Computational
Natural
Language
Learning
,
pp.
208-217
,
Prague
,
June
2007
.
©
2007
Association
for
Computational
Linguistics
c.
is-beautiful
there
shepherds
that
distant
the
and
big
the
green
the
meadow
the
in
shepherd
the
ADJ+Masc
VB+Masc
+Masc
d.
the
shepherd
in
the
big
green
distant
meadow
who
shepherds
there
is
beautiful
/
.
nicely
there
is-lying
distant
the
and
big
the
green
the
meadow
the
in
shepherd
the
g.
the
shepherdess
in
the
big
green
distant
meadow
is
lying
there
nicely
Figure
1
:
(
a.
)
A
sentence
in
Hebrew
(
to
be
read
right
to
left
)
,
with
(
b.
)
one
morphological
analysis
,
(
c.
)
English
glosses
,
and
(
d.
)
natural
translation
;
and
(
e.
)
a
different
morphological
analysis
,
(
f.
)
English
glosses
,
and
(
g.
)
less
natural
translation
.
(
h.
)
shows
a
morphological
"
sausage
"
lattice
that
encodes
the
morpheme-sequence
analyses
L
(
x
)
possible
for
a
shortened
sentence
(
unmodified
"
meadow
"
)
.
Shaded
states
are
word
boundaries
,
white
states
are
intra-word
morpheme
boundaries
;
in
practice
we
add
POS
tags
to
the
arcs
,
to
permit
disambiguation
.
According
to
both
native
speakers
we
polled
,
both
interpretations
are
grammatical
—
note
the
long-distance
agreement
required
for
grammaticality
.
(
2001
)
built
a
Hebrew
Treebank
of
88,747
words
(
4,783
sentences
)
and
parsed
it
using
a
probabilistic
model
.
However
,
they
assumed
that
the
input
to
the
parser
was
already
(
perfectly
)
morphologically
disambiguated
.
This
assumption
is
very
common
in
multilingual
parsing
(
see
,
for
example
,
Cowan
et
al.
,
2005
,
and
Buchholz
et
al.
,
2006
)
.
2005
)
.
In
NLP
,
the
separation
of
syntax
and
morphology
is
understandable
when
the
latter
is
impoverished
,
as
in
English
.
When
both
involve
high
levels
of
ambiguity
,
this
separation
becomes
harder
to
justify
,
as
argued
by
Tsarfaty
(
2006
)
.
To
our
knowledge
,
that
is
the
only
study
to
move
toward
joint
inference
of
syntax
and
morphology
,
presenting
joint
models
and
testing
approximation
of
these
models
with
two
parsers
:
one
a
pipeline
(
segmentation
—
tagging
—
parsing
)
,
the
other
involved
joint
inference
of
segmentation
and
tagging
,
with
the
result
piped
to
the
parser
.
The
latter
was
slightly
more
accurate
.
Tsar-faty
discussed
but
did
not
carry
out
joint
inference
.
In
a
morphologically
rich
language
,
the
different
morphemes
that
make
up
a
word
can
play
a
variety
of
different
syntactic
roles
.
A
reasonable
linguistic
analysis
might
not
make
such
morphemes
immediate
sisters
in
the
tree
.
Indeed
,
the
convention
of
the
Hebrew
Treebank
is
to
place
morphemes
(
rather
than
words
)
at
the
leaves
of
the
parse
tree
,
allowing
morphemes
of
a
word
to
attach
to
different
nonterminal
parents.2
Generating
parse
trees
over
morphemes
requires
the
availability
of
morphological
information
when
parsing
.
Because
this
analysis
is
not
in
general
reducible
to
sequence
labeling
(
tagging
)
,
the
problem
is
different
from
POS
tagging
.
Figure
1
gives
an
2The
Arabic
Treebank
,
by
contrast
,
annotates
words
morphologically
but
keeps
the
morphemes
together
as
a
single
node
tagged
with
a
POS
sequence
.
In
Bikel
's
Arabic
parser
,
complex
POS
tags
are
projected
to
a
small
atomic
set
;
it
is
unclear
how
much
information
is
lost
.
example
from
Hebrew
that
illustrates
the
interaction
between
morphology
and
syntax
.
In
this
example
,
we
show
two
interpretations
of
the
surface
text
,
with
the
first
being
a
more
common
natural
analysis
for
the
sentence
.
The
first
and
third-to-last
words
'
analyses
depend
on
each
other
if
the
resulting
analysis
is
to
be
the
more
natural
one
:
for
this
analysis
the
first
seven
words
have
to
be
a
noun
phrase
,
while
for
the
less
common
analysis
(
"
lying
there
nicely
"
)
only
the
first
six
words
compose
a
noun
phrase
,
with
the
last
two
words
composing
a
verb
phrase
.
Consistency
depends
on
a
long-distance
dependency
that
a
finite-state
morphology
model
cannot
capture
,
but
a
model
that
involves
syntactic
information
can
.
Dis-ambiguating
the
syntax
aids
in
disambiguating
the
morphology
,
suggesting
that
a
joint
model
will
perform
both
more
accurately
.
In
sum
,
joint
inference
of
morphology
and
syntax
is
expected
to
allow
decisions
of
both
kinds
to
influence
each
other
,
enforce
adherence
to
constraints
at
both
levels
,
and
to
diminish
the
propagation
of
errors
inherent
in
pipelines
.
3
Joint
Inference
of
Morphology
and
Syntax
We
now
formalize
the
problem
and
supply
the
necessary
framework
for
performing
joint
morphological
disambiguation
and
syntactic
parsing
.
3.1
Notation
and
Morphological
Sausages
Let
X
be
the
language
's
word
vocabulary
and
M
be
its
morpheme
inventory
.
The
set
of
valid
analyses
for
a
surface
word
is
defined
using
a
morphological
lexicon
L
,
which
defines
L
(
x
)
C
M+.
L
(
x
)
C
(
M+
)
+
(
sequence
of
sequences
)
is
the
set
of
whole-sentence
analyses
for
sentence
x
=
(
xi
,
x2
,
xn
)
,
produced
by
concatenating
elements
of
L
(
x
»
)
in
order
.
L
(
x
)
can
be
represented
as
an
acyclic
lattice
with
a
"
sausage
"
shape
familiar
from
speech
recognition
(
Mangu
et
al.
,
1999
)
and
machine
translation
(
Lavie
et
al.
,
2004
)
.
Fig
.
1h
shows
a
sausage
lattice
for
a
sentence
in
Hebrew
.
We
use
m
to
denote
an
element
of
L
(
x
)
and
mj
to
denote
an
element
of
L
(
xj
)
;
in
general
,
m
=
(
m
1
,
m2
,
mn
)
.
classifier
.
We
use
DG
(
m
)
c
T
to
denote
the
set
of
valid
trees
under
a
grammar
G
(
here
,
a
PCFG
with
terminal
alphabet
M
)
for
morpheme
sequence
m.
To
be
precise
,
f
(
x
)
selects
a
mutually
consistent
morphological
and
syntactic
analysis
from
Our
mapping
f
(
x
)
is
based
on
a
joint
probability
model
p
(
r
,
m
|
x
)
which
combines
two
probability
models
pc
(
r
,
rn
)
(
a
PCFG
built
on
the
grammar
G
)
and
pL
(
m
|
x
)
(
a
morphological
disambiguation
model
built
on
the
lexicon
L
)
.
Factoring
the
joint
model
into
sub-models
simplifies
training
,
since
we
can
train
each
model
separately
,
and
inference
(
parsing
)
,
as
we
will
see
later
in
this
section
.
Factored
estimation
has
been
quite
popular
in
NLP
of
late
(
Klein
and
Manning
,
2003b
;
Smith
and
Smith
,
2004
;
Smith
et
al.
,
2005a
,
inter
alia
)
.
The
most
obvious
joint
parser
uses
pG
as
a
conditional
model
over
trees
given
morphemes
and
maximizes
the
joint
likelihood
:
This
is
not
straightforward
,
because
it
involves
summing
up
the
trees
for
each
m
to
compute
pc
(
mm
)
,
which
calls
for
the
O
(
|
m
|
3
)
-
Inside
algorithm
to
be
called
on
each
m.
Instead
,
we
use
the
joint
,
pg
(
t
,
m
)
,
which
,
strictly
speaking
,
makes
the
model
deficient
(
"
leaky
"
)
,
but
permits
a
dynamic
programming
solution
.
Our
models
will
be
parametrized
using
either
un-normalized
weights
(
a
log-linear
model
)
or
multinomial
distributions
.
Either
way
,
both
models
define
scores
over
parts
of
analyses
,
and
it
may
be
advantageous
to
give
one
model
relatively
greater
strength
,
especially
since
we
often
ignore
constant
normalizing
factors
.
This
is
known
as
a
product
of
experts
(
Hinton
,
1999
)
,
where
a
new
combined
distribution
over
events
is
defined
by
multiplying
component
distributions
together
and
renormalizing
.
In
the
where
Z
(
x
,
a
)
need
not
be
computed
(
since
it
is
a
constant
in
m
and
t
)
.
a
tunes
the
relative
weight
of
the
morphology
model
with
respect
to
the
parsing
model
.
The
higher
a
is
,
the
more
we
trust
the
morphology
model
over
the
parser
to
correctly
dis-ambiguate
the
sentence
.
We
might
trust
one
model
more
than
the
other
for
a
variety
of
reasons
:
it
could
be
more
robustly
or
discriminatively
estimated
,
or
it
could
be
known
to
come
from
a
more
appropriate
family
.
This
formulation
also
generalizes
two
more
naive
parsing
methods
.
If
a
=
0
,
the
morphology
is
modeled
only
through
the
PCFG
and
pL
is
ignored
except
as
a
constraint
on
which
analyses
L
(
x
)
are
allowed
(
i.e.
,
on
the
definition
of
the
set
GEN
(
x
)
)
.
At
the
other
extreme
,
as
a
—
+00
,
pL
becomes
more
important
.
Because
pL
does
not
predict
trees
,
pG
still
"
gets
to
choose
"
the
syntax
tree
,
but
in
the
limit
it
must
find
a
tree
for
argmaxf
eL
(
x
)
pL
(
m
|
x
)
.
This
is
effectively
the
morphology-first
pipeline.3
3.3
Parsing
Algorithms
To
parse
,
we
apply
a
dynamic
programming
algorithm
in
the
(
max
,
+
)
semiring
to
solve
the
fpo^a
problem
shown
in
Eq
.
If
pL
is
a
unigram-factored
model
,
such
that
for
some
single-word
morphological
model
u
we
have
then
we
can
implement
morpho-syntactic
parsing
by
weighting
the
sausage
lattice
.
Let
the
weight
of
each
arc
that
starts
an
analysis
G
L
(
xj
)
be
equal
to
logu
(
mj
|
xi
)
,
and
let
other
arcs
have
weight
0
.
In
the
parsing
algorithm
,
the
weight
on
an
arc
is
summed
in
when
the
arc
is
first
used
to
build
a
constituent
.
In
general
,
we
would
like
to
define
a
joint
model
that
assigns
(
unnormalized
)
probabilities
to
elements
of
GEN
(
x
)
.
If
pG
is
a
PCFG
and
pL
can
3There
is
a
slight
difference
.
If
no
parse
tree
exists
for
the
pL-best
morphological
analysis
,
then
a
less
probable
m
may
be
chosen
.
So
as
a
—
+00
,
we
can
view
fnk
&gt;
a
as
finding
the
best
grammatical
m
and
its
best
tree
—
not
exactly
a
pipeline
.
be
described
as
a
weighted
finite-state
transducer
,
then
this
joint
model
is
their
weighted
composition
,
which
is
a
weighted
CFG
;
call
the
composed
grammar
I
and
its
(
unnormalized
)
distribution
p
/
.
Compared
to
G
,
I
will
have
many
more
nonterminals
if
pL
has
a
Markov
order
greater
than
0
(
unigram
,
as
above
)
.
Because
parsing
runtime
depends
heavily
on
the
grammar
constant
(
at
best
,
quadratic
in
the
number
of
nonterminals
)
,
parsing
with
p
/
is
not
computationally
attractive.4
fpo^a
is
not
,
then
,
a
scalable
solution
when
we
wish
to
use
a
morphology
model
pL
that
can
make
interdependent
decisions
about
different
words
in
x
in
context
.
We
propose
two
new
,
efficient
dynamic
programming
solutions
for
joint
parsing
.
posterior
,
depends
on
all
of
x
Similar
methods
were
applied
by
Matsuzaki
et
al.
(
2005
)
and
Petrov
and
Klein
(
2007
)
for
parsing
under
a
PCFG
with
nonterminals
with
latent
annotations
.
Their
approach
was
variational
,
approximating
the
true
posterior
over
coarse
parses
using
a
sentence-specific
PCFG
on
the
coarse
nonterminals
,
created
directly
out
of
the
true
fine-grained
PCFG
.
In
our
case
,
we
approximate
the
full
distribution
over
morphological
analyses
for
the
sentence
by
a
simpler
,
sentence-specific
unigram
model
that
assumes
each
word
's
analysis
is
to
be
chosen
independently
of
the
others
.
Note
that
our
model
(
pL
)
does
not
make
such
an
assumption
,
only
the
approximate
model
p'L
does
,
and
the
approximation
is
per-sentence
.
The
idea
resembles
a
mean-field
vari-ational
approximation
for
graphical
models
.
Turning
to
implementation
,
we
can
solve
for
pL
(
mj
|
x
)
exactly
using
the
forward-backward
algorithm
.
We
will
call
this
method
far^a
(
see
Eq
.
5
)
.
A
closely
related
method
,
applied
by
Goodman
(
1996
)
is
called
minimum-risk
decoding
.
Goodman
called
it
"
maximum
expected
recall
"
when
applying
it
to
parsing
.
In
the
HMM
community
it
4In
prior
work
involving
factored
syntax
models
—
lexicalized
(
Klein
and
Manning
,
2003b
)
and
bilingual
(
Smith
and
Smith
,
2004
)
—
/
poe
,
i
was
applied
,
and
the
asymptotic
runtime
went
to
O
(
n5
)
and
O
(
n7
)
.
is
sometimes
called
"
posterior
decoding
.
"
Minimum
risk
decoding
is
attributable
to
Goel
and
Byrne
(
2000
)
.
Applied
to
a
single
model
,
it
factors
the
parsing
decision
by
penalizable
errors
,
and
chooses
the
solution
that
minimizes
the
risk
(
expected
number
of
errors
under
the
model
)
.
This
factors
into
a
sum
of
expectations
,
one
per
potential
mistake
.
This
method
is
expensive
for
parsing
models
(
since
it
requires
the
Inside
algorithm
to
compute
expected
recall
mistakes
)
,
but
entirely
reasonable
for
sequence
labeling
models
.
The
idea
is
to
score
each
word-analysis
mj
in
the
morphological
lattice
by
the
expected
value
(
under
pL
)
that
mj
is
present
in
the
final
analysis
m.
This
is
,
of
course
pL
(
M
?
j
=
mj
|
x
)
,
the
same
quantity
computed
for
fvari
,
a
,
except
the
score
of
a
path
in
the
lattice
is
now
a
sum
of
posteriors
rather
than
a
product
.
Our
second
approximate
joint
parser
tries
to
maximize
the
probability
of
the
parse
(
as
before
)
and
at
the
same
time
to
minimize
the
risk
of
the
morphological
analysis
.
See
frisk
,
a
in
Eq
.
6
;
the
only
difference
between
frisk
,
a
and
fvari
,
a
is
whether
posteriors
are
added
(
frisk
,
a
)
or
multiplied
(
fvari
,
a
)
.
To
summarize
this
section
,
fvari
,
a
and
frisk
,
a
are
two
approximations
to
the
expensive-in-general
fpoe
,
a
that
boil
down
to
parsing
over
weighted
lattices
.
The
only
difference
between
them
is
how
the
lattice
is
weighted
:
using
a
logpL
(
mj
|
x
)
for
fvari
,
a
or
using
apL
(
mj
|
x
)
for
frisk
,
a.5
In
case
of
a
unigram
pL
,
fpoe
,
a
is
equivalent
to
fvari
,
«
;
otherwise
fpoe
,
a
is
likely
to
be
too
expensive
.
To
parse
the
weighted
lattices
using
fvari
,
a
and
frisk
,
a
in
the
previous
section
,
we
use
lattice
parsing
.
Lattice
parsing
is
a
straightforward
generalization
of
5Until
now
,
we
have
talked
about
weighting
word
analyses
,
which
may
cover
several
arcs
,
rather
than
arcs
.
In
practice
we
apply
the
weight
to
the
first
arc
of
a
word
analysis
,
and
weight
the
remaining
arcs
of
that
analysis
with
0
(
no
cost
or
benefit
)
,
giving
the
desired
effect
.
string
parsing
that
indexes
constituents
by
states
in
the
lattice
rather
than
word
interstices
.
At
parsing
time
,
a
(
max
,
+
)
lattice
parser
finds
the
best
combined
parse
tree
and
path
through
the
lattice
.
Importantly
,
the
data
structures
that
are
used
in
chart
parsing
need
not
change
in
order
to
accommodate
lattices
.
The
generalization
over
classic
Earley
or
CKY
parsing
is
simple
:
keep
in
the
parsing
chart
constituents
created
over
a
pair
of
start
state
and
end
state
(
instead
of
start
position
and
end
position
)
,
and
(
if
desired
)
factor
in
weights
on
lattice
arcs
;
see
Hall
(
2005
)
.
4
Factored
Models
A
fair
comparison
of
joint
and
pipeline
parsing
must
make
some
attempt
to
control
for
the
component
models
.
We
describe
here
two
PCFGs
we
used
for
pg
(
t
,
m
)
and
two
finite-state
morphological
models
we
used
for
pL
(
m
|
x
)
.
We
show
how
these
models
perform
in
stand-alone
evaluations
.
For
all
experiments
,
we
used
the
Hebrew
Treebank
(
Sima'an
et
al.
,
2001
)
.
After
removing
traces
and
removing
functional
information
from
the
nonterminals
,
we
had
3,770
sentences
in
the
training
set
,
371
sentences
in
the
development
set
(
used
primarily
to
select
the
value
of
a
)
and
370
sentences
in
the
test
set
.
Our
first
syntax
model
is
an
unbinarized
PCFG
trained
using
relative
frequencies
.
Preterminal
(
POS
tag
—
morpheme
)
rules
are
smoothed
using
backoff
to
a
model
that
predicts
the
morpheme
length
and
letter
sequence
.
The
PCFG
is
not
binarized
.
This
grammar
is
remarkably
good
,
given
the
limited
effort
that
went
into
it
.
The
rules
in
the
training
set
had
high
coverage
with
respect
to
the
development
set
:
an
oracle
experiment
in
which
we
maximized
the
number
of
recovered
gold-standard
constituents
(
on
the
development
set
)
gave
F1
accuracy
of
93.7
%
.
In
fact
,
its
accuracy
supersedes
more
complex
,
lexicalized
,
models
:
given
goldstandard
morphology
,
it
achieves
81.2
%
(
compared
to
72.0
%
by
Bikel
's
parser
,
with
head
rules
specified
by
a
native
speaker
)
.
This
is
probably
attributable
to
the
dataset
's
size
,
which
makes
training
with
highly-parameterized
lexicalized
models
precarious
and
prone
to
overfitting
.
With
first-order
vertical
markovization
(
i.e.
,
annotating
each
nonterminal
with
its
parent
as
in
Johnson
,
1998
)
,
accuracy
is
also
at
81.2
%
.
Tuning
the
horizontal
markovization
of
the
grammar
rules
(
Klein
and
Manning
,
2003a
)
had
a
small
,
adverse
effect
on
this
dataset
.
Since
the
PCFG
model
was
relatively
successful
compared
to
lexicalized
models
,
and
is
faster
to
run
,
we
decided
to
use
a
vanilla
PCFG
,
denoted
Gvan
,
and
a
parent-annotated
version
of
that
PCFG
(
Johnson
,
1998
)
,
denoted
Gv
=
2
.
Both
of
our
morphology
models
use
the
same
morphological
lexicon
L
,
which
we
describe
first
.
In
this
work
,
a
morphological
analysis
of
a
word
is
a
sequence
of
morphemes
,
possibly
with
a
tag
for
each
morpheme
.
There
are
several
available
analyzers
for
Hebrew
,
including
Yona
and
Wintner
(
2005
)
and
Segal
(
2000
)
.
We
use
instead
an
empirically-constructed
generative
lexicon
that
has
the
advantage
of
matching
the
Treebank
data
and
conventions
.
If
the
Treebank
is
enriched
,
this
would
then
directly
benefit
the
lexicon
and
our
models
.
Starting
with
the
training
data
from
the
Hebrew
Treebank
,
we
first
create
a
set
of
prefixes
Mp
C
M
;
this
set
includes
any
morpheme
seen
in
a
non-final
position
within
any
word
.
We
also
create
a
set
of
stems
Ms
C
M
that
includes
any
morpheme
seen
in
a
final
position
in
a
word
.
This
effectively
captures
the
morphological
analysis
convention
in
the
Hebrew
Treebank
,
where
a
stem
is
prefixed
by
a
relatively
dominant
low-entropy
sequence
of
0-5
prefix
morphemes
.
For
example
,
MHKLB
(
"
from
the
dog
"
)
is
analyzed
as
M+H+KLB
with
prefixes
M
(
"
from
"
)
and
H
(
"
the
"
)
and
KLB
(
"
dog
"
)
is
the
stem
.
In
practice
,
|
Mp
|
=
124
(
including
some
conventions
for
numerals
)
and
|
Ms
|
=
13,588
.
The
morphological
lexicon
is
then
defined
as
any
analysis
given
Mp
and
where
mk
denotes
(
m1
,
.
.
.
,
mk
)
and
count
(
mk
,
x
)
denotes
the
number
of
occurrences
of
x
disam-biguated
as
mk
in
the
training
set
.
Note
that
L
(
x
)
also
includes
any
analysis
of
x
observed
in
the
training
data
.
This
permits
the
memorization
of
any
observed
analysis
that
is
more
involved
than
simple
segmentation
(
4
%
of
word
tokens
in
the
training
set
;
e.g.
,
LXDR
(
"
to
the
room
"
)
is
analyzed
as
L+H+XDR
)
.
This
will
have
an
effect
on
evaluation
(
see
§
5.1
)
.
On
the
development
data
,
L
has
98.6
%
coverage
.
The
baseline
morphology
model
,
pLni
,
first
defines
a
joint
distribution
following
Eq
.
The
word
model
factors
out
when
we
conditionalize
to
form
pLni
(
(
m1
,
mk
)
|
x
)
.
The
prefix
sequence
model
is
multinomial
estimated
by
MLE
.
The
stem
model
(
conditioned
on
the
prefix
sequence
)
is
smoothed
to
permit
any
stem
that
is
a
sequence
of
Hebrew
characters
.
On
the
development
data
,
pLni
is
88.8
%
accurate
(
by
word
)
.
The
second
morphology
model
,
pLrf
,
which
is
based
on
the
same
morphological
lexicon
L
,
uses
a
second-order
conditional
random
field
(
Lafferty
et
al.
,
2001
)
to
disambiguate
the
full
sentence
by
modeling
local
contexts
(
Kudo
et
al.
,
2004
;
Smith
et
al.
,
2005b
)
.
Space
does
not
permit
a
full
description
;
the
model
uses
all
the
features
of
Smith
et
al.
(
2005b
)
except
the
"
lemma
"
portion
of
the
model
,
since
the
Hebrew
Treebank
does
not
provide
lemmas
.
The
weights
are
trained
to
maximize
the
probability
of
the
correct
path
through
the
morphological
lattice
,
conditioned
on
the
lattice
.
This
is
therefore
a
discriminative
model
that
defines
pL
(
m
|
x
)
directly
,
though
we
ignore
the
normalization
factor
in
parsing
.
Until
now
we
have
described
pL
as
a
model
of
morphemes
,
but
this
CRF
is
trained
to
predict
POS
tags
as
well
—
we
can
either
use
the
tags
(
i.e.
,
label
the
morphological
lattice
with
tag
/
morpheme
pairs
,
word
stem
prefix
sequence
so
that
the
lattice
parser
finds
a
parse
that
is
consistent
under
both
models
)
,
or
sum
the
tags
out
and
let
the
parser
do
the
tagging
.
One
subtlety
is
the
tagging
of
words
not
seen
in
the
training
data
;
for
such
words
an
unsegmented
hypothesis
with
tag
UNK
NOWN
is
included
in
the
lattice
and
may
therefore
be
selected
by
the
CRF
.
On
the
development
data
,
pLrf
is
89.8
%
accurate
on
morphology
,
with
74.9
%
fine-grained
POS-tagging
Fi-accuracy
(
see
§
5.1
)
.
Note
on
generative
and
discriminative
models
.
The
reader
may
be
skeptical
of
our
choice
to
combine
a
generative
PCFG
with
a
discrimative
CRF
.
We
point
out
that
both
are
used
to
define
conditional
distributions
over
desired
"
output
"
structures
given
"
input
"
sequences
.
Notwithstanding
the
fact
that
the
factors
can
be
estimated
in
very
different
ways
,
our
combination
in
an
exact
or
approximate
product-of-experts
is
a
reasonable
and
principled
approach
.
5
Experiments
In
this
section
we
evaluate
parsing
performance
,
but
an
evaluation
issue
is
resolved
first
.
5.1
Evaluation
Measures
The
"
Parseval
"
measures
(
Black
et
al.
,
1991
)
are
used
to
evaluate
a
parser
's
phrase-structure
trees
against
a
gold
standard
.
They
compute
precision
and
recall
of
constituents
,
each
indexed
by
a
label
and
two
endpoints
.
As
pointed
out
by
Tsarfaty
(
2006
)
,
joint
parsing
of
morphology
and
syntax
renders
this
indexing
inappropriate
,
since
it
assumes
the
yields
of
the
trees
are
identical
—
that
assumption
is
violated
if
there
are
any
errors
in
the
hypothesized
m.
Tsarfaty
(
2006
)
instead
indexed
by
non-whitespace
character
positions
,
to
deal
with
segmentation
mismatches
.
In
general
(
and
in
this
work
)
that
is
still
insufficient
,
since
L
(
x
)
may
include
m
that
are
not
simply
segmentations
of
x
(
see
§
4.2.1
)
.
Roark
et
al.
(
2006
)
propose
an
evaluation
metric
for
comparing
a
parse
tree
over
a
sentence
generated
by
a
speech
recognizer
to
a
gold-standard
parse
.
As
in
our
case
,
the
hypothesized
tree
could
have
a
different
yield
than
the
original
gold-standard
parse
tree
,
because
of
errors
made
by
the
speech
recognizer
.
The
metric
is
based
on
an
alignment
between
the
hypothesized
sentence
and
the
goldstandard
sentence
.
We
used
a
similar
evaluation
metric
,
which
takes
into
account
the
information
about
parallel
word
boundaries
as
well
,
a
piece
of
information
that
does
not
appear
naturally
in
speech
recognition
.
Given
the
correct
m
*
and
the
hypothesis
m
,
we
use
dynamic
programming
to
find
an
optimal
many-to-many
monotonic
alignment
between
the
atomic
morphemes
in
the
two
sequences
.
The
algorithm
penalizes
each
violation
(
by
a
morpheme
)
of
a
one-to-one
correspondence,6
and
each
character
edit
required
to
transform
one
side
of
a
correspondence
into
the
other
(
without
whitespace
)
.
Word
boundaries
are
(
here
)
known
and
included
as
index
positions
.
In
the
case
where
m
=
m
*
(
or
equal
up
to
whitespace
)
the
method
is
identical
to
Parseval
(
and
also
to
Tsarfaty
,
2006
)
.
POS
tag
accuracy
is
evaluated
the
same
way
,
for
the
same
reasons
;
we
report
Fi-accuracy
for
tagging
and
parsing
.
5.2
Experimental
Comparison
In
our
experiment
we
vary
four
settings
:
(
§
3.3
)
.
•
Syntax
model
:
Gvan
or
Gv
=
2
(
§
4.1
)
.
6That
is
,
in
a
correspondence
of
a
morphemes
in
one
string
with
b
in
the
other
,
the
penalty
is
a
+
b
—
2
,
since
the
morpheme
on
each
side
is
not
in
violation
.
7One
subtlety
is
that
any
arc
with
the
unknown
POS
tag
can
be
relabeled
—
to
any
other
tag
—
by
the
syntax
model
,
whose
preterminal
rules
are
smoothed
.
This
was
crucial
for
a
=
+0
(
pipeline
)
parsing
with
t.-pLrf
as
the
morphology
model
,
since
the
parser
does
not
recognize
UNKNOWN
as
a
tag
.
Table
1
:
Results
of
experiments
on
Hebrew
(
test
data
,
max
.
length
40
)
.
This
table
shows
the
performance
of
joint
parsing
(
finite
a
;
left
)
and
a
pipeline
(
a
—
+oc
;
right
)
.
Joint
parsing
with
a
non-unigram
morphology
model
is
too
expensive
(
marked
*
)
.
Morphological
analysis
accuracy
(
by
word
)
,
fine-grained
(
full
tags
)
and
coarse-grained
(
only
parts
of
speech
)
POS
tagging
accuracy
(
Fi
)
,
and
generalized
constituent
accuracy
(
Fi
)
are
reported
;
a
was
tuned
for
each
of
these
separately
.
Boldface
denotes
that
figures
were
significantly
better
than
their
counterparts
in
the
same
row
,
under
a
binomial
sign
test
(
p
&lt;
0.05
)
.
f
marks
the
best
overall
accuracy
and
figures
that
are
not
significantly
worse
(
binomial
sign
test
,
p
&lt;
0.05
)
.
and
as
a
—
+oo
a
morphology-first
pipeline
is
approached
.
We
measured
four
outcome
values
:
segmentation
accuracy
(
fraction
of
word
tokens
segmented
correctly
)
,
fine
-
and
coarse-grained
tagging
accuracy,8
and
parsing
accuracy
.
For
tagging
and
parsing
,
F
\
-
measures
are
given
,
according
to
the
generalized
evaluation
measure
described
in
§
5.1
.
Tab
.
1
compares
parsing
with
tuned
a
values
to
the
pipeline
.
The
best
results
were
achieved
using
/
vari
,
a
,
using
the
CRF
and
joint
disambiguation
.
Without
the
CRF
(
using
pLni
)
,
the
difference
between
the
decoding
algorithms
is
less
apparent
,
suggesting
an
interaction
between
the
sophistication
of
the
components
and
the
best
way
to
decode
with
them
.
These
results
suggest
that
/
vari^
,
which
permits
pL
to
"
veto
"
any
structure
involving
a
morphological
analysis
for
any
word
that
is
a
posteriori
unlikely
(
note
that
8
Although
the
Hebrew
Treebank
is
small
,
the
size
of
its
POS
tagset
is
large
(
four
times
larger
than
the
Penn
Treebank
)
,
because
the
tags
encode
morphological
features
(
gender
,
person
,
and
number
)
.
These
features
have
either
been
ignored
in
prior
work
or
encoded
differently
.
In
order
for
our
POS-tagging
figures
to
be
reasonably
comparable
to
previous
work
,
we
include
accuracy
for
coarse-grained
tags
(
only
the
core
part
of
speech
)
tags
as
well
as
the
detailed
Hebrew
Treebank
tags
.
log
pL
(
mi
|
x
)
can
be
an
arbitrarily
large
negative
number
)
,
is
beneficial
as
a
"
filter
"
on
parses.9
/
risk^
,
on
the
other
hand
,
is
only
allowed
to
give
"
bonuses
"
of
up
to
a
to
each
morphological
analysis
that
pL
believes
in
;
its
influence
is
therefore
weaker
.
This
result
is
consistent
with
the
findings
of
Petrov
et
al.
(
2007
)
for
another
approximate
parsing
task
.
The
advantage
of
the
parent-annotated
PCFG
is
also
more
apparent
when
the
CRF
is
used
for
morphology
,
and
when
a
is
tuned
.
All
other
things
equal
,
then
,
pLrf
led
to
higher
accuracy
all
around
.
Letting
the
CRF
help
predict
the
POS
tags
helped
tagging
accuracy
but
not
parsing
accuracy
.
While
the
gains
over
the
pipeline
are
modest
,
the
segmentation
,
fine
POS
,
and
parsing
accuracy
scores
achieved
by
joint
disambiguation
with
/
vari^
with
the
CRF
are
significantly
better
than
any
of
the
pipeline
conditions
.
Interestingly
,
if
we
had
not
tested
with
the
CRF
,
we
might
have
reached
a
very
different
conclusion
about
the
usefulness
of
tuning
a
as
opposed
to
a
pipeline
.
With
the
unigram
morphology
model
,
joint
parsing
frequently
underperforms
the
pipeline
,
sometimes
even
signficantly
.
The
explanation
,
we
9Another
way
to
describe
this
combination
is
to
call
it
a
product
of
|
x
|
+1
experts
:
one
for
the
morphological
analysis
of
each
word
,
plus
the
grammar
.
The
morphology
experts
(
softly
)
veto
any
analysis
that
is
dubious
based
on
surface
criteria
,
and
the
grammar
(
softly
)
vetoes
less-grammatical
parses
.
Table
2
:
Oracle
results
of
experiments
on
Hebrew
(
test
data
,
max
.
length
40
)
.
This
table
shows
the
performance
of
morphological
segmentation
,
part-of-speech
tagging
,
coarse
part-of-speech
tagging
and
parsing
when
using
an
oracle
to
select
the
best
a
for
each
sentence
.
The
notation
and
interpretation
of
the
numbers
are
the
same
as
in
Tab
.
believe
,
has
to
do
with
the
ability
of
the
unigram
model
to
estimate
a
good
distribution
over
analyses
.
While
the
unigram
model
is
nearly
as
good
as
the
CRF
at
picking
the
right
segmentation
for
a
word
,
joint
parsing
demands
much
more
.
In
case
the
best
segmentation
does
not
lead
to
a
grammatical
morpheme
sequence
(
under
the
syntax
model
)
,
the
morphology
model
needs
to
be
able
to
give
relative
strengths
to
the
alternatives
.
The
unigram
model
is
less
able
to
do
this
,
because
it
ignores
the
context
of
the
word
,
and
so
the
benefit
of
joint
parsing
is
lost
.
Most
commonly
the
tuned
value
of
a
is
around
10
(
not
shown
,
to
preserve
clarity
)
.
Because
of
ignored
normalization
constants
,
this
does
not
mean
that
morphology
is
"
10
x
more
important
than
syntax
,
"
but
it
does
mean
that
,
for
a
particular
pL
and
Pg
,
tuning
their
relative
importance
in
decoding
can
improve
accuracy
.
In
Tab
.
2
we
show
how
performance
would
improve
if
the
oracle
value
of
a
was
selected
for
each
test-set
sentence
;
this
further
highlights
the
potential
impact
of
perfecting
the
tradeoff
between
models
.
Of
course
,
selecting
a
automatically
at
test-time
,
per
sentence
,
is
an
open
problem
.
To
our
knowledge
,
the
parsers
we
have
described
represent
the
state-of-the-art
in
Modern
Hebrew
parsing
.
The
closest
result
is
Tsarfaty
(
2006
)
,
which
we
have
not
directly
replicated
.
Tsarfaty
's
model
is
essentially
a
pipeline
application
of
/
p0e
,
oo
with
a
grammar
like
PGvan
.
Her
work
focused
more
on
the
interplay
between
the
segmentation
and
POS
tagging
models
and
the
amount
of
information
passed
to
the
parser
.
Some
key
differences
preclude
direct
comparison
:
we
modeled
fine-grained
tags
(
though
we
report
both
kinds
of
tagging
accurcy
)
,
we
employed
a
richer
morphological
lexicon
(
permitting
analyses
that
are
not
just
segmentation
)
,
and
a
different
training
/
test
split
and
length
filter
(
we
used
longer
sentences
)
.
Nonetheless
,
our
conclusions
support
the
argument
in
Tsarfaty
(
2006
)
for
more
integrated
parsing
methods
.
We
conclude
that
tuning
the
relative
importance
of
the
two
models
—
rather
than
pipelining
to
give
one
infinitely
more
importance
—
can
provide
an
improvement
on
segmentation
,
tagging
,
and
parsing
accuracy
.
This
suggests
that
future
parsing
efforts
for
languages
with
rich
morphology
might
continue
to
assume
separately-trained
(
and
separately-improved
)
morphology
and
syntax
components
,
which
would
stand
to
gain
from
joint
decoding
.
In
our
experiments
,
better
morphological
disambiguation
was
crucial
to
getting
any
benefit
from
joint
decoding
.
Our
result
also
suggests
that
exploring
new
,
fully-integrated
models
(
and
training
methods
for
them
)
may
be
advantageous
.
6
Conclusion
We
showed
that
joint
morpho-syntactic
parsing
can
improve
the
accuracy
of
both
kinds
of
disambiguation
.
Several
efficient
parsing
methods
were
presented
,
using
factored
state-of-the-art
morphology
and
syntax
models
for
the
language
under
consideration
.
We
demonstrated
state-of-the-art
performance
on
and
consistent
improvements
across
many
settings
for
Modern
Hebrew
,
a
morphologically-rich
language
with
a
relatively
small
treebank
.
