To
date
,
work
on
Non-Local
Dependencies
(
NLDs
)
has
focused
almost
exclusively
on
English
and
it
is
an
open
research
question
how
well
these
approaches
migrate
to
other
languages
.
This
paper
surveys
non-local
dependency
constructions
in
Chinese
as
represented
in
the
Penn
Chinese
Treebank
(
CTB
)
and
provides
an
approach
for
generating
proper
predicate-argument-modifier
structures
including
NLDs
from
surface
context-free
phrase
structure
trees
.
Our
approach
recovers
non-local
dependencies
at
the
level
of
Lexical-Functional
Grammar
f-structures
,
using
automatically
acquired
subcategorisa-tion
frames
and
f-structure
paths
linking
antecedents
and
traces
in
NLDs
.
Currently
our
algorithm
achieves
92.2
%
f-score
for
trace
insertion
and
84.3
%
for
antecedent
recovery
evaluating
on
gold-standard
CTB
trees
,
and
64.7
%
and
54.7
%
,
respectively
,
on
CTB-trained
state-of-the-art
parser
output
trees
.
1
Introduction
A
substantial
number
of
linguistic
phenomena
such
as
topicalisation
,
relativisation
,
coordination
and
raising
&amp;
control
constructions
,
permit
a
constituent
in
one
position
to
bear
the
grammatical
role
associated
with
another
position
.
These
relationships
are
referred
to
Non-Local
Dependencies
(
NLDs
)
,
where
the
surface
location
of
the
constituent
is
called
antecedent
,
and
the
site
where
the
antecedent
should
be
interpreted
semantically
is
called
trace
.
Capturing
non-local
dependencies
is
crucial
to
the
accurate
and
complete
determination
of
semantic
interpretation
in
the
form
of
predicate-argument-modifier
structures
or
deep
dependencies
.
However
,
with
few
exceptions
(
Model
3
of
Collins
,
1999
;
Schmid
,
2006
)
,
output
trees
produced
by
state-of-the-art
broad
coverage
statistical
parsers
(
Charniak
,
2000
;
Bikel
,
2004
)
are
only
surface
context-free
phrase
structure
trees
(
CFG-trees
)
without
empty
categories
and
coindexation
to
represent
displaced
constituents
.
Because
of
the
importance
of
non-local
dependencies
in
the
proper
determination
of
predicate-argument
structures
,
recent
years
have
witnessed
a
considerable
amount
of
research
on
reconstructing
such
hidden
relationships
in
CFG-trees
.
Three
strategies
have
been
proposed
:
(
ii
)
integrating
non-local
dependency
recovery
into
the
parser
by
enriching
a
simple
PCFG
model
with
GPSG-style
gap
features
(
Collins
,
1999
;
Schmid
,
2006
)
;
(
iii
)
pre-processing
the
input
sentence
with
a
finite-state
trace
tagger
which
detects
empty
nodes
before
parsing
,
and
identify
the
antecedents
on
the
parser
output
with
the
gap
information
(
Dienes
and
Dubey
,
2003a
;
Dienes
and
Dubey
,
2003b
)
.
In
addition
to
CFG-oriented
approaches
,
a
number
of
richer
treebank-based
grammar
acquisition
and
parsing
methods
based
on
HPSG
(
Miyao
et
LFG
(
Riezler
et
al.
,
2002
;
Cahill
et
al.
,
2004
)
and
Dependency
Grammar
(
Nivre
and
Nilsson
,
2005
)
incorporate
non-local
dependencies
into
their
deep
syntactic
or
semantic
representations
.
A
common
characteristic
of
all
these
approaches
'
(
Jijkoun
,
2003
;
Jijkoun
and
Rijke
,
2004
)
also
describe
postprocessing
methods
to
recover
NLDs
,
which
are
applied
to
syntactic
dependency
structures
converted
from
CFG-trees
.
Proceedings
of
the
2007
Joint
Conference
on
Empirical
Methods
in
Natural
Language
Processing
and
Computational
Natural
Language
Learning
,
pp.
257-266
,
Prague
,
June
2007
.
©
2007
Association
for
Computational
Linguistics
is
that
,
to
date
,
the
research
has
focused
almost
entirely
on
English,2
despite
the
disparity
in
type
and
frequency
of
non-local
dependencies
for
various
languages
.
In
this
paper
,
we
address
recovering
non-local
dependencies
for
Chinese
,
a
language
drastically
different
from
English
and
whose
special
features
such
as
lack
of
morphological
inflection
make
NLD
recovery
more
challenging
.
Inspired
by
(
Cahill
et
al.
,
2004
)
'
s
methodology
which
was
originally
designed
for
English
and
Penn-II
treebank
,
our
approach
to
Chinese
non-local
dependency
recovery
is
based
on
Lexical-Functional
Grammar
(
LFG
)
,
a
formalism
that
involves
both
phrase
structure
trees
and
predicate-argument
structures
.
NLDs
are
recovered
in
LFG
f-structures
using
automatically
acquired
subcategorisation
frames
and
finite
approximations
of
functional
uncertainty
equations
describing
NLD
paths
at
the
level
of
f-structures
.
The
paper
is
structured
as
follows
:
in
Section
2
we
outline
the
distinguishing
features
of
Chinese
nonlocal
dependencies
compared
to
English
.
In
Section
3
we
review
(
Cahill
et
al.
,
2004
)
'
s
method
for
recovering
English
NLDs
in
treebank-based
LFG
approximations
.
In
Section
4
,
we
describe
how
we
modify
and
substantially
extend
the
previous
method
to
recover
all
types
of
NLDs
for
Chinese
data
.
We
present
experiments
and
provide
a
dependency-based
evaluation
in
Section
5
.
Finally
we
conclude
and
summarise
future
work
.
2
Non-Local
Dependencies
in
Chinese
In
the
Penn
Chinese
Treebank
(
CTB
)
(
Xue
et
al.
,
2002
)
non-local
dependencies
are
represented
in
terms
of
empty
categories
(
ECs
)
and
(
for
some
of
them
)
coindexation
with
antecedents
,
as
exemplified
in
Figure
1
.
Following
previous
work
for
English
and
the
CTB
annotation
scheme
,
we
use
nonlocal
dependencies
as
a
cover
term
for
all
missing
or
dislocated
elements
represented
in
the
CTB
as
an
empty
category
(
with
or
without
coindexa-tion
/
antecedent
)
,
and
our
use
of
the
term
remains
agnostic
about
fine-grained
distinctions
between
nonlocal
dependencies
drawn
in
the
theoretical
linguistics
literature
.
In
order
to
give
an
overview
on
the
character
-
2
(
Levy
and
Manning
,
2004
)
is
the
only
approach
we
are
aware
of
that
has
been
applied
to
both
English
and
German
.
not
want
look-for
train
have
potential
DE
new
writer
'
(
People
)
don
't
want
to
look
for
and
train
new
writers
who
have
potential
.
'
potential
Figure
1
:
Example
of
non-local
annotations
in
CTB
,
including
dropped
subject
(
*
pro
*
)
,
control
subject
(
*
PRO
*
)
,
relative
clause
(
*
T
*
)
,
and
coordination
(
*
RNR
*
)
.
According
to
their
different
linguistics
properties
,
we
divide
the
empty
nodes
listed
in
Table
1
into
three
major
types
:
null
relative
pronouns
,
locally
mediated
dependencies
,
and
long-distance
dependencies
.
Null
Relative
Pronouns
(
lines
2
,
7
)
themselves
are
local
dependencies
,
and
thus
are
not
coindexed
with
an
antecedent
.
But
they
mediate
non-local
dependencies
by
functioning
as
antecedents
for
the
dis
-
3An
extensive
description
of
the
types
of
empty
categories
and
the
use
of
coindexation
in
CTB
can
be
found
in
Section
VI
of
the
bracketing
guidelines
.
Description
Pro-drop
situations
(
e.g.
*
proM
,
not'S1
everjP
}
j
"
lj
encounterËJ
/
DEl'oJlgJ
/
problem
)
Empty
relative
pronouns
(
e.g.
*
OP
*
AU
/
populationS
:
ft
;
/
denseitllK
/
area
)
Raising
&amp;
passive
constructions
(
e.g.
f
|
f
J
/
weft
/
BEIflPliiii
/
exclude
'
-
fr'^IVoutside
)
Coordinations
(
e.g.
uJ
|
?
J
)
/
encourage
*
RNR
*
fll
/
andif^
/
supportKîJï
/
investment
)
Table
1
:
The
distribution
of
the
most
frequent
types
of
empty
categories
and
their
antecedents
in
CTB5.1
.
The
types
with
frequency
less
than
30
are
ignored
.
located
constituent
inside
a
relative
clause.4
Locally
Mediated
Dependencies
are
non-local
as
they
are
projected
through
a
third
lexical
item
(
such
as
a
control
or
raising
verb
)
which
involves
a
dependency
between
two
adjacent
levels
and
they
are
therefore
bounded
.
This
type
encompasses
:
(
line
8
)
raising
constructions
,
and
short-bei
constructions
(
passivisation
)
;
(
line
3
)
control
constructions
,
which
includes
two
different
types
:
a
generic
*
PRO
*
with
an
arbitrary
reading
(
approximately
equals
to
unexpressed
subjects
of
fo-infinitive
and
gerund
verbs
in
English
)
;
and
a
*
PRO
*
with
definite
reference
(
subject
or
object
control
)
.
Long-Distance
Dependencies
(
LDDs
)
differ
from
locally
mediated
dependencies
,
in
that
the
path
linking
the
antecedent
and
trace
might
be
unbounded
(
also
called
unbounded
,
long-range
dependencies
)
.
LDDs
include
the
following
phenomena
:
Wh-traces
in
relative
clauses
,
where
an
argument
(
line
1
)
or
adjunct
(
line
6
)
"
moves
"
and
is
coin-dexed
with
the
extraction
site
.
Topicalisation
(
lines
5
,
11
)
is
one
of
the
typical
LDDs
in
English
,
whereas
in
Chinese
not
all
topics
involve
displacement
,
for
instance
(
2
)
.
Beijing
autumn
most
beautiful
'
Autumn
is
the
most
beautiful
in
Beijing
.
'
4Null
relative
pronouns
used
in
the
CTB
annotation
are
to
distinguish
relative
clauses
in
which
an
argument
or
adjunct
of
the
embedded
verb
is
missing
from
complement
(
apposi-tive
)
clauses
which
do
not
involve
non-local
dependencies
.
5However
in
this
case
the
CTB
annotation
doesn
't
coindex
the
locus
(
trace
)
with
its
controller
(
antecedent
)
.
Coordination
is
divided
into
two
groups
:
right
node
raising
of
an
NP
phrase
which
is
an
argument
shared
by
the
coordinate
predicates
(
line
9
)
;
and
the
coordination
of
quantifier
phrases
(
line
10
)
and
verbal
phrases
(
3
)
,
in
which
the
antecedent
and
trace
are
both
predicates
and
possibly
take
their
own
arguments
or
adjuncts
.
I
and
he
respectively
go
to
company
and
*
RNR
*
hospital
'
I
went
to
the
company
and
he
went
to
the
hospital
respectively
.
'
Pro-drop
situations
(
line
4
)
are
prominent
in
Chinese
because
subject
and
object
are
only
seman-tically
but
not
syntactically
required
.
Nevertheless
we
also
treat
pro-drop
as
a
long-distance
dependency
as
in
principle
the
dropped
subjects
can
be
determined
from
the
general
(
often
inter-sentential
)
context
.
Table
2
gives
a
quantitative
comparison
of
NLDs
between
Chinese
data
in
CTB5.1
and
English
in
Penn-II
.
The
data
reveals
that
:
first
,
NLDs
in
Chinese
are
much
more
frequent
than
in
English
(
by
nearly
1.5
times
)
;
and
moreover
69
%
are
not
explicitly
linked
to
an
antecedent
,
compared
to
43
%
for
English
,
due
to
the
high
prevalence
of
pro-drop
in
Chinese
.
Chinese
English
Table
2
:
Comparison
ofNLDs
between
Chinese
data
in
CTB5.1
and
English
in
Penn-II
.
to
please
Figure
2
:
(
a
)
the
CTB
tree
;
(
b
)
LFG
c-structure
with
functional
equations
;
(
c
)
corresponding
f-structure
.
(
T
)
in
the
functional
annotation
refers
to
the
f-structure
associated
with
the
mother
node
and
(
j
)
to
that
of
the
local
node
.
3
NLD
Recovery
in
LFG
Approximations
3.1
Lexical
Functional
Grammar
Lexical
Functional
Grammar
(
Kaplan
and
Bres-nan
,
1982
)
is
a
constraint-based
grammar
formalism
which
minimally
involves
two
levels
of
syntactic
representation
:
c
(
onstituent
)
-
structure
and
f
(
unctional
)
-
structure
.
C-structure
takes
the
form
of
CFG-trees
and
captures
surface
grammatical
configurations
.
F-structure
encodes
more
abstract
grammatical
functions
(
GFs
)
such
as
suBJ
(
ect
)
,
OBJ
(
ect
)
,
coMP
(
lement
)
,
ADJ
(
unct
)
and
topic
etc.
,
in
the
form
of
Attribute
Value
Matrices
which
approximate
to
basic
predicate-argument-adjunct
structures
or
dependency
relations
.
C-structures
are
related
to
f-structures
by
functional
annotations
(
cf.
Figure
2
(
b
)
&amp;
(
c
)
)
.
in
LFG
,
non-local
dependencies
are
captured
at
f-structure
level
in
terms
of
reentrancies
,
indicated
|
1
|
for
the
topicalisation
and
|
_2j
for
the
control
construction
in
Figure
2
(
c
)
obviating
the
need
for
traces
and
coindexation
in
the
c-structure
(
Figure
2
(
b
)
)
,
unlike
in
CTB
trees
(
Figure
2
(
a
)
)
.
LFG
uses
functional
uncertainty
(
FU
)
equations
(
regular
expressions
)
to
specify
paths
in
f-structures
between
the
trace
and
its
antecedent
.
To
account
for
the
reen-trancy
[
T
]
in
the
f-structure
,
a
FU
equation
of
the
form
Ttopic
=
Tcomp
*
obj
is
required
(
as
the
length
of
the
dependency
might
be
unbounded
)
.
The
equation
states
that
the
value
of
the
topic
attribute
is
token
identical
with
the
value
of
the
final
obj
argument
along
a
path
through
the
immediately
enclosing
f-structure
along
zero
or
more
COMP
attributes
.
in
addition
to
FU
equations
,
subcategorisation
information
is
also
a
significant
ingredient
in
LFG
's
account
of
non-local
dependencies
.
Subcategorisa-tion
frames
(
subcat
frames
)
specify
the
governable
grammatical
functions
(
i.e.
arguments
)
required
by
a
particular
predicate
.
in
Figure
2
(
c
)
each
predicate
in
the
f-structure
is
followed
by
its
subcat
frame
.
3.2
F-Structure
Based
NLD
Recovery
(
Cahill
et
al.
,
2004
)
presented
a
NLD
recovery
algorithm
operating
at
LFG
f-structure
for
treebank-based
LFG
approximations
.
The
method
automatically
converts
penn-ii
treebank
trees
with
traces
and
coindexation
into
proper
f-structures
where
traces
and
coindexation
in
treebank
trees
(
Figure
2
(
a
)
)
are
represented
as
corresponding
reentrances
in
f-structures
(
Figure
2
(
c
)
)
,
and
from
the
f-structures
automatically
extracts
subcat
frames
by
collecting
all
arguments
of
the
local
predicate
at
each
level
of
the
f-structures
,
and
further
acquires
finite
approximations
of
FU
equations
by
extracting
paths
linking
the
reentracies
occurring
in
the
f-structures
.
(
Cahill
et
al.
,
2004
)
'
s
approach
for
English
resolves
three
LDD
types
in
parser
output
trees
without
traces
and
coindexation
(
Figure
2
(
b
)
)
,
i.e.
topi-calisation
(
topic
)
,
wh-movement
in
relative
clauses
(
topic.rel
)
and
interrogatives
(
focus
)
.
Given
a
set
of
subcat
frames
s
for
lemma
w
with
probabilities
P
(
s
|
w
)
,
a
set
of
paths
p
linking
reen-trancies
conditioned
on
the
triggering
antecedent
a
(
topic
,
topic.rel
or
focus
)
with
probabilities
P
(
p
|
a
)
,
the
core
algorithm
recursively
traverses
an
f-structure
f
to
:
*
all
GFs
specified
in
the
subcat
frame
s
except
g
are
present
at
h
(
completeness
condition
)
*
no
other
governable
GFs
present
at
h
are
specified
in
s
(
coherence
condition
)
-
rank
resolution
candidates
according
to
the
product
of
subcat
frame
and
NLD
path
probabilities
(
Eq
.
1
)
.
4.1
Automatic
F-Structure
Generation
Our
NLD
recovery
is
done
at
the
level
of
LFG
f-structures
.
Inspired
by
(
Cahill
et
al.
,
2004
;
Burke
et
al.
,
2004
)
,
we
have
implemented
an
f-structure
annotation
algorithm
to
automatically
obtain
f-structures
from
CFG-trees
in
the
CTB5.1
.
The
f-structure
annotation
algorithm
,
described
below
,
is
applied
both
to
the
original
CTB
trees
providing
functional
tags
,
traces
and
coindexation
to
generate
the
training
corpus
,
and
to
the
parser
output
trees
without
traces
and
coindexation
to
provide
the
f-structure
input
for
NLD
recovery
.
to
CTB
.
Each
local
subtree
of
depth
one
is
partitioned
by
the
head
into
left
and
right
context
.
Left-right
context
rules
exploiting
configurational
,
categorial
and
CTB
functional
tag
information
are
used
to
assign
each
left
and
right
constituent
with
appropriate
functional
equations
.
Empty
nodes
and
coindexation
in
the
CTB
trees
are
automatically
captured
into
corresponding
reentrances
at
f-structure
via
functional
equations
.
All
the
functional
equations
are
collected
and
then
passed
to
a
constraint
solver
to
generate
f-structures
.
4.2
Adaptation
to
Chinese
(
Cahill
et
al.
,
2004
)
'
s
algorithm
(
Section
3.2
)
only
resolves
certain
NLDs
with
known
types
of
antecedents
(
topic
,
topic_rel
and
focus
)
at
f-structures
.
However
,
as
illustrated
in
Section
2
,
except
for
relative
clauses
,
the
antecedents
in
Chinese
NLDs
do
not
systematically
correspond
to
types
of
grammatical
function
.
Furthermore
nearly
70
%
of
all
empty
categories
are
not
coindexed
with
an
antecedent
.
in
order
to
resolve
all
Chinese
NLDs
represented
in
the
CTB
,
we
modify
and
substantially
short
)
algorithm
as
follows
:
Given
the
set
of
subcat
frames
s
for
the
word
w
,
and
a
set
of
paths
p
for
the
trace
t
,
the
algorithm
traverses
the
f-structure
f
to
:
-
predict
a
dislocated
argument
t
at
a
sub-f-structure
h
by
comparing
the
local
pred
:
w
to
w
's
subcat
frames
s
-
t
can
be
inserted
at
h
if
h
together
with
t
is
complete
and
coherent
relative
to
subcat
frame
-
traverse
f
starting
from
t
along
the
path
p
-
link
t
to
it
's
antecedent
a
if
p
's
ending
GF
a
exists
in
a
sub-f-structure
within
f
;
or
leave
t
without
an
antecedent
if
an
empty
path
for
t
exists
In
the
modified
algorithm
,
we
condition
the
probability
of
NLD
path
p
(
including
the
empty
path
without
an
antecedent
)
on
the
GF
associated
of
the
trace
t
rather
than
the
antecedent
a
as
in
C04
.
The
path
probability
P
(
p
|
t
)
is
estimated
as
:
in
contrast
even
to
English
,
Chinese
has
very
little
morphological
information
.
As
a
result
,
every
word
in
Chinese
has
a
unique
form
regardless
of
its
syntactic
distribution
.
For
this
reason
we
use
more
syntactic
features
w-feats
in
addition
to
word
form
to
discriminate
between
appropriate
subcat
frames
s.
For
a
given
word
w
,
w.
feats
include
:
-
w_gf
:
the
grammatical
function
of
w
As
more
conditioning
features
may
cause
sever
sparse-data
problems
,
in
order
to
increase
the
coverage
of
the
automatically
acquired
subcat
frames
,
the
subcat
frame
frequencies
count
(
s
,
w
,
W-feats
)
are
smoothed
by
backing
off
to
w
's
part-of-speech
wjpos
according
to
Eq
.
(
4
)
.
P
(
s
\
wjpos
)
is
estimated
according
to
Eq
.
(
5
)
and
weighted
by
a
parameter
6
.
The
lexical
subcat
frame
probabilities
are
estimated
from
the
smoothed
frequencies
as
shown
in
Eq
.
(
)
.
Yli
=
i
countbk
{
si
,
w
,
w-feots
)
Finally
,
NLD
resolutions
are
ranked
according
to
coordination
,
can
not
be
recovered
by
the
algorithm
.
Table
3
shows
the
types
of
NLD
that
can
be
recovered
by
C04
and
by
the
algorithm
presented
in
Section
4.2
.
Table
3
shows
that
a
hybrid
methodology
is
required
to
resolve
all
types
of
NLDs
in
the
CTB
.
The
hybrid
method
involves
four
strategies
:
•
Applying
a
few
simple
heuristic
rules
to
insert
the
empty
PRED
for
coordinations
and
null
relative
pronouns
for
relative
constructions
.
The
former
is
done
by
comparing
the
part-of-speech
of
the
local
predicates
and
their
arguments
in
each
coordinate
;
and
the
latter
is
triggered
by
GF
ADJUNCT_REL
in
our
system
.
•
Inserting
an
empty
node
with
GF
SUBJ
for
short-bei
construction
,
control
and
raising
constructions
,
and
relate
it
to
the
upper-level
SUBJ
or
OBJ
accordingly
.
•
Exploiting
the
C04
algorithm
to
resolve
the
wh-trace
in
relativisation
,
including
ungovernable
GFs
topic
and
adjunct
.
Using
our
modified
algorithm
(
Section
4.2
)
to
resolve
the
remaining
types
,
viz
.
long-distance
dependencies
in
Chinese
.
Antecedent
Topic_Rel
Argument
As
,
apart
from
the
maximum
number
of
arguments
in
a
subcat
frame
,
there
is
no
a
priori
limit
on
the
number
of
dislocated
arguments
in
a
local
f-structure
,
we
rank
resolutions
with
the
product
of
the
path
probabilities
of
each
(
of
m
)
missing
argu-ment
(
s
)
.
4.3
A
Hybrid
Fine-Grained
Strategy
As
described
in
Section
2
,
there
are
three
types
of
NLDs
in
the
CTB
,
and
their
different
linguistic
properties
may
require
fine-grained
recovery
strategies
.
Furthermore
,
as
the
NLD
recovery
method
described
in
Section
4.2
is
triggered
by
missing
subcategorisable
grammatical
functions
,
a
few
cases
of
NLDs
in
which
the
trace
is
not
an
argument
in
the
f-structure
,
e.g.
an
ADJUNCT
or
TOPIC
in
relative
clauses
or
an
null
PRED
in
verbal
Table
3
:
Comparison
of
the
ability
of
NLD
recovery
for
Chinese
between
C04
and
our
algorithm
5
Experiments
and
Evaluation
The
complete
list
of
double-annotated
files
can
be
found
in
the
documentation
of
CTB5.1
.
second
,
on
the
output
trees
of
Bikel
's
parser
(
Bikel
,
2004
)
.
The
evaluation
metric
adopted
by
most
previous
work
used
the
label
and
string
position
of
the
trace
and
its
antecedent
(
Johnson
,
2002
)
.
As
pointed
out
by
(
Campbell
,
2004
)
,
this
metric
is
insensitive
to
the
correct
attachment
of
the
EC
into
the
parse
tree
,
and
more
importantly
it
is
not
clear
whether
it
adequately
measures
performance
in
predicate-argument
structure
recovery
.
Therefore
,
we
use
a
predicate-argument
based
evaluation
method
instead
.
The
NLD
recovery
is
represented
as
a
triple
in
the
form
of
REl
(
pred
:
loc
,
gf
:
loo
)
,
where
rel
is
the
relation
between
the
dislocated
gf
and
the
pred
.
In
the
evaluation
for
insertion
of
traces
,
the
gf
is
represented
by
the
empty
category
,
and
in
the
evaluation
for
antecedent
recovery
,
the
gf
is
realised
by
the
predicate
of
the
antecedent
,
e.g.
obj
(
fl9
/
use
:
3
,
|
^
/
money
:
1
)
in
Figure
2
(
c
)
.
The
antecedent
and
pred
are
both
numbered
with
their
string
position
in
the
input
sentence
.
precision
,
recall
and
f-score
are
calculated
for
the
evaluation
.
5.1
CTB-Based
F-Structure
and
NLD
Resources
Acquisition
5.1.1
Automatically
Acquired
F-Structures
As
described
in
Section
4.1
,
we
automatically
generate
LFG
f-structures
from
the
CTB
trees
to
obtain
the
training
data
and
generate
f-structures
from
the
parser
output
trees
,
on
which
the
NLDs
will
be
recovered
.
To
evaluate
the
performance
of
the
automatic
f-structure
annotation
algorithm
,
we
randomly
selected
200
sentences
from
the
test
set
and
manually
annotated
the
f-structures
to
generate
a
gold
standard
.
The
evaluation
metric
is
the
same
as
for
NLD
recovery
in
terms
of
predicate-argument
relations
.
Table
4
reports
the
results
against
the
200-sentence
gold
standard
given
the
original
CTB
trees
and
trees
output
by
Bikel
's
parser
.
Dependencies
Precision
Recall
F-Score
Table
4
:
Evaluation
off-structure
annotation
From
the
automatically
generated
f-structure
training
data
,
we
extract
144,119
different
lexical
subcat
frames
and
178
paths
linking
traces
and
antecedents
for
NLD
recovery
.
Tables
5
&amp;
6
show
some
examples
of
the
automatically
extracted
sub-cat
frames
and
NLD
paths
respectively
.
Word
:
POS-GF
(
Subcat
Frames
)
i
/
.
Table
5
:
Examples
of
subcat
frames
Prob
.
adj
unct
(
up-adjunct
:
down-topic_rel
)
adjunct
(
NULL
)
obj
(
up-obj
:
down-topic_rel
)
subj
(
up-subj
:
down-topic_rel
)
Table
6
:
Examples
of
NLD
paths
The
basic
algorithm
described
in
Section
4.2
can
be
used
to
indiscriminately
resolve
almost
all
NLD
types
for
Chinese
including
locally
mediated
dependencies
with
few
exceptions
(
traces
with
modifier
GFs
,
which
accounts
for
about
1.5
%
of
all
NLDs
in
CTB5.1
)
.
Table
7
shows
the
results
of
the
basic
algorithm
for
trace
insertion
and
antecedent
recovery
on
both
stripped
CTB
trees
and
parser
output
trees
.
For
comparison
,
we
implemented
the
C04
algorithm
on
our
data
and
evaluated
the
result
.
Since
the
basic
algorithm
focus
on
argument
traces
,
results
for
arguments
only
are
given
separately
.
Table
7
shows
that
the
C04
algorithm
achieves
a
high
precision
but
as
expected
a
low
recall
due
to
its
limitation
to
certain
types
of
NLDs
.
By
contrast
,
our
basic
algorithm
scored
higher
recall
but
lower
precision
,
which
is
understandable
as
the
C04
algorithm
identifies
the
trace
given
a
known
antecedent
,
whereas
our
algorithm
tries
to
identify
both
the
trace
and
antecedent
.
Compared
to
trace
CTB
Trees
Parser
Output
Basic
Model
with
Subject
Path
Constraint
args_only
Table
7
:
Evaluation
of
trace
insertion
and
antecedent
recovery
for
C04
algorithm
,
our
basic
algorithm
and
basic
algorithm
with
the
subject
path
constraint
.
Basic
Model
Hybrid
Model
Prec
.
Ree
.
TOPIC_REL
Table
8
:
Breakdown
of
trace
insertion
and
antecedent
recovery
results
on
stripped
CTB
trees
for
the
hybrid
model
by
major
grammatical
functions
.
insertion
,
the
general
results
for
antecedent
identification
are
rather
poor
.
Examining
the
development
data
,
we
found
that
most
recovery
errors
were
due
to
wrongly
treating
missing
subjs
as
a
PRO
(
using
empty
NLD
paths
)
.
Since
the
subject
in
Chinese
has
a
very
strong
tendency
to
be
omitted
if
it
can
be
inferred
from
context
,
the
empty
NLD
path
(
without
any
antecedent
)
has
the
greatest
probability
in
all
resolution
paths
conditioned
on
subj
,
and
prevents
the
subj
from
finding
a
proper
antecedent
in
certain
cases
.
To
test
the
effect
of
the
empty
path
on
subj
,
we
weighted
non-empty
paths
for
subj
so
as
to
suppress
the
empty
path
.
After
testing
on
the
development
set
,
the
optimal
weight
was
found
to
be
1.9
.
The
subject
path
constraint
model
shows
a
dramatic
improvement
of
12.9
%
and
8.1
%
for
the
overall
result
of
antecedent
recovery
on
CTB
trees
and
parser
output
trees
.
5.3
The
Hybrid
Fine-Grained
Model
As
proposed
in
Section
4.3
,
we
implemented
a
more
fine-grained
strategy
to
capture
specific
linguistic
properties
of
different
NLD
types
in
the
CTB
.
also
combine
our
basic
algorithm
(
Section
4.2
)
with
(
Cahill
et
al.
,
2004
)
'
s
algorithm
in
order
to
resolve
the
modifier-function
traces
.
The
two
algorithms
may
conflict
due
to
(
i
)
inserting
the
same
trace
at
the
same
site
but
related
to
different
antecedents
or
(
ii
)
resolving
the
same
antecedent
to
different
traces
.
We
keep
the
traces
inserted
by
the
C04
algorithm
and
abandon
those
inserted
by
our
algorithm
in
case
of
conflict
,
as
the
results
in
Section
5.2
suggest
that
C04
has
a
higher
precision
than
ours
.
Table
8
reports
the
results
of
trace
insertion
and
antecedent
recovery
,
respectively
,
on
stripped
CTB
trees
,
broken
down
by
major
GFs
.
The
fine-grained
hybrid
model
allows
us
to
recover
NLDs
with
traces
with
modifier
functions
and
,
more
importantly
it
is
sensitive
to
particular
linguistic
properties
of
different
NLD
types
.
As
the
hybrid
model
separates
the
locally
mediated
dependencies
from
other
long-distance
dependencies
,
it
increases
the
f-score
by
8.7
%
for
antecedent
recovery
compared
with
the
basic
model
.
Table
9
reports
the
results
of
the
hybrid
model
on
parser
output
trees
,
which
shows
an
increase
of
3.6
%
for
antecedent
re
-
covery
(
compared
with
Table
7
)
.
Table
9
:
Evaluation
of
hybrid
model
for
trace
insertion
and
antecedent
recovery
on
parser
output
trees
.
5.4
Better
Training
for
Parser
Output
Our
experiments
show
that
although
our
NLD
recovery
algorithm
performs
well
on
stripped
CTB
trees
,
it
is
sensitive
to
the
noise
in
parser
output
trees
,
with
a
performance
drop
of
about
30
%
.
This
is
in
contrast
to
English
data
,
on
which
(
Johnson
,
2002
)
reports
a
drop
of
7-9
%
moving
from
treebank
trees
to
parser
output
trees
.
No
doubt
this
is
partially
due
to
the
poor
performance
of
the
parser
on
Chinese
data
.
It
is
widely
accepted
that
parsing
Chinese
is
more
difficult
than
parsing
other
more
configurational
or
richer
morphological
languages
,
such
as
English.7
Our
NLD
recovery
algorithm
runs
on
automatically
generated
LFG
f-structures
.
The
f-structure
annotation
algorithm
is
highly
tailored
to
the
CTB
bracketing
scheme
(
using
configurational
,
categorial
and
functional
tag
information
)
,
and
suffers
considerably
from
errors
produced
by
the
parser
.
Table
4
shows
that
performance
ofthe
f-structure
annotation
decreases
sharply
(
about
22
%
)
for
the
parser
output
trees
and
this
contributes
to
the
eventual
trace
insertion
and
antecedent
recovery
performance
drop
.
Since
the
f-structures
automatically
generated
from
parser
output
trees
are
substantially
different
from
those
generated
from
the
original
CTB
trees
,
our
method
to
obtain
the
NLD
resolution
training
data
suffers
from
a
serious
drawback
:
the
training
data
come
from
perfect
CTB
trees
,
whereas
test
data
are
derived
from
imperfect
parser
output
trees
.
This
constitutes
a
serious
drawback
for
machine
learning
based
approaches
,
such
as
ours
:
ideally
,
instances
seen
during
training
should
be
similar
to
unseen
test
data
.
To
make
training
examples
more
similar
to
test
instances
,
we
reparse
the
training
set
to
obtain
better
training
data
.
To
avoid
running
the
parser
on
the
training
data
,
we
carried
out
10-fold-cross
training
,
dividing
the
training
data
into
10
parts
and
parsing
7
(
Bikel
,
2004
)
reports
89
%
f-score
for
English
parsing
of
Penn-II
treebank
data
and
79
%
f-score
for
Chinese
parsing
on
CTB
version
3
.
each
part
in
turn
with
the
parser
trained
on
the
remaining
9
parts
.
The
reparsed
training
data
are
more
similar
to
the
test
data
than
the
original
perfect
CTB
trees
.
We
then
converted
both
the
reparsed
training
data
and
the
original
CTB
trees
into
f-structures
,
and
by
comparing
with
the
f-structures
generated
from
the
original
CTB
trees
,
we
recovered
the
empty
nodes
and
coindexation
on
the
f-structures
generated
from
the
reparsed
training
data
.
We
used
parser
output
based
f-structures
to
train
our
NLD
recovery
model
and
recovered
NLDs
for
parser
output
trees
from
the
test
data
.
Table
10
presents
the
results
for
trace
insertion
and
antecedent
recovery
on
parser
output
trees
using
the
improved
training
method
,
which
shows
a
clear
increase
in
precision
and
almost
the
same
recall
over
the
normal
training
(
Table
9
)
.
Insertion
Recovery
Table
10
:
Evaluation
of
hybrid
model
for
trace
insertion
and
antecedent
recovery
on
parser
output
trees
with
better
training
.
6
Conclusion
We
have
presented
an
algorithm
for
recovering
nonlocal
dependencies
for
Chinese
.
Our
method
revises
and
considerably
extends
the
approach
of
(
Cahill
et
al.
,
2004
)
originally
designed
for
English
,
and
,
to
the
best
of
our
knowledge
,
is
the
first
NLD
recovery
algorithm
for
Chinese
.
The
evaluation
shows
that
our
algorithm
considerably
outperforms
(
Cahill
et
al.
,
2004
)
'
s
with
respect
to
Chinese
data
.
In
future
work
,
we
will
refine
and
extend
the
conditioning
features
in
our
models
to
discriminate
sub-cat
frames
and
explore
the
possibilities
to
use
the
Chinese
Propbank
and
Hownet
to
supplement
our
automatically
acquired
subcat
frames
.
We
will
investigate
ways
of
closing
the
gap
between
the
performance
of
gold-standard
and
parer
output
trees
,
including
improving
parsing
result
for
Chinese
.
We
also
plan
to
adapt
other
NLD
recovery
methods
(
Ji-jkoun
and
Rijke
,
2004
;
Schmid
,
2006
)
to
Chinese
and
compare
them
with
the
current
results
.
Acknowledgements
This
research
is
funded
by
Science
Foundation
Ireland
grant
04
/
LN
/
I527
.
