The
Conference
on
Computational
Natural
Language
Learning
features
a
shared
task
,
in
which
participants
train
and
test
their
learning
systems
on
the
same
data
sets
.
In
2007
,
as
in
2006
,
the
shared
task
has
been
devoted
to
dependency
parsing
,
this
year
with
both
a
multilingual
track
and
a
domain
adaptation
track
.
In
this
paper
,
we
define
the
tasks
of
the
different
tracks
and
describe
how
the
data
sets
were
created
from
existing
treebanks
for
ten
languages
.
In
addition
,
we
characterize
the
different
approaches
of
the
participating
systems
,
report
the
test
results
,
and
provide
a
first
analysis
of
these
results
.
1
Introduction
Previous
shared
tasks
of
the
Conference
on
Computational
Natural
Language
Learning
(
CoNLL
)
have
been
devoted
to
chunking
(
1999,2000
)
,
clause
identification
(
2001
)
,
named
entity
recognition
(
2002
,
2003
)
,
and
semantic
role
labeling
(
2004
,
2005
)
.
In
2006
the
shared
task
was
multilingual
dependency
parsing
,
where
participants
had
to
train
a
single
parser
on
data
from
thirteen
different
languages
,
which
enabled
a
comparison
not
only
of
parsing
and
learning
methods
,
but
also
of
the
performance
that
can
be
achieved
for
different
languages
(
Buchholz
and
Marsi
,
2006
)
.
In
dependency-based
syntactic
parsing
,
the
task
is
to
derive
a
syntactic
structure
for
an
input
sentence
by
identifying
the
syntactic
head
of
each
word
in
the
sentence
.
This
defines
a
dependency
graph
,
where
the
nodes
are
the
words
of
the
input
sentence
and
the
arcs
are
the
binary
relations
from
head
to
dependent
.
Often
,
but
not
always
,
it
is
assumed
that
all
words
except
one
have
a
syntactic
head
,
which
means
that
the
graph
will
be
a
tree
with
the
single
independent
word
as
the
root
.
In
labeled
dependency
parsing
,
we
additionally
require
the
parser
to
assign
a
specific
type
(
or
label
)
to
each
dependency
relation
holding
between
a
head
word
and
a
dependent
word
.
In
this
year
's
shared
task
,
we
continue
to
explore
data-driven
methods
for
multilingual
dependency
parsing
,
but
we
add
a
new
dimension
by
also
introducing
the
problem
of
domain
adaptation
.
The
way
this
was
done
was
by
having
two
separate
tracks
:
a
multilingual
track
using
essentially
the
same
setup
as
last
year
,
but
with
partly
different
languages
,
and
a
domain
adaptation
track
,
where
the
task
was
to
use
machine
learning
to
adapt
a
parser
for
a
single
language
to
a
new
domain
.
In
total
,
test
results
were
submitted
for
twenty-three
systems
in
the
multilingual
track
,
and
ten
systems
in
the
domain
adaptation
track
(
six
of
which
also
participated
in
the
multilingual
track
)
.
Not
everyone
submitted
papers
describing
their
system
,
and
some
papers
describe
more
than
one
system
(
or
the
same
system
in
both
tracks
)
,
which
explains
why
there
are
only
(
!
)
twenty-one
papers
in
the
proceedings
.
In
this
paper
,
we
provide
task
definitions
for
the
two
tracks
(
section
2
)
,
describe
data
sets
extracted
from
available
treebanks
(
section
3
)
,
report
results
for
all
systems
in
both
tracks
(
section
4
)
,
give
an
overview
of
approaches
used
(
section
5
)
,
provide
a
first
analysis
of
the
results
(
section
6
)
,
and
conclude
with
some
future
directions
(
section
7
)
.
2
Task
Definition
In
this
section
,
we
provide
the
task
definitions
that
were
used
in
the
two
tracks
of
the
CoNLL
2007
Shard
Task
,
the
multilingual
track
and
the
domain
adaptation
track
,
together
with
some
background
and
motivation
for
the
design
choices
made
.
First
of
all
,
we
give
a
brief
description
of
the
data
format
and
evaluation
metrics
,
which
were
common
to
the
two
tracks
.
2.1
Data
Format
and
Evaluation
Metrics
The
data
sets
derived
from
the
original
treebanks
(
section
3
)
were
in
the
same
column-based
format
as
for
the
2006
shared
task
(
Buchholz
and
Marsi
,
2006
)
.
In
this
format
,
sentences
are
separated
by
a
blank
line
;
a
sentence
consists
of
one
or
more
tokens
,
each
one
starting
on
a
new
line
;
and
a
token
consists
of
the
following
ten
fields
,
separated
by
a
single
tab
character
:
ID
:
Token
counter
,
starting
at
1
for
each
new
sentence
.
FORM
:
Word
form
or
punctuation
symbol
.
LEMMA
:
Lemma
or
stem
of
word
form
,
or
an
underscore
if
not
available
.
CPOSTAG
:
Coarse-grained
part-of-speech
tag
,
where
the
tagset
depends
on
the
language
.
POSTAG
:
Fine-grained
part-of-speech
tag
,
where
the
tagset
depends
on
the
language
,
or
identical
to
the
coarse-grained
part-of-speech
tag
if
not
available
.
FEATS
:
Unordered
set
of
syntactic
and
/
or
morphological
features
(
depending
on
the
particular
language
)
,
separated
by
a
vertical
bar
(
|
)
,
or
an
underscore
if
not
available
.
HEAD
:
Head
of
the
current
token
,
which
is
either
a
value
of
ID
or
zero
(
0
)
.
Note
that
,
depending
on
the
original
treebank
annotation
,
there
may
be
multiple
tokens
with
HEAD
=
0
.
DEPREL
:
Dependency
relation
to
the
HEAD
.
The
set
of
dependency
relations
depends
on
the
particular
language
.
Note
that
,
depending
on
the
original
treebank
annotation
,
the
dependency
relation
when
HEAD
=
0
may
be
meaningful
or
simply
ROOT
.
PHEAD
:
Projective
head
of
current
token
,
which
is
either
a
value
of
ID
or
zero
(
0
)
,
or
an
underscore
if
not
available
.
10
.
PDEPREL
:
Dependency
relation
to
the
PHEAD
,
or
an
underscore
if
not
available
.
The
PHEAD
and
PDEPREL
were
not
used
at
all
in
this
year
's
data
sets
(
i.e.
,
they
always
contained
underscores
)
but
were
maintained
for
compatibility
with
last
year
's
data
sets
.
This
means
that
,
in
practice
,
the
first
six
columns
can
be
considered
as
input
to
the
parser
,
while
the
HEAD
and
DEPREL
fields
are
the
output
to
be
produced
by
the
parser
.
Labeled
training
sets
contained
all
ten
columns
;
blind
test
sets
only
contained
the
first
six
columns
;
and
gold
standard
test
sets
(
released
only
after
the
end
of
the
test
period
)
again
contained
all
ten
columns
.
All
data
files
were
encoded
in
UTF-8
.
The
official
evaluation
metric
in
both
tracks
was
the
labeled
attachment
score
(
LAS
)
,
i.e.
,
the
percentage
of
tokens
for
which
a
system
has
predicted
the
correct
HEAD
and
DEPREL
,
but
results
were
also
reported
for
unlabeled
attachment
score
(
UAS
)
,
i.e.
,
the
percentage
of
tokens
with
correct
HEAD
,
and
the
label
accuracy
(
LA
)
,
i.e.
,
the
percentage
of
tokens
with
correct
DEPREL
.
One
important
difference
compared
to
the
2006
shared
task
is
that
all
tokens
were
counted
as
"
scoring
tokens
"
,
including
in
particular
all
punctuation
tokens
.
The
official
evaluation
script
,
eval07.pl
,
is
available
from
the
shared
task
website.1
2.2
Multilingual
Track
The
multilingual
track
of
the
shared
task
was
organized
in
the
same
way
as
the
2006
task
,
with
annotated
training
and
test
data
from
a
wide
range
of
languages
to
be
processed
with
one
and
the
same
parsing
system
.
This
system
must
therefore
be
able
to
learn
from
training
data
,
to
generalize
to
unseen
test
data
,
and
to
handle
multiple
languages
,
possibly
by
adjusting
a
number
of
hyper-parameters
.
Participants
in
the
multilingual
track
were
expected
to
submit
parsing
results
for
all
languages
involved
.
1http
:
/
/
depparse.uvt.nl
/
depparse-wiki
/
SoftwarePage
One
of
the
claimed
advantages
of
dependency
parsing
,
as
opposed
to
parsing
based
on
constituent
analysis
,
is
that
it
extends
naturally
to
languages
with
free
or
flexible
word
order
.
This
explains
the
interest
in
recent
years
for
multilingual
evaluation
of
dependency
parsers
.
Even
before
the
2006
shared
task
,
the
parsers
of
Collins
(
1997
)
and
Charniak
(
2000
)
,
originally
developed
for
English
,
had
been
adapted
for
dependency
parsing
of
Czech
,
and
the
parsing
methodology
proposed
by
Kudo
and
Mat-sumoto
(
2002
)
and
Yamada
and
Matsumoto
(
2003
)
had
been
evaluated
on
both
Japanese
and
English
.
The
parser
of
McDonald
and
Pereira
(
2006
)
had
been
applied
to
English
,
Czech
and
Danish
,
and
the
parser
of
Nivre
et
al.
(
2007
)
to
ten
different
languages
.
But
by
far
the
largest
evaluation
of
multilingual
dependency
parsing
systems
so
far
was
the
2006
shared
task
,
where
nineteen
systems
were
evaluated
on
data
from
thirteen
languages
(
Buchholz
and
One
of
the
conclusions
from
the
2006
shared
task
was
that
parsing
accuracy
differed
greatly
between
languages
and
that
a
deeper
analysis
of
the
factors
involved
in
this
variation
was
an
important
problem
for
future
research
.
In
order
to
provide
an
extended
empirical
foundation
for
such
research
,
we
tried
to
select
the
languages
and
data
sets
for
this
year
's
task
based
on
the
following
desiderata
:
•
The
selection
of
languages
should
be
typolog-ically
varied
and
include
both
new
languages
and
old
languages
(
compared
to
2006
)
.
•
The
creation
of
the
data
sets
should
involve
as
little
conversion
as
possible
from
the
original
treebank
annotation
,
meaning
that
preference
should
be
given
to
treebanks
with
dependency
annotation
.
The
final
selection
included
data
from
Arabic
,
Basque
,
Catalan
,
Chinese
,
Czech
,
English
,
Greek
,
Hungarian
,
Italian
,
and
Turkish
.
The
treebanks
from
2The
reason
for
having
an
upper
bound
on
the
training
set
size
was
the
fact
that
,
in
2006
,
some
participants
could
not
train
on
all
the
data
for
some
languages
because
of
time
limitations
.
Similar
considerations
also
led
to
the
decision
to
have
a
smaller
number
of
languages
this
year
(
ten
,
as
opposed
to
thirteen
)
.
which
the
data
sets
were
extracted
are
described
in
section
3
.
2.3
Domain
Adaptation
Track
One
well
known
characteristic
of
data-driven
parsing
systems
is
that
they
typically
perform
much
worse
on
data
that
does
not
come
from
the
training
domain
(
Gildea
,
2001
)
.
Due
to
the
large
overhead
in
annotating
text
with
deep
syntactic
parse
trees
,
the
need
to
adapt
parsers
from
domains
with
plentiful
resources
(
e.g.
,
news
)
to
domains
with
little
resources
is
an
important
problem
.
This
problem
is
commonly
referred
to
as
domain
adaptation
,
where
the
goal
is
to
adapt
annotated
resources
from
a
source
domain
to
a
target
domain
of
interest
.
Almost
all
prior
work
on
domain
adaptation
assumes
one
of
two
scenarios
.
In
the
first
scenario
,
there
are
limited
annotated
resources
available
in
the
target
domain
,
and
many
studies
have
shown
that
this
may
lead
to
substantial
improvements
.
This
includes
the
work
of
Roark
and
Bacchiani
(
2003
)
,
Florian
et
al.
(
2004
)
,
Chelba
and
Acero
(
2004
)
,
Daume
and
Marcu
(
2006
)
,
and
Titov
and
Henderson
(
2006
)
.
Of
these
,
Roark
and
Bacchiani
(
2003
)
and
Titov
and
Henderson
(
2006
)
deal
specifically
with
syntactic
parsing
.
The
second
scenario
assumes
that
there
are
no
annotated
resources
in
the
target
domain
.
This
is
a
more
realistic
situation
and
is
considerably
more
difficult
.
Recent
work
by
McClosky
et
al.
(
2006
)
and
Blitzer
et
al.
(
2006
)
have
shown
that
the
existence
of
a
large
unlabeled
corpus
in
the
new
domain
can
be
leveraged
in
adaptation
.
For
this
shared-task
,
we
are
assuming
the
latter
setting
-
no
annotated
resources
in
the
target
domain
.
Obtaining
adequate
annotated
syntactic
resources
for
multiple
languages
is
already
a
challenging
problem
,
which
is
only
exacerbated
when
these
resources
must
be
drawn
from
multiple
and
diverse
domains
.
As
a
result
,
the
only
language
that
could
be
feasibly
tested
in
the
domain
adaptation
track
was
English
.
The
setup
for
the
domain
adaptation
track
was
as
follows
.
Participants
were
provided
with
a
large
annotated
corpus
from
the
source
domain
,
in
this
case
sentences
from
the
Wall
Street
Journal
.
Participants
were
also
provided
with
data
from
three
different
target
domains
:
biomedical
abstracts
(
development
data
)
,
chemical
abstracts
(
test
data
1
)
,
and
parent-child
dialogues
(
test
data
2
)
.
Additionally
,
a
large
unlabeled
corpus
for
each
data
set
(
training
,
development
,
test
)
was
provided
.
The
goal
of
the
task
was
to
use
the
annotated
source
data
,
plus
any
unlabeled
data
,
to
produce
a
parser
that
is
accurate
for
each
of
the
test
sets
from
the
target
domains.3
Participants
could
submit
systems
in
either
the
"
open
"
or
"
closed
"
class
(
or
both
)
.
The
closed
class
requires
a
system
to
use
only
those
resources
provided
as
part
of
the
shared
task
.
The
open
class
allows
a
system
to
use
additional
resources
provided
those
resources
are
not
drawn
from
the
same
domain
as
the
development
or
test
sets
.
An
example
might
be
a
part-of-speech
tagger
trained
on
the
entire
Penn
Treebank
and
not
just
the
subset
provided
as
training
data
,
or
a
parser
that
has
been
hand-crafted
or
trained
on
a
different
training
set
.
3
Treebanks
In
this
section
,
we
describe
the
treebanks
used
in
the
shared
task
and
give
relevant
information
about
the
data
sets
created
from
them
.
3.1
Multilingual
Track
Arabic
The
analytical
syntactic
annotation
of
the
Prague
Arabic
Dependency
Treebank
(
PADT
)
(
Hajic
et
al.
,
2004
)
can
be
considered
a
pure
dependency
annotation
.
The
conversion
,
done
by
Otakar
Smrz
,
from
the
original
format
to
the
column-based
format
described
in
section
2.1
was
therefore
relatively
straightforward
,
although
not
all
the
information
in
the
original
annotation
could
be
transfered
to
the
new
format
.
PADT
was
one
of
the
treebanks
used
in
the
2006
shared
task
but
then
only
contained
about
54,000
tokens
.
Since
then
,
the
size
of
the
treebank
has
more
than
doubled
,
with
around
112,000
tokens
.
In
addition
,
the
morphological
annotation
has
been
made
more
informative
.
It
is
also
worth
noting
that
the
parsing
units
in
this
treebank
are
in
many
cases
larger
than
conventional
sentences
,
which
partly
explains
the
high
average
number
of
tokens
per
"
sentence
"
(
Buchholz
and
3Note
that
annotated
development
data
for
the
target
domain
was
only
provided
for
the
development
domain
,
biomedical
abstracts
.
For
the
two
test
domains
,
chemical
abstracts
and
parent-child
dialogues
,
the
only
annotated
data
sets
were
the
gold
standard
test
sets
,
released
only
after
test
runs
had
been
submitted
.
Basque
For
Basque
,
we
used
the
3LB
Basque
treebank
(
Aduriz
et
al.
,
2003
)
.
At
present
,
the
tree-bank
consists
of
approximately
3,700
sentences
,
334
of
which
were
used
as
test
data
.
The
treebank
comprises
literary
and
newspaper
texts
.
It
is
annotated
in
a
dependency
format
and
was
converted
to
the
CoNLL
format
by
a
team
led
by
Koldo
Gojenola
.
Catalan
The
Catalan
section
of
the
CESS-ECE
Syntactically
and
Semantically
Annotated
Corpora
(
Marti
et
al.
,
2007
)
is
annotated
with
,
among
other
things
,
constituent
structure
and
grammatical
functions
.
A
head
percolation
table
was
used
for
automatically
converting
the
constituent
trees
into
dependency
trees
.
The
original
data
only
contains
functions
related
to
the
verb
,
and
a
function
table
was
used
for
deriving
the
remaining
syntactic
functions
.
The
conversion
was
performed
by
a
team
led
by
Lluls
Marquez
and
Antonia
Marti
.
Chinese
The
Chinese
data
are
taken
from
the
Sinica
treebank
(
Chen
et
al.
,
2003
)
,
which
contains
both
syntactic
functions
and
semantic
functions
.
The
syntactic
head
was
used
in
the
conversion
to
the
CoNLL
format
,
carried
out
by
Yu-Ming
Hsieh
and
the
organizers
of
the
2006
shared
task
,
and
the
syntactic
functions
were
used
wherever
it
was
possible
.
The
training
data
used
is
basically
the
same
as
for
the
2006
shared
task
,
except
for
a
few
corrections
,
but
the
test
data
is
new
for
this
year
's
shared
task
.
It
is
worth
noting
that
the
parsing
units
in
this
treebank
are
sometimes
smaller
than
conventional
sentence
units
,
which
partly
explains
the
low
average
number
of
tokens
per
"
sentence
"
(
Buchholz
and
Marsi
,
2006
)
.
Czech
The
analytical
syntactic
annotation
of
the
Prague
Dependency
Treebank
(
PDT
)
(
Bohmova
et
al.
,
2003
)
is
a
pure
dependency
annotation
,
just
as
for
PADT
.
It
was
also
used
in
the
shared
task
2006
,
but
there
are
two
important
changes
compared
to
last
year
.
First
,
version
2.0
of
PDT
was
used
instead
of
version
1.0
,
and
a
conversion
script
was
created
by
Zdenek
Zabokrtsky
,
using
the
new
XML-based
format
of
PDT
2.0
.
Secondly
,
due
to
the
upper
bound
on
training
set
size
,
only
sections
1-3
of
PDT
constitute
the
training
data
,
which
amounts
to
some
450,000
tokens
.
The
test
data
is
a
small
subset
of
the
development
test
set
of
PDT
.
English
For
English
we
used
the
Wall
Street
Journal
section
of
the
Penn
Treebank
(
Marcus
et
al.
,
1993
)
.
In
particular
,
we
used
sections
2-11
for
training
and
a
subset
of
section
23
for
testing
.
As
a
preprocessing
stage
we
removed
many
functions
tags
from
the
non-terminals
in
the
phrase
structure
representation
to
make
the
representations
more
uniform
with
out-of-domain
test
sets
for
the
domain
adaptation
track
(
see
section
3.2
)
.
The
resulting
data
set
was
then
converted
to
dependency
structures
using
the
procedure
described
in
Johansson
and
Nugues
(
2007a
)
.
This
work
was
done
by
Ryan
McDonald
.
Greek
The
Greek
Dependency
Treebank
(
GDT
)
(
Prokopidis
et
al.
,
2005
)
adopts
a
dependency
structure
annotation
very
similar
to
those
of
PDT
and
PADT
,
which
means
that
the
conversion
by
Prokopis
Prokopidis
was
relatively
straightforward
.
GDT
is
one
of
the
smallest
treebanks
in
this
year
's
shared
task
(
about
65,000
tokens
)
and
contains
sentences
of
Modern
Greek
.
Just
like
PDT
and
PADT
,
the
treebank
contains
more
than
one
level
of
annotation
,
but
we
only
used
the
analytical
level
of
GDT
.
Hungarian
For
the
Hungarian
data
,
the
Szeged
treebank
(
Csendes
et
al.
,
2005
)
was
used
.
The
tree-bank
is
based
on
texts
from
six
different
genres
,
ranging
from
legal
newspaper
texts
to
fiction
.
The
original
annotation
scheme
is
constituent-based
,
following
generative
principles
.
It
was
converted
into
dependencies
by
Zoltan
Alexin
based
on
heuristics
.
Italian
The
data
set
used
for
Italian
is
a
subset
of
the
balanced
section
of
the
Italian
Syntactic-Semantic
Treebank
(
ISST
)
(
Montemagni
et
al.
,
2003
)
and
consists
of
texts
from
the
newspaper
Cor-riere
della
Sera
and
from
periodicals
.
A
team
led
by
Giuseppe
Attardi
,
Simonetta
Montemagni
,
and
Maria
Simi
converted
the
annotation
to
the
CoNLL
format
,
using
information
from
two
different
annotation
levels
,
the
constituent
structure
level
and
the
dependency
structure
level
.
all
the
approximately
65,000
tokens
of
the
original
treebank
for
training
.
The
rich
morphology
of
Turkish
requires
the
basic
tokens
in
parsing
to
be
inflectional
groups
(
IGs
)
rather
than
words
.
IGs
of
a
single
word
are
connected
to
each
other
deterministically
using
dependency
links
labeled
DERIV
,
referred
to
as
word-internal
dependencies
in
the
following
,
and
the
FORM
and
the
LEMMA
fields
may
be
empty
(
they
contain
underscore
characters
in
the
data
files
)
.
Sentences
do
not
necessarily
have
a
unique
root
;
most
internal
punctuation
and
a
few
foreign
words
also
have
HEAD
=
0
.
3.2
Domain
Adaptation
Track
As
mentioned
previously
,
the
source
data
is
drawn
from
a
corpus
of
news
,
specifically
the
Wall
Street
Journal
section
of
the
Penn
Treebank
(
Marcus
et
al.
,
1993
)
.
This
data
set
is
identical
to
the
English
training
set
from
the
multilingual
track
(
see
section
3.1
)
.
For
the
target
domains
we
used
three
different
labeled
data
sets
.
The
first
two
were
annotated
as
part
of
the
PennBioIE
project
(
Kulick
et
al.
,
2004
)
and
consist
of
sentences
drawn
from
either
biomedical
or
chemical
research
abstracts
.
Like
the
source
WSJ
corpus
,
this
data
is
annotated
using
the
Penn
Treebank
phrase
structure
scheme
.
To
convert
these
sets
to
dependency
structures
we
used
the
same
procedure
as
before
(
Johansson
and
Nugues
,
2007a
)
.
Additional
care
was
taken
to
remove
sentences
that
contained
non-WSJ
part-of-speech
tags
or
non-terminals
(
e.g.
,
HYPH
part-of-speech
tag
indicating
a
hyphen
)
.
Furthermore
,
the
annotation
scheme
for
gaps
and
traces
was
made
consistent
with
the
Penn
Treebank
wherever
possible
.
As
already
mentioned
,
the
biomedical
data
set
was
distributed
as
a
development
set
for
the
training
phase
,
while
the
chemical
data
set
was
only
used
for
final
testing
.
The
third
target
data
set
was
taken
from
the
CHILDES
database
(
MacWhinney
,
2000
)
,
in
particular
the
EVE
corpus
(
Brown
,
1973
)
,
which
has
been
annotated
with
dependency
structures
.
Unfortunately
the
dependency
labels
of
the
CHILDES
data
were
inconsistent
with
those
of
the
WSJ
,
biomedical
and
chemical
data
sets
,
and
we
therefore
opted
to
only
evaluate
unlabeled
accuracy
for
this
data
set
.
Furthermore
,
there
was
an
inconsistency
in
how
main
and
auxiliary
verbs
were
annotated
for
this
data
set
relative
to
others
.
As
a
result
of
this
,
submitting
Multilingual
Domain
adaptation
PCHEM
CHILDES
Language
family
Sem
.
Isol
.
Sin
.
Sla
.
Hel
.
F.-U
.
Rom
.
Tur
.
Ger
.
Annotation
Training
data
Development
data
Tokens
(
k
)
Sentences
(
k
)
No.
CPOSTAG
No.
POSTAG
No.
FEATS
No.
DEPREL
HEAD
=
0
/
sentence
%
Non-proj
.
sent
.
Punc
.
attached
DEPRELS
for
punc
.
Test
data
Sentences
Tokens
/
sentence
Table
1
:
Characteristics
of
the
data
sets
for
the
10
languages
of
the
multilingual
track
and
the
development
set
and
the
two
test
sets
of
the
domain
adaptation
track
.
results
for
the
CHILDES
data
was
considered
optional
.
Like
the
chemical
data
set
,
this
data
set
was
only
used
for
final
testing
.
Finally
,
a
large
corpus
of
unlabeled
in-domain
data
was
provided
for
each
data
set
and
made
available
for
training
.
This
data
was
drawn
from
the
WSJ
,
PubMed.com
(
specific
to
biomedical
and
chemical
research
literature
)
,
and
the
CHILDES
data
base
.
The
data
was
tokenized
to
be
as
consistent
as
possible
with
the
WSJ
training
set
.
Table
1
describes
the
characteristics
of
the
data
sets
.
For
the
multilingual
track
,
we
provide
statistics
over
the
training
and
test
sets
;
for
the
domain
adaptation
track
,
the
statistics
were
extracted
from
the
development
set
.
Following
last
year
's
shared
task
practice
(
Buchholz
and
Marsi
,
2006
)
,
we
use
the
following
definition
of
projectivity
:
An
arc
(
i
,
j
)
is
projective
iff
all
nodes
occurring
between
i
and
j
are
dominated
by
i
(
where
dominates
is
the
transitive
closure
of
the
arc
relation
)
.
In
the
table
,
the
languages
are
abbreviated
to
their
first
two
letters
.
Language
families
are
:
Semitic
,
Isolate
,
Romance
,
Sino-Tibetan
,
Slavic
,
Germanic
,
Hellenic
,
Finno-Ugric
,
and
Turkic
.
The
type
of
the
original
annotation
is
either
constituents
plus
(
some
)
functions
(
c+f
)
or
dependencies
(
d
)
.
For
the
training
data
,
the
number
of
words
and
sentences
are
given
in
multiples
of
thousands
,
and
the
average
length
of
a
sentence
in
words
(
including
punctuation
tokens
)
.
The
following
rows
contain
information
about
whether
lemmas
are
available
,
the
number
of
coarse
-
and
fine-grained
part-of-speech
tags
,
the
number
of
feature
components
,
and
the
number
of
dependency
labels
.
Then
information
is
given
on
how
many
different
dependency
labels
can
co-occur
with
HEAD
=
0
,
the
percentage
of
HEAD
=
0
dependencies
,
and
the
percentage
of
heads
preceding
(
left
)
or
succeeding
(
right
)
a
token
(
giving
an
indication
of
whether
a
language
is
predominantly
head-initial
or
head-final
)
.
This
is
followed
by
the
average
number
of
HEAD
=
0
dependencies
per
sentence
and
the
percentage
of
non-projective
arcs
and
sentences
.
The
last
two
rows
show
whether
punctuation
tokens
are
attached
as
dependents
of
other
tokens
(
A
=
Always
,
S
=
Sometimes
)
and
specify
the
number
of
dependency
labels
that
exist
for
punctuation
tokens
.
Note
that
punctuation
is
defined
as
any
token
belonging
to
the
UTF-8
category
of
punctuation
.
This
means
,
for
example
,
that
any
token
having
an
underscore
in
the
FORM
field
(
which
happens
for
word-internal
IGs
in
Turkish
)
is
also
counted
as
punctuation
here
.
For
the
test
sets
,
the
number
of
words
and
sentences
as
well
as
the
ratio
of
words
per
sentence
are
listed
,
followed
by
the
percentage
of
new
words
and
lemmas
(
if
applicable
)
.
For
the
domain
adaptation
sets
,
the
percentage
of
new
words
is
computed
with
regard
to
the
training
set
(
Penn
Treebank
)
.
4
Submissions
and
Results
As
already
stated
in
the
introduction
,
test
runs
were
submitted
for
twenty-three
systems
in
the
multilingual
track
,
and
ten
systems
in
the
domain
adaptation
track
(
six
of
which
also
participated
in
the
multilingual
track
)
.
In
the
result
tables
below
,
systems
are
identified
by
the
last
name
of
the
team
member
listed
first
when
test
runs
were
uploaded
for
evaluation
.
In
general
,
this
name
is
also
the
first
author
of
a
paper
describing
the
system
in
the
proceedings
,
but
there
are
a
few
exceptions
and
complications
.
First
of
all
,
for
four
out
of
twenty-seven
systems
,
no
paper
was
submitted
to
the
proceedings
.
This
is
the
case
for
the
systems
of
Jia
,
Maes
et
al.
,
Nash
,
and
Zeman
,
which
is
indicated
by
the
fact
that
these
names
appear
in
italics
in
all
result
tables
.
Secondly
,
two
teams
submitted
two
systems
each
,
which
are
described
in
a
single
paper
by
each
team
.
Thus
,
the
systems
called
"
Nilsson
"
and
"
Hall
,
J.
"
are
both
described
in
Hall
et
al.
(
2007a
)
,
while
the
systems
called
"
Duan
(
1
)
"
and
"
Duan
(
2
)
"
are
both
described
in
Duan
et
al.
(
2007
)
.
Finally
,
please
pay
attention
to
the
fact
that
there
are
two
teams
,
where
the
first
author
's
last
name
is
to
disambiguate
between
the
teams
involving
Johan
Hall
(
Hall
et
al.
,
2007a
)
and
Keith
Hall
(
Hall
et
al.
,
2007b
)
,
respectively
.
Tables
2
and
3
give
the
scores
for
the
multilingual
track
in
the
CoNLL
2007
shared
task
.
The
Average
column
contains
the
average
score
for
all
ten
languages
,
which
determines
the
ranking
in
this
track
.
Table
4
presents
the
results
for
the
domain
adaptation
track
,
where
the
ranking
is
determined
based
on
the
PCHEM
results
only
,
since
the
CHILDES
data
set
was
optional
.
Note
also
that
there
are
no
labeled
Schiehlen
Johansson
Mannem
Wu
Nguyen
Maes
Canisius
Jia
Zeman
Marinov
Duan
(
2
)
Nash
Shimizu
Table
2
:
Labeled
attachment
score
(
LAS
)
for
the
multilingual
track
in
the
CoNLL
2007
shared
task
.
Nakagawa
Carreras
Hall
,
J.
Hall
,
K.
Schiehlen
Johansson
Canisius
Hungarian
Table
3
:
Unlabeled
attachment
scores
(
UAS
)
for
the
multilingual
track
in
the
CoNLL
2007
shared
task
.
A
star
next
to
a
score
in
the
Average
column
indicates
a
statistically
significant
difference
with
the
next
lower
rank
.
CHILDES-c
CHILDES-o
Schneider
Table
4
:
Labeled
(
LAS
)
and
unlabeled
(
UAS
)
attachment
scores
for
the
closed
(
-
c
)
and
open
(
-
o
)
classes
of
the
domain
adaptation
track
in
the
CoNLL
2007
shared
task
.
Teams
are
denoted
by
the
last
name
of
their
first
member
,
with
italics
indicating
that
there
is
no
corresponding
paper
in
the
proceedings
.
A
star
next
to
a
score
in
the
PCHEM
columns
indicates
a
statistically
significant
difference
with
the
next
lower
rank
.
attachment
scores
for
the
CHILDES
data
set
,
for
reasons
explained
in
section
3.2
.
The
number
in
parentheses
next
to
each
score
gives
the
rank
.
A
star
next
to
a
score
indicates
that
the
difference
with
the
next
lower
rank
is
significant
at
the
5
%
level
using
a
z-test
for
proportions
.
A
more
complete
presentation
of
the
results
,
including
the
significance
results
for
all
the
tasks
and
their
p-values
,
can
be
found
on
the
shared
task
website.4
Looking
first
at
the
results
in
the
multilingual
track
,
we
note
that
there
are
a
number
of
systems
performing
at
almost
the
same
level
at
the
top
of
the
ranking
.
For
the
average
labeled
attachment
score
,
the
difference
between
the
top
score
(
Nilsson
)
and
the
fifth
score
(
Hall
,
J.
)
is
no
more
than
half
a
percentage
point
,
and
there
are
generally
very
few
significant
differences
among
the
five
or
six
best
systems
,
regardless
of
whether
we
consider
labeled
or
unlabeled
attachment
score
.
For
the
closed
class
of
the
domain
adaptation
track
,
we
see
a
very
similar
pattern
,
with
the
top
system
(
Sagae
)
being
followed
very
closely
by
two
other
systems
.
For
the
open
class
,
the
results
are
more
spread
out
,
but
then
there
are
very
few
results
in
this
class
.
It
is
also
worth
noting
that
the
top
scores
in
the
closed
class
,
somewhat
unexpectedly
,
are
higher
than
the
top
scores
in
the
4http
:
/
/
nextens.uvt.nl
/
depparse-wiki
/
AllScores
open
class
.
But
before
we
proceed
to
a
more
detailed
analysis
of
the
results
(
section
6
)
,
we
will
make
an
attempt
to
characterize
the
approaches
represented
by
the
different
systems
.
5
Approaches
In
this
section
we
give
an
overview
of
the
models
,
inference
methods
,
and
learning
methods
used
in
the
participating
systems
.
For
obvious
reasons
the
discussion
is
limited
to
systems
that
are
described
by
a
paper
in
the
proceedings
.
But
instead
of
describing
the
systems
one
by
one
,
we
focus
on
the
basic
methodological
building
blocks
that
are
often
found
in
several
systems
although
in
different
combinations
.
For
descriptions
of
the
individual
systems
,
we
refer
to
the
respective
papers
in
the
proceedings
.
Section
5.1
is
devoted
to
system
architectures
.
We
then
describe
the
two
main
paradigms
for
learning
and
inference
,
in
this
year
's
shared
task
as
well
as
in
last
year
's
,
which
we
call
transition-based
parsers
(
section
5.2
)
and
graph-based
parsers
(
section
5.3
)
,
adopting
the
terminology
of
McDonald
and
Nivre
(
2007
)
.
5
Finally
,
we
give
an
overview
of
the
domain
adaptation
methods
that
were
used
(
section
5.4
)
.
5This
distinction
roughly
corresponds
to
the
distinction
made
by
Buchholz
and
Marsi
(
2006
)
between
"
stepwise
"
and
"
all-pairs
"
approaches
.
5.1
Architectures
Most
systems
perform
some
amount
of
pre
-
and
post-processing
,
making
the
actual
parsing
component
part
of
a
sequential
workflow
of
varying
length
and
complexity
.
For
example
,
most
transition-based
parsers
can
only
build
projective
dependency
graphs
.
For
languages
with
non-projective
dependencies
,
graphs
therefore
need
to
be
projectivized
for
training
and
deprojectivized
for
testing
(
Hall
et
al.
,
2007a
;
Johansson
and
Nugues
,
2007b
;
Titov
and
Henderson
,
2007
)
.
Instead
of
assigning
HEAD
and
DEPREL
in
a
single
step
,
some
systems
use
a
two-stage
approach
for
attaching
and
labeling
dependencies
(
Chen
et
al.
,
2007
;
Dredze
et
al.
,
2007
)
.
In
the
first
step
unlabeled
dependencies
are
generated
,
in
the
second
step
these
are
labeled
.
This
is
particularly
helpful
for
factored
parsing
models
,
in
which
label
decisions
cannot
be
easily
conditioned
on
larger
parts
of
the
structure
due
to
the
increased
complexity
of
inference
.
One
system
(
Hall
et
al.
,
2007b
)
extends
this
two-stage
approach
to
a
three-stage
architecture
where
the
parser
and
labeler
generate
an
n-best
list
of
parses
which
in
turn
is
reranked.6
In
ensemble-based
systems
several
base
parsers
provide
parsing
decisions
,
which
are
added
together
for
a
combined
score
for
each
potential
dependency
arc
.
The
tree
that
maximizes
the
sum
of
these
combined
scores
is
taken
as
the
final
output
parse
.
This
technique
is
used
by
Sagae
and
Tsujii
(
2007
)
and
in
the
Nilsson
system
(
Hall
et
al.
,
2007a
)
.
It
is
worth
noting
that
both
these
systems
combine
transition-based
base
parsers
with
a
graph-based
method
for
parser
combination
,
as
first
described
by
Sagae
and
Lavie
(
2006
)
.
Data-driven
grammar-based
parsers
,
such
as
Bick
Briscoe
(
2007
)
,
need
pre
-
and
post-processing
in
order
to
map
the
dependency
graphs
provided
as
training
data
to
a
format
compatible
with
the
grammar
used
,
and
vice
versa
.
5.2
Transition-Based
Parsers
Transition-based
parsers
build
dependency
graphs
by
performing
sequences
of
actions
,
or
transitions
.
Both
learning
and
inference
is
conceptualized
in
6They
also
flip
the
order
of
the
labeler
and
the
reranker
.
terms
of
predicting
the
correct
transition
based
on
the
current
parser
state
and
/
or
history
.
We
can
further
subclassify
parsers
with
respect
to
the
model
(
or
transition
system
)
they
adopt
,
the
inference
method
they
use
,
and
the
learning
method
they
employ
.
The
most
common
model
for
transition-based
parsers
is
one
inspired
by
shift-reduce
parsing
,
where
a
parser
state
contains
a
stack
of
partially
processed
tokens
and
a
queue
of
remaining
input
tokens
,
and
where
transitions
add
dependency
arcs
and
perform
stack
and
queue
operations
.
This
type
of
model
is
used
by
the
majority
of
transition-based
et
al.
,
2007a
;
Johansson
and
Nugues
,
2007b
;
Man-nem
,
2007
;
Titov
and
Henderson
,
2007
;
Wu
et
al.
,
2007
)
.
Sometimes
it
is
combined
with
an
explicit
probability
model
for
transition
sequences
,
which
may
be
conditional
(
Duan
et
al.
,
2007
)
or
generative
(
Titov
and
Henderson
,
2007
)
.
An
alternative
model
is
based
on
the
list-based
parsing
algorithm
described
by
Covington
(
2001
)
,
which
iterates
over
the
input
tokens
in
a
sequential
manner
and
evaluates
for
each
preceding
token
whether
it
can
be
linked
to
the
current
token
or
not
.
This
model
is
used
by
Marinov
(
2007
)
and
in
component
parsers
of
the
Nilsson
ensemble
system
(
Hall
et
al.
,
2007a
)
.
Finally
,
two
systems
use
models
based
on
LR
parsing
(
Sagae
and
Tsujii
,
2007
;
Watson
and
Briscoe
,
2007
)
.
The
most
common
inference
technique
in
transition-based
dependency
parsing
is
greedy
deterministic
search
,
guided
by
a
classifier
for
predicting
the
next
transition
given
the
current
parser
state
and
history
,
processing
the
tokens
of
the
sentence
in
sequential
left-to-right
order7
(
Hall
et
al.
,
2007a
;
Mannem
,
multiple
passes
over
the
input
are
conducted
until
no
tokens
are
left
unattached
(
Attardi
et
al.
,
2007
)
.
As
an
alternative
to
deterministic
parsing
,
several
parsers
use
probabilistic
models
and
maintain
a
heap
or
beam
of
partial
transition
sequences
in
order
to
pick
the
most
probable
one
at
the
end
of
the
sentence
7For
diversity
in
parser
ensembles
,
right-to-left
parsers
are
also
used
.
(
Duan
et
al.
,
2007
;
Johansson
and
Nugues
,
2007b
;
Sagae
and
Tsujii
,
2007
;
Titov
and
Henderson
,
2007
)
.
Transition-based
parsers
either
maintain
a
classifier
that
predicts
the
next
transition
or
a
global
probabilistic
model
that
scores
a
complete
parse
.
To
train
these
classifiers
and
probabilitistic
models
several
approaches
were
used
:
SVMs
(
Duan
et
al.
,
2007
;
Hall
et
al.
,
2007a
;
Sagae
and
Tsujii
,
2007
)
,
modified
finite
Newton
SVMs
(
Wu
et
al.
,
2007
)
,
maximum
entropy
models
(
Sagae
and
Tsujii
,
2007
)
,
multiclass
averaged
perceptron
(
Attardi
et
al.
,
2007
)
and
maximum
likelihood
estimation
(
Watson
and
Briscoe
,
2007
)
.
In
order
to
calculate
a
global
score
or
probability
for
a
transition
sequence
,
two
systems
used
a
Markov
chain
approach
(
Duan
et
al.
,
2007
;
Sagae
and
Tsujii
,
2007
)
.
Here
probabilities
from
the
output
of
a
classifier
are
multiplied
over
the
whole
sequence
of
actions
.
This
results
in
a
locally
normalized
model
.
Two
other
entries
used
MIRA
(
Mannem
,
2007
)
or
online
passive-aggressive
learning
(
Johansson
and
Nugues
,
2007b
)
to
train
a
globally
normalized
model
.
Titov
and
Henderson
(
2007
)
used
an
incremental
sigmoid
Bayesian
network
to
model
the
probability
of
a
transition
sequence
and
estimated
model
parameters
using
neural
network
learning
.
While
transition-based
parsers
use
training
data
to
learn
a
process
for
deriving
dependency
graphs
,
graph-based
parsers
learn
a
model
of
what
it
means
to
be
a
good
dependency
graph
given
an
input
sentence
.
They
define
a
scoring
or
probability
function
over
the
set
of
possible
parses
.
At
learning
time
they
estimate
parameters
of
this
function
;
at
parsing
time
they
search
for
the
graph
that
maximizes
this
function
.
These
parsers
mainly
differ
in
the
type
and
structure
of
the
scoring
function
(
model
)
,
the
search
algorithm
that
finds
the
best
parse
(
infer
-
ence
)
,
and
the
method
to
estimate
the
function
's
parameters
(
learning
)
.
The
simplest
type
of
model
is
based
on
a
sum
of
local
attachment
scores
,
which
themselves
are
calculated
based
on
the
dot
product
of
a
weight
vector
and
a
feature
representation
of
the
attachment
.
This
type
of
scoring
function
is
often
referred
to
as
a
first-order
model.8
Several
systems
participating
in
this
year
's
shared
task
used
first-order
models
(
Schiehlen
and
Spranger
,
2007
;
Nguyen
et
al.
,
2007
;
Shimizu
and
Tjong
Kim
Sang
(
2007
)
cast
the
same
type
of
arc-based
factorization
as
a
weighted
constraint
satisfaction
problem
.
Carreras
(
2007
)
extends
the
first-order
model
to
incorporate
a
sum
over
scores
for
pairs
of
adjacent
arcs
in
the
tree
,
yielding
a
second-order
model
.
In
contrast
to
previous
work
where
this
was
constrained
to
sibling
relations
of
the
dependent
(
McDonald
and
Pereira
,
2006
)
,
here
head-grandchild
relations
can
be
taken
into
account
.
In
all
of
the
above
cases
the
scoring
function
is
decomposed
into
functions
that
score
local
properties
(
arcs
,
pairs
of
adjacent
arcs
)
of
the
graph
.
By
contrast
,
the
model
of
Nakagawa
(
2007
)
considers
global
properties
of
the
graph
that
can
take
multiple
arcs
into
account
,
such
as
multiple
siblings
and
children
of
a
node
.
Searching
for
the
highest
scoring
graph
(
usually
a
tree
)
in
a
model
depends
on
the
factorization
chosen
and
whether
we
are
looking
for
projective
or
non-projective
trees
.
Maximum
spanning
tree
algorithms
can
be
used
for
finding
the
highest
scoring
non-projective
tree
in
a
first-order
model
(
Hall
et
al.
,
2007b
;
Nguyen
et
al.
,
2007
;
Canisius
and
Tjong
Kim
Sang
,
2007
;
Shimizu
and
Nakagawa
,
2007
)
,
while
Eisner
's
dynamic
programming
algorithm
solves
the
problem
for
a
first-order
factorization
in
the
projective
case
(
Schiehlen
and
Spranger
,
2007
)
.
Carreras
(
2007
)
employs
his
own
extension
of
Eisner
's
algorithm
for
the
case
of
projective
trees
and
second-order
models
that
include
head-grandparent
relations
.
8It
is
also
known
as
an
edge-factored
model
.
The
methods
presented
above
are
mostly
efficient
and
always
exact
.
However
,
for
models
that
take
global
properties
of
the
tree
into
account
,
they
cannot
be
applied
.
Instead
Nakagawa
(
2007
)
uses
Gibbs
sampling
to
obtain
marginal
probabilities
of
arcs
being
included
in
the
tree
using
his
global
model
and
then
applies
a
maximum
spanning
tree
algorithm
to
maximize
the
sum
of
the
logs
of
these
marginals
and
return
a
valid
cycle-free
parse
.
Most
of
the
graph-based
parsers
were
trained
using
an
online
inference-based
method
such
as
passive-aggressive
learning
(
Nguyen
et
al.
,
2007
;
Schiehlen
and
Spranger
,
2007
)
,
averaged
perceptron
(
Carreras
,
while
some
systems
instead
used
methods
based
on
maximum
conditional
likelihood
(
Nakagawa
,
2007
;
Hall
et
al.
,
2007b
)
.
5.4
Domain
Adaptation
One
way
of
adapting
a
learner
to
a
new
domain
without
using
any
unlabeled
data
is
to
only
include
features
that
are
expected
to
transfer
well
(
Dredze
et
al.
,
2007
)
.
In
structural
correspondence
learning
a
transformation
from
features
in
the
source
domain
to
features
of
the
target
domain
is
learnt
(
Shimizu
and
Nakagawa
,
2007
)
.
The
original
source
features
along
with
their
transformed
versions
are
then
used
to
train
a
discriminative
parser
.
Dredze
et
al.
(
2007
)
trained
a
diverse
set
of
parsers
in
order
to
improve
cross-domain
performance
by
incorporating
their
predictions
as
features
for
another
classifier
.
Similarly
,
two
parsers
trained
with
different
learners
and
search
directions
were
used
in
the
co-learning
approach
of
Sagae
and
Tsujii
(
2007
)
.
Unlabeled
target
data
was
processed
with
both
parsers
.
Sentences
that
both
parsers
agreed
on
were
then
added
to
the
original
training
data
.
This
combined
data
set
served
as
training
data
for
one
of
the
original
parsers
to
produce
the
final
system
.
In
a
similar
fashion
,
Watson
and
Briscoe
(
2007
)
used
a
variant
of
self-training
to
make
use
of
the
unlabeled
target
data
.
Attardi
et
al.
(
2007
)
learnt
tree
revision
rules
for
the
target
domain
by
first
parsing
unlabeled
target
data
using
a
strong
parser
;
this
data
was
then
combined
with
labeled
source
data
;
a
weak
parser
was
applied
to
this
new
dataset
;
finally
tree
correction
rules
are
collected
based
on
the
mistakes
of
the
weak
parser
with
respect
to
the
gold
data
and
the
output
of
the
strong
parser
.
Another
technique
used
was
to
filter
sentences
of
the
out-of-domain
corpus
based
on
their
similarity
to
the
target
domain
,
as
predicted
by
a
classifier
(
Dredze
et
al.
,
2007
)
.
Only
if
a
sentence
was
judged
similar
to
target
domain
sentences
was
it
included
in
the
training
set
.
Bick
(
2007
)
used
a
hybrid
approach
,
where
a
data-driven
parser
trained
on
the
labeled
training
data
was
given
access
to
the
output
of
a
Constraint
Grammar
parser
for
English
run
on
the
same
data
.
Finally
,
Schneider
et
al.
(
2007
)
learnt
collocations
and
relational
nouns
from
the
unlabeled
target
data
and
used
these
in
their
parsing
algorithm
.
6
Analysis
Having
discussed
the
major
approaches
taken
in
the
two
tracks
of
the
shared
task
,
we
will
now
return
to
the
test
results
.
For
the
multilingual
track
,
we
compare
results
across
data
sets
and
across
systems
,
and
report
results
from
a
parser
combination
experiment
involving
all
the
participating
systems
(
section
6.1
)
.
For
the
domain
adaptation
track
,
we
sum
up
the
most
important
findings
from
the
test
results
(
section
6.2
)
.
The
average
LAS
over
all
systems
varies
from
68.07
for
Basque
to
80.95
for
English
.
Top
scores
vary
from
76.31
for
Greek
to
89.61
for
English
.
In
general
,
there
is
a
good
correlation
between
the
top
scores
and
the
average
scores
.
For
Greek
,
Italian
,
and
Turkish
,
the
top
score
is
closer
to
the
average
score
than
the
average
distance
,
while
for
Czech
,
the
distance
is
higher
.
The
languages
that
produced
the
most
stable
results
in
terms
of
system
ranks
with
respect
to
LAS
are
Hungarian
and
Italian
.
For
UAS
,
Catalan
also
falls
into
this
group
.
The
language
that
Table
5
:
A
comparison
of
the
LAS
top
scores
from
2006
and
2007
.
Official
scoring
conditions
in
boldface
.
For
Turkish
,
scores
with
punctuation
also
include
word-internal
dependencies
.
produced
the
most
unstable
results
with
respect
to
LAS
is
Turkish
.
In
comparison
to
last
year
's
languages
,
the
languages
involved
in
the
multilingual
track
this
year
can
be
more
easily
separated
into
three
classes
with
respect
to
top
scores
:
Catalan
,
Chinese
,
English
,
Italian
It
is
interesting
to
see
that
the
classes
are
more
easily
definable
via
language
characteristics
than
via
characteristics
of
the
data
sets
.
The
split
goes
across
training
set
size
,
original
data
format
(
constituent
vs.
dependency
)
,
sentence
length
,
percentage
of
unknown
words
,
number
of
dependency
labels
,
and
ratio
of
(
C
)
POSTAGS
and
dependency
labels
.
The
class
with
the
highest
top
scores
contains
languages
with
a
rather
impoverished
morphology
.
Medium
scores
are
reached
by
the
two
agglutinative
languages
,
Hungarian
and
Turkish
,
as
well
as
by
Czech
.
The
most
difficult
languages
are
those
that
combine
a
relatively
free
word
order
with
a
high
degree
of
inflection
.
Based
on
these
characteristics
,
one
would
expect
to
find
Czech
in
the
last
class
.
However
,
the
Czech
training
set
is
four
times
the
size
of
the
training
set
for
Arabic
,
which
is
the
language
with
the
largest
training
set
of
the
difficult
languages
.
However
,
it
would
be
wrong
to
assume
that
training
set
size
alone
is
the
deciding
factor
.
A
closer
look
at
table
1
shows
that
while
Basque
and
Greek
in
fact
have
small
training
data
sets
,
so
do
Turkish
and
Italian
.
Another
factor
that
may
be
associated
with
the
above
classification
is
the
percentage
of
new
words
(
PNW
)
in
the
test
set
.
Thus
,
the
expectation
would
be
that
the
highly
inflecting
languages
have
a
high
PNW
while
the
languages
with
little
morphology
have
a
low
PNW
.
But
again
,
there
is
no
direct
correspondence
.
Arabic
,
Basque
,
Catalan
,
English
,
and
Greek
agree
with
this
assumption
:
Catalan
and
English
have
the
smallest
PNW
,
and
Arabic
,
Basque
,
and
Greek
have
a
high
PNW
.
But
the
PNW
for
Italian
is
higher
than
for
Arabic
and
Greek
,
and
this
is
also
true
for
the
percentage
of
new
lemmas
.
Additionally
,
the
highest
PNW
can
be
found
in
Hungarian
and
Turkish
,
which
reach
higher
scores
than
Arabic
,
Basque
,
and
Greek
.
These
considerations
suggest
that
highly
inflected
languages
with
(
relatively
)
free
word
order
need
more
training
data
,
a
hypothesis
that
will
have
to
be
investigated
further
.
There
are
four
languages
which
were
included
in
the
shared
tasks
on
multilingual
dependency
parsing
both
at
CoNLL
2006
and
at
CoNLL
2007
:
Arabic
,
Chinese
,
Czech
,
and
Turkish
.
For
all
four
languages
,
the
same
treebanks
were
used
,
which
allows
a
comparison
of
the
results
.
However
,
in
some
cases
the
size
of
the
training
set
changed
,
and
at
least
one
treebank
,
Turkish
,
underwent
a
thorough
correction
phase
.
Table
5
shows
the
top
scores
for
LAS
.
Since
the
official
scores
excluded
punctuation
in
2006
but
includes
it
in
2007
,
we
give
results
both
with
and
without
punctuation
for
both
years
.
For
Arabic
and
Turkish
,
we
see
a
great
improvement
of
approximately
9
and
6
percentage
points
.
For
Arabic
,
the
number
of
tokens
in
the
training
set
doubled
,
and
the
morphological
annotation
was
made
more
informative
.
The
combined
effect
of
these
changes
can
probably
account
for
the
substantial
improvement
in
parsing
accuracy
.
For
Turkish
,
the
training
set
grew
in
size
as
well
,
although
only
by
600
sentences
,
but
part
of
the
improvement
for
Turkish
may
also
be
due
to
continuing
efforts
in
error
cor
-
rection
and
consistency
checking
.
We
see
that
the
choice
to
include
punctuation
or
not
makes
a
large
difference
for
the
Turkish
scores
,
since
non-final
IGs
of
a
word
are
counted
as
punctuation
(
because
they
have
the
underscore
character
as
their
FORM
value
)
,
which
means
that
word-internal
dependency
links
are
included
if
punctuation
is
included.9
However
,
regardless
of
whether
we
compare
scores
with
or
without
punctuation
,
we
see
a
genuine
improvement
of
approximately
6
percentage
points
.
For
Chinese
,
the
same
training
set
was
used
.
Therefore
,
the
drop
from
last
year
's
top
score
to
this
year
's
is
surprising
.
However
,
last
year
's
top
scoring
system
for
Chinese
(
Riedel
et
al.
,
2006
)
,
which
did
not
participate
this
year
,
had
a
score
that
was
more
than
3
percentage
points
higher
than
the
second
best
system
for
Chinese
.
Thus
,
if
we
compare
this
year
's
results
to
the
second
best
system
,
the
difference
is
approximately
2
percentage
points
.
This
final
difference
may
be
attributed
to
the
properties
of
the
test
sets
.
While
last
year
's
test
set
was
taken
from
the
treebank
,
this
year
's
test
set
contains
texts
from
other
sources
.
The
selection
of
the
textual
basis
also
significantly
changed
average
sentence
length
:
The
Chinese
training
set
has
an
average
sentence
length
of
5.9
.
Last
year
's
test
set
also
had
an
average
sentence
length
of
5.9
.
However
,
this
year
,
the
average
sentence
length
is
7.5
tokens
,
which
is
a
significant
increase
.
Longer
sentences
are
typically
harder
to
parse
due
to
the
increased
likelihood
of
ambiguous
constructions
.
Finally
,
we
note
that
the
performance
for
Czech
is
almost
exactly
the
same
as
last
year
,
despite
the
fact
that
the
size
of
the
training
set
has
been
reduced
to
approximately
one
third
of
last
year
's
training
set
.
It
is
likely
that
this
in
fact
represents
a
relative
improvement
compared
to
last
year
's
results
.
9The
decision
to
include
word-internal
dependencies
in
this
way
can
be
debated
on
the
grounds
that
they
can
be
parsed
de-terministically
.
On
the
other
hand
,
they
typically
correspond
to
regular
dependencies
captured
by
function
words
in
other
languages
,
which
are
often
easy
to
parse
as
well
.
It
is
therefore
unclear
whether
scores
are
more
inflated
by
including
word-internal
dependencies
or
deflated
by
excluding
them
.
guages
show
considerably
more
variation
than
last
year
's
systems
.
Buchholz
and
Marsi
(
2006
)
report
that
"
[
f
]
or
most
parsers
,
their
ranking
differs
at
most
a
few
places
from
their
overall
ranking
"
.
This
year
,
for
all
of
the
ten
best
performing
systems
with
respect
to
LAS
,
there
is
at
least
one
language
for
which
their
rank
is
at
least
5
places
different
from
their
overall
rank
.
The
most
extreme
case
is
the
top
performing
Nilsson
system
(
Hall
et
al.
,
2007a
)
,
which
reached
rank
1
for
five
languages
and
rank
2
for
two
more
languages
.
Their
only
outlier
is
for
Chinese
,
where
the
system
occupies
rank
14
,
with
a
LAS
approximately
9
percentage
points
below
the
top
scoring
system
for
Chinese
(
Sagae
and
Tsujii
,
the
official
results
for
Chinese
contained
a
bug
,
and
the
true
performance
of
their
system
was
actually
much
higher
.
The
greatest
improvement
of
a
system
with
respect
to
its
average
rank
occurs
for
English
,
for
which
the
system
by
Nguyen
et
al.
(
2007
)
improved
from
the
average
rank
15
to
rank
6
.
Two
more
outliers
can
be
observed
in
the
system
of
Johansson
and
Nugues
(
2007b
)
,
which
improves
from
its
average
rank
12
to
rank
4
for
Basque
and
Turkish
.
The
authors
attribute
this
high
performance
to
their
parser
's
good
performance
on
small
training
sets
.
However
,
this
hypothesis
is
contradicted
by
their
results
for
Greek
and
Italian
,
the
other
two
languages
with
small
training
sets
.
For
these
two
languages
,
the
system
's
rank
is
very
close
to
its
average
rank
.
6.1.3
An
Experiment
in
System
Combination
Having
the
outputs
of
many
diverse
dependency
parsers
for
standard
data
sets
opens
up
the
interesting
possibility
of
parser
combination
.
To
combine
the
outputs
of
each
parser
we
used
the
method
of
Sagae
and
Lavie
(
2006
)
.
This
technique
assigns
to
each
possible
labeled
dependency
a
weight
that
is
equal
to
the
number
of
systems
that
included
the
dependency
in
their
output
.
This
can
be
viewed
as
an
arc-based
voting
scheme
.
Using
these
weights
it
is
possible
to
search
the
space
of
possible
dependency
trees
using
directed
maximum
spanning
tree
algorithms
(
McDonald
et
al.
,
2005
)
.
The
maximum
spanning
tree
in
this
case
is
equal
to
the
tree
that
on
average
contains
the
labeled
dependencies
that
most
systems
voted
for
.
It
is
worth
noting
that
variants
of
this
scheme
were
used
in
two
of
the
participating
■
□
Unlabeled
Accuracy
-
Labeled
Accuracy
Number
of
Systems
Figure
1
:
System
Combination
systems
,
the
Nilsson
system
(
Hall
et
al.
,
2007a
)
and
the
system
of
Sagae
and
Tsujii
(
2007
)
.
Figure
1
plots
the
labeled
and
unlabeled
accuracies
when
combining
an
increasing
number
of
systems
.
The
data
used
in
the
plot
was
the
output
of
all
competing
systems
for
every
language
in
the
multilingual
track
.
The
plot
was
constructed
by
sorting
the
systems
based
on
their
average
labeled
accuracy
scores
over
all
languages
,
and
then
incrementally
adding
each
system
in
descending
order.10
We
can
see
that
both
labeled
and
unlabeled
accuracy
are
significantly
increased
,
even
when
just
the
top
three
systems
are
included
.
Accuracy
begins
to
degrade
gracefully
after
about
ten
different
parsers
have
been
added
.
Furthermore
,
the
accuracy
never
falls
below
the
performance
of
the
top
three
systems
.
6.2
Domain
Adaptation
Track
For
this
task
,
the
results
are
rather
surprising
.
A
look
at
the
LAS
and
UAS
for
the
chemical
research
abstracts
shows
that
there
are
four
closed
systems
that
outperform
the
best
scoring
open
system
.
The
best
system
(
Sagae
and
Tsujii
,
2007
)
reaches
an
LAS
of
81.06
(
in
comparison
to
their
LAS
of
89.01
for
the
English
data
set
in
the
multilingual
track
)
.
Considering
that
approximately
one
third
of
the
words
of
the
chemical
test
set
are
new
,
the
results
are
noteworthy
.
The
next
surprise
is
to
be
found
in
the
relatively
low
UAS
for
the
CHILDES
data
.
At
a
first
glance
,
this
data
set
has
all
the
characteristics
of
an
easy
10The
reason
that
there
is
no
data
point
for
two
parsers
is
that
the
simple
voting
scheme
adopted
only
makes
sense
with
at
least
three
parsers
voting
.
set
;
the
average
sentence
is
short
(
12.9
words
)
,
and
the
percentage
of
new
words
is
also
small
(
6.10
%
)
.
Despite
these
characteristics
,
the
top
UAS
reaches
62.49
and
is
thus
more
than
10
percentage
points
below
the
top
UAS
for
the
chemical
data
set
.
One
major
reason
for
this
is
that
auxiliary
and
main
verb
dependencies
are
annotated
differently
in
the
CHILDES
data
than
in
the
WSJ
training
set
.
As
a
result
of
this
discrepancy
,
participants
were
not
required
to
submit
results
for
the
CHILDES
data
.
The
best
performing
system
on
the
CHILDES
corpus
is
an
open
system
(
Bick
,
2007
)
,
but
the
distance
to
the
top
closed
system
is
approximately
1
percentage
point
.
In
this
domain
,
it
seems
more
feasible
to
use
general
language
resources
than
for
the
chemical
domain
.
However
,
the
results
prove
that
the
extra
effort
may
be
unnecessary
.
7
Conclusion
Two
years
of
dependency
parsing
in
the
CoNLL
shared
task
has
brought
an
enormous
boost
to
the
development
of
dependency
parsers
for
multiple
languages
(
and
to
some
extent
for
multiple
domains
)
.
But
even
though
nineteen
languages
have
been
covered
by
almost
as
many
different
parsing
and
learning
approaches
,
we
still
have
only
vague
ideas
about
the
strengths
and
weaknesses
of
different
methods
for
languages
with
different
typological
characteristics
.
Increasing
our
knowledge
of
the
multi-causal
relationship
between
language
structure
,
annotation
scheme
,
and
parsing
and
learning
methods
probably
remains
the
most
important
direction
for
future
research
in
this
area
.
The
outputs
of
all
systems
for
all
data
sets
from
the
two
shared
tasks
are
freely
available
for
research
and
constitute
a
potential
gold
mine
for
comparative
error
analysis
across
languages
and
systems
.
For
domain
adaptation
we
have
barely
scratched
the
surface
so
far
.
But
overcoming
the
bottleneck
of
limited
annotated
resources
for
specialized
domains
will
be
as
important
for
the
deployment
of
human
language
technology
as
being
able
to
handle
multiple
languages
in
the
future
.
One
result
from
the
domain
adaptation
track
that
may
seem
surprising
at
first
is
the
fact
that
closed
class
systems
outperformed
open
class
systems
on
the
chemical
abstracts
.
However
,
it
seems
that
the
major
problem
in
adapting
pre-existing
parsers
to
the
new
domain
was
not
the
domain
as
such
but
the
mapping
from
the
native
output
of
the
parser
to
the
kind
of
annotation
provided
in
the
shared
task
data
sets
.
Thus
,
finding
ways
of
reusing
already
invested
development
efforts
by
adapting
the
outputs
of
existing
systems
to
new
requirements
,
without
substantial
loss
in
accuracy
,
seems
to
be
another
line
of
research
that
may
be
worth
pursuing
.
Acknowledgments
First
and
foremost
,
we
want
to
thank
all
the
people
and
organizations
that
generously
provided
us
with
treebank
data
and
helped
us
prepare
the
data
sets
and
without
whom
the
shared
task
would
have
been
literally
impossible
:
Otakar
Smrz
,
Charles
University
,
and
the
LDC
(
Arabic
)
;
Maxux
Aranz-abe
,
Kepa
Bengoetxea
,
Larraitz
Uria
,
Koldo
Go-jenola
,
and
the
University
of
the
Basque
Country
(
Basque
)
;
Ma
.
Antonia
Marti
Antonin
,
Lluis
Marquez
,
Manuel
Bertran
,
Mariona
Taule
,
Difda
Monterde
,
Eli
Comelles
,
and
CLiC-UB
(
Catalan
)
;
Shih-Min
Li
,
Keh-Jiann
Chen
,
Yu-Ming
Hsieh
,
and
Academia
Sinica
(
Chinese
)
;
Jan
Hajic
,
Zdenek
Zabokrtsky
,
Charles
University
,
and
the
LDC
(
Czech
)
;
Brian
MacWhinney
,
Eric
Davis
,
the
CHILDES
project
,
the
Penn
BiolE
project
,
and
the
LDC
(
English
)
;
Prokopis
Prokopidis
and
ILSP
(
Greek
)
;
Csirik
Janos
and
ZoMn
Alexin
(
Hungarian
)
;
Giuseppe
Attardi
,
Simonetta
Montemagni
,
Maria
Simi
,
Isidoro
Barraco
,
Patrizia
Topi
,
Kiril
Ribarov
,
Alessandro
Lenci
,
Nicoletta
Calzolari
,
ILC
,
and
ELRA
(
Italian
)
;
Giil
§
en
Eryigit
,
Kemal
Oflazer
,
and
Ruket
CCakici
(
Turkish
)
.
Secondly
,
we
want
to
thank
the
organizers
of
last
year
's
shared
task
,
Sabine
Buchholz
,
Amit
Dubey
,
Erwin
Marsi
,
and
Yuval
Krymolowski
,
who
solved
all
the
really
hard
problems
for
us
and
answered
all
our
questions
,
as
well
as
our
colleagues
who
helped
review
papers
:
Jason
Baldridge
,
Sabine
Buchholz
,
James
Clarke
,
Gul
§
en
Eryigit
,
Kilian
Evang
,
Julia
Hockenmaier
,
Yuval
Krymolowski
,
Erwin
Marsi
,
Beata
Megyesi
,
Yannick
Versley
,
and
Alexander
Yeh
.
Special
thanks
to
Bertjan
Busser
and
Erwin
Marsi
for
help
with
the
CoNLL
shared
task
website
and
many
other
things
,
and
to
Richard
Johansson
for
letting
us
use
his
conversion
tool
for
English
.
Thirdly
,
we
want
to
thank
the
program
chairs
Kudo
,
the
publications
chair
,
Eric
Ringger
,
the
SIGNLL
officers
,
Antal
van
den
Bosch
,
Hwee
Tou
Ng
,
and
Erik
Tjong
Kim
Sang
,
and
members
of
the
LDC
staff
,
Tony
Castelletto
and
Ilya
Ahtaridis
,
for
great
cooperation
and
support
.
Finally
,
we
want
to
thank
the
following
people
,
who
in
different
ways
assisted
us
in
the
organization
of
the
CoNLL
2007
shared
task
:
Giuseppe
Attardi
,
Eckhard
Bick
,
Matthias
Buch-Kromann
,
Xavier
Carreras
,
Tomaz
Erjavec
,
Svetoslav
Mari-nov
,
Wolfgang
Menzel
,
Xue
Nianwen
,
Gertjan
van
Noord
,
Petya
Osenova
,
Florian
Schiel
,
Kiril
Simov
,
Zdenka
Uresova
,
and
Heike
Zinsmeister
.
