This
paper
discusses
automatic
determination
of
case
in
Arabic
.
This
task
is
a
major
source
of
errors
in
full
diacritization
of
Arabic
.
We
use
a
gold-standard
syntactic
tree
,
and
obtain
an
error
rate
of
about
4.2
%
,
with
a
machine
learning
based
system
outperforming
a
system
using
hand-written
rules
.
1
Introduction
In
Modern
Standard
Arabic
(
MSA
)
,
all
nouns
and
adjectives
have
one
of
three
cases
:
nominative
(
Nom
)
,
accusative
(
Acc
)
,
or
genitive
(
Gen
)
.
What
sets
case
in
MSA
apart
from
case
in
other
languages
is
most
saliently
the
fact
that
it
is
usually
not
marked
in
the
orthography
,
as
it
is
written
using
diacritics
which
are
normally
omitted
.
In
fact
,
in
a
recent
paper
on
diacritization
,
Habash
and
Rambow
(
2007
)
report
that
word
error
rate
drops
9.4
%
absolute
(
to
5.5
%
)
if
the
word-final
diacritics
(
which
include
case
)
need
not
be
predicted
.
Similar
drops
have
been
observed
by
other
researchers
(
Nelken
and
Shieber
,
2005
;
Zitouni
et
al.
,
2006
)
.
Thus
,
we
can
deduce
that
tagging-based
approaches
to
case
identification
are
limited
in
their
usefulness
,
and
if
we
need
full
diacritization
for
subsequent
processing
in
a
natural
language
processing
(
NLP
)
application
(
say
,
language
modeling
for
automatic
speech
recognition
(
Vergyri
and
Kirchhoff
,
2004
)
)
,
we
need
to
perform
more
complex
syntactic
processing
to
restore
case
diacritics
.
Options
include
using
the
output
of
a
parser
in
determining
case
.
An
additional
motivation
for
investigating
case
in
Arabic
comes
from
treebanking
.
Native
speakers
of
Arabic
in
fact
are
native
speakers
of
one
of
the
Arabic
dialects
,
all
of
which
have
lost
case
(
Holes
,
2004
)
.
They
learn
MSA
in
school
,
and
have
no
native-speaker
intuition
about
case
.
Thus
,
determining
case
in
MSA
is
a
hard
problem
for
everyone
,
including
treebank
annotators
.
A
tool
to
catch
case-related
errors
in
treebanking
would
be
useful
.
In
this
paper
,
we
investigate
the
problem
of
determining
case
of
nouns
and
adjectives
in
syntactic
trees
.
We
use
gold
standard
trees
from
the
Arabic
Treebank
(
ATB
)
.
We
see
our
work
using
gold
standard
trees
as
a
first
step
towards
developing
a
system
for
restoring
case
to
the
output
of
a
parser
.
The
complexity
of
the
task
justifies
an
initial
investigation
based
on
gold
standard
trees
.
And
of
course
,
the
use
of
gold
standard
trees
is
justified
for
our
other
objective
,
helping
quality
control
for
treebanking
.
The
study
presented
in
this
paper
shows
the
importance
of
what
has
been
called
"
feature
engineering
"
and
the
issue
of
representation
for
machine
learning
.
Our
initial
machine
learning
experiments
use
features
that
can
be
read
off
the
ATB
phrase
structure
trees
in
a
straightforward
manner
.
The
literature
on
case
in
MSA
(
prescriptive
and
descriptive
sources
)
reveals
that
case
assignment
in
Arabic
does
not
always
follow
standard
assumptions
about
predicate-argument
structure
,
which
is
what
Proceedings
of
the
2007
Joint
Conference
on
Empirical
Methods
in
Natural
Language
Processing
and
Computational
Natural
Language
Learning
,
pp.
1084-1092
,
Prague
,
June
2007
.
©
2007
Association
for
Computational
Linguistics
the
ATB
annotation
is
based
on
.
Therefore
,
we
transform
the
ATB
so
that
the
new
representation
is
based
entirely
on
case
assignment
,
not
predicate-argument
structure
.
The
features
for
machine
learning
that
can
now
be
read
off
from
the
new
representation
yield
much
better
results
.
Our
results
show
that
we
can
determine
case
with
an
error
rate
of
4.2
%
.
However
,
our
results
would
have
been
impossible
without
a
deeper
understanding
of
the
linguistic
phenomenon
of
case
and
a
transformation
of
the
representation
oriented
towards
this
phenomenon
.
Using
either
underlying
representation
,
machine
learning
performs
better
than
hand-written
rules
.
However
,
a
closer
look
at
the
errors
made
by
the
machine
learning-derived
classifier
and
the
handwritten
rules
reveals
that
most
errors
are
in
fact
treebank
errors
(
between
69
%
and
86
%
of
all
errors
for
the
machine
learning-derived
classifier
and
the
hand-written
rules
,
respectively
)
.
Furthermore
,
the
machine
learning
classifier
agrees
more
often
with
treebank
errors
than
the
hand-written
rules
do
.
This
fact
highlights
the
problem
ofmachine
learning
(
garbage
in
,
garbage
out
)
,
but
holds
out
the
prospect
for
improvement
in
the
machine
learning
based
classifier
as
the
treebank
is
checked
for
errors
and
re-released
.
In
the
next
section
,
we
describe
all
relevant
linguistic
facts
of
case
in
Arabic
.
Section
3
details
the
resources
used
in
this
research
.
Section
4
describes
the
preprocessing
done
to
extract
the
relevant
linguistic
features
from
the
ATB
.
Sections
5
and
6
detail
the
two
systems
we
compare
.
Sections
7
and
8
present
results
and
an
error
analysis
of
the
two
systems
.
And
we
conclude
with
a
discussion
of
our
findings
in
Section
9
.
2
Linguistic
Facts
All
Arabic
nominals
(
common
nouns
,
proper
nouns
,
adjectives
and
adverbs
)
are
inflected
for
case
,
which
has
three
values
in
Arabic
:
nominative
(
Nom
)
,
accusative
(
Acc
)
or
genitive
(
Gen
)
.
We
know
this
from
case
agreement
facts
,
even
though
the
morphology
and
/
or
orthography
do
not
necessarily
always
make
the
case
realization
overt
.
We
discuss
morphological
and
syntactic
aspects
ofcase
in
MSA
in
turn
.
2.1
Morphological
Realization
of
Case
The
realization
of
nominal
case
in
Arabic
is
complicated
by
its
orthography
,
which
uses
optional
diacritics
to
indicate
short
vowel
case
morphemes
,
and
by
its
morphology
,
which
does
not
always
distinguish
between
all
cases
.
Additionally
,
case
realization
in
Arabic
interacts
heavily
with
the
realization
of
definiteness
,
leading
to
different
realizations
depending
on
whether
the
nominal
is
indefinite
,
i.e.
,
receiving
nunation
(
^y3
)
,
definite
through
the
determiner
Al+
(
+Jl
)
or
definite
through
being
the
governor
of
an
idafa
possessive
construction
(
&lt;
jL
»
J
)
.
Most
details
of
this
interaction
are
outside
the
scope
of
this
paper
,
but
we
discuss
it
as
much
as
it
helps
clarify
issues
of
case
.
Buckley
(
2004
)
describes
eight
different
classes
of
nominal
case
expression
,
which
we
briefly
review
.
We
first
discuss
the
realization
of
case
in
morphologically
singular
nouns
(
including
broken
,
i.e.
,
irregular
,
plurals
)
.
Triptotes
are
the
basic
class
which
expresses
the
three
cases
in
the
singular
using
the
three
short
vowels
of
Arabic
:
Nom
is
i
+U,1
Acc
is
r
+a
,
and
Gen
is
_
+i
.
The
corresponding
nunated
forms
for
these
three
diacritics
are
:
i
+U
for
Nom
,
;
+a
for
Acc
,
and
_
+1
for
Gen.
Nominals
not
ending
with
Ta
Marbuta
(
5
h
)
or
Alif
Hamza
(
A
A
'
)
receive
an
extra
Alif
in
the
accusative
indefinite
case
(
e.g
,
Ufcf
kitAbAa
'
book
'
versus
&lt;
u
kfkitAbaha
'
writing
'
)
.
Diptotes
are
like
triptotes
except
that
when
they
are
indefinite
,
they
do
not
express
nunation
and
they
use
the
r
+a
suffix
for
both
Acc
and
Gen.
The
class
of
diptotes
is
lexically
specific
.
It
includes
nominals
with
specific
meanings
or
morphological
patterns
(
colors
,
elatives
,
specific
broken
plurals
,
some
proper
names
with
Ta
Marbuta
ending
or
location
names
devoid
of
the
definite
article
)
.
Examples
include
Cj^ju
bayruwt
'
Beirut
'
and
ijjj
1
Aazraq
'
All
Arabic
transliterations
are
provided
in
the
Habash-Soudi-Buckwalter
transliteration
scheme
(
Habash
et
al.
,
2007
)
.
This
scheme
extends
Buckwalter
's
transliteration
scheme
(
Buckwalter
,
2002
)
to
increase
its
readability
while
maintaining
the
1-to-1
correspondence
with
Arabic
orthography
as
represented
in
standard
encodings
of
Arabic
,
i.e.
,
Unicode
,
CP-1256
,
etc.
The
following
are
the
only
differences
from
Buckwalter
's
D
Ji
(
Z
)
,
s
*
(
E
)
,
7
£
(
g
)
,
yj
;
(
Y
)
,
(
F
)
,
u
-
(
N
)
,
i
.
(
K
)
.
'
blue
'
.
The
next
three
classes
are
less
common
.
The
invariables
show
no
case
in
the
singular
(
e.g.
nomi-nals
ending
in
long
vowels
:
l^jj
-
»
suwryA
'
Syria
'
or
J
"
i
dikray
'
memoir
'
)
.
The
indéclinables
always
use
the
r
+a
suffix
to
express
case
in
the
singular
and
allow
for
nunation
(
J^
*
-
*
maçnayà
'
meaning
'
)
.
The
defective
nominals
,
which
are
derived
from
roots
with
a
final
radical
glide
(
y
or
w
)
,
look
like
triptotes
except
that
they
collapse
Nom
and
Gen
into
the
Gen
form
,
which
also
includes
loosing
their
final
glide
:
Je
\
3qAD1
(
Nom
,
Gen
)
versus
\
Li
&gt;
\
3qADiyAa
(
Acc
)
'
a
judge
'
.
For
the
dual
and
sound
plural
,
the
situation
is
simpler
,
as
there
are
no
lexical
exceptions
.
The
duals
and
masculine
sound
plurals
express
number
,
case
and
gender
jointly
in
single
morphemes
that
are
identifiable
even
if
undia-critized
:
jj
£
&gt;
^
"
kAtib+uwna
'
writersmasC
;
Pl
'
(
Nom
)
,
jlsK
"
kAtib+Ani
'
writersmasC
;
du
'
(
Nom
)
,
juajK
"
kAtib+atAni
'
writersfem
,
du
'
(
Nom
)
.
The
AcC
and
Gen
forms
are
identical
,
e.g.
,
jxJ8
"
'
kAtib+iyna
'
writersmasC
;
Pl
'
(
Acc
,
Gen
)
.
Finally
,
the
dual
and
masculine
sound
plural
do
not
express
nunation
.
On
the
other
hand
,
the
feminine
sound
plural
marks
nunation
explicitly
,
and
all
of
its
case
morphemes
are
written
only
as
diacritics
,
e.g.
,
kAtib+At+u
'
writersfem
,
pi
'
(
Nom
)
.
Traditional
Arabic
grammar
makes
a
distinction
between
verbal
clauses
(
LL
*
i
J
-
*
*
"
)
and
nominal
clauses
(
&lt;
Lr
"
l
J
-
*
*
)
.
Verbal
clauses
are
verb-initial
sentences
,
and
we
(
counter
to
the
Arabic
grammatical
tradition
)
include
copula-initial
clauses
in
this
group
.
The
copula
is
kAn
to
be
'
or
one
of
her
sisters
.
Nominal
clauses
begin
with
a
topic
(
which
is
always
a
nominal
)
,
and
continue
with
a
complement
which
is
either
a
verbal
clause
,
a
nominal
predicate
,
or
a
prepositional
predicate
.
If
the
complement
of
a
topic
is
a
verbal
clause
,
an
inflectional
subject
morpheme
or
a
resumptive
object
clitic
pronoun
replace
the
argument
which
has
become
the
topic
.
Arabic
case
system
falls
within
the
class
of
nominative-accusative
languages
(
as
opposed
to
ergative-absolutive
languages
)
.
Some
of
the
common
behavior
of
case
in
Arabic
with
other
languages
includes
:
2
•
Nom
is
assigned
to
subjects
of
verbal
clauses
,
as
well
as
other
nominals
in
headings
,
titles
and
quotes
.
•
Acc
is
assigned
to
(
direct
and
indirect
)
objects
of
verbal
clauses
,
verbal
nouns
,
or
active
participles
;
to
subjects
of
small
clauses
governed
by
other
verbs
(
i.e.
,
"
exceptional
case
marking
"
or
"
raising
to
object
"
contexts
;
we
remain
agnostic
on
the
proper
analysis
)
;
adverbs
;
and
certain
interjections
,
such
as
Ijdi
sukrAa
'
Thank
you
'
.
•
Gen
is
assigned
to
objects
of
prepositions
and
to
possessors
in
idafa
(
possessive
)
construction
.
•
There
is
a
distinction
between
case-by-assignment
and
case-by-agreement
.
In
case-by-assignment
,
a
specific
case
is
assigned
to
a
nominal
by
its
case
assigner
;
whereas
in
case-by-agreement
,
the
modifying
or
conjoined
nominal
copies
the
case
of
its
governor
.
Arabic
case
differs
from
case
in
other
languages
in
the
following
conditions
,
which
relate
to
nominal
clauses
and
numbers
.
•
The
topic
(
independently
of
its
grammatical
function
)
is
Acc
if
it
follows
the
subordinating
conjunction
jj
Ain^a
(
or
any
of
her
"
sisters
"
:
jl
)
liAan^a
,
ij^kaAan^a
,
lakin^a
,
etc.
)
.
Otherwise
,
the
topic
is
Nom
.
•
Nominal
predicates
are
Acc
if
they
are
governed
by
the
overt
copula
.
They
are
also
Acc
if
they
are
objects
of
verbs
that
take
small
clause
complements
(
such
as
to
consider
'
)
,
unless
the
predicate
is
introduced
by
a
subordinating
conjunction
.
In
all
other
cases
,
they
are
Nom
.
•
In
constructions
involving
a
nominal
and
a
number
(
IIK
"
"
jjJcLt
Eisruwna
kAtibAa
twenty
writers
'
)
,
the
head
of
the
phrase
for
case
assignment
is
the
number
,
which
receives
whichever
case
the
context
assigns
.
The
case
of
the
nominal
depends
on
the
number
.
If
the
number
is
between
11
and
99
,
the
nominal
is
2Buckley
(
2004
)
describes
in
detail
the
conditions
for
each
of
the
three
cases
in
Arabic
.
He
considers
NOM
to
be
the
default
case
.
He
specifies
seven
conditions
for
NOM
,
25
for
ACC
and
two
for
GEN
.
Our
summary
covers
the
same
ground
as
his
description
except
that
we
omit
the
vocative
use
of
nominals
.
Acc
by
tamiyz
(
Jrur
-
lit
.
"
specification
"
)
.
Otherwise
,
the
nominal
is
Gen
by
idafa
.
We
use
the
third
section
of
the
current
version
of
the
Arabic
Treebank
released
by
the
Linguistic
Data
Consortium
(
LDC
)
(
Maamouri
et
al.
,
2004
)
.
We
use
the
division
into
training
and
devtest
corpora
proposed
by
Zitouni
et
al.
(
2006
)
,
further
dividing
their
devtest
set
into
two
equal
parts
to
give
us
a
development
and
a
test
set
.
The
training
set
has
approximately
367,000
words
,
and
the
development
and
test
sets
each
have
about
33,000
words
.
In
our
training
data
,
of
133,250
case-marked
nominals
,
66.4
%
are
Gen
,
18.5
%
Acc
,
and
15.1
%
Nom
.
The
ATB
annotation
in
principle
indicates
for
each
nominal
its
case
and
the
corresponding
realization
(
including
diacritics
)
.
The
only
systematic
exception
is
that
invariables
are
not
marked
at
all
with
their
unrealized
case
,
and
are
marked
as
having
NO-CASE
.
We
exclude
all
nominals
marked
NOCASE
from
our
evaluations
,
as
we
believe
that
these
nom-inals
actually
do
have
case
,
it
is
just
not
marked
in
the
treebank
,
and
we
do
not
wish
to
predict
the
morphological
realization
,
only
the
underlying
case
.
In
reporting
results
,
we
use
accuracy
on
the
number
of
nominals
whose
case
is
given
in
the
treebank
.
While
the
ATB
does
not
contain
explicit
information
about
headedness
in
its
phrase
structure
,
we
can
say
that
the
syntactic
annotations
in
the
ATB
are
roughly
based
on
predicate-argument
structure
.
For
example
,
for
the
structure
shown
in
Figure
1
,
the
"
natural
"
interpretation
is
that
the
head
is
AHtrAqu
'
burning
'
,
with
a
modifier
mnzlAd
'
house
'
,
which
in
turn
is
modified
by
a
QP
whose
head
is
(
presumably
)
the
number
20
,
which
is
modified
by
\
Ak9ri
'
more
'
and
mn
'
than
'
.
This
dependency
structure
is
shown
on
the
left
in
Figure
2
.
Another
annotation
detail
relevant
to
this
paper
is
that
the
ATB
marks
the
topic
of
a
nominal
clause
as
"
SBJ
"
(
i.e.
,
as
a
subject
)
except
when
the
predicate
is
a
verbal
clause
;
then
it
is
marked
as
TPC
.
We
consider
these
two
cases
to
be
the
same
case
and
relabel
all
such
cases
as
TPC
.
'
burning
mnzlAa
'
house
Figure
1
:
The
representation
of
numbers
in
the
Arabic
Treebank
,
for
a
subject
NP
meaning
'
the
burning
of
more
than
20
houses
4
Determining
the
Case
Assigner
Case
assignment
is
a
relationship
between
two
words
:
one
word
(
the
case
governor
or
assigner
)
assigns
case
to
the
other
word
(
the
case
assignee
)
.
Because
case
assignment
is
a
relationship
between
words
,
we
switch
to
a
dependency-based
version
of
the
treebank
.
There
are
many
possible
ways
to
transform
a
phrase
structure
representation
into
a
dependency
representation
;
we
explore
two
such
conversions
in
the
context
of
this
paper
.
Note
that
if
we
had
used
the
Prague
Arabic
Dependency
Tree-bank
(
Smrz
and
Hajic
,
2006
)
instead
of
the
ATB
,
we
would
not
have
had
to
convert
to
dependency
,
but
we
still
would
have
had
to
analyze
whether
the
dependencies
are
the
ones
we
need
for
modeling
case
assignment
,
possibly
having
to
restructure
the
dependencies
.
For
determining
the
dependency
relations
that
determine
case
assignment
,
we
start
out
by
using
a
standard
head
percolation
algorithm
with
the
following
parameters
:
Verbs
head
all
the
arguments
in
VPs
;
prepositions
head
the
PP
arguments
;
and
the
first
nominal
in
an
NP
or
ADJP
heads
those
structures
.
Non-verbal
predicates
(
NPs
,
ADJPs
or
PPs
)
head
their
subjects
(
topics
)
.
The
subordinating
conjunction
jl
Ain^a
is
governed
by
what
follows
it
.
The
overt
copula
kAn
governs
both
topic
and
predicate
.
Conjunctions
are
headed
by
what
they
follow
and
head
what
they
precede
(
with
the
exception
of
the
common
sentence
initial
conjunction
w+
'
and
'
,
which
is
headed
by
the
sentence
it
introduces
)
.
We
will
call
the
result
of
this
algorithm
the
Basic
Case
Assigner
Identification
Algorithm
,
or
Basic
Representation
for
short
.
After
initial
experiments
with
both
hand-written
rules
and
machine
learning
,
we
extend
the
Basic
Representation
in
order
to
account
for
the
special
case
assigning
properties
of
numbers
in
Arabic
by
adding
additional
head
percolation
parameters
and
restructuring
rules
to
handle
the
structure
of
NPs
in
the
ATB
.
This
is
because
the
current
ATB
representation
is
not
useful
in
some
cases
for
representing
case
assignment
.
Consider
the
structure
in
Figure
1
.
Here
,
the
head
of
the
NP
is
the
noun
ijLfe
&gt;
\
AHtrAqu
'
burning
'
,
which
has
Nom
because
the
NP
is
a
subject
(
the
verb
is
not
shown
)
.
The
QP
s
irst
member
,
AkOri
'
more
'
is
Gen
because
it
is
in
an
idafa
construction
with
the
noun
AHtrAqu
.
AkOri
is
modified
by
the
preposition
^
mn
'
than
'
which
assigns
Gen
to
the
number
20
(
which
is
written
in
Arabic
numerals
and
thus
does
not
show
any
case
at
all
)
.
The
noun
Mjru
mnzlAa
'
house
'
is
in
a
tamyiz
relation
with
the
number
20
which
governs
it
,
and
thus
it
is
Acc
.
It
is
clear
that
the
phrase
structure
chosen
for
the
ATB
does
not
represent
these
case-assignment
relations
in
a
direct
manner
.
To
create
the
appropriate
head
relations
for
case
determination
,
we
flatten
all
QPs
and
use
a
set
of
simple
deterministic
rules
to
create
the
more
appropriate
structure
which
expresses
the
chain
ofcase
assignments
.
In
our
development
set
,
5.8
%
of
words
get
a
new
head
using
this
new
head
assignment
.
We
call
this
new
representation
the
Revised
Representation
.
Figure
2
shows
the
dependency
representation
corresponding
to
the
phrase
structure
in
Figure
1
.
We
make
use
of
all
dash-tags
provided
by
the
ATB
as
arc
labels
and
we
extend
the
label
set
to
explicitly
mark
objects
of
prepositions
(
POBJ
)
,
possessors
in
idafa
construction
(
IDAFA
)
,
conjuncts
(
CONJ
)
and
conjunctions
(
CC
)
,
and
the
accusative
speciier
,
tamyiz
(
TMZ
)
.
All
other
modiications
receive
the
label
(
MOD
)
.
5
Hand
Written
Rules
Our
irst
system
is
based
on
hand-written
rules
(
henceforth
,
we
refer
to
this
system
as
the
rule-based
system
)
.
We
add
two
features
to
nominals
in
the
tree
:
(
1
)
we
identify
if
a
word
governs
a
subordinating
conjunction
jj
Ain^a
or
any
of
its
sisters
;
and
(
2
)
we
also
identify
if
a
topic
of
a
nominal
sentence
has
an
Ain^a
sibling
.
The
following
are
the
simple
hand
written
rules
we
use
:
•
RULE
1
:
The
default
case
assigned
is
Acc
for
all
words
.
•
RULE
2
:
Assign
Nom
to
nominals
heading
the
tree
and
those
labeled
HLN
(
headline
)
or
TTL
(
title
)
.
•
RULE
3
:
Assign
Gen
to
nominals
with
the
labels
POBJ
or
IDAFA
.
•
RULE
4
:
Assign
Nom
to
nominals
with
the
label
PRD
if
NOT
headed
by
a
verbal
(
verb
or
deverbal
noun
)
or
if
it
has
an
Ain^a
child
.
•
RULE
5
:
Assign
Nom
to
nominal
topics
that
do
not
have
an
Ain^a
sibling
.
•
RULE
6
:
All
case-unassigned
children
ofnom-inal
parents
(
and
conjunctions
)
,
whose
label
is
MOD
,
CONJ
or
CC
,
copy
the
case
of
their
parent
.
Conjunctions
carry
the
case
temporarily
to
pass
on
agreement
.
Verbs
do
not
pass
on
agreement
.
The
irst
rule
is
applied
to
all
nodes
.
The
second
to
ifth
rules
are
case-by-assignment
rules
applied
in
an
if-else
fashion
(
no
overwriting
is
done
)
.
The
last
rule
is
a
case-by-agreement
rule
.
All
non-nominals
receive
the
case
NA
.
6
Machine
Learning
Experiments
:
The
Statistical
System
Our
second
system
uses
statistical
machine
learning
.
This
system
consists
of
a
core
model
and
an
agreement
model
,
both
of
which
are
linear
classifiers
trained
using
the
maximum
entropy
technique
.
We
implement
this
system
using
the
MALLET
toolbox
(
McCallum
,
2002
)
.
The
core
model
is
used
to
classify
all
words
whose
label
in
the
dependency
representation
is
not
MOD
(
case-by-assignment
)
;
whereas
,
the
agreement
model
is
used
to
classify
all
words
VERB
VERB
mnzlAa
'
house
Acc
Figure
2
:
Two
possible
dependency
trees
for
the
phrase
structure
tree
in
Figure
1
,
meaning
'
burning
ofmore
than
20
houses
;
the
tree
on
the
left
,
our
Basic
Representation
,
represents
a
standard
predicate-argument-modiication
style
tree
,
while
the
tree
on
the
rightrepresents
the
chain
ofcase
assignment
and
is
our
Revised
Representation
whose
label
is
MOD
(
case-by-agreement
)
.
We
handle
conjunctions
in
the
statistical
system
differently
from
the
rule-based
system
:
we
resolve
conjunctions
so
that
conjoined
words
are
labeled
exactly
the
same
.
For
example
,
in
John
and
Mary
went
to
the
store
,
both
John
and
Mary
would
have
the
subject
label
,
even
though
Mary
has
a
conjunction
label
in
the
raw
dependency
tree
.
Both
models
are
trained
only
on
those
words
which
are
marked
for
case
in
the
treebank
.
The
core
model
uses
the
following
features
of
a
word
:
•
the
conjunction
of
the
word
s
POS
tag
and
its
arc
label
;
•
the
word
s
last
length-one
and
length-two
suf-ixes
(
to
model
written
case
morphemes
)
;
•
if
the
word
is
the
object
of
a
preposition
,
the
preposition
it
is
the
object
of
;
•
whether
the
word
is
a
PRD
child
ofa
verb
(
with
the
identity
of
that
verb
conjoined
if
so
)
;
•
if
the
word
has
a
sister
which
is
a
subordinating
conjunction
,
and
if
so
,
that
conjunction
conjoined
with
its
arc
label
;
•
whether
the
word
is
in
an
embedded
clause
conjoined
with
its
arc
label
under
the
verb
of
the
embedded
clause
;
•
the
word
s
left
sister
s
POS
tag
conjoined
with
this
word
s
arc
label
and
its
sister
s
arc
label
;
•
whether
the
word
s
sister
depends
on
the
word
or
something
else
;
•
and
the
left
sister
s
terminal
symbol
.
Arabic
words
which
do
not
overtly
show
case
are
still
determined
for
purposes
of
resolving
agreement
.
The
classiier
is
applied
to
these
cases
at
runtime
anyway
.
The
agreement
model
uses
the
following
features
of
a
word
:
•
and
the
conjunction
of
the
word
s
POS
tag
and
the
case
of
what
it
agrees
with
.
Since
words
may
get
their
case
by
agreement
with
other
words
which
themselves
get
their
case
by
agreement
,
the
agreement
model
is
applied
repeatedly
until
case
has
been
determined
for
all
words
.
Rule-based
Statistical
Table
1
:
Accuracies
of
various
approaches
on
the
test
set
in
both
basic
and
revised
dependency
representations
.
7
Results
The
performance
of
our
two
systems
on
the
test
data
set
is
shown
in
table
1
.
There
are
three
points
to
note
:
irst
,
even
in
the
basic
representation
,
the
statistical
system
reduces
error
over
the
rule-based
system
by
7.7
%
.
Second
,
the
revised
representation
helps
tremendously
,
resulting
in
a
13.8
%
reduction
in
error
for
the
rule-based
system
and
30
%
for
the
statistical
system
.
Finally
,
the
statistical
system
gains
much
more
than
the
rule-based
system
from
the
improved
representation
,
increasing
the
gap
between
them
to
a
25
%
reduction
in
error
.
8
Error
Analysis
We
took
a
sample
of
105
sentences
(
around
10
%
)
from
our
development
data
prepared
in
the
revised
representation
.
Our
rule-based
system
accuracy
for
the
sample
is
about
94.1
%
and
our
statistical
system
accuracy
is
96.2
%
.
Table
2
classifies
the
different
types
of
errors
found
.
The
irst
and
second
rows
list
the
errors
made
by
the
statistical
and
rule-based
systems
,
respectively
.
The
third
row
lists
errors
made
by
the
statistical
system
only
.
The
fourth
row
lists
errors
made
by
the
rule-based
system
only
.
And
the
ifth
row
lists
errors
made
by
both
.
The
second
column
indicates
the
count
of
all
errors
.
The
rest
of
the
columns
specify
the
error
types
as
:
system
errors
,
gold
POS
errors
or
gold
tree
errors
.
The
gold
POS
and
tree
errors
are
treebank
errors
that
misguide
our
systems
.
They
represent
69
%
ofall
statistical
system
errors
and
86
%
of
all
rule-based
system
errors
.
Gold
POS
errors
represent
around
35-40
%
of
all
gold
errors
.
They
most
commonly
include
the
wrong
POS
tag
or
the
wrong
case
.
One
example
of
such
errors
is
the
mis-annotation
of
the
Acc
case
to
a
Gen
for
a
diptote
nominal
(
which
are
indistinguishable
out
of
context
)
.
Gold
tree
errors
are
primarily
errors
in
the
dash-tags
used
(
or
missing
)
in
the
treebank
or
attachment
errors
that
are
inconsistent
with
the
gold
POS
tag
.
The
rule-based
system
errors
involve
various
constructions
that
were
not
addressed
in
our
study
,
e.g.
flat
adjectival
phrases
or
non
S
constructions
at
the
highest
level
in
a
tree
(
e.g.
FRAG
or
NP
)
.
The
majority
of
the
statistical
system
errors
involve
agreement
decisions
and
incorrect
choice
of
case
despite
the
presence
ofthe
dash-tags
.
The
ratio
ofsystem
errors
for
the
statistical
system
is
31
%
(
twice
as
much
as
those
of
the
rule-based
system
s
14
%
)
.
Thus
,
it
seems
that
the
statistical
system
manages
to
learn
some
of
the
erroneous
noise
in
the
treebank
.
9
Discussion
9.1
Accomplishments
We
have
developed
a
system
that
determines
case
for
nominals
in
MSA
.
This
task
is
a
major
source
of
errors
in
full
diacritization
ofArabic
.
We
use
a
goldstandard
syntactic
tree
,
and
obtain
an
error
rate
of
about
4.2
%
,
with
a
machine
learning
based
system
outperforming
a
system
using
hand-written
rules
.
A
careful
error
analysis
suggests
that
when
we
account
for
annotation
errors
in
the
gold
standard
,
the
error
rate
drops
to
0.8
%
,
with
the
hand-written
rules
outperforming
the
machine
learning-based
system
.
We
can
draw
several
general
conclusions
from
our
experiments
.
•
The
features
relevant
for
the
prediction
ofcom-plex
linguistic
phenomena
cannot
necessarily
be
easily
read
off
from
the
given
representation
of
the
data
.
Sometimes
,
due
to
data
sparseness
and
/
or
limitations
in
the
machine
learning
paradigm
used
,
we
need
to
extract
features
from
the
available
representation
in
a
manner
that
profoundly
changes
the
representation
(
as
is
done
in
bilexical
parsing
(
Collins
,
1997
)
)
.
Such
transformations
require
a
deep
understanding
of
the
linguistic
phenomena
on
the
part
of
the
researchers
.
•
Researchers
developing
hand-written
rules
may
follow
an
empirical
methodology
in
natural
language
processing
if
they
use
data
sets
to
develop
and
test
the
rules
—
the
only
true
methodological
difference
between
machine
learning
and
this
kind
of
hand-writing
of
rules
GOLD
POS
GOLD
TREE
All
Statistical
All
Rule-based
Statistical
only
Rule-based
only
Statistical
f
]
Rule-based
Table
2
:
Results
of
Error
Analysis
is
the
type
of
learning
(
human
or
machine
)
.
For
certain
phenomena
,
machine
learning
may
result
in
only
a
small
or
no
improvement
in
performance
over
hand-written
rules
.
•
Error
analysis
remains
a
crucial
part
of
any
empirical
work
in
natural
language
processing
.
Not
only
does
it
contribute
insight
into
how
the
system
can
be
improved
,
it
also
reveals
problems
with
the
underlying
data
.
Sometimes
the
problems
are
just
part
of
the
noise
in
the
data
,
but
sometimes
the
problems
can
be
fixed
.
Annotations
on
data
are
not
themselves
naturally
occurring
data
and
thus
may
be
subject
to
critique
.
Note
that
an
error
analysis
requires
a
good
understanding
of
the
linguistic
phenomena
and
of
the
data
.
Our
work
was
motivated
in
two
ways
:
to
help
tree-banking
,
and
to
develop
tools
for
automatic
case
determination
from
unannotated
text
.
For
the
first
goal
,
our
error
analysis
has
shown
that
86
%
of
the
errors
found
by
our
hand-written
rules
are
in
fact
treebank
errors
.
Furthermore
,
we
suspect
that
the
hand-written
rules
have
very
few
false
positives
(
i.e.
,
cases
in
which
the
treebank
has
been
annotated
in
error
but
our
rules
predict
exactly
that
error
)
.
Thus
we
believe
that
our
tool
can
serve
an
important
function
in
improving
the
treebank
annotation
.
For
our
second
motivation
,
the
next
step
will
be
to
adapt
our
feature
extraction
to
work
on
the
output
of
parsers
,
which
typically
exclude
dash-tags
.
We
note
that
for
many
contexts
,
we
do
not
currently
rely
on
dash-tags
but
rather
identify
the
relevant
structures
on
our
own
(
such
as
idafa
,
tamyiz
,
and
so
on
)
.
We
suspect
that
the
machine
learning-based
approach
will
outperform
the
hand-written
rules
,
as
it
can
learn
typical
errors
the
parser
makes
.
As
the
treebank
will
soon
be
revised
and
hand-checked
,
we
will
postpone
this
work
until
the
new
release
of
the
treebank
,
which
will
allow
us
to
train
better
parsers
as
the
data
will
be
more
consistent
.
Acknowledgements
The
research
presented
here
was
supported
by
the
Defense
Advanced
Research
Projects
Agency
(
DARPA
)
under
ContractNos
.
HR0011-06-C-0023
,
HR0011-06-C-0022
and
HR0011-06-1-0003
.
Any
opinions
,
findings
and
conclusions
or
recommendations
expressed
in
this
paper
are
those
of
the
authors
and
do
not
necessarily
reflect
the
views
of
DARPA
.
