We
present
in
this
paper
methods
to
improve
HMM-based
part-of-speech
(
POS
)
tagging
of
Mandarin
.
We
model
the
emission
probability
of
an
unknown
word
using
all
the
characters
in
the
word
,
and
enrich
the
standard
left-to-right
trigram
estimation
of
word
emission
probabilities
with
a
right-to-left
prediction
of
the
word
by
making
use
of
the
current
and
next
tags
.
In
addition
,
we
utilize
the
RankBoost-based
reranking
algorithm
to
rerank
the
N-best
outputs
of
the
HMM-based
tagger
using
various
n-gram
,
morphological
,
and
dependency
features
.
Two
methods
are
proposed
to
improve
the
generalization
performance
of
the
reranking
algorithm
.
Our
reranking
model
achieves
an
accuracy
of
94.68
%
using
n-gram
and
morphological
features
on
the
Penn
Chinese
Treebank
5.2
,
and
is
able
to
further
improve
the
accuracy
to
95.11
%
with
the
addition
of
dependency
features
.
1
Introduction
Part-of-speech
(
POS
)
tagging
is
potentially
helpful
for
many
advanced
natural
language
processing
tasks
,
for
example
,
named
entity
recognition
,
parsing
,
and
sentence
boundary
detection
.
Much
research
has
been
done
to
improve
tagging
performance
for
a
variety
of
languages
.
The
state-of-the-art
systems
have
achieved
an
accuracy
of
97
%
for
English
on
the
Wall
Street
Journal
(
WSJ
)
corpus
(
which
contains
4.5M
words
)
using
various
models
(
Brants
,
2000
;
Ratnaparkhi
,
1996
;
Thede
and
Harper
,
1999
)
.
Lower
accuracies
have
been
reported
in
the
literature
for
Mandarin
POS
tagging
(
Tseng
et
al.
,
2005
;
Xue
et
al.
,
2002
)
.
This
is
,
in
part
,
due
to
the
relatively
small
size
and
the
different
annotation
guidelines
(
e.g.
,
granularity
of
the
tag
set
)
for
the
annotated
corpus
of
Mandarin
.
Xue
at
el
.
(
2002
)
and
Tseng
at
el
.
(
2005
)
reported
accuracies
of
93
%
and
93.74
%
on
CTB-I
(
Xue
et
al.
,
2002
)
(
100K
words
)
and
CTB
5.0
(
500K
words
)
,
respectively
,
each
using
a
Maximum
Entropy
approach
.
The
characteristics
of
Mandarin
make
it
harder
to
tag
than
English
.
Chinese
words
tend
to
have
greater
POS
tag
ambiguity
than
English
.
Tseng
at
el
.
(
2005
)
reported
that
29.9
%
of
the
words
in
CTB
have
more
than
one
POS
assignment
compared
to
19.8
%
of
the
English
words
in
WSJ
.
Moreover
,
the
morphological
properties
of
Chinese
words
complicate
the
prediction
of
POS
type
for
unknown
words
.
These
challenges
for
Mandarin
POS
tagging
suggest
the
need
to
develop
more
sophisticated
methods
.
In
this
paper
,
we
investigate
the
use
of
a
discriminative
reranking
approach
to
increase
Mandarin
tagging
accuracy
.
Reranking
approaches
(
Charniak
and
Johnson
,
2005
;
Chen
et
al.
,
2002
;
Collins
and
Koo
,
2005
;
Ji
et
al.
,
2006
;
Roark
et
al.
,
2006
)
have
been
successfully
applied
to
many
NLP
applications
,
including
parsing
,
named
entity
recognition
,
sentence
boundary
detection
,
etc.
To
the
best
of
our
knowledge
,
reranking
approaches
have
not
been
used
for
POS
tagging
,
possibly
due
to
the
already
high
levels
of
accuracy
for
English
,
which
leave
little
room
for
further
improvement
.
However
,
the
relatively
poorer
performance
of
existing
methods
on
Mandarin
POS
tagging
makes
reranking
a
much
more
compelling
technique
to
evaluate
.
In
this
paper
,
we
use
reranking
to
improve
tagging
performance
of
an
HMM
tagger
adapted
to
Proceedings
of
the
2007
Joint
Conference
on
Empirical
Methods
in
Natural
Language
Processing
and
Computational
Natural
Language
Learning
,
pp.
1093-1102
,
Prague
,
June
2007
.
©
2007
Association
for
Computational
Linguistics
Mandarin
.
Hidden
Markov
models
are
simple
and
effective
,
but
unlike
discriminative
models
,
such
as
Maximum
Entropy
models
(
Ratnaparkhi
,
1996
)
and
Conditional
Random
Fields
(
John
Lafferty
,
2001
)
,
they
have
more
difficulty
utilizing
a
rich
set
of
conditionally
dependent
features
.
This
limitation
can
be
overcome
by
utilizing
reranking
approaches
,
which
are
able
to
make
use
of
the
features
extracted
from
the
tagging
hypotheses
produced
by
the
HMM
tagger
.
Reranking
also
has
advantages
over
MaxEnt
and
CRF
models
.
It
is
able
to
use
any
features
extracted
from
entire
labeled
sentences
,
including
those
that
cannot
be
incorporated
into
MaxEnt
and
CRF
models
due
to
inference
difficulties
.
In
addition
,
reranking
methods
are
able
to
utilize
the
information
provided
by
N-best
lists
.
Finally
,
the
decoding
phase
of
reranking
is
much
simpler
.
The
rest
of
the
paper
is
organized
as
follows
.
We
describe
the
HMM
tagger
in
Section
2
.
We
discuss
the
modifications
to
better
handle
unknown
words
in
Mandarin
and
to
enrich
the
word
emission
probabilities
through
the
combination
of
bi-directional
estimations
.
In
Section
3
,
we
first
describe
the
reranking
algorithm
and
then
propose
two
methods
to
improve
its
performance
.
We
also
describe
the
features
that
will
be
used
for
Mandarin
POS
reranking
in
Section
3
.
Experimental
results
are
given
in
Section
4
.
Conclusions
and
future
work
appear
in
Section
5
.
2.1
Porting
English
Tagger
to
Mandarin
The
best
tag
sequence
t
(
wn
)
can
be
determined
efficiently
using
the
Viterbi
algorithm
.
For
estimating
emission
probabilities
of
unknown
words
(
i.e.
,
words
that
do
not
appear
in
the
training
data
)
in
English
(
and
similarly
for
other
inflected
languages
)
,
a
weighted
sum
of
P
(
sk
(
with
k
up
to
four
)
was
used
as
an
approximation
,
where
sk
is
the
suffix
of
length
k
of
word
wj
(
e.g.
,
s1
is
the
last
character
of
word
wj
)
.
The
suffix
information
and
three
binary
features
(
i.e.
,
whether
the
word
is
capitalized
,
whether
the
word
is
hyphenated
,
and
whether
the
word
contains
numbers
)
are
combined
to
estimate
the
emission
probabilities
of
unknown
words
.
The
interpolation
weights
for
smoothing
transition
,
emission
,
and
suffix
probabilities
were
estimated
using
the
log-based
Thede
smoothing
method
(
Thede
and
Harper
,
1999
)
as
follows
:
the
ML
estimation
f
(
n-gram
count
)
We
assume
that
symbols
exist
implicitly
for
boundary
conditions
.
While
porting
the
HMM-based
English
POS
tagger
to
Mandarin
is
fairly
straightforward
for
words
seen
in
the
training
data
,
some
thought
is
required
to
handle
unknown
words
due
to
the
morphology
differences
between
the
two
languages
.
First
,
in
Mandarin
,
there
is
no
capitalization
and
no
hyphenation
.
Second
,
although
Chinese
has
morphology
,
it
is
not
the
same
as
in
English
;
words
tend
to
contain
far
fewer
characters
than
inflected
words
in
English
,
so
word
endings
will
tend
to
be
short
,
say
one
or
two
characters
long
.
Hence
,
in
our
baseline
model
(
denoted
HMM
baseline
)
,
we
simply
utilize
word
endings
of
up
to
two
characters
in
length
along
with
a
binary
feature
of
whether
the
word
contains
numbers
or
not
.
In
the
next
two
subsections
,
we
describe
two
ways
in
which
we
enhance
this
simple
HMM
baseline
model
.
2.2
Improving
the
Mandarin
Unknown
Word
Model
Chinese
words
are
quite
different
from
English
words
,
and
the
word
formation
process
for
Chinese
words
can
be
quite
complex
(
Packard
,
2000
)
.
Indeed
,
the
last
characters
in
a
Chinese
word
are
,
in
some
cases
,
most
informative
of
the
POS
type
,
while
for
others
,
it
is
the
characters
at
the
beginning
.
Furthermore
,
it
is
not
uncommon
for
a
character
in
the
middle
of
a
word
to
provide
some
evidence
for
the
POS
type
of
the
word
.
Hence
,
we
chose
to
employ
a
rather
simple
but
effective
method
to
estimate
the
emission
probability
,
P
(
wi
\
ti_1
,
ti
)
,
of
an
unknown
word
,
wi
.
We
use
the
geometric
average2
of
the
emission
probability
of
the
characters
in
the
word
,
i.e.
,
P
(
ck
|
ti_1
,
ti
)
with
ck
being
the
k-th
character
in
the
word
.
Since
some
of
the
characters
in
wi
may
not
have
appeared
in
any
word
tagged
as
ti
in
that
context
in
the
training
data
,
only
characters
that
are
observed
in
this
context
are
used
in
the
computation
of
the
geometric
average
,
as
shown
below
:
2.3
Bi-directional
Word
Probability
Estimation
In
Equation
2
,
the
word
emission
probability
P
(
wi
\
ti_1ti
)
is
a
left-to-right
prediction
that
depends
on
the
current
tag
ti
associated
with
wi
,
as
well
as
its
previous
tag
ti_
1
.
Although
the
interaction
between
wi
and
the
next
tag
ti+1
is
captured
to
some
extent
when
ti+1
is
generated
by
the
model
,
this
implicit
interaction
may
not
be
as
effective
as
adding
the
information
more
directly
to
the
model
.
Hence
,
we
chose
to
apply
the
constraint
explicitly
in
our
HMM
framework
by
replacing
P
(
wilti_1ti
)
in
Equation
2
with
Px
(
wilti_1
ti
)
P
1_x
(
wiltiti+1
)
for
both
known
and
unknown
words
,
with
t
(
wN
)
determined
by
:
This
corresponds
to
a
mixture
model
of
two
generation
paths
,
one
from
the
left
and
one
from
the
right
,
to
approximate
t
(
wN
)
in
Equation
1
in
a
different
way
.
In
this
case
,
the
decoding
process
involves
the
computation
of
three
local
probabilities
,
i.e.
,
P
(
ti
|
ti_2ti_1
)
,
P
(
wi
|
ti_1ti
)
,
and
P
(
wi
|
titi+1
)
.
By
using
a
simple
manipulation
that
shifts
the
time
index
of
P
(
wi
|
titi+1
)
in
Equation
4
by
two
time
slices3
(
i.e.
,
by
replacing
P
(
wi
\
titi+1
)
with
P
(
wi_2
|
ti_2ti_1
)
)
,
we
are
able
to
compute
t
(
wN
)
in
Equation
4
with
the
same
asymptotic
time
complexity
of
decoding
as
in
Equation
2
.
3
Discriminative
Reranking
In
this
section
,
we
describe
our
use
of
the
RankBoost-based
(
Freund
and
Schapire
,
1997
;
Freund
et
al.
,
1998
)
discriminative
reranking
approach
that
was
originally
developed
by
Collins
and
Koo
(
2005
)
for
parsing
.
It
provides
an
additional
avenue
for
improving
tagging
accuracy
,
and
also
allows
us
to
investigate
the
impact
of
various
features
on
Mandarin
tagging
performance
.
The
reranking
algorithm
takes
as
input
a
list
of
candidates
produced
by
some
probabilistic
model
,
in
our
case
the
HMM
tagger
,
and
reranks
these
candidates
based
on
a
set
of
features
.
We
first
introduce
Collins
'
reranking
algorithm
in
Subsection
3.1
,
and
then
describe
two
modifications
in
Subsections
3.2
and
3.3
that
were
designed
to
improve
the
generalization
performance
of
the
reranking
algorithm
for
our
POS
tagging
task
.
The
reranking
features
that
are
used
for
POS
tagging
are
then
described
in
Subsection
3.4
.
3.1
Collins
'
Reranking
Algorithm
2Based
on
preliminary
testing
,
the
geometric
average
provided
greater
tag
accuracy
than
the
arithmetic
average
.
3Replacing
P
(
wi
\
titi+i
)
with
P
(
wi_i
\
ti_iti
)
also
gives
the
same
solution
.
the
log-probability
L
(
xij
)
produced
by
the
HMM
tagger
.
Each
tagging
candidate
xi
}
j
in
the
training
data
has
a
"
goodness
"
score
Score
(
xij
)
that
measures
the
similarity
between
the
candidate
and
the
gold
reference
.
For
tagging
,
we
use
tag
accuracy
as
the
similarity
measure
.
Without
loss
of
generality
,
we
assume
that
xi
}
1
has
the
highest
score
,
i.e.
,
Score
(
xi
}
1
)
&gt;
Score
(
xij
)
for
j
=
2
,
•
•
•
,
ni
.
To
summarize
,
the
training
data
consists
of
a
set
of
examples
[
xi
}
j
:
i
=
1
,
•
•
•
,
n
;
j
=
1
,
•
•
•
,
ni
}
,
each
along
with
a
"
goodness
"
score
Score
(
xij
)
and
a
log-probability
L
(
xi
}
j
)
.
Each
indicator
function
hk
is
associated
with
a
weight
parameter
ak
which
is
real
valued
.
In
addition
,
a
weight
parameter
a0
is
associated
with
the
log-probability
L
(
xi
}
j
)
.
The
ranking
function
of
candidate
E
akhk
(
xij
)
.
where
Sij
is
the
weight
function
that
gives
the
importance
of
each
example
,
and
Mij
(
a
)
is
the
margin
:
All
of
the
ai
's
are
initially
set
to
zero
.
The
value
of
ao
is
determined
first
to
minimize
the
loss
function
and
is
kept
fixed
afterwards
.
Then
a
greedy
sequential
4
optimization
method
is
used
in
each
iteration
(
i.e.
,
a
boosting
round
)
to
select
the
feature
that
The
new
loss
after
adding
the
update
parameter
5
to
the
parameter
ak
is
shown
below
:
1
,
w+.
4Parallel
optimization
algorithms
exist
and
have
comparable
performance
according
to
(
Collins
etal
.
,
2002
)
.
The
value
of
e
plays
an
important
role
in
this
formula
.
If
e
is
set
too
small
,
the
smoothing
factor
eZ
would
not
prevent
setting
6
*
to
a
potentially
overly
large
absolute
value
,
resulting
in
over-fitting
.
If
e
is
set
too
large
,
then
the
opposite
condition
of
under-training
could
result
.
The
value
of
e
is
determined
based
on
a
development
set
.
Collins
'
method
allows
multiple
updates
to
the
weight
of
a
feature
based
on
Equations
5
and
7
.
We
found
that
for
those
features
for
which
either
W+
or
W_
equals
zero
,
the
update
formula
in
Equation
7
can
only
increase
their
weight
(
in
absolute
value
)
in
one
direction
.
Although
these
features
are
strong
and
useful
,
setting
their
weights
too
large
can
be
undesirable
in
that
it
limits
the
use
of
other
features
for
reducing
the
loss
.
Based
on
this
analysis
,
we
have
developed
and
evaluated
an
update-once
method
,
in
which
we
use
the
update
formula
in
Equation
7
but
limit
weight
updates
so
that
once
a
feature
is
selected
on
a
certain
iteration
and
its
weight
parameter
is
updated
,
it
cannot
be
updated
again
.
Using
this
method
,
the
weights
of
the
strong
features
are
not
allowed
to
prevent
additional
features
from
being
considered
during
the
training
phase
.
3.3
Regularized
Reranking
Although
the
update-once
method
may
attenuate
over-fitting
to
some
extent
,
it
also
prevents
adjusting
the
value
of
any
weight
parameter
that
is
initially
set
too
high
or
too
low
in
an
earlier
boosting
round
.
In
order
to
design
a
more
sophisticated
weight
update
method
that
allows
multiple
updates
in
both
directions
while
penalizing
overly
large
weights
,
we
have
also
investigated
the
addition
of
a
regulariza-tion
term
R
(
a
)
,
an
exponential
function
of
a
,
to
the
loss
function
:
where
pk
is
the
penalty
weight
of
parameter
ak
.
The
reason
that
we
chose
this
form
of
regularization
is
that
(
e_ak
+
eak
—
2
)
is
a
symmetric
,
monotonically
decreasing
function
of
|
ak
|
,
and
more
importantly
it
provides
a
closed
analytical
expression
of
the
weight
update
formula
similar
to
Equations
5
and
6
.
Hence
,
the
best
feature
/
update
pair
for
the
regularized
loss
function
is
defined
as
follows
:
There
are
many
ways
of
choosing
pk
,
the
penalty
weight
of
ak
.
In
this
paper
,
we
use
the
values
of
0
•
(
W+
+
W_
)
at
the
beginning
of
the
first
iteration
(
after
ao
is
determined
)
for
pk
,
where
0
is
a
weighting
parameter
to
be
tuned
on
the
development
set
.
The
regularized
weight
update
formula
has
many
advantages
.
It
is
always
well
defined
no
matter
what
value
Wk+
and
Wk_
take
,
in
contrast
to
Equation
6
.
For
all
features
,
even
in
the
case
when
either
Wk+
or
Wk_
equals
zero
,
the
regularized
update
formula
allows
weight
updates
in
two
directions
.
If
the
weight
is
small
,
Wk+
and
Wk_
have
more
impact
on
determining
the
weight
update
direction
,
however
,
when
the
weight
becomes
large
,
the
regularization
factors
pke_a
and
pke+a
favor
reducing
the
weight
.
3.4
Reranking
Features
A
reranking
model
has
the
flexibility
of
incorporating
any
type
of
feature
extracted
from
N-best
candidates
.
For
the
work
presented
in
this
paper
,
we
examine
three
types
of
features
.
For
each
window
of
three
word
/
tag
pairs
,
we
extract
all
the
n-grams
,
except
those
that
are
comprised
of
only
one
word
/
tag
pair
,
or
only
tags
,
or
only
words
,
or
do
not
include
either
the
word
or
tag
in
the
center
word
/
tag
pair
.
These
constitute
the
n-gram
feature
set
.
In
order
to
better
handle
unknown
words
,
we
also
extract
the
two
most
important
types
of
morphological
features5
that
were
utilized
in
(
Tseng
et
al.
,
2005
)
for
those
words
that
appear
no
more
than
seven
times
(
following
their
convention
)
in
the
training
set
:
Affixation
features
:
we
use
character
n-gram
prefixes
and
suffixes
for
n
up
to
4
.
For
example
,
for
word
/
tag
pair
D
14
&lt;
/
NN
(
Information-Bag
,
i.e.
,
folder
)
,
we
add
the
following
features
:
(
prefix1
,
D
,
NN
)
,
(
prefix2
,
D4
,
NN
)
,
(
prefix3
,
D4
&lt;
,
NN
)
,
(
suffix1
,
&lt;
,
NN
)
,
(
suf-fix2
,
™
&lt;
,
NN
)
,
(
suffix3
,
D4
&lt;
,
NN
)
.
AffixPOS
features6
:
we
used
the
training
set
to
build
a
prefix
/
POS
and
suffix
/
POS
dictionary
associating
possible
tags
with
each
prefix
and
5Tseng
at
el
.
also
used
other
morphological
features
that
require
additional
resources
to
which
we
do
not
have
access
.
6AffixPOS
features
are
somewhat
different
from
the
CTB-Morph
features
used
in
(
Tseng
et
al.
,
2005
)
,
where
a
morpheme
/
POS
dictionary
with
the
possible
tags
for
all
morphemes
in
the
training
set
was
used
instead
of
two
separate
dictionaries
for
prefix
and
suffix
.
AffixPOS
features
perform
slightly
better
in
our
task
than
the
CTB-morph
features
.
suffix
in
the
training
set
.
The
AffixPOS
features
indicate
the
set
of
tags
a
given
affix
could
have
.
For
the
same
example
D
44
&lt;
/
NN
,
D
occurred
as
prefix
in
both
NN
and
VV
words
in
the
training
data
.
So
we
add
the
following
features
based
on
the
prefix
D
:
(
prefix
,
D
,
NN
,
1
,
NN
)
,
(
prefix
,
D
,
VV
,
1
,
NN
)
,
and
(
prefix
,
D
,
X
,
0
,
NN
)
for
every
tag
X
not
in
[
NN
,
VV
}
,
where
1
and
0
are
indicator
values
.
Features
are
extracted
in
the
similar
way
for
the
suffix
&lt;
.
The
n-gram
and
morphological
features
are
easy
to
compute
,
however
,
they
have
difficulty
in
capturing
the
long
distance
information
related
to
syntactic
relationships
that
might
help
POS
tagging
accuracy
.
In
order
to
examine
the
effectiveness
of
utilizing
syntactic
information
in
tagging
,
we
have
also
experimented
with
dependency
features
that
are
extracted
based
on
automatic
parse
trees
.
First
a
bracketing
parser
(
the
Charniak
parser
(
Charniak
,
2000
)
in
our
case
)
is
used
to
generate
the
parse
tree
of
a
sentence
,
then
the
const2dep
tool
developed
by
Hwa
was
utilized
to
convert
the
bracketing
tree
to
a
dependency
tree
based
on
the
head
percolation
table
developed
by
the
second
author
.
The
dependency
tree
is
comprised
of
a
set
of
dependency
relations
among
word
pairs
.
A
dependency
relation
is
a
triple
(
word-a
,
word-b
,
relation
)
,
in
which
word-a
is
governed
by
word-b
with
grammatical
relation
denoted
as
relation
.
For
example
,
in
the
sentence
"
H
M
(
Tibet
)
^
N
(
economy
)
H
ii
(
construction
)
M
—
(
achieves
)
&gt;
W
(
significant
)
^^
(
accomplishments
)
"
,
one
example
dependency
relation
is
(
M
—
,
mm
,
mod
)
.
Given
these
dependency
relations
,
we
then
extract
dependency
features
(
in
total
36
features
for
each
relation
)
by
examining
the
POS
tags
of
the
words
for
each
tagging
candidate
of
a
sentence
.
The
relative
positions
of
the
word
pairs
are
also
taken
into
account
for
some
features
.
For
example
,
if
M
—
and
mm
in
the
above
sentence
are
tagged
as
VV
and
NN
respectively
in
one
candidate
,
then
two
example
dependency
features
are
(
dep-1
,
M
—
,
VV
,
mm
,
NN
,
mod
)
,
(
dep-14
,
M
—
,
VV
,
NN
,
right
,
mod
)
,
in
which
dep-1
and
dep-14
are
feature
types
and
right
indicates
that
word-b
(
M
—
)
is
to
the
right
of
word-a
(
mm
)
.
4
Experiments
The
most
recently
released
Penn
Chinese
Treebank
experiments
.
It
contains
500K
words
,
800K
characters
,
18K
sentences
,
and
900
data
files
,
including
articles
from
the
Xinhua
news
agency
(
ChinaMainland
)
,
Information
Services
Department
of
HKSAR
(
Hongkong
)
,
and
Sinorama
magazine
(
Taiwan
)
.
Its
format
is
similar
to
the
English
WSJ
Penn
Treebank
,
and
it
was
carefully
annotated
.
There
are
33
POS
tags
used
,
to
which
we
add
tags
to
discriminate
among
punctuation
types
.
The
original
POS
tag
for
punctuation
was
PU
;
we
created
new
POS
tags
for
each
distinct
punctuation
type
(
e.g.
,
PU
-
?
)
.
The
CTB
corpus
was
collected
during
different
time
periods
from
different
sources
with
a
diversity
of
articles
.
In
order
to
obtain
a
representative
split
of
training
,
development
,
and
test
sets
,
we
divide
the
whole
corpus
into
blocks
of
10
files
by
sorted
order
.
For
each
block
,
the
first
file
is
used
for
development
,
the
second
file
is
used
for
test
,
and
the
remaining
8
files
are
used
for
training
.
Table
1
gives
the
basic
statistics
on
the
data
.
The
development
set
is
used
to
determine
the
parameter
X
in
Equation
4
,
the
smoothing
parameter
e
in
Equation
7
,
the
weight
parameter
0
described
in
Section
3.3
,
and
the
number
of
boosting
rounds
in
the
reranking
model
.
In
order
to
train
the
reranking
model
,
the
method
in
(
Collins
and
Koo
,
2005
)
is
used
to
prepare
the
N-best
training
examples
.
We
divided
the
training
set
into
20
chunks
,
with
each
chunk
N-best
tagged
by
the
HMM
model
trained
on
the
combination
of
the
other
19
chunks
.
The
development
set
is
N-best
tagged
by
the
HMM
model
trained
on
the
training
set
,
and
the
test
set
is
N-best
tagged
by
the
HMM
model
trained
on
the
combination
of
the
training
set
and
the
development
set
.
#Sentences
Table
1
:
The
basic
statistics
on
the
data
.
In
the
following
subsections
,
we
will
first
examine
the
HMM
models
alone
to
determine
the
best
HMM
configuration
to
use
to
generate
the
N-best
candidates
,
and
then
evaluate
the
reranking
models
.
Finally
,
we
compare
our
performance
with
previous
work
.
In
this
paper
,
we
use
the
sign
test
with
p
&lt;
0.01
to
evaluate
the
statistical
significance
of
the
difference
between
the
performances
of
two
models
.
The
baseline
HMM
model
ported
directly
from
the
English
tagger
,
as
described
in
Subsection
2.1
,
has
an
overall
tag
accuracy
of
93.12
%
on
the
test
set
,
which
is
fairly
low
compared
to
the
97
%
accuracy
of
many
state-of-the-art
taggers
on
WSJ
for
English
.
By
approximating
the
unknown
word
emission
probability
using
the
characters
in
the
word
as
in
Equation
3
,
the
performance
of
the
HMM
tagger
improves
significantly
to
93.43
%
,
suggesting
that
characters
in
different
positions
of
a
Chinese
word
help
to
disambiguate
the
word
class
of
the
entire
word
,
in
contrast
to
English
for
which
suffixes
are
most
helpful
.
Figure
1
depicts
the
impact
of
combining
the
left-to-right
and
right-to-left
word
emission
models
using
different
weighting
values
(
i.e.
,
X
)
on
the
development
set
.
Note
that
emission
probabilities
of
unknown
words
are
estimated
based
on
characters
using
the
same
X
for
combination
.
When
X
=
1.0
,
the
model
uses
only
the
standard
left-to-right
prediction
of
words
,
while
when
X
=
0
it
uses
only
the
right-to-left
estimation
.
It
is
interesting
to
note
that
the
right-to-left
estimation
results
in
greater
accuracy
than
the
left-to-right
estimation
.
This
might
be
because
there
is
stronger
interaction
between
a
word
and
its
next
tag
.
Also
as
shown
in
Figure
1
,
the
estimations
in
the
two
directions
are
complementary
to
each
other
,
with
X
=
0.5
performing
best
.
The
performance
of
the
HMM
taggers
on
the
test
set
is
given
in
Table
2
for
the
best
operating
point
,
as
well
as
the
two
other
extreme
operating
points
to
compare
the
left-to-right
and
right-to-left
constraints
.
Our
best
HMM
tagger
further
improves
the
tag
accuracy
significantly
from
93.43
%
(
X
=
1.0
)
to
94.01
%
(
X
=
0.5
)
.
93.1
-
J
-
,
-
,
-
,
-
,
-
,
-
,
-
,
-
,
-
,
-
,
-
.
Figure
1
:
The
accuracy
of
the
HMM
tagger
on
the
development
set
with
various
X
values
for
combining
the
word
emission
probabilities
.
HMM
baseline
Table
2
:
The
performance
of
various
HMM
taggers
on
the
test
set
.
4.3
Results
of
the
Reranking
Models
The
HMM
tagger
with
the
best
accuracy
(
i.e.
,
the
one
with
X
=
0.5
in
Table
2
)
is
used
to
generate
the
N-Best
tagging
candidates
,
with
a
maximum
of
100
candidates
.
As
shown
in
Table
3
,
a
maximum
of
100-Best
provides
a
reasonable
margin
for
improvement
in
the
reranking
task
.
We
first
test
the
performance
of
the
reranking
methods
using
only
the
n-gram
feature
set
,
which
contains
around
18
million
features
.
Later
,
we
will
investigate
the
addition
of
morphological
features
and
dependency
features
.
The
smoothing
parameter
e
(
for
Collins
'
method
and
the
update-once
method
)
and
the
weight
parameter
0
(
for
the
regularization
method
)
both
have
great
impact
on
reranking
performance
.
We
trained
various
reranking
models
with
e
values
of
0.0001
x
[
1
,
2.5
,
5
,
7.5,10
,
25
,
50
,
75,100
}
,
and
0
values
of
[
0.1,0.25,0.5,0.75,1
}
.
For
all
these
parameter
values
,
600,000
rounds
of
iterations
were
executed
on
the
training
set
.
The
development
set
was
used
to
determine
the
early
stopping
point
in
training
.
If
not
mentioned
explicitly
,
all
the
results
reported
are
based
on
the
best
parameters
tuned
on
the
development
set
.
Table
3
:
The
oracle
tag
accuracies
ofthe
1-Best
,
50-Best
,
and
100-Best
candidates
in
the
training
,
development
,
and
test
sets
for
the
reranking
experiments
.
Note
that
the
tagging
candidates
are
prepared
using
the
method
described
in
Subsection
4.1
.
Table
4
reports
the
performance
of
the
best
HMM
tagger
and
the
three
reranking
taggers
on
the
test
set
.
All
three
reranking
methods
improve
the
HMM
tagger
significantly
.
Also
,
the
update-once
and
regu-larization
methods
both
outperform
Collins
'
original
training
method
significantly
.
Table
4
:
The
performance
on
the
test
set
of
the
HMM
tagger
,
and
the
reranking
methods
using
the
n-gram
features
.
Table
6
:
The
performance
on
the
test
set
of
the
HMM
tagger
and
the
reranking
methods
using
n-gram
and
morphological
features
.
We
observed
that
no
matter
which
value
the
smoothing
parameter
e
takes
,
there
are
only
about
10,000
non-zero
features
finally
selected
by
Collins
'
original
method
.
In
contrast
,
the
two
new
methods
select
substantially
more
features
,
as
shown
in
Table
5
.
As
mentioned
before
,
there
are
some
strong
features
that
only
appear
in
positive
or
negative
samples
,
i.e.
,
either
W+
or
W
-
equals
zero
.
Although
introducing
the
smoothing
parameter
e
in
Equation
7
prevents
infinite
weight
values
,
the
update
to
the
feature
weights
is
no
longer
optimal
(
in
terms
of
minimizing
the
error
function
)
.
Since
the
update
is
not
optimal
,
subsequent
iterations
may
still
focus
on
these
features
(
and
thus
ignore
other
weaker
but
informative
features
)
and
always
increase
their
weights
in
one
direction
,
leading
to
biased
training
.
The
update-once
method
at
each
iteration
selects
a
new
feature
that
has
the
most
impact
in
reducing
the
training
loss
function
.
It
has
the
advantage
of
preventing
increasingly
large
weights
from
being
assigned
to
the
strong
features
,
enabling
the
update
of
other
features
.
The
regularization
method
allows
multiple
updates
and
also
penalizes
large
weights
.
Once
a
feature
is
selected
and
has
its
weight
updated
,
no
matter
how
strong
the
feature
is
,
the
weight
value
is
optimal
in
terms
of
the
current
weights
of
other
features
,
so
that
the
training
algorithm
would
choose
another
feature
to
update
.
A
previously
selected
feature
may
be
selected
again
if
it
becomes
suboptimal
due
to
a
change
in
the
weights
of
other
features
.
#iterations
#features
Update-once
Regularized
Table
5
:
The
number
of
iterations
(
for
the
best
performance
)
,
the
number
of
selected
features
,
and
the
percentage
of
selected
features
,
by
Collins
'
method
,
the
update-once
method
,
and
the
regular-ization
method
on
the
development
set
.
We
next
add
morphological
features
to
the
n-gram
features
selected
by
the
reranking
methods7
.
As
can
be
seen
by
comparing
Table
6
to
Table
4
,
morphological
features
improve
the
tagging
accuracy
of
unknown
words
.
It
should
be
noted
that
the
improvement
made
by
both
update-one
and
regulariza-tion
methods
is
statistically
significant
over
using
n-gram
features
alone
;
however
,
the
improvement
by
Collins
'
original
method
is
not
significant
.
This
suggests
that
the
two
new
methods
are
able
to
utilize
a
greater
variety
of
features
than
the
original
method
.
We
trained
several
Charniak
parsers
using
the
same
method
for
the
HMM
taggers
to
generate
automatic
parse
trees
for
training
,
development
,
and
test
data
.
The
update-once
method
is
used
to
evaluate
the
effectiveness
of
dependency
features
for
rerank-ing
,
as
shown
in
Table
7
.
The
parser
has
an
overall
tagging
accuracy
that
is
greater
than
that
of
the
best
HMM
tagger
,
but
worse
than
that
of
the
reranking
models
using
n-gram
and
morphological
features
.
It
is
interesting
to
note
that
reranking
with
the
dependency
features
alone
improves
the
tagging
accuracy
significantly
,
outperforming
reranking
models
using
n-gram
and
morphological
features
.
This
suggests
that
the
long
distance
features
based
on
the
syntactic
structure
of
the
sentence
are
very
beneficial
for
POS
tagging
of
Mandarin
.
Moreover
,
n-gram
and
morphological
features
are
complementary
to
the
dependency
features
,
with
their
combination
performing
the
best
.
The
n-gram
features
improve
the
accuracy
on
known
words
,
while
the
morphological
features
improve
the
accuracy
on
unknown
words
.
The
best
accuracy
of
95.11
%
is
an
18
%
relative
reduction
in
error
compared
to
the
best
HMM
tagger
.
7Because
the
size
of
the
combined
feature
set
of
all
n-gram
features
and
morphological
features
is
too
large
to
be
handled
by
our
server
,
we
chose
to
add
morphological
features
to
the
n-gram
features
selected
by
the
reranking
methods
,
and
then
retrain
the
reranking
model
.
dep+ngram
dep+morph
dep+ngram+morph
Table
7
:
The
tagging
performance
of
the
parser
and
the
update-once
reranking
models
with
dependency
features
and
their
combination
with
n-gram
and
morphological
features
.
4.4
Comparison
to
Previous
Work
So
how
is
our
performance
compared
to
previous
work
?
When
working
on
the
same
training
/
test
data
(
CTB5.0
with
the
same
pre-processing
procedures
)
as
in
(
Tseng
et
al.
,
2005
)
,
our
HMM
model
obtained
an
accuracy
of
93.72
%
,
as
compared
to
their
93.74
%
accuracy
.
Our
reranking
model8
using
n-gram
and
morphological
features
improves
the
accuracy
to
94.16
%
.
Note
that
we
did
not
use
all
the
morphological
features
as
in
(
Tseng
et
al.
,
2005
)
,
which
would
probably
provide
additional
improvement
.
The
dependency
features
are
expected
to
further
improve
the
performance
,
although
they
are
not
included
here
in
order
to
provide
a
relatively
fair
comparison
.
5
Conclusions
and
Future
Work
We
have
shown
that
the
characters
in
a
word
are
informative
of
the
POS
type
of
the
entire
word
in
Mandarin
,
reflecting
the
fact
that
the
individual
Chinese
characters
carry
POS
information
to
some
degree
.
The
syntactic
relationship
among
characters
may
provide
further
information
,
which
we
leave
as
future
work
.
We
have
also
shown
that
the
additional
right-to-left
estimation
of
word
emission
probabilities
is
useful
for
HMM
tagging
of
Mandarin
.
This
suggests
that
explicit
modeling
of
bidirectional
interactions
captures
more
sequential
information
.
This
could
possibly
help
in
other
sequential
modeling
tasks
.
We
have
also
investigated
using
the
reranking
algorithm
in
(
Collins
and
Koo
,
2005
)
for
the
Mandarin
POS
tagging
task
,
and
found
it
quite
effective
8Tseng
at
el
.
'
s
training
/
test
split
uses
up
the
entire
CTB
corpus
,
leaving
no
development
data
for
tuning
parameters
.
In
order
to
roughly
measure
reranking
performance
,
we
use
the
update-once
method
to
train
the
reranking
model
for
600,000
rounds
with
the
other
parameters
tuned
in
Section
4
.
This
sacrifices
performance
to
some
extent
.
in
improving
tagging
accuracy
.
The
original
algorithm
has
a
tendency
to
focus
on
a
small
subset
of
strong
features
and
ignore
some
of
the
other
useful
features
.
We
were
able
to
improve
the
performance
of
the
reranking
algorithm
by
utilizing
two
different
methods
that
make
better
use
of
more
features
.
Both
are
simple
and
yet
effective
.
The
effectiveness
ofde-pendency
features
suggests
that
syntax-based
long
distance
features
are
important
for
improving
part-of-speech
tagging
performance
in
Mandarin
.
Although
parsing
is
computationally
more
demanding
than
tagging
,
we
hope
to
identify
related
features
that
can
be
extracted
more
efficiently
.
In
future
efforts
,
we
plan
to
extract
additional
reranking
features
utilizing
more
explicitly
the
characteristics
of
Mandarin
.
We
also
plan
to
extend
our
work
to
speech
transcripts
for
Broadcast
News
and
Broadcast
Conversation
corpora
,
and
explore
semi-supervised
training
methods
for
reranking
.
Acknowledgments
This
material
is
based
upon
work
supported
by
the
Defense
Advanced
Research
Projects
Agency
(
DARPA
)
under
Contract
No.
HR0011-06-C-0023
.
Any
opinions
,
findings
and
conclusions
or
recommendations
expressed
in
this
material
are
those
of
the
authors
and
do
not
necessarily
reflect
the
views
of
DARPA
.
We
gratefully
acknowledge
the
comments
from
the
anonymous
reviewers
.
