We
consider
here
the
problem
of
Chinese
named
entity
(
NE
)
identification
using
statistical
language
model
(
LM
)
.
In
this
research
,
word
segmentation
and
NE
identification
have
been
integrated
into
a
unified
framework
that
consists
of
several
class-based
language
models
.
We
also
adopt
a
hierarchical
structure
for
one
of
the
LMs
so
that
the
nested
entities
in
organization
names
can
be
identified
.
Our
experiments
further
demonstrate
the
improvement
after
seamlessly
integrating
with
linguistic
heuristic
information
,
cache-based
model
and
NE
abbreviation
identification
.
1
Introduction
NE
identification
is
the
key
technique
in
many
applications
such
as
information
extraction
,
question
answering
,
machine
translation
and
so
on
.
English
NE
identification
has
achieved
a
great
success
.
However
,
for
Chinese
,
NE
identification
is
very
different
.
There
is
no
space
to
mark
the
word
boundary
and
no
standard
definition
of
words
in
Chinese
.
The
Chinese
NE
identification
and
word
segmentation
are
interactional
in
nature
.
This
paper
presents
a
unified
approach
that
integrates
these
two
steps
together
using
a
class-based
LM
,
and
apply
Viterbi
search
to
select
the
global
optimal
solution
.
The
class-based
LM
consists
of
two
sub-models
,
namely
the
context
model
and
the
entity
model
.
The
context
model
estimates
the
probability
of
generating
a
NE
given
a
certain
context
,
and
the
entity
model
estimates
the
probability
of
a
sequence
of
Chinese
characters
given
a
certain
kind
of
NE
.
In
this
study
,
we
are
interested
in
three
kinds
of
Chinese
NE
that
are
most
commonly
used
,
namely
person
name
(
PER
)
,
location
name
(
LOC
)
and
organization
name
(
ORG
)
.
We
have
also
adopted
a
variety
of
approaches
to
improving
the
LM
.
In
addition
,
a
hierarchical
structure
for
organization
LM
is
employed
so
that
the
nested
PER
,
LOC
in
ORG
can
be
identified
.
The
evaluation
is
conducted
on
a
large
test
set
in
which
NEs
have
been
manually
tagged
.
The
experiment
result
shows
consistent
improvements
over
existing
methods
.
Our
experiments
further
demonstrate
the
improvement
after
integrating
with
linguistic
heuristic
information
,
cache-based
model
and
NE
abbreviation
identification
.
The
precision
of
PER
,
LOC
,
ORG
on
the
test
set
is
79.86
%
,
80.88
%
,
76.63
%
,
respectively
;
and
the
recall
is
87.29
%
,
82.46
%
,
56.54
%
,
respectively
.
2
Related
Work
Recently
,
research
on
English
NE
identification
has
been
focused
on
the
machine-learning
approaches
,
including
hidden
Markov
model
(
HMM
)
,
maximum
entropy
model
,
decision
tree
and
transformation-based
learning
,
etc.
(
Bikel
et
al
,
1997
;
Borthwick
et
al
,
1999
;
Sekine
et
al
,
1998
)
.
Some
systems
have
been
applied
to
real
application
.
Research
on
Chinese
NE
identification
is
,
however
,
still
at
its
early
stage
.
Some
researches
apply
methods
of
English
NE
identification
to
Chinese
.
Yu
et
al
(
1997
)
applied
the
HMM
approach
where
the
NE
identification
is
formulated
as
a
tagging
1
This
work
was
done
while
the
author
was
visiting
Microsoft
Research
Asia
problem
using
Viterbi
algorithm
.
In
general
,
current
approaches
to
NE
identification
(
e.g.
Chen
,
1997
)
usually
contain
two
separate
steps
:
word
segmentation
and
NE
identification
.
The
word
segmentation
error
will
definitely
lead
to
errors
in
the
NE
identification
results
.
Zhang
(
2001
)
put
forward
class-based
LM
for
Chinese
NE
identification
.
We
further
develop
this
idea
with
some
new
features
,
which
leads
to
a
new
framework
.
In
this
framework
,
we
integrate
Chinese
word
segmentation
and
NE
identification
into
a
unified
framework
using
a
class-based
language
model
(
LM
)
.
3
Class-based
LM
for
NE
Identification
The
n-gram
LM
is
a
stochastic
model
which
predicts
the
next
word
given
the
previous
n-1
words
by
estimating
the
conditional
probability
P
(
wn
\
w1
.
.
.
wn1
)
.
In
practice
,
trigram
approximation
P
(
wJw
--
2w-1
)
is
widely
used
,
assuming
that
the
word
w.
depends
only
on
two
preceding
words
w
--
2and
w-r
Brown
et
al
(
1992
)
put
forward
and
discussed
n-gram
models
based
on
classes
of
words
.
In
this
section
,
we
will
describe
how
to
use
class-based
trigram
model
for
NE
identification
.
Each
kind
of
NE
(
including
PER
,
LOC
and
ORG
)
is
defined
as
a
class
in
the
model
.
In
addition
,
we
differentiate
the
transliterated
person
name
(
FN
)
from
the
Chinese
person
name
since
they
have
different
constitution
patterns
.
The
four
classes
of
NE
used
in
our
model
are
shown
in
Table
1
.
All
other
words
are
also
defined
as
individual
classes
themselves
(
i.e.
one
word
as
one
class
)
.
Consequently
,
there
are
\
V
\
+4
classes
in
our
model
,
where
\
V
\
is
the
size
of
vocabulary
.
Table
1
:
Classes
defined
in
class-based
model
Description
Chinese
person
name
Transliterated
person
name
Location
name
Organization
name
3.1
The
Language
Modeling
Given
a
Chinese
character
sequence
S
=
S
]
.
.
.
sn
,
the
task
of
Chinese
NE
identification
is
to
find
The
class-based
model
consists
of
two
sub-models
:
the
context
model
P
(
C
)
and
the
entity
model
P
(
S
\
C
)
.
The
context
model
indicates
the
probability
of
generating
a
NE
class
given
a
(
previous
)
context
.
P
(
C
)
is
a
priori
probability
,
which
is
computed
according
to
Equation
(
2
)
:
P
(
C
)
can
be
estimated
using
a
NE
labeled
corpus
.
The
entity
model
can
be
parameterized
by
Equation
(
3
)
:
The
entity
model
estimates
the
generative
probability
of
the
Chinese
character
sequence
in
square
bracket
pair
(
i.e.
starting
from
Cj-start
to
Cj-end
)
given
the
specific
NE
class
.
For
different
class
,
we
define
the
different
entity
model
.
For
the
class
of
PER
(
including
PN
and
FN
)
,
the
entity
model
is
a
CharaCter-based
trigram
model
as
shown
in
Equation
(
4
)
.
where
s
can
be
any
characters
occurred
in
a
person
name
.
For
example
,
the
generative
probability
of
character
sequence
"
^vJIf
(
Li
Dapeng
)
is
much
larger
than
that
of
(
many
years
)
given
the
PER
since
"
is
a
commonly
used
family
name
,
and
and
J#
are
commonly
used
first
names
.
The
probabilities
can
be
estimated
with
the
person
name
list
.
For
the
class
of
LOC
,
the
entity
model
is
a
word-based
trigram
model
as
shown
in
Equation
(
5
)
.
where
W
=
wr.
.
wt
is
possible
segmentation
result
of
character
sequence
-
start
.
.
.
sCj-end
.
For
the
class
of
ORG
,
the
construction
is
much
more
complicated
because
an
ORG
often
contain
PER
and
/
or
LOC
.
For
example
,
the
ORG
"
tHHSI
$
@^
"
(
Air
China
Corporation
)
contains
the
LOC
H
"
(
China
)
.
It
is
beneficial
to
such
applications
as
question
answering
,
information
extraction
and
so
on
if
nested
NE
can
be
identified
as
well
.
In
order
to
identify
the
nested
PER
,
LOC
in
ORG2
,
we
adopted
class-based
LMs
for
ORG
further
,
in
which
there
are
three
sub
models
,
one
is
the
class
generative
model
,
and
the
others
are
entity
model
:
person
name
model
and
location
name
model
in
ORG
.
Therefore
,
the
entity
model
of
ORG
is
shown
in
Equation
(
6
)
which
is
almost
same
as
Equation
(
1
)
.
and
can
get
the
optimal
class
sequence
.
The
Chinese
PER
and
transliterated
PER
share
the
same
context
class
model
when
computing
the
probability
.
As
discussed
in
3.1.1
,
there
are
two
kinds
of
probabilities
to
be
estimated
:
P
(
C
)
and
P
(
S
\
C
)
.
Both
probabilities
are
estimated
using
Maximum
Likelihood
Estimation
(
MLE
)
with
the
annotated
training
corpus
.
The
parser
NLPWin3
was
used
to
tag
the
training
corpus
.
As
a
result
,
the
corpus
was
annotated
with
NE
marks
.
Four
lists
were
extracted
from
the
annotated
corpus
and
each
list
corresponds
one
NE
class
.
The
context
model
P
(
C
)
was
trained
with
the
annotated
corpus
and
the
four
entity
models
were
trained
with
corresponding
NE
lists
.
The
Figure
1
shows
the
training
process
.
(
Begin
of
sentence
(
BOS
)
and
end
of
sentence
(
EOS
)
is
added
)
ON
Class
list
Corresponding
English
Sentence
Figure
1
:
Example
of
Training
Process
corresponding
to
the
Chinese
character
sequence
.
Based
on
the
context
model
and
entity
models
,
we
can
compute
the
probability
P
(
C
\
S
)
2
For
simplification
,
only
nested
person
,
location
names
are
identified
in
organization
.
The
nested
person
in
location
is
not
identified
because
of
low
frequency
Given
a
sequence
of
Chinese
characters
,
the
decoding
process
consists
of
the
following
three
steps
:
Step
1
:
All
possible
word
segmentations
are
generated
using
a
Chinese
lexicon
containing
120,050
entries
.
The
lexicon
is
only
used
for
segmentation
and
there
is
no
NE
tag
in
it
even
if
one
word
is
PER
,
LOC
or
3
NLPWin
system
is
a
natural
language
processing
system
developed
by
Microsoft
Research
.
ORG
.
For
example
,
itM
(
Beijing
)
is
not
tagged
as
LOC
.
Step
2
:
NE
candidates
are
generated
from
any
one
or
more
segmented
character
strings
and
the
corresponding
generative
prob
ability
for
each
candidate
is
computed
using
entity
models
described
in
Equation
(
4
)
-
(
7
)
.
Step
3
:
Viterbi
search
is
used
to
select
hypothesis
with
the
highest
probability
as
the
best
output
.
Furthermore
,
in
order
to
identify
nested
named
entities
,
two
-
pass
Viterbi
search
is
adopted
.
The
inner
Viterbi
search
is
corresponding
to
Equation
(
6
)
and
the
outer
one
corresponding
to
Equation
(
1
)
.
After
the
two-pass
searches
,
the
word
segmentation
and
the
named
entities
(
including
nested
ones
)
can
be
obtained
.
There
are
some
problems
with
the
framework
of
NE
identification
using
the
class-based
LM
.
First
,
redundant
candidates
NEs
are
generated
in
the
decoding
process
,
which
results
in
very
large
search
space
.
The
second
problem
is
that
data
sparseness
will
seriously
influence
the
performance
.
Finally
,
the
abbreviation
of
NEs
cannot
be
handled
effectively
.
In
the
following
three
subsections
,
we
provide
solutions
to
the
three
problems
mentioned
above
.
In
order
to
overcome
the
redundant
candidate
generation
problem
,
the
heuristic
information
is
introduced
into
the
class-based
LM
.
The
following
resources
were
used
:
(
1
)
Chinese
family
name
list
,
containing
373
entries
(
e.g.
?
i
£
(
Zhang
)
,
_
(
Wang
)
)
;
(
2
)
transliterated
name
character
list
,
containing
618
characters
(
e.g.
"
(
shi
)
,
S
(
dun
)
)
;
and
(
3
)
ORG
keyword
list
,
containing
1,355
entries
(
e.g.
(
university
)
,
(
corporation
)
)
.
The
heuristic
information
is
used
to
constrain
the
generation
of
NE
candidates
.
For
PER
(
PN
)
,
only
PER
candidates
beginning
with
the
family
name
is
considered
.
For
PER
(
FN
)
,
a
candidate
is
generated
only
if
all
its
composing
character
belongs
to
the
transliterated
name
character
list
.
For
ORG
,
a
candidate
is
excluded
if
it
does
not
contain
one
ORG
keyword
.
Here
,
we
do
not
utilize
the
LOC
keyword
to
generate
LOC
candidate
because
of
the
fact
that
many
LOC
do
not
end
with
keywords
.
The
cache
entity
model
can
address
the
data
sparseness
problem
by
adjusting
the
parameters
continually
as
NE
identification
proceeds
.
The
basic
idea
is
to
accumulate
Chinese
character
or
word
n-gram
so
far
appeared
in
the
document
and
use
them
to
create
a
local
dynamic
entity
entity
model
with
the
static
entity
where
O1,02
g
[
0,1
]
are
interpolation
weight
that
is
determined
on
the
held-out
data
set
.
We
found
that
many
errors
result
from
the
occurrence
of
abbreviation
of
person
,
location
,
and
organization
.
Therefore
,
different
strategies
are
adopted
to
deal
with
abbreviations
for
different
kinds
of
NEs
.
For
PER
,
if
Chinese
surname
is
followed
by
the
title
,
then
this
surname
is
tagged
as
PER
.
For
example
,
.
SiSS
(
President
Zuo
)
is
tagged
as
&lt;
PER
&gt;
^
&lt;
/
PER
&gt;
SS
.
For
LOC
,
if
at
least
two
location
abbreviations
occur
consecutive
,
the
individual
location
abbreviation
is
tagged
as
LOC
.
For
example
,
t
H
(
Sino-Japan
relation
)
is
tagged
as
&lt;
LOC
&gt;
t
&lt;
/
LOC
&gt;
&lt;
LOC
&gt;
H
&lt;
/
LOC
&gt;
G
%
.
For
ORG
,
if
organization
abbreviation
is
followed
by
LOC
,
which
is
again
followed
by
organization
keyword
,
the
three
units
are
tagged
as
one
ORG
.
For
example
,
t
E
it
M
ff
§
|
(
Chinese
Communist
Party
Committee
of
Beijing
)
is
tagged
as
&lt;
ORG
&gt;
tE
&lt;
LOC
&gt;
^M
&lt;
/
LOC
&gt;
ff
31
&lt;
/
ORG
&gt;
.
At
present
,
we
collected
112
organization
abbreviations
and
18
location
abbreviations
.
4
Experiments
4.1
Evaluation
Metric
We
conduct
evaluations
in
terms
of
precision
(
P
)
and
recall
(
R
)
.
P
number
of
correctly
identified
NE
number
of
identified
NE
5
number
of
correct
identified
NE
number
oI
all
NE
We
also
used
the
F-measure
,
which
is
defined
as
a
weighted
combination
of
precision
and
recall
as
Equation
(
11
)
:
Table
2
:
Statistics
of
Open-Test
where
E
is
the
relative
weight
of
precision
and
recall
.
There
are
two
differences
between
MET
evaluation
and
ours
.
First
,
we
include
nested
NE
in
our
evaluation
whereas
MET
does
not
.
Second
,
in
our
evaluation
,
only
NEs
with
correct
boundary
and
type
label
are
considered
the
correct
identifications
.
In
MET
,
the
evaluation
is
somewhat
flexible
.
For
example
,
a
NE
may
be
identified
partially
correctly
if
the
label
is
correct
but
the
boundary
is
wrongly
detected
.
was
obtained
after
this
corpus
was
parsed
with
NLPWin
.
We
built
the
wide
coverage
test
data
according
to
the
guidelines4
that
are
just
same
as
those
of
1999
IEER
.
The
test
set
(
as
shown
in
Table
2
)
contains
half
a
million
Chinese
characters
;
it
is
a
balanced
test
set
covering
11
domains
.
The
test
set
contains
11,844
sentences
,
49.84
%
of
the
sentences
contain
at
least
one
NE
.
The
number
of
characters
in
NE
accounts
for
8.448
%
in
all
Chinese
characters
.
We
can
see
that
the
test
data
is
much
larger
than
the
MET
test
data
and
IEER
data
4
The
difference
between
IEER
's
guidelines
and
ours
is
that
the
nested
person
and
location
name
in
organization
are
tagged
in
our
guidelines
.
Number
of
NE
Tokens
Computer
Entertainment
Literature
Politics
4.3
Training
Data
Preparation
The
training
data
produced
by
NLPWin
has
some
noise
due
to
two
reasons
.
First
,
the
NE
guideline
used
by
NLPWin
is
different
from
the
one
we
used
.
For
example
,
in
NLPWin
,
it
MTU
(
Beijing
City
)
is
tagged
as
&lt;
LOC
&gt;
ibiH
&lt;
/
LOC
&gt;
Tfr
,
whereas
itMt
)
should
be
LOC
in
our
definition
.
Second
,
there
are
some
errors
in
NLPWin
results
.
We
utilized
18
rules
to
correct
the
frequent
errors
.
The
following
shows
some
examples
.
LN
+
Location
Key
The
Table
4
shows
the
quality
of
our
training
corpus
.
Table
4
Quality
of
Training
Corpus
We
conduct
incrementally
the
following
four
experiments
:
(
1
)
Class-based
LM
,
we
view
the
results
as
baseline
performance
;
(
4
)
Integrating
NE
abbreviation
processing
with
(
3
)
.
Based
on
the
basic
class-based
models
estimated
with
the
training
data
,
we
can
get
the
baseline
performance
,
as
is
shown
in
Table
5
.
Comparing
Table
4
and
Table
5
,
we
found
that
the
performance
of
baseline
is
better
than
the
quality
of
training
data
.
Table
5
Baseline
Performance
4.4.2
Integrating
Heuristic
Information
In
this
part
,
we
want
to
see
the
effects
of
using
heuristic
information
.
The
results
are
shown
in
Table
6
.
In
experiments
,
we
found
that
by
integrating
the
heuristic
information
,
we
not
only
achieved
more
efficient
decoding
,
but
also
obtained
higher
NE
identification
precision
.
For
example
,
the
precision
of
PER
increases
from
65.70
%
to
77.63
%
,
and
precision
of
ORG
increases
from
56.55
%
to
81.23
%
.
The
reason
is
that
adopting
heuristic
information
reduces
the
noise
influence
.
However
,
we
noticed
that
the
recall
of
PER
and
LOC
decreased
a
bit
.
There
are
two
reasons
.
First
,
organization
names
without
organization
ending
keywords
were
not
marked
as
ORG
.
Second
,
Chinese
names
without
surnames
were
also
missed
.
Table
6
Results
of
Heuristic
Information
Integrated
into
the
Class-based
LM
Table
7
shows
the
evaluation
results
after
cache-based
LM
was
integrated
.
From
Table
6
and
Table
7
,
we
found
that
almost
all
the
precision
and
recall
of
PER
,
LOC
,
ORG
have
obtained
slight
improvements
.
4.4.4
Integrating
with
NE
Abbreviation
Processing
In
this
experiment
,
we
integrated
with
NE
abbreviation
processing
.
As
shown
in
Table
8
,
the
experiment
result
indicates
that
the
recall
of
81.27
%
,
36.65
%
to
87.29
%
,
82.46
%
,
56.54
%
,
respectively
.
Table
8
Results
of
our
system
From
above
data
,
we
observed
that
(
1
)
the
class
based
SLM
performs
better
than
the
training
data
automatically
produced
with
the
parser
;
(
2
)
the
distinct
improvements
is
achieved
by
using
heuristic
information
;
(
3
)
Furthermore
,
our
method
of
dealing
with
abbreviation
increases
the
recall
of
NEs
.
In
addition
,
the
cache-based
LM
increases
the
performance
not
so
much
.
The
reason
is
as
follows
:
The
cache-based
LM
is
based
on
the
hypothesis
that
a
word
used
in
the
recent
past
is
much
likely
either
to
be
used
soon
than
its
overall
frequency
in
the
language
or
a
3
-
gram
model
would
suggest
(
Kuhn
,
1990
)
.
However
,
we
found
that
the
same
NE
often
varies
its
morpheme
in
the
same
document
.
For
example
,
the
same
NE
t
E
it
M
ff
ft
(
Chinese
Communist
Party
Committee
of
Beijing
)
,
itM
|
ff
f
(
Committee
of
Beijing
City
)
,
ff
f
(
Committee
)
occur
in
order
.
Furthermore
,
we
notice
that
the
segmentation
dictionary
has
an
important
impact
on
the
performance
of
NE
identification
.
We
do
not
think
it
is
better
if
more
words
are
added
into
dictionary
.
For
example
,
because
PB^
(
Chinese
)
is
in
our
dictionary
,
there
is
much
possibility
that
PB
(
China
)
in
PBX
is
missed
identified
.
5
Evaluation
with
MET2
and
IEER
Test
Data
We
also
evaluated
on
the
MET2
test
data
and
IEER
test
data
.
The
results
are
shown
in
Table
9
.
The
results
on
MET2
are
lower
than
the
highest
report
of
MUC7
(
PER
:
Precision
66
%
,
Recall
92
%
;
LOC
:
Precision
89
%
,
Recall
91
%
;
ORG
:
Precision
89
%
,
Recall
88
%
,
http
:
/
/
www.itl.nist.gov
)
.
We
speculate
the
reasons
for
this
in
the
following
.
The
main
reason
is
that
our
class-based
LM
was
estimated
with
a
general
domain
corpus
,
which
is
quite
different
from
the
domain
of
MUC
data
.
Moreover
,
we
didn
't
use
a
NE
dictionary
.
Another
reason
is
that
our
NE
definitions
are
slightly
different
from
MET2
.
MET2
Data
IEER
Data
6
Conclusions
&amp;
Future
work
In
this
research
,
Chinese
word
segmentation
and
NE
identification
has
been
integrated
into
a
framework
using
class-based
language
models
(
LM
)
.
We
adopted
a
hierarchical
structure
in
ORG
model
so
that
the
nested
entities
in
organization
names
can
be
identified
.
Another
characteristic
is
that
our
NE
identification
do
not
utilize
NE
dictionary
when
decoding
.
The
evaluation
on
a
large
test
set
shows
consistent
improvements
.
The
integration
of
heuristic
information
improves
the
precision
and
recall
of
our
system
.
The
cache-based
LM
increases
the
recall
of
NE
identification
to
some
extent
.
Moreover
,
some
rules
dealing
with
abbreviations
of
NEs
have
increased
dramatically
the
performance
.
The
precision
of
80.88
%
,
76.63
%
,
respectively
;
and
the
recall
is
87.29
%
,
82.46
%
,
56.54
%
,
respectively
.
In
our
future
work
,
we
will
be
focusing
more
on
NE
coreference
using
language
model
.
Second
,
we
intend
to
extend
our
model
to
include
the
part-of-speech
tagging
model
to
improve
the
performance
.
At
present
,
the
class-based
LM
is
based
on
the
general
domain
and
we
may
need
to
fine-tune
the
model
for
a
specific
domain
.
ACKNOWLEDGEMENT
I
would
like
to
thank
Ming
Zhou
,
Jianfeng
Gao
,
Changning
Huang
,
Andi
Wu
,
Hang
Li
and
other
colleagues
from
Microsoft
Research
for
their
help
.
And
I
want
to
thank
especially
Lei
Zhang
from
Tsinghua
University
for
his
help
in
developing
the
ideas
.
