In
this
paper
,
we
address
a
unique
problem
in
Chinese
language
processing
and
report
on
our
study
on
extending
a
Chinese
thesaurus
with
region-specific
words
,
mostly
from
the
financial
domain
,
from
various
Chinese
speech
communities
.
With
the
larger
goal
of
automatically
constructing
a
Pan-Chinese
lexical
resource
,
this
work
aims
at
taking
an
existing
semantic
classi-ficatory
structure
as
leverage
and
incorporating
new
words
into
it
.
In
particular
,
it
is
important
to
see
if
the
classification
could
accommodate
new
words
from
heterogeneous
data
sources
,
and
whether
simple
similarity
measures
and
clustering
methods
could
cope
with
such
variation
.
We
use
the
cosine
function
for
similarity
and
test
it
on
automatically
classifying
120
target
words
from
four
regions
,
using
different
datasets
for
the
extraction
of
feature
vectors
.
The
automatic
classification
results
were
evaluated
against
human
judgement
,
and
the
performance
was
encouraging
,
with
accuracy
reaching
over
85
%
in
some
cases
.
Thus
while
human
judgement
is
not
straightforward
and
it
is
difficult
to
create
a
PanChinese
lexicon
manually
,
it
is
observed
that
combining
simple
clustering
methods
with
the
appropriate
data
sources
appears
to
be
a
promising
approach
toward
its
automatic
construction
.
1
Introduction
Large-scale
semantic
lexicons
are
important
resources
for
many
natural
language
processing
(
NLP
)
tasks
.
For
a
significant
world
language
such
as
Chinese
,
it
is
especially
critical
to
capture
the
substantial
regional
variation
as
an
important
part
of
the
lexical
knowledge
,
which
will
be
useful
for
many
NLP
applications
,
including
natural
language
understanding
,
information
retrieval
,
and
machine
translation
.
Existing
Chinese
lexical
resources
,
however
,
are
often
based
on
language
use
in
one
particular
region
and
thus
lack
the
desired
comprehensiveness
.
Toward
this
end
,
Tsou
and
Kwong
(
2006
)
proposed
a
comprehensive
Pan-Chinese
lexical
resource
,
based
on
a
large
and
unique
synchronous
Chinese
corpus
as
an
authentic
source
for
lexical
acquisition
and
analysis
across
various
Chinese
speech
communities
.
To
allow
maximum
versatil-itty
and
portability
,
it
is
expected
to
document
the
core
and
universal
substances
of
the
language
on
the
one
hand
,
and
also
the
more
subtle
variations
found
in
different
communities
on
the
other
.
Different
Chinese
speech
communities
might
share
lexical
items
in
the
same
form
but
with
different
meanings
.
For
instance
,
the
word
jUM
refers
to
general
housing
in
Mainland
China
but
specifically
to
housing
under
the
Home
Ownership
Scheme
in
Hong
Kong
;
and
while
the
word
fijf
is
similar
to
jUM
to
mean
general
housing
in
Mainland
China
,
it
is
rarely
seen
in
the
Hong
Kong
context
.
Hence
,
the
current
study
aims
at
taking
an
existing
Chinese
thesaurus
,
namely
the
Tongyici
Cilin
,
as
leverage
and
extending
it
with
lexical
items
specific
to
individual
Chinese
speech
communities
.
In
particular
,
the
feasibility
depends
on
the
following
issues
:
(
1
)
Can
lexical
items
from
various
Chinese
speech
communities
,
that
is
,
from
such
heterogeneous
sources
,
be
classified
as
effectively
with
methods
shown
to
work
for
clustering
Proceedings
of
the
2007
Joint
Conference
on
Empirical
Methods
in
Natural
Language
Processing
and
Computational
Natural
Language
Learning
,
pp.
325-333
,
Prague
,
June
2007
.
©
2007
Association
for
Computational
Linguistics
closely
related
words
from
presumably
the
same
,
or
homogenous
,
source
?
(
2
)
Could
existing
semantic
classificatory
structures
accommodate
concepts
and
expressions
specific
to
individual
Chinese
speech
communities
?
Measuring
similarity
will
make
sense
only
if
the
feature
vectors
of
the
two
words
under
comparison
are
directly
comparable
.
There
is
usually
no
problem
if
both
words
and
their
contextual
features
are
from
the
same
data
source
.
Since
Tongyici
Cilin
(
or
simply
Cilin
hereafter
)
is
based
on
the
vocabulary
used
in
Mainland
China
,
it
is
not
clear
how
often
these
words
will
be
found
in
data
from
other
places
,
and
even
if
they
are
found
,
how
well
the
feature
vectors
extracted
could
reflect
the
expected
usage
or
sense
.
Our
hypothesis
is
that
it
will
be
more
effective
to
classify
new
words
from
Mainland
China
with
respect
to
Cilin
categories
,
than
to
do
the
same
on
new
words
from
regions
outside
Mainland
China
.
Furthermore
,
if
this
hypothesis
holds
,
one
would
need
to
consider
separate
mechanisms
to
cluster
heterogeneous
region-specific
words
in
the
Pan-Chinese
context
.
Thus
in
the
current
study
we
sampled
30
target
words
specific
to
each
of
Beijing
,
Hong
Kong
,
Singapore
,
and
Taipei
,
from
the
financial
domain
;
and
used
the
cosine
similarity
function
to
classify
them
into
one
or
more
of
the
semantic
categories
in
Cilin
.
The
automatic
classification
results
were
compared
with
a
simple
baseline
method
,
against
human
judgement
as
the
gold
standard
.
In
general
,
an
accuracy
of
up
to
85
%
could
be
reached
with
the
top
15
candidates
considered
.
It
turns
out
that
our
hypothesis
is
supported
by
the
Taipei
test
data
,
whereas
the
data
heterogeneity
effect
is
less
obvious
in
Hong
Kong
and
Singapore
test
data
,
though
the
effect
on
individual
test
items
varies
.
In
Section
2
,
we
will
briefly
review
related
work
and
highlight
the
innovations
of
the
current
study
.
In
Sections
3
and
4
,
we
will
describe
the
materials
used
and
the
experimental
setup
respectively
.
Results
will
be
presented
and
discussed
with
future
directions
in
Section
5
,
followed
by
a
conclusion
in
Section
6
.
2
Related
Work
To
build
a
semantic
lexicon
,
one
has
to
identify
the
relation
between
words
within
a
semantic
hierarchy
,
and
to
group
similar
words
together
into
a
class
.
Previous
work
on
automatic
methods
for
building
semantic
lexicons
could
be
divided
into
two
main
groups
.
One
is
automatic
thesaurus
acquisition
,
that
is
,
to
identify
synonyms
or
topically
related
words
from
corpora
based
on
various
measures
of
similarity
(
e.g.
Riloff
and
Shepherd
,
1997
;
Thelen
and
Riloff
,
2002
)
.
For
instance
,
Lin
(
1998
)
used
dependency
relation
as
word
features
to
compute
word
similarities
from
large
corpora
,
and
compared
the
thesaurus
created
in
such
a
way
with
WordNet
and
Roget
classes
.
Caraballo
(
1999
)
selected
head
nouns
from
conjunctions
and
appositives
in
noun
phrases
,
and
used
the
cosine
similarity
measure
with
a
bottom-up
clustering
technique
to
construct
a
noun
hierarchy
from
text
.
Curran
and
Moens
(
2002
)
explored
a
new
similarity
measure
for
automatic
thesaurus
extraction
which
better
compromises
with
the
speed
/
performance
tradeoff
.
You
and
Chen
(
2006
)
used
a
feature
clustering
method
to
create
a
thesaurus
from
a
Chinese
newspaper
corpus
.
Another
line
of
research
,
which
is
more
closely
related
with
the
current
study
,
is
to
extend
existing
thesauri
by
classifying
new
words
with
respect
to
their
given
structures
(
e.g.
Tokunaga
et
al.
,
1997
;
Pekar
,
2004
)
.
An
early
effort
along
this
line
is
Hearst
(
1992
)
,
who
attempted
to
identify
hypo-nyms
from
large
text
corpora
,
based
on
a
set
of
lexico-syntactic
patterns
,
to
augment
and
critique
the
content
of
WordNet
.
Ciaramita
(
2002
)
compared
several
models
in
classifying
nouns
with
respect
to
a
simplified
version
of
WordNet
and
signified
the
gain
in
performance
with
morphological
features
.
For
Chinese
,
Tseng
(
2003
)
proposed
a
method
based
on
morphological
similarity
to
assign
a
Cilin
category
to
unknown
words
from
the
Sinica
corpus
which
were
not
in
the
Chinese
Electronic
Dictionary
and
Cilin
;
but
somehow
the
test
data
were
taken
from
Cilin
,
and
therefore
could
not
really
demonstrate
the
effectiveness
with
unknown
words
found
in
the
Sinica
corpus
.
The
current
work
attempts
to
classify
new
words
with
an
existing
thesaural
classificatory
structure
.
However
,
the
usual
practice
in
past
studies
is
to
test
with
a
portion
of
data
from
the
thesaurus
itself
and
evaluate
the
results
against
the
original
classification
of
those
words
.
This
study
is
thus
different
in
the
following
ways
:
(
1
)
The
test
data
(
i.e.
the
target
words
to
be
classified
)
were
not
taken
from
the
thesaurus
,
but
extracted
from
corpora
and
these
words
were
unknown
to
the
thesaurus
.
(
2
)
The
target
words
were
not
limited
to
nouns
.
(
3
)
Automatic
classification
results
were
compared
with
a
baseline
method
and
with
the
manual
judgement
of
several
linguistics
students
constituting
the
gold
standard
.
(
4
)
In
view
of
the
heterogeneous
nature
of
the
Pan-Chinese
context
,
we
experimented
with
extracting
feature
vectors
from
different
datasets
.
3
Materials
The
Tongyici
Cilin
(
l^iiH^HBpffiO
(
Mei
et
al.
,
1984
)
is
a
Chinese
synonym
dictionary
,
or
more
often
known
as
a
Chinese
thesaurus
in
the
tradition
of
the
Roget
's
Thesaurus
for
English
.
The
Roget
's
Thesaurus
has
about
1,000
numbered
semantic
heads
,
more
generally
grouped
under
higher
level
semantic
classes
and
subclasses
,
and
more
specifically
differentiated
into
paragraphs
and
semicolon-separated
word
groups
.
Similarly
,
some
70,000
Chinese
lexical
items
are
organized
into
a
hierarchy
of
broad
conceptual
categories
in
Cilin
.
Its
classification
consists
of
12
top-level
semantic
classes
,
94
subclasses
,
1,428
semantic
heads
and
3,925
paragraphs
.
It
was
first
published
in
the
1980s
and
was
based
on
lexical
usages
mostly
of
post-1949
Mainland
China
.
The
Appendix
shows
some
example
subclasses
.
In
the
following
discussion
,
we
will
mainly
refer
to
the
subclass
level
and
semantic
head
level
.
3.2
The
LIVAC
Synchronous
Corpus
LIVAC
(
http
:
/
/
www.livac.org
)
stands
for
Linguistic
Variation
in
Chinese
Speech
Communities
.
It
is
a
synchronous
corpus
developed
and
dynamically
maintained
by
the
Language
Information
Sciences
Research
Centre
of
the
City
University
of
Hong
Kong
since
1995
(
Tsou
and
Lai
,
2003
)
.
The
corpus
consists
of
newspaper
articles
collected
regularly
and
synchronously
from
six
Chinese
speech
communities
,
namely
Hong
Kong
,
Beijing
,
Taipei
,
Singapore
,
Shanghai
,
and
Macau
.
Texts
collected
cover
a
variety
of
domains
,
including
front
page
news
stories
,
local
news
,
international
news
,
editorials
,
sports
news
,
entertainment
news
,
and
financial
news
.
Up
to
December
2006
,
the
corpus
has
already
accumulated
over
200
million
character
tokens
which
,
upon
automatic
word
segmentation
and
manual
verification
,
amount
to
over
1.2
million
word
types
.
For
the
present
study
,
we
made
use
of
the
sub-corpora
collected
over
the
9-year
period
1995-2004
from
Beijing
(
BJ
)
,
Hong
Kong
(
HK
)
,
Singapore
(
SG
)
,
and
Taipei
(
TW
)
.
In
particular
,
we
made
use
of
the
financial
news
sections
in
these
subcorpora
,
from
which
we
extracted
feature
vectors
for
comparing
similarity
between
a
given
target
word
and
a
thesaurus
class
,
which
is
further
explained
in
Section
4.3
.
Table
1
shows
the
sizes
of
the
subcorpora
.
Instead
of
using
a
portion
of
Cilin
as
the
test
data
,
we
extracted
unique
lexical
items
from
the
various
subcorpora
above
,
and
classified
them
with
respect
to
the
Cilin
classification
.
Kwong
and
Tsou
(
2006
)
observed
that
among
the
unique
lexical
items
found
from
the
individual
subcorpora
,
only
about
30-40
%
are
covered
by
Cilin
,
but
not
necessarily
in
the
expected
senses
.
In
other
words
,
Cilin
could
in
fact
be
enriched
with
over
60
%
of
the
unique
items
from
various
regions
.
In
the
current
study
,
we
sampled
the
most
frequent
30
words
from
each
of
these
unique
item
lists
for
testing
.
Classification
was
based
on
their
similarity
with
each
of
the
Cilin
subclasses
,
compared
by
the
cosine
measure
,
as
discussed
in
Section
4.3
.
Subcorpus
Size
of
Financial
News
Sections
Word
Token
Word
Type
Table
1
Sizes
of
Individual
Subcorpora
4
Experiments
Three
undergraduate
linguistics
students
and
one
research
student
on
computational
linguistics
from
the
City
University
of
Hong
Kong
were
asked
to
do
the
task
.
The
undergraduate
students
were
raised
in
Hong
Kong
and
the
research
student
in
Mainland
China
.
They
were
asked
to
assign
what
they
consider
the
most
appropriate
Cilin
category
(
up
to
the
semantic
head
level
,
i.e.
third
level
in
the
Cilin
structure
)
to
each
of
the
120
target
words
.
The
inter-annotator
agreement
was
measured
by
the
Kappa
statistic
(
Siegel
and
Castellan
,
1988
)
,
at
both
the
subclass
and
semantic
head
levels
.
Results
on
the
human
judgement
are
discussed
in
Section
5.1
.
4.2
Creating
Gold
Standard
The
"
gold
standard
"
was
set
at
both
the
subclass
level
and
semantic
head
level
.
For
each
level
,
we
formed
a
"
strict
"
standard
for
which
we
considered
all
categories
assigned
by
at
least
two
judges
to
a
word
;
and
a
"
loose
"
standard
for
which
we
considered
all
categories
assigned
by
one
or
more
judges
.
For
evaluating
the
automatic
classification
in
this
study
,
however
,
we
only
experimented
with
the
loose
standard
at
the
subclass
level
.
4.3
Automatic
Classification
Each
target
word
was
automatically
classified
with
respect
to
the
Cilin
subclasses
based
on
the
similarity
between
the
target
word
and
each
subclass
.
We
compute
the
similarity
by
the
cosine
between
the
two
corresponding
feature
vectors
.
The
feature
vector
of
a
given
target
word
contains
all
its
co-occurring
content
words
in
the
corpus
within
a
window
of
±5
words
(
excluding
many
general
adjectives
and
adverbs
,
and
numbers
and
proper
names
were
all
ignored
)
.
The
feature
vector
of
a
Cilin
subclass
is
based
on
the
union
of
the
features
(
i.e.
co-occurring
words
in
the
corpus
)
from
all
individual
members
in
the
subclass
.
The
cosine
of
two
feature
vectors
is
computed
as
In
view
of
the
difference
in
the
feature
space
of
a
target
word
and
a
whole
class
of
words
,
and
thus
the
potential
difference
in
the
number
of
occurrence
of
individual
features
,
we
experimented
with
two
versions
of
the
cosine
measurement
,
namely
binary
vectors
and
real-valued
vectors
.
In
addition
,
as
mentioned
in
previous
sections
,
we
also
experimented
with
the
following
conditions
:
whether
feature
vectors
for
the
Cilin
subclasses
were
extracted
from
the
subcorpus
where
a
given
target
word
originates
,
or
from
the
Beijing
subcorpus
which
is
assumed
to
be
representative
of
language
use
in
Mainland
China
.
All
automatic
classification
results
were
evaluated
against
the
gold
standard
based
on
human
judgement
.
To
evaluate
the
effectiveness
of
the
automatic
classification
,
we
adopted
a
simple
baseline
measure
by
ranking
the
94
subclasses
in
descending
order
of
the
number
of
words
they
cover
.
In
other
words
,
assuming
the
bigger
the
subclass
size
,
the
more
likely
it
covers
a
new
term
,
thus
we
compared
the
top-ranking
subclasses
with
the
classifications
obtained
from
the
automatic
method
using
the
cosine
measure
.
5
Results
and
Discussion
5.1
Response
from
Human
Judges
All
human
judges
reported
difficulties
in
various
degrees
in
assigning
Cilin
categories
to
the
target
words
.
The
major
problem
comes
from
the
regional
specificity
and
thus
the
unfamiliarity
of
the
judges
with
the
respective
lexical
items
and
contexts
.
For
instance
,
students
grown
up
in
Hong
Kong
were
most
familiar
with
the
Hong
Kong
data
,
and
slightly
less
so
with
the
Beijing
data
,
but
more
often
had
the
least
ideas
for
the
Taipei
and
Singapore
data
.
The
research
student
from
Mainland
China
had
no
problem
with
Beijing
data
and
the
lexical
items
in
Cilin
,
but
had
a
hard
time
figuring
out
the
meaning
for
words
from
Hong
Kong
,
Taipei
and
Singapore
.
For
example
,
all
judges
reported
problem
with
the
term
§
$
t
,
one
of
the
target
words
from
Singapore
referring
to
EllfljIlxrfJ
(
CLOB
in
the
Singaporean
stock
market
)
,
which
is
really
specific
to
Singapore
.
The
demand
on
cross-cultural
knowledge
thus
poses
a
challenge
for
building
a
Pan-Chinese
lexical
resource
manually
.
Cilin
,
for
instance
,
is
quite
biased
in
language
use
in
Mainland
China
,
and
it
requires
experts
with
knowledge
of
a
wide
variety
of
Chinese
terms
to
be
able
to
manually
classify
lexical
items
specific
to
other
Chinese
speech
communities
.
It
is
therefore
even
more
important
to
devise
robust
ways
for
automatic
acquisition
of
such
a
resource
.
Notwithstanding
the
difficulty
,
the
inter-annotator
agreement
was
quite
satisfactory
.
At
the
subclass
level
,
we
found
K
=
0.6870
.
At
the
semantic
head
level
,
we
found
K
=
0.5971
.
Both
figures
are
statistically
significant
.
As
mentioned
,
we
set
up
a
loose
standard
and
a
strict
standard
at
both
the
subclass
and
semantic
head
level
.
In
general
,
the
judges
managed
to
reach
some
consensus
in
all
cases
,
except
for
two
words
from
Singapore
.
For
these
two
cases
,
we
considered
all
categories
assigned
by
any
of
the
judges
for
both
standards
.
The
gold
standards
were
verified
by
the
authors
.
Although
in
several
cases
the
judges
did
not
reach
complete
agreement
with
one
another
,
we
found
that
their
decisions
reflected
various
possible
perspectives
to
classify
a
given
word
with
respect
to
the
Cilin
classification
;
and
the
judges
'
assignments
,
albeit
varied
,
were
nevertheless
reasonable
in
one
way
or
another
.
5.3
Evaluating
Automatic
Classification
In
the
following
discussion
,
we
will
refer
to
the
various
testing
conditions
for
each
group
of
target
words
with
labels
in
the
form
of
Cos
-
&lt;
Vector
Type
&gt;
-
&lt;
Target
Words
&gt;
-
&lt;
Cilin
Feature
Source
&gt;
.
Thus
the
label
Cos-Bin-hk-hk
means
testing
on
Hong
Kong
target
words
with
binary
vectors
and
extracting
features
for
the
Cilin
words
from
the
Hong
Kong
subcorpus
;
and
the
label
Cos-RV-sg-bj
means
testing
on
Singapore
target
words
with
real-valued
vectors
and
extracting
features
for
the
Cilin
words
from
the
Beijing
subcorpus
.
For
each
target
word
,
we
evaluated
the
automatic
classification
(
and
the
baseline
ranking
)
by
matching
the
human
decisions
with
the
top
N
candidates
.
If
any
of
the
categories
suggested
by
the
human
judges
is
covered
,
the
automatic
classification
is
considered
accurate
.
The
results
are
shown
in
Figure
1
for
test
data
from
individual
regions
.
Overall
speaking
,
the
results
are
very
encouraging
,
especially
in
view
of
the
number
of
categories
(
over
90
)
we
have
at
the
subclass
level
.
An
accuracy
of
80
%
or
more
is
obtained
in
general
if
the
top
15
candidates
were
considered
,
which
is
much
higher
than
the
baseline
result
in
all
cases
.
Table
2
shows
some
examples
with
appropriate
classification
within
the
Top
3
candidates
.
The
two-letter
codes
in
the
"
Top
3
"
column
in
Table
2
refer
to
the
subclass
labels
,
and
the
code
in
bold
is
the
one
matching
human
judgement
.
In
terms
of
the
difference
between
binary
vectors
and
real-valued
vectors
in
the
similarity
measurement
,
the
latter
almost
always
gave
better
re
-
sults
.
This
was
not
surprising
as
we
expected
by
using
real-valued
vectors
we
could
be
less
affected
by
the
potential
huge
difference
in
the
feature
space
and
the
number
of
occurrence
of
the
features
for
a
Cilin
subclass
and
a
target
word
.
As
for
extracting
features
for
Cilin
subclasses
from
the
Beijing
subcorpus
or
other
subcorpora
,
the
difference
is
more
obvious
for
the
Singapore
and
Taipei
target
words
.
We
will
discuss
the
results
for
each
group
of
target
words
in
detail
below
.
5.4
Performance
on
Individual
Sources
Target
words
from
Beijing
were
expected
to
have
a
relatively
higher
accuracy
because
they
are
homogenous
with
the
Cilin
content
.
It
turned
out
,
however
,
the
accuracy
only
reached
73
%
with
top
15
candidates
and
83
%
with
top
20
candidates
even
under
the
Cos-RV-bj-bj
condition
.
Words
like
#H
(
SARS
)
,
ffjjc
(
save
water
)
,
MMit
(
industrialize
/
industrialization
)
,
pj#r^P
(
passing
rate
)
and
flftf
'
(
multi-level
marketing
)
could
not
be
successfully
classified
.
Results
were
surprisingly
good
for
target
words
from
the
Hong
Kong
subcorpus
.
Under
the
Cos-RV-hk-hk
condition
,
the
accuracy
was
87
%
with
top
15
candidates
and
even
over
95
%
with
top
20
candidates
considered
.
Apart
from
this
high
accuracy
,
another
unexpected
observation
is
the
lack
of
significant
difference
between
Cos-RV-hk-hk
and
Cos-RV-hk-bj
.
One
possible
reason
is
that
the
relatively
larger
size
of
the
Hong
Kong
subcorpus
might
have
allowed
enough
features
to
be
extracted
even
for
the
Cilin
words
.
Nevertheless
,
the
similar
results
from
the
two
conditions
might
also
suggest
that
the
context
in
which
Cilin
words
are
used
might
be
relatively
similar
in
the
Hong
Kong
subcorpus
and
the
Beijing
subcorpus
,
as
compared
with
other
communities
.
Similar
trends
were
observed
from
the
Singapore
target
words
.
Looking
at
Cos-RV-sg-sg
and
Cos-RV-sg-bj
,
it
appears
that
extracting
feature
vectors
for
the
Cilin
words
from
the
Singapore
subcorpus
leads
to
better
performance
than
extracting
them
from
the
Beijing
subcorpus
.
It
suggests
that
although
the
Singapore
subcorpus
shares
those
words
in
Cilin
,
the
context
in
which
they
are
used
might
be
slightly
different
from
their
use
in
Mainland
China
.
Thus
extracting
their
contextual
features
from
the
Singapore
subcorpus
might
better
reflect
their
usage
and
makes
it
more
comparable
with
the
unique
target
words
from
Singapore
.
Such
possible
difference
in
contextual
features
with
shared
lexical
items
between
different
Chinese
speech
communities
would
require
further
investigation
,
and
will
form
part
of
our
future
work
as
discussed
below
.
Despite
the
above
observation
from
the
accuracy
figures
,
the
actual
effect
,
however
,
seems
to
vary
on
individual
lexical
items
.
Table
3
shows
some
examples
of
target
words
which
received
similar
(
with
white
cells
)
and
very
different
(
with
shaded
cells
)
classification
respectively
under
the
two
conditions
.
It
appears
that
the
region-specific
but
common
concepts
like
Sf^JJI
(
office
)
,
|
M
!
t
(
apartment
)
,
S
,
^E
(
private
residence
)
,
which
relate
to
building
or
housing
,
were
affected
most
.
Taipei
data
,
on
the
contrary
,
seems
to
be
more
affected
by
the
different
testing
conditions
.
Cos
-
Bin-tw-bj
and
Cos-RV-tw-bj
produced
similar
results
,
and
both
conditions
showed
better
results
than
Cos-RV-tw-tw
.
This
supports
our
hypothesis
that
the
effect
of
data
heterogeneity
is
so
apparent
that
it
is
much
harder
to
classify
target
words
unique
to
Taipei
with
respect
to
the
Cilin
categories
.
In
addition
,
as
Kwong
and
Tsou
(
2006
)
observed
,
Beijing
and
Taipei
data
share
the
least
number
of
lexical
items
,
among
the
four
regions
under
investigation
.
Hence
,
words
in
Cilin
might
not
have
the
appropriate
contextual
feature
vectors
extracted
from
the
Taipei
subcorpus
.
The
different
results
for
individual
regions
might
be
partly
due
to
the
endocentric
and
exocen-tric
nature
of
influence
in
lexical
innovation
(
e.g.
Tsou
,
2001
)
especially
with
respect
to
the
financial
domain
and
the
history
of
capitalism
in
individual
regions
.
This
factor
is
worth
further
investigation
.
Classification
Accuracy
for
BJ
Data
Classification
Accuracy
for
SG
Data
Baseline
Classification
Accuracy
for
HK
Data
Classification
Accuracy
for
TW
Data
Figure
1
Classification
Results
with
Top
N
Candidates
Table
2
Examples
of
Correct
Classification
(
Top
3
)
1
5.5
General
Discussions
and
Future
Work
As
mentioned
in
a
previous
section
,
the
test
data
in
this
study
were
not
taken
from
the
thesaurus
itself
,
but
were
unknown
words
to
the
thesaurus
.
They
were
extracted
from
corpora
,
and
were
not
limited
to
nouns
.
We
found
in
this
study
that
the
simple
cosine
measure
,
which
used
to
be
applied
for
clustering
contextually
similar
words
from
homogenous
sources
,
performs
quite
well
in
general
for
classifying
these
unseen
words
with
respect
to
the
Cilin
subclasses
.
The
automatic
classification
results
were
compared
with
the
manual
judgement
of
several
linguistics
students
.
In
addition
to
providing
a
gold
standard
for
evaluating
the
automatic
classification
results
in
this
study
,
the
human
1
English
gloss
:
1-restoring
agricultural
lands
for
afforestation
,
2-material
,
3-coal
mine
,
4-to
seize
(
an
opportunity
)
,
5-unemployed
,
6-sales
performance
,
7-broadband
,
8-red
chip
,
9-interest
rate
,
10-property
stocks
,
11-financial
year
,
12-sell
short
,
13-proposal
,
14-sell
,
15-brigadier
general
,
16-financial
holdings
,
17-individual
stocks
,
18-property
market
,
19-cash
card
,
20-stub
.
judgement
on
the
one
hand
proves
that
the
Cilin
classificatory
structure
could
accommodate
region-specific
lexical
items
;
but
on
the
other
hand
also
suggests
how
difficult
it
would
be
to
construct
such
a
Pan-Chinese
lexicon
manually
as
rich
cultural
and
linguistic
knowledge
would
be
required
.
Moreover
,
we
started
with
Cilin
as
the
established
semantic
classification
and
attempted
to
classify
words
specific
to
Beijing
,
Hong
Kong
,
Singapore
,
and
Taipei
respectively
.
The
heterogeneity
of
sources
did
not
seem
to
hamper
the
similarity
measure
on
the
whole
,
provided
appropriate
datasets
are
used
for
feature
extraction
,
although
the
actual
effect
seemed
to
vary
on
individual
lexical
items
.
No.
Ranking
of
1st
appropriate
class
Cos-RV-hk-hk
,
etc.
Cos-RV-hk-bj
,
etc.
Table
3
Different
Impact
on
Individual
Items2
Despite
the
encouraging
results
with
the
top
15
candidates
in
the
current
study
,
it
is
desirable
to
improve
the
accuracy
for
the
system
to
be
useful
in
2
English
gloss
:
1-sales
performance
,
2-broadband
,
3-red
chip
,
4-add
(
supply
to
market
)
,
5-low
level
,
6-office
,
7-financial
year
,
8-sell
short
,
9-rights
issue
,
10-apartment
,
11-holding
space
rate
,
12-private
residence
,
13-stub
,
14-individual
stocks
,
15-financial
holdings
,
16-investment
trust
,
17-growth
rate
,
18-cash
card
.
practice
.
Hence
our
next
step
is
to
expand
the
test
data
size
and
to
explore
alternative
methods
such
as
using
a
nearest
neighbour
approach
.
In
addition
,
we
plan
to
further
the
investigation
in
the
following
directions
.
First
,
we
will
experiment
with
the
automatic
classification
at
the
Cilin
semantic
head
level
,
which
is
much
more
fine-grained
than
the
subclasses
.
The
fine-grainedness
might
make
the
task
more
difficult
,
but
at
the
same
time
the
more
specialized
grouping
might
pose
less
ambiguity
for
classification
.
Second
,
we
will
further
experiment
with
classifying
words
from
other
special
domains
like
sports
,
as
well
as
the
general
domain
.
Third
,
we
will
study
the
classification
in
terms
of
the
part-of-speech
of
the
target
words
,
and
their
respective
requirements
on
the
kinds
of
features
which
give
best
classification
performance
.
The
current
study
only
dealt
with
presumably
Modern
Standard
Chinese
in
different
communities
,
and
it
could
potentially
be
expanded
to
handle
various
dialects
within
a
common
resource
,
eventually
benefiting
speech
lexicons
and
applications
at
large
.
6
Conclusion
In
this
paper
,
we
have
reported
our
study
on
a
unique
problem
in
Chinese
language
processing
,
namely
extending
a
Chinese
thesaurus
with
new
words
from
various
Chinese
speech
communities
,
including
Beijing
,
Hong
Kong
,
Singapore
and
Taipei
.
The
critical
issues
include
whether
the
existing
classificatory
structure
could
accommodate
concepts
and
expressions
specific
to
various
Chinese
speech
communities
,
and
whether
the
difference
in
textual
sources
might
pose
difficulty
in
using
conventional
similarity
measures
for
the
automatic
classification
.
Our
experiments
,
using
the
cosine
function
to
measure
similarity
and
testing
with
various
sources
for
extracting
contextual
vectors
,
suggest
that
the
classification
performance
might
depend
on
the
compatibility
between
the
words
in
the
thesaurus
and
the
sources
from
which
the
target
words
are
drawn
.
Evaluated
against
human
judgement
,
an
accuracy
of
over
85
%
was
reached
in
some
cases
,
which
were
much
higher
than
the
baseline
and
were
very
encouraging
in
general
.
While
human
judgement
is
not
straightforward
and
it
is
difficult
to
create
a
Pan-Chinese
lexicon
manually
,
combining
simple
classification
methods
with
the
appropriate
data
sources
seems
to
be
a
promising
approach
toward
its
automatic
construction
.
Acknowledgements
The
work
described
in
this
paper
was
supported
by
a
grant
from
the
Research
Grants
Council
of
the
Hong
Kong
Special
Administrative
Region
,
China
(
Project
No.
CityU
1317
/
03H
)
.
Appendix
The
following
table
shows
some
examples
of
the
Cilin
subclasses
:
Subclasses
(
Abstract
entities
)
(
Characteristics
)
(
Psychological
activities
)
(
Association
)
(
Auxiliary
words
)
(
Respectful
expressions
)
