This
paper
presents
the
strategy
and
design
of
a
highly
efficient
semiautomatic
method
for
labelling
the
semantic
features
of
common
nouns
,
using
semantic
relationships
between
words
,
and
based
on
the
information
extracted
from
an
electronic
monolingual
dictionary
.
The
method
,
that
uses
genus
data
,
specific
relators
and
synonymy
information
,
obtains
an
accuracy
of
over
99
%
and
a
scope
of
68,2
%
with
regard
to
all
the
common
nouns
contained
in
a
real
corpus
of
over
1
million
words
,
after
the
manual
labelling
of
only
100
nouns
.
1
Introduction
Semantic
information
is
essential
in
a
lot
of
NLP
applications
.
In
our
case
,
the
feature
[
±animate
]
is
necessary
to
disambiguate
between
the
possible
Basque
translations
for
the
English
preposition
"
of
"
and
the
Spanish
preposition
"
de
"
,
when
referring
to
location
or
possession
.
This
ambiguity
appears
very
often
when
translating
to
Basque
[
Diaz
de
Ilarraza
et
al.
,
2000
]
.
A
complete
manual
labelling
of
semantic
information
would
prove
extremely
expensive
.
This
study
aims
to
outline
the
strategy
and
design
of
a
semiautomatic
method
for
labelling
semantic
features
of
common
nouns
in
Basque
,
expanding
and
improving
the
idea
outlined
in
[
Diaz
de
Ilarraza
et
al.
2000
]
.
Due
to
the
poor
results
obtained
,
this
study
dismissed
the
possibility
of
an
initial
approach
aimed
at
extracting
the
information
corresponding
to
the
(
±animate
)
feature
automatically
from
corpus
.
Instead
,
an
alternative
idea
was
proposed
,
i.e.
that
of
using
semantic
relationships
between
words
extracted
from
the
Basque
monolingual
dictionary
Euskal
Hiztegia
(
Sarasola
1996
)
.
In
this
context
,
we
used
genus
data
and
specific
relators
,
together
with
a
few
words
manually
labelled
,
to
extract
the
information
corresponding
to
the
(
±animate
)
feature
.
The
results
obtained
were
very
promising
:
8,439
common
nouns
were
labelled
automatically
after
the
manual
labelling
of
just
100
.
This
paper
describes
the
work
carried
out
with
the
aim
of
expanding
this
idea
this
idea
through
the
inclusion
of
information
about
synonymy
,
repeating
the
automatic
process
iteratively
in
order
to
obtain
better
results
and
,
monitoring
the
reliability
of
the
labelling
of
each
individual
noun
.
After
studying
the
ideal
relationship
between
the
manual
part
of
the
operation
and
the
scope
of
the
automatic
process
,
we
generalised
the
process
in
order
to
adapt
it
to
other
semantic
features
.
We
obtained
very
satisfactory
results
considering
the
labelling
of
common
nouns
contained
in
the
dictionary
:
for
the
[
±animate
]
feature
,
we
labelled
12,308
nouns
with
an
accuracy
of
99.2
%
,
after
the
manual
labelling
of
only
100
.
This
paper
is
organised
as
follows
:
section
2
presents
the
semantic
relationships
between
words
extracted
from
the
Basque
monolingual
dictionary
,
and
used
by
our
semiautomatic
labelling
method
.
The
method
itself
is
described
in
section
3
.
The
experiments
carried
out
with
the
aim
of
optimising
the
efficiency
of
the
method
are
described
in
section
4
,
and
section
5
outlines
the
accuracy
and
scope
of
the
labelling
process
for
the
[
±animate
]
semantic
feature
.
Finally
,
section
6
describes
how
the
method
was
generalised
to
cover
other
semantic
features
.
The
study
finishes
by
underlining
the
results
obtained
and
suggesting
future
research
.
2
Superficial
semantic
relationships
between
words
in
dictionaries
According
to
Smith
and
Maxwell
,
there
are
three
basic
methods
for
defining
a
lexical
entry
[
Smith
and
Maxwell
.
,
1980
]
:
•
By
means
of
a
synonym
:
a
word
with
the
same
sense
as
the
lexical
entry
.
finish
.
conclude
(
siw
)
,
terminate
(
siw
)
•
By
means
of
a
classical
definition
:
'
genus
+
differentia
'
.
The
genus
is
the
generic
term
or
hyperonym
,
and
the
lexical
entry
a
more
specific
term
or
hyponym
.
aeroplane
.
vehicle
(
genus
)
that
can
fly
(
differentia
)
•
By
means
of
specific
relators
,
that
will
often
determine
the
semantic
relationship
between
the
lexical
entry
and
the
core
of
the
definition
.
horsefly
.
Name
given
to
(
relator
)
certain
insects
(
related
term
)
of
the
Tabanidae
family
One
method
for
identifying
the
semantic
relationship
that
exists
between
different
words
is
to
extract
the
information
from
monolingual
dictionaries
.
Agirre
et
al.
(
2000
)
applied
it
for
Basque
,
using
the
definitions
contained
in
the
monolingual
dictionary
Euskal
Hiztegia
.
We
use
for
our
research
the
information
about
genus
,
specific
relators
and
synonymy
extracted
by
them
.
3
Semiautomatic
labelling
using
genus
,
specific
relators
and
synonymy
In
order
to
label
the
common
nouns
that
appear
in
the
dictionary
,
we
used
the
definitions
of
the
26,461
senses
of
the
16,380
common
nouns
defined
by
means
of
genus
/
relators
(
14,569
)
or
synonyms
(
11,892
)
.
The
experiment
was
carried
out
as
follows
:
firstly
,
we
used
the
information
relative
to
genus
and
specific
relators
to
extract
the
information
regarding
the
[
±animate
]
feature
(
3.1
)
.
Subsequently
,
we
also
incorporated
the
information
relative
to
synonymy
(
3.2
)
.
Finally
,
we
repeated
the
automatic
process
iteratively
in
order
to
obtain
better
results
(
3.3
)
.
An
example
of
the
whole
process
is
given
in
section
3.4
.
3.1
Labelling
using
information
relative
to
genus
and
specific
relators
Our
strategy
consisted
of
manually
labelling
the
semantic
feature
for
a
small
number
of
words
that
appear
most
frequently
in
the
dictionary
as
genus
/
relators
.
We
used
these
words
to
infer
the
value
of
this
feature
for
as
many
other
words
as
possible
.
This
inference
is
possible
because
in
the
hyperonymy
/
hyponymy
relationship
,
that
characterises
the
genus
,
semantic
attributes
are
inherited
.
For
example
,
if
'
langile
'
(
worker
)
has
the
[
+animate
]
feature
,
all
its
hyponyms
(
or
in
other
words
,
all
the
words
whose
hyperonym
is
'
langile
'
)
will
have
the
same
[
+animate
]
feature
.
Certain
genus
are
ambiguous
,
since
they
contain
senses
with
opposing
semantic
features
.
For
example
'
buru
'
(
head
/
boss
)
has
the
[
animate
]
feature
when
it
means
'
head
'
and
the
[
+animate
]
feature
when
it
means
'
boss
'
.
The
semantic
feature
of
the
sense
defined
can
also
be
deduced
from
some
specific
relators
.
In
this
way
,
the
semantic
feature
of
words
whose
relator
is
'
nolakotasuna
'
(
quality
)
would
be
[
-
animate
]
,
such
as
in
the
case
of
'
aitatasuna
'
(
paternity
)
,
for
example
.
There
are
also
certain
relators
that
offer
no
information
,
such
as
'
mota
'
(
type
)
,
'
izena
'
(
name
)
,
and
'
banako
'
(
unit
,
individual
)
.
We
used
four
types
of
labels
during
the
manual
operation
:
[
+
]
,
[
-
]
,
[
?
]
and
[
x
]
.
[
?
]
for
ambiguous
cases
;
and
[
x
]
for
relators
that
do
not
offer
information
regarding
this
semantic
feature
.
procedure
Labelling_of_the_dictionary
{
foreach
(
common
Noun
of
the
dictionary
)
{
procedure
Find_its_label
(
Noun
)
{
foreach
(
Sense
with
Noun
Genus
/
Relator
)
{
if
(
Genus
/
Relator
labelled
)
{
Sense.Label
=
Genus
/
Relator.Label
Sense-Reliability
=
Genus
/
Relator.Reliability
Noun.Reliability
=
S
Reliability
labelled
senses
/
number
of
senses
return
(
Noun.Label
,
Noun.Reliability
)
Figure
1
.
Implementation
of
the
automatic
process
using
genus
and
relater
information
In
order
to
establish
the
reliability
of
the
automatic
labelling
process
for
a
particular
noun
,
we
considered
the
number
of
senses
labelled
,
taking
into
account
the
reliability
of
the
labels
of
the
genus
(
or
relator
)
that
provided
the
information
.
The
result
was
calculated
as
follows
:
_
During
manual
labelling
,
we
assigned
reliability
value
1
to
all
labels
,
since
all
the
senses
of
these
nouns
are
taken
into
account
.
Figure
1
shows
the
algorithm
used
.
For
each
common
noun
defined
in
the
dictionary
,
we
take
,
one
by
one
,
all
their
senses
containing
genus
or
relator
,
assigning
in
each
case
the
first
label
associated
to
a
genus
or
relator
in
the
hierarchy
of
hyperonyms
.
When
the
sign
of
all
the
labels
are
coincident
we
use
it
to
label
the
entry
,
in
other
case
,
we
use
the
label
[
?
]
.
In
all
cases
,
their
reliability
is
calculated
.
When
we
detect
a
cycle
,
the
search
is
interrupted
and
the
sense
to
be
tagged
remains
unlabelled
.
3.2
Labelling
using
synonymy
information
Labelling
using
genus
and
relators
can
be
expanded
by
using
synonymy
.
Since
the
synonymy
relationship
shares
semantic
features
,
we
can
deduce
the
semantic
label
of
a
sense
if
we
know
the
label
of
its
synonymes
.
Therefore
,
the
information
obtained
during
the
previous
phase
can
now
be
used
to
label
new
nouns
.
It
also
serves
to
increase
the
reliability
of
nouns
already
been
labelled
thanks
to
the
genus
information
of
some
of
their
senses
.
If
the
synonymy
information
provided
corroborates
the
genus
information
,
the
noun
's
reliability
rating
increases
.
If
,
on
the
other
hand
,
the
new
label
does
not
coincide
with
the
previous
one
,
a
special
label
:
[
?
]
is
assigned
to
the
noun
indicating
this
ambiguity
.
The
automatic
process
using
synonymy
was
implemented
in
the
same
way
as
in
the
previous
process
.
3.3
Iterative
repetition
of
the
automatic
process
Our
next
idea
was
to
repeat
the
process
;
since
the
information
gathered
so
far
using
synonymy
may
also
be
applied
hereditarily
through
the
genus
'
hyperonymy
relationship
.
We
therefore
repeated
the
process
from
the
beginning
,
trying
to
label
all
the
senses
of
the
nouns
that
had
not
been
fully
labelled
during
the
initial
operations
,
by
using
the
information
contained
in
the
senses
of
the
nouns
that
had
been
fully
labelled
(
reliability
1
)
.
As
with
the
initial
operation
,
we
first
used
information
about
genus
and
relators
,
and
then
,
synonymy
.
This
process
can
be
repeated
any
number
of
times
,
thereby
labelling
more
and
more
words
while
increasing
the
reliability
of
the
labelling
itself
.
However
,
repetition
of
the
process
also
increases
the
number
of
words
labelled
as
ambiguous
[
?
]
,
since
more
senses
are
labelled
during
each
iteration
,
thereby
increasing
the
chances
of
inconsistencies
.
As
we
shall
see
,
this
iterative
process
improves
the
results
logarithmically
up
to
a
certain
number
of
repetitions
,
after
which
it
has
no
further
advantageous
effects
.
3.4
Example
of
semiautomatic
labelling
for
the
[
±animate
]
feature
The
100
words
that
are
most
frequently
used
as
genus
(
g
)
or
relators
(
r
)
were
labelled
manually
for
the
[
±animate
]
feature
,
as
shown
in
table
2
(
tables
3
,
4
and
5
contain
the
Basque
words
processed
during
the
explained
operation
,
Noun
±anim
Freq
Gen
/
rel
nolakotasun
(
quality
)
-
multzo
(
collection
)
-
txikigarri
(
collection
)
x
tresna
(
instrument
)
-
Table
2
.
We
shall
now
trace
the
implementation
of
the
automatic
labelling
process
for
certain
nouns
.
Table
3
shows
the
results
of
the
first
labelling
process
using
information
about
genus
and
relators
.
The
words
printed
in
bold
in
the
results
column
are
nouns
that
were
labelled
during
the
manual
labelling
process
.
We
can
see
how
the
noun
'
babesgarri
'
(
protector
)
is
labelled
as
[
-
]
thanks
to
the
information
provided
by
the
relator
of
its
only
sense
,
which
was
manually
labelled
.
Rel
.
babesgarri
(
protector
)
gertaera
(
event
)
espetxe
(
jail
)
(
construction
)
(
place
)
adiskide
(
friend
)
filosofia
(
philosophy
)
(
knowledge
)
(
collection
)
Table
3
.
Result
of
automatic
labelling
using
genus
and
relator
information
The
reliability
rating
obtained
for
'
zinismo
'
was
therefore
0.87
(
f
=
(
1+0.75
)
/
2
=
0.87
)
.
Table
4
shows
some
examples
of
the
process
using
synonym
information
.
As
we
can
see
,
'
iturburu
'
(
spring
)
,
which
the
previous
process
had
not
managed
to
tag
,
is
now
labelled
as
[
-
]
thanks
to
the
synonymy
information
associated
to
one
of
the
two
senses
.
The
resulting
reliability
rating
is
0.06
which
had
previously
been
labelled
as
[
+
]
on
the
basis
of
genus
information
,
we
see
that
the
synonyms
of
the
two
senses
that
use
synonymy
Genus
lab
.
Results
of
the
process
using
synonymy
iturburu
(
spring
)
gertakuntza
(
event
)
(
companion
)
jateko
(
food
)
giltzape
(
prison
)
Table
4
.
Results
of
automatic
labelling
using
synonymy
information
Result
of
process
using
genus
and
relators
Lab
.
Relia
.
armadura
(
armour
)
adiskidetzako
(
friend
)
apio
(
celery
)
ikusgune
(
viewpoint
)
jarrera
(
attitude
)
zinismo
(
cynicism
)
Table
5
.
Results
of
the
2n
iteration
of
automatic
labelling
using
genus
and
relator
information
information
are
labelled
as
[
-
]
.
Due
to
this
inconsistency
,
the
word
is
now
labelled
as
[
?
]
.
The
terms
'
gertakuntza
'
(
event
)
,
'
lagun
'
(
companion
)
and
'
jateko
'
(
food
)
,
which
previously
only
had
one
sense
,
are
now
labelled
thanks
to
synonym
information
.
The
words
'
giltzape
'
(
prison
)
and
'
ikusgune
'
(
viewpoint
)
,
which
had
had
one
sense
labelled
on
the
basis
of
genus
,
now
have
both
senses
labelled
.
The
reliability
rating
for
'
ikusgune
'
is
calculated
as
f
=
(
1+0.33
)
/
2
=
0.66
.
We
then
repeated
the
process
using
first
the
genus
/
relator
information
(
table
4
)
followed
by
the
synonymy
information
(
table
5
)
.
The
aim
of
this
repetition
was
to
label
only
those
words
that
had
not
been
fully
labelled
,
using
the
information
provided
by
the
terms
that
had
been
and
that
had
a
reliability
rating
of
1
,
such
as
'
babesgarri
'
,
'
gertaera
'
,
'
espetxe
'
,
'
adiskide
'
,
'
filosofia
'
,
'
ama
'
,
'
gertakuntza
'
,
'
lagun
'
,
'
jateko
'
and
'
giltzape
'
(
tables
4
and
5
)
.
This
process
succeeded
in
labelling
the
senses
of
'
armadura
'
(
protector
)
,
'
adiskidetzako
'
(
friend
)
and
'
apio
'
(
celery
)
,
previously
left
unlabelled
,
since
their
genus
'
soineko
'
(
garment
)
,
'
lagun
'
(
friend
)
and
'
jateko
'
(
food
)
had
been
fully
labelled
using
the
synonym
information
.
On
the
other
hand
,
'
ikusgune
'
(
viewpoint
)
,
'
jarrera
'
(
attitude
)
and
'
zinismo
'
(
cynicism
)
,
did
not
benefit
from
this
repetition
.
Following
this
process
,
we
applied
the
synonymy
information
,
thus
completing
the
second
iteration
.
The
process
may
be
repeated
as
many
times
as
you
wish
.
4
Experiments
for
optimising
the
efficiency
of
the
method
We
carried
out
a
number
of
different
tests
for
the
[
±animate
]
semantic
feature
labelling
the
2
,
5
,
10
,
50
,
100
,
125
and
150
words
most
frequently
used
as
genus
/
relators
,
and
repeating
the
whole
process
(
using
both
genus
and
relator
and
synonymy
information
)
1
,
2
and
3
times
.
The
first
5
terms
that
appear
most
frequently
Manual
labelling
Automatic
labelling
and
relative
increase
as
genus
/
relators
are
also
the
most
productive
during
the
automatic
labelling
process
.
From
here
on
,
the
rate
of
increase
gradually
falls
,
until
only
7
terms
are
labelled
automatically
for
every
noun
labelled
manually
.
On
average
,
the
first
2
nouns
each
enabled
1840
terms
to
be
labelled
,
the
next
3
enabled
1112
while
the
next
5
enabled
only
250
.
After
the
hundredth
noun
,
this
average
dropped
to
just
7
new
terms
labelled
automatically
for
every
term
labelled
manually
.
These
results
are
illustrated
in
figure
2
.
For
efficiency
reasons
,
we
decided
that
when
labelling
other
semantic
features
,
we
will
label
manually
the
100
nouns
most
frequently
used
as
genus
/
relators
.
In
order
to
decide
the
number
of
iterations
required
for
optimum
results
,
we
compared
the
results
obtained
after
1
to
10
iterations
after
manually
labelling
100
nouns
(
Figure
3
)
.
Although
no
increase
was
recorded
for
the
number
of
nouns
with
reliability
rating
1
(
i.e.
with
all
senses
labelled
)
after
the
3rd
iteration
,
the
results
for
other
reliability
ratings
continued
to
increase
up
until
the
8th
iteration
,
since
as
more
and
more
information
is
gathered
,
new
contradictions
are
generated
and
the
number
of
ambiguous
labels
increases
.
When
the
results
stabilise
,
we
can
affirm
that
all
the
available
information
has
been
used
and
the
most
accurate
results
possible
with
this
manual
labelling
operation
have
been
obtained
.
It
is
important
to
check
that
the
process
does
indeed
stabilise
,
and
that
it
does
so
after
a
fairly
low
number
of
iterations
(
in
this
case
,
after
8
)
.
The
repetition
of
the
process
does
not
significantly
increase
execution
time
.
10
iterations
of
the
automatic
labelling
process
for
the
[
±animate
]
feature
takes
just
11
minutes
33
seconds
using
the
total
capacity
of
the
CPU
of
a
Sun
Sparc
10
machine
with
512
Megabytes
of
memory
running
at
360
MHz
.
We
can
therefore
conclude
that
the
method
is
viable
and
that
,
in
the
automatic
process
for
other
semantic
features
,
the
necessary
iterations
should
be
carried
out
until
the
results
are
totally
stabilised
.
5
Accuracy
and
scope
of
the
labelling
process
for
the
[
±animate
]
feature
In
order
to
calculate
the
accuracy
of
the
automatic
labelling
process
,
we
took
1
%
of
the
labelled
words
as
a
sample
and
checked
them
manually
.
The
results
are
shown
in
table
6
.
Reliability
|
Accuracy
Table
6
.
Accuracy
of
automatic
labelling
Although
we
initially
planned
to
use
only
the
labels
with
a
reliability
rating
of
1
,
after
seeing
the
accuracy
of
the
others
,
we
decided
to
use
all
the
labels
obtained
during
the
process
,
thereby
achieving
an
overall
accuracy
rating
of
99.2
%
.
We
can
affirm
that
the
semiautomatic
process
designed
and
implemented
here
is
very
efficient
.
Fig
.
Automatic
labelling
according
to
number
of
iterations
carried
out
8
iterations
.
Labelling
lab
.
Table
7
.
Scope
of
the
dictionary
Appearances
in
the
corpus
Different
nouns
Labelled
Table
S.
Scope
of
labelling
within
the
corpus
6
Generalisation
for
use
with
other
semantic
features
Given
the
process
's
efficiency
,
it
can
be
generalised
for
use
with
other
semantic
features
.
To
this
end
,
we
have
adapted
its
implementation
to
enable
the
automatic
process
to
be
carried
out
on
the
basis
of
the
manual
labelling
of
any
semantic
feature
.
So
far
,
we
have
carried
out
the
labelling
process
for
the
[
±animate
]
,
[
±human
]
and
[
±concrete
]
semantic
features
.
Table
12
shows
the
corresponding
results
.
±animate
±concrete
Table
12
.
Labelling
data
for
different
semantic
features
Conclusions
We
have
presented
a
highly
efficient
semiautomatic
method
for
labelling
the
semantic
features
of
common
nouns
,
using
the
study
of
genus
,
relators
and
synonymy
as
contained
in
the
Euskal
Hiztegia
dictionary
.
The
results
obtained
have
been
excellent
,
with
an
accuracy
of
over
99
%
and
a
scope
of
68,2
%
with
regard
to
all
the
common
nouns
contained
in
a
real
corpus
of
over
1
million
words
,
after
the
manual
labelling
of
only
100
nouns
.
As
far
as
we
know
,
no
so
method
of
semantic
feature
labelling
has
been
described
in
the
literature
,
although
many
authors
[
Pustejovsky
,
2000
;
Sheremetyeva
&amp;
Nirenburg
,
2000
]
claim
the
significance
of
semantic
features
in
general
,
and
[
animacy
]
in
particular
,
for
NLP
systems
.
One
of
the
possible
applications
of
these
experiments
is
to
enrich
the
Basque
Lexical
Database
,
EDBL
,
using
the
semantic
information
obtained
.
Acknowledgements
The
Basque
Government
Department
of
Education
,
Universities
and
Research
sponsored
this
study
.
Bibliography
Agirre
E.
,
Ansa
O.
,
Arregi
X.
,
Artola
X.
,
Diaz
de
Ilarraza
A.
,
Lersundi
M.
,
Martinez
D.
,
Sarasola
K.
,
Urizak
R.
,
2000
,
"
Extraction
of
semantic
relations
from
a
Basque
monolingual
dictionary
using
Constraint
Grammar
"
,
EURALEX'2000
.
Diaz
de
Ilarraza
A.
,
Lersundi
M.
,
Mayor
A.
,
Sarasola
K.
,
2000
.
Etiquetado
semiautomatico
del
rasgo
semantico
de
animicidad
para
su
uso
en
un
sistema
de
traduction
automatica
.
SEPLN'2000
.
Vigo
.
.
Diaz
de
Ilarraza
A.
,
Mayor
A.
,
Sarasola
K.
,
2000
.
"
Reusability
of
Wide-Coverage
Linguistic
Resources
in
the
Construction
of
a
Multilingual
MT
System
"
.
Mr
2000
.
Exeter
.
UK
.
Pustejovsky
J.
,
2000
.
"
Syntagmatic
Processes
"
.
Handbook
of
Lexicology
and
Lexicography
.
de
Gruyter
,
2000
.
Sheremetyeva
S.
and
Nirenburg
S.
,
2000
.
"
Towards
A
Universal
Tool
for
NLP
Resource
Acquisition
"
.
LREC2000
.
Greece
.
Smith
,
R.N.
,
Maxwell
,
E.
,
1980
,
"
An
English
dictionary
for
computerised
syntactic
and
semantic
processing
systems
"
,
Proceedings
of
the
International
Conference
on
Computational
Linguistics
.
1980
.
