Lexical
analogies
occur
frequently
in
text
and
are
useful
in
various
natural
language
processing
tasks
.
In
this
study
,
we
present
a
system
that
generates
lexical
analogies
automatically
from
text
data
.
Our
system
discovers
semantically
related
pairs
of
words
by
using
dependency
relations
,
and
applies
novel
machine
learning
algorithms
to
match
these
word-pairs
to
form
lexical
analogies
.
Empirical
evaluation
shows
that
our
system
generates
valid
lexical
analogies
with
a
precision
of
70
%
,
and
produces
quality
output
although
not
at
the
level
of
the
best
human-generated
lexical
analogies
.
1
Introduction
Analogy
discovery
and
analogical
reasoning
are
active
research
areas
in
a
multitude
of
disciplines
,
including
philosophy
,
psychology
,
cognitive
science
,
linguistics
,
and
artificial
intelligence
.
A
type
of
analogy
that
is
of
particular
interest
in
natural
language
processing
is
lexical
analogy
.
A
lexical
analogy
is
a
pair
of
word-pairs
that
share
a
similar
semantic
relation
.
For
example
,
the
word-pairs
(
dalmatian
,
dog
)
and
(
trout
,
fish
)
form
a
lexical
analogy
because
dalmatian
is
a
subspecies
of
dog
just
as
trout
is
a
subspecies
of
fish
,
and
the
word-pairs
(
metal
,
electricity
)
and
(
air
,
sound
)
form
a
lexical
analogy
because
in
both
cases
the
initial
word
serves
as
a
conductor
for
the
second
word
.
Lexical
analogies
occur
fre
-
quently
in
text
and
are
useful
in
various
natural
language
processing
tasks
.
For
example
,
understanding
metaphoric
language
such
as
"
the
printer
died
"
requires
the
recognition
of
implicit
lexical
analogies
,
in
this
case
between
(
printer
,
malfunction
)
and
(
person
,
death
)
.
Lexical
analogies
also
have
applications
in
word
sense
disambiguation
,
information
extraction
,
question-answering
,
and
semantic
relation
classification
(
see
(
Turney
,
2006
)
)
.
In
this
study
,
we
present
a
novel
system
for
generating
lexical
analogies
directly
from
a
text
corpus
without
relying
on
dictionaries
or
other
semantic
resources
.
Our
system
uses
dependency
relations
to
characterize
pairs
of
semantically
related
words
,
then
compares
the
similarity
of
their
semantic
relations
using
two
machine
learning
algorithms
.
We
also
present
an
empirical
evaluation
that
shows
our
system
generates
valid
lexical
analogies
with
a
precision
of
70
%
.
Section
2
provides
a
list
of
definitions
,
notations
,
and
necessary
background
materials
.
Section
3
describes
the
methods
used
in
our
system
.
Section
4
presents
our
empirical
evaluation
.
Section
5
reviews
selected
related
work
.
Finally
,
Section
6
concludes
the
paper
with
suggested
future
work
and
a
brief
conclusion
.
2
Definitions
A
word-pair
is
a
pair
of
entities
,
where
each
entity
is
a
single
word
or
a
multi-word
named
entity
.
The
underlying
relations
of
a
word-pair
(
w1
,
w2
)
are
the
semantic
relations1
between
wi
and
w2
.
For
exam
-
Here
'
semantic
relations
'
include
both
classical
relations
such
as
synonymy
and
meronymy
,
and
non-classical
relations
as
defined
by
Morris
and
Hirst
(
2004
)
.
Proceedings
of
the
2007
Joint
Conference
on
Empirical
Methods
in
Natural
Language
Processing
and
Computational
Natural
Language
Learning
,
pp.
561-570
,
Prague
,
June
2007
.
©
2007
Association
for
Computational
Linguistics
ple
,
the
underlying
relations
of
(
poet
,
poem
)
include
produces
,
writes
,
enjoys
,
and
understands
.
A
lexical
analogy
is
a
pair
of
word-pairs
that
share
at
least
one
identical
or
similar
underlying
relation
.
A
key
linguistic
formalism
we
use
is
dependency
grammar
(
Tesniere
,
1959
)
.
A
dependency
grammar
describes
the
syntactic
structure
of
a
sentence
in
a
manner
similar
to
the
familiar
phrase-structure
grammar
.
However
,
unlike
phrase-structure
grammars
which
associate
each
word
of
a
sentence
to
the
syntactic
phrase
in
which
the
word
is
contained
,
a
dependency
grammar
associates
each
word
to
its
syntactic
superordinate
as
determined
by
a
set
of
rules
.
Each
pair
of
depending
words
is
called
a
dependency
.
Within
a
dependency
,
the
word
being
depended
on
is
called
the
governor
,
and
the
word
depending
on
the
governor
is
called
the
dependent
.
Each
dependency
is
also
labelled
with
the
syntactic
relation
between
the
governor
and
the
dependent
.
Dependency
grammars
require
that
each
word
of
a
sentence
have
exactly
one
governor
,
except
for
one
word
called
the
head
word
which
has
no
governor
at
all
.
A
proposition
p
that
is
governor
to
exactly
one
word
w1
and
dependent
of
exactly
one
word
w2
is
often
collapsed
(
Lin
and
Pantel
,
2001
)
;
that
is
,
the
two
dependencies
involving
p
are
replaced
by
a
single
dependency
between
w1
and
w2
labelled
p.
The
dependency
structure
of
a
sentence
can
be
concisely
represented
by
a
dependency
tree
,
in
which
each
word
is
a
node
,
each
dependent
is
a
child
of
its
governor
,
and
the
head
word
is
the
root
.
A
dependency
path
is
an
undirected
path
through
a
dependency
tree
,
and
a
dependency
pattern
is
a
dependency
path
with
both
ends
replaced
by
slots
(
Lin
and
Pantel
,
2001
)
.
Figure
1
illustrates
various
dependency
structures
of
the
sentence
,
rebels
fired
rockets
at
a
military
convoy
,
after
each
word
is
lemmatized
.
3
Methods
We
consider
lexical
analogy
generation
as
a
sequence
of
two
key
problems
:
data
extraction
and
relation-matching
.
Data
extraction
involves
the
identification
and
extraction
of
pairs
of
semantically
related
words
,
as
well
as
features
that
characterize
their
relations
.
Relation-matching
involves
matching
word-pairs
with
similar
features
to
form
lexical
analogies
.
We
describe
our
methods
for
solving
these
two
problems
in
the
following
subsections
.
3.1
Data
Extraction
Extracting
Word-Pairs
To
identify
semantically
related
words
,
we
rely
on
the
assumption
that
highly
syntactically
related
words
also
tend
to
be
semantically
related
—
a
hypothesis
that
is
supported
by
works
such
as
Levin
's
(
1993
)
study
of
English
verbs
.
As
such
,
the
dependency
structure
of
a
sentence
can
be
used
to
approximate
the
semantic
relatedness
between
its
constituent
words
.
Our
system
uses
a
dependency
parser
to
parse
the
input
text
into
a
set
of
dependency
trees
,
then
searches
through
these
trees
to
extract
dependency
paths
satisfying
the
following
constraints
:
The
path
must
be
of
the
form
noun-verb-noun
.
One
of
the
nouns
must
be
the
subject
of
the
clause
to
which
it
belongs
.
Each
of
these
paths
is
then
turned
into
a
word-pair
by
taking
its
two
nouns
.
The
path
constraints
that
we
use
are
suggested
by
the
subject-verb-object
(
SVO
)
pattern
commonly
used
in
various
relation
extraction
algorithms
.
However
,
our
constraints
allow
significantly
more
flexibility
than
the
SVO
pattern
in
two
important
aspects
.
First
,
our
constraints
allow
an
arbitrary
relation
between
the
verb
and
the
second
noun
,
not
just
the
object
relation
.
Hence
,
word-pairs
can
be
formed
from
a
clause
's
subject
and
its
location
,
time
,
instrument
,
and
other
arguments
,
which
are
clearly
semantically
related
to
the
subject
.
Secondly
,
searching
in
the
space
of
dependency
trees
instead
of
raw
text
data
means
that
we
are
able
to
find
semantically
related
words
that
are
not
necessarily
adjacent
to
each
other
in
the
sentence
.
It
is
important
to
note
that
,
although
these
constraints
improve
the
precision
of
our
system
and
tend
to
identify
effectively
the
most
relevant
word-pairs
,
they
are
not
strictly
necessary
.
Our
system
would
be
fully
functional
using
alternative
sets
of
constraints
tailored
for
specific
applications
,
or
even
with
no
constraints
at
all
.
Using
the
sentence
in
Figure
1
as
an
example
,
our
system
would
extract
the
dependency
paths
"
rebel
subj
obj
lt
subj
at
—
fire
—
rocket
and
rebel
—
fire
—
convoy
"
,
and
would
thus
generate
the
word-pairs
(
rebel
,
rocket
)
and
(
rebel
,
convoy
)
.
Figure
1
:
Dependency
structures
of
"
rebels
fired
rockets
at
a
military
convoy
"
after
lemmatization
Extracting
Features
Recall
that
each
word-pair
originates
from
a
dependency
path
.
The
path
,
and
in
particular
the
middle
verb
,
provides
a
connection
between
the
two
words
of
the
word-pair
,
and
hence
is
a
good
indication
of
their
semantic
relation
.
Therefore
,
for
each
word-pair
extracted
,
we
also
extract
the
dependency
pattern
derived
from
the
word-pair
's
dependency
path
as
a
feature
for
the
word-pair
.
We
further
justify
this
choice
of
feature
by
noting
that
the
use
of
dependency
patterns
have
previously
been
shown
to
be
effective
at
characterizing
lexico-syntactic
relations
(
Lin
and
Pantel
,
2001
;
Snow
et
al.
,
2004
)
.
rocket
)
and
(
rebel
,
convoy
)
,
respectively
.
Filtering
Word-pairs
and
features
extracted
using
only
dependency
relations
tend
to
be
crude
in
several
aspects
.
First
,
they
contain
a
significant
amount
of
noise
,
such
as
word-pairs
that
have
no
meaningful
underlying
relations
.
Noise
comes
from
grammatical
and
spelling
mistakes
in
the
original
input
data
,
imperfect
parsing
,
as
well
as
the
fact
that
dependency
structure
only
approximates
semantic
related-ness
.
Secondly
,
some
of
the
extracted
word-pairs
contain
underlying
relations
that
are
too
general
or
too
obscure
for
the
purpose
of
lexical
analogy
generation
.
For
example
,
consider
the
word-pair
(
company
,
right
)
from
the
sentence
"
the
company
exercised
the
right
to
terminate
his
contract
"
.
The
two
words
are
clearly
semantically
related
,
however
the
relation
(
have
or
entitled-to
)
is
very
general
and
it
is
difficult
to
construct
satisfying
lexical
analogies
from
the
word-pair
.
Lastly
,
some
features
are
also
say
—
—
_
"
,
for
example
,
has
very
little
characterization
power
because
almost
any
pair
of
words
can
occur
with
this
feature
.
In
order
to
retain
only
the
most
relevant
word-pairs
and
features
,
we
employ
a
series
of
refining
filters
.
All
of
our
filters
rely
on
the
occurrence
statistics
of
the
word-pairs
and
features
.
Let
W
=
{
wp1
,
wp2
,
wpn
}
be
the
set
of
all
word-pairs
and
F
=
{
f1
,
f2
,
fm
}
the
set
of
all
features
.
Let
Fwp
be
the
set
of
features
of
word-pair
wp
,
and
let
Wf
be
the
set
of
word-pairs
associated
with
feature
f.
Let
O
(
wp
)
be
the
total
number
of
occurrences
of
word-pair
wp
,
O
(
f
)
be
the
total
number
of
occurrences
of
feature
f
,
and
O
(
wp
,
f
)
be
the
number
of
occurrences
of
word-pair
wp
with
feature
f.
The
following
filters
are
used
:
Occurrence
filter
:
Eliminate
word-pair
wp
if
O
(
wp
)
is
less
than
some
constant
Kf1
,
and
eliminate
feature
f
if
O
(
f
)
is
less
than
some
constant
Kf2
.
This
filter
is
inspired
by
the
simple
observation
that
valid
word-pairs
and
features
tend
to
occur
repeatedly
.
Generalization
filter
:
Eliminate
feature
f
if
|
Wf
|
is
greater
than
some
constant
Kf3
.
This
filter
ensures
that
features
associated
with
too
many
word-pairs
are
not
kept
.
A
feature
that
occurs
with
many
word-pairs
tend
to
describe
overly
general
relations
.
An
example
of
such
our
experiment
occurred
with
several
thousand
word-pairs
while
most
features
occurred
with
less
than
a
hundred
.
Data
sufficiency
filter
:
Eliminate
word-pair
wp
if
|
Fwp
|
is
less
than
some
constant
Kf4
.
This
filter
ensures
that
all
word-pairs
have
sufficient
features
to
be
compared
meaningfully
.
Entropy
filter
:
Eliminate
word-pair
wp
if
its
normalized
entropy
is
greater
than
some
constant
Kf5
.
We
compute
a
word-pair
's
entropy
by
considering
it
as
a
distribution
over
features
,
in
a
manner
that
is
analogous
to
the
feature
entropy
defined
in
(
Turney
,
2006
)
.
Specifically
,
the
normalized
entropy
of
a
word-pair
wp
is
:
where
p
(
f
|
wp
)
=
°^WPfj
is
the
conditional
probability
of
f
occurring
in
the
context
of
wp.
The
normalized
entropy
of
a
word-pair
ranges
from
zero
to
one
,
and
is
at
its
highest
when
the
distribution
ofthe
word-pair
's
occurrences
over
its
features
is
the
most
random
.
The
justification
behind
this
filter
is
that
word-pairs
with
strong
underlying
relations
tend
to
have
just
a
few
dominant
features
that
characterize
those
relations
,
whereas
word-pairs
that
have
many
non-dominant
features
tend
to
have
overly
general
underlying
relations
that
can
be
characterized
in
many
different
ways
.
3.2
Relation-Matching
Central
to
the
problem
of
relation-matching
is
that
of
a
relational
similarity
function
:
a
function
that
computes
the
degree
of
similarity
between
two
word-pairs
'
underlying
relations
.
Given
such
a
function
,
relation-matching
reduces
to
simply
computing
the
relational
similarity
between
every
pair
of
word-pairs
,
and
outputting
the
pairs
scoring
higher
than
some
threshold
Kth
as
lexical
analogies
.
Our
system
incorporates
two
relational
similarity
functions
,
as
discussed
in
the
following
subsections
.
Latent
Relational
Analysis
The
baseline
algorithm
that
we
use
to
compute
relational
similarity
is
a
modified
version
of
Latent
Relational
Analysis
(
LRA
)
(
Turney
,
2006
)
,
that
consists
of
the
following
steps
:
Construct
an
n-by-m
matrix
A
such
that
the
ith
row
maps
to
word-pair
wp
»
,
the
jth
column
maps
to
feature
f
,
and
Ajj
=
O
(
wpi
;
f
)
.
Reduce
the
dimensionality
of
A
to
a
constant
Ksvd
using
Singular
Value
Decomposition
(
SVD
)
(
Golub
and
van
Loan
,
1996
)
.
SVD
produces
a
matrix
A
of
rank
Ksvd
that
is
the
best
approximation
of
A
among
all
matrices
of
rank
Ksvd.
The
use
of
SVD
to
compress
the
feature
space
was
pioneered
in
Latent
Semantic
Analysis
(
Deerwester
et
al.
,
1990
)
and
has
become
a
popular
technique
in
feature-based
similarity
computation
.
The
compressed
space
is
believed
to
be
a
semantic
space
that
minimizes
artificial
surface
differences
.
The
relational
similarity
between
two
word-pairs
is
the
cosine
measure
of
their
corresponding
row
vectors
in
the
reduced
feature
space
.
Specifically
,
let
Aj
denote
the
ith
row
vector
of
A
,
then
the
relational
similarity
between
word-pairs
wpj1
and
wpi2
is
:
The
primary
difference
between
our
algorithm
and
LRA
is
that
LRA
also
includes
each
word
's
synonyms
in
the
computation
.
Synonym
inclusion
greatly
increases
the
size
of
the
problem
space
,
which
leads
to
computational
issues
for
our
system
as
it
operates
at
a
much
larger
scale
than
previous
work
in
relational
similarity
.
Turney
's
(
2006
)
extensive
evaluation
of
LRA
on
SAT
verbal
analogy
questions
,
for
example
,
involves
roughly
ten
thousand
relational
similarity
computations2
.
In
contrast
,
our
system
typically
requires
millions
of
relational
similarity
computations
because
every
pair
of
extracted
word-pairs
needs
to
be
compared
.
We
call
our
algorithm
LRA-S
(
LRA
Without
Synonyms
)
to
differentiate
it
from
the
original
LRA
.
Similarity
Graph
Traversal
While
LRA
has
been
shown
to
perform
well
in
computing
relational
similarity
,
it
suffers
from
two
2The
study
evaluated
374
SAT
questions
,
each
involving
30
pairwise
comparisons
,
for
a
total
of
11220
relational
similarity
computations
.
limitations
.
First
,
the
use
of
SVD
is
difficult
to
interpret
from
an
analytical
point
of
view
as
there
is
no
formal
analysis
demonstrating
that
the
compressed
space
really
corresponds
to
a
semantic
space
.
Secondly
,
even
LRA-S
does
not
scale
up
well
to
large
data
sets
due
to
SVD
being
an
expensive
operation
—
computing
SVD
is
in
general
O
(
mn
•
min
(
m
,
n
)
)
(
Koyuturk
et
al.
,
2005
)
,
where
m
,
n
are
the
number
of
matrix
rows
and
columns
,
respectively
.
To
counter
these
limitations
,
we
propose
an
alternative
algorithm
for
computing
relational
similarity
—
Similarity
Graph
Traversal
(
SGT
)
.
The
intuition
behind
SGT
is
as
follows
.
Suppose
we
know
that
wp1
and
wp2
are
relationally
similar
,
and
that
wp2
and
wp3
are
relationally
similar
.
Then
,
by
transitivity
,
wp1
and
wp3
are
also
likely
to
be
relation-ally
similar
.
In
other
words
,
the
relational
similarity
between
two
word-pairs
can
be
reinforced
by
other
word-pairs
through
transitivity
.
The
actual
algorithm
involves
the
following
steps
:
Construct
a
similarity
graph
as
follows
.
Each
word-pair
corresponds
to
a
node
in
the
graph
.
An
edge
exists
from
wp1
to
wp2
if
and
only
if
the
cosine
measure
of
the
two
word-pairs
'
feature
vectors
is
greater
than
or
equal
to
some
threshold
Ksgt
,
in
which
case
,
the
cosine
measure
is
assigned
as
the
strength
of
the
edge
.
Define
a
similarity
path
of
length
k
,
or
k-path
,
from
wp1
to
wp2
to
be
a
directed
acyclic
path
of
length
k
from
wp1
to
wp2
,
and
define
the
strength
s
(
p
)
of
a
path
p
to
be
the
product
of
the
strength
of
all
of
the
path
's
edges
.
Denote
the
set
of
all
k-paths
from
wp1
to
wp2
as
P
(
k
,
wp1
;
wp2
)
,
and
denote
the
sum
of
the
strength
of
all
paths
in
P
(
k
,
wp1
,
wp2
)
as
S
(
k
,
wp1
,
wp2
)
.
The
relational
similarity
between
word-pairs
wpj1
and
wpi2
is
:
learned
using
least-squares
regression
on
a
small
set
of
hand-labelled
lexical
analogies
.
A
natural
concern
for
SGT
is
that
relational
similarity
is
not
always
transitive
,
and
hence
some
paths
may
be
invalid
.
For
example
,
although
(
teacher
,
student
)
is
relationally
similar
to
both
(
shepherd
,
sheep
)
and
(
boss
,
employee
)
,
the
latter
two
word-pairs
are
not
relationally
similar
.
The
reason
that
this
is
not
a
problem
for
SGT
is
because
truly
similar
word-pairs
tend
to
be
connected
by
many
transitive
paths
,
while
invalid
paths
tend
to
occur
in
isolation
.
As
such
,
while
a
single
path
may
not
be
indicative
,
a
collection
of
many
paths
likely
signifies
a
true
common
relation
.
The
weights
in
step
3
ensure
that
SGT
assigns
a
high
similarity
score
to
two
word-pairs
only
if
there
are
sufficiently
many
transitive
paths
(
which
are
sufficiently
strong
)
between
them
.
Analogy
Filters
As
a
final
step
in
both
LSA-R
and
SGT
,
we
filter
out
lexical
analogies
of
the
form
(
w1
,
w2
)
and
(
w1
,
w3
)
,
as
such
lexical
analogies
tend
to
express
the
near-synonymy
between
w2
and
w3
more
than
they
express
the
relational
similarity
between
the
two
word-pairs
.
We
also
keep
only
one
permutation
of
each
lexical
analogy
:
(
w1
,
w2
)
and
(
w3
,
w4
)
,
(
w3
,
w4
)
and
(
w1
,
w2
)
,
(
w2
,
w1
)
and
(
w4
,
w3
)
,
and
(
w4
,
w3
)
and
(
w2
,
w1
)
are
different
permutations
of
the
same
lexical
analogy
.
4
Evaluation
Our
evaluation
consisted
of
two
parts
.
First
,
we
evaluated
the
performance
of
the
system
,
using
LRA-S
for
relation-matching
.
Then
,
we
evaluated
the
SGT
algorithm
,
in
particular
,
how
it
compares
to
LRA-S
.
4.1
System
Evaluation
Experimental
Setup
We
implemented
our
system
in
Sun
JDK
1.5
.
We
also
used
MXTerminator
(
Reynar
and
Ratnaparkhi
,
1997
)
for
sentence
segmentation
,
MINIPAR
(
Lin
,
1993
)
for
lemmatization
and
dependency
parsing
,
and
MATLAB3
for
SVD
computation
.
The
experiment
was
conducted
on
a
2.1
GHz
processor
,
with
the
exception
of
SVD
computation
which
was
carried
out
in
MATLAB
running
on
a
single
2.4
GHz
processor
within
a
64-processor
cluster
.
The
input
corpus
consisted
of
the
following
collections
in
the
Text
Retrieval
Conference
Dataset4
:
AP
Newswire
1988-1990
,
LA
Times
1989-1990
,
and
San
Jose
Mercury
1991
.
In
total
,
1196
megabytes
of
text
data
were
used
for
the
experiment
.
Table
1
summarizes
the
running
times
of
the
experiment
.
Sentence
Segmentation
Dependency
Parsing
Data
Extraction
Relation-Matching
Table
1
:
Experiment
Running
Times
The
parameter
values
selected
for
the
experiment
are
listed
in
Table
2
.
The
filter
parameters
were
selected
mostly
through
trial-and-error
—
various
parameter
values
were
tried
and
filtration
results
examined
.
We
used
a
threshold
value
Kth
=
0.80
to
generate
the
lexical
analogies
,
but
the
evaluation
was
performed
at
ten
different
thresholds
from
0.98
to
0.80
in
0.02
decrements
.
Table
2
:
Experiment
Parameter
Values
Evaluation
Protocol
An
objective
evaluation
of
our
system
is
difficult
for
two
reasons
.
First
,
lexical
analogies
are
by
definition
subjective
;
what
constitutes
a
'
good
'
lexical
analogy
is
debatable
.
Secondly
,
there
is
no
gold
standard
of
lexical
analogies
to
which
we
can
compare
.
For
these
reasons
,
we
adopted
a
subjective
evaluation
protocol
that
involved
human
judges
rating
the
quality
of
the
lexical
analogies
generated
.
Such
a
manual
evaluation
protocol
,
however
,
meant
that
it
was
impractical
to
evaluate
the
entire
output
set
(
which
was
well
in
the
thousands
)
.
Instead
,
we
evaluated
random
samples
from
the
output
and
interpolated
the
results
.
In
total
,
22
human
judges
participated
in
the
evaluation
.
All
judges
were
graduate
or
senior
undergraduate
students
in
English
,
Sociology
,
or
Psychology
,
and
all
were
highly
competent
English
speakers
.
Each
judge
was
given
a
survey
containing
105
lexical
analogies
,
100
of
which
were
randomly
sampled
from
our
output
,
and
the
remaining
five
were
sampled
from
a
control
set
of
ten
human-generated
lexical
analogies
.
All
entries
in
the
control
set
were
taken
from
the
Verbal
Analogy
section
of
the
Standard
Aptitude
Test5
and
represented
the
best
possible
lexical
analogies
.
The
judges
were
instructed
to
grade
each
lexical
analogy
with
a
score
from
zero
to
10
,
with
zero
representing
an
invalid
lexical
analogy
(
i.e.
,
when
the
two
word-pairs
share
no
meaningful
underlying
relation
)
and
ten
representing
a
perfect
lexical
analogy
.
To
minimize
inter-judge
subjectivity
,
all
judges
were
given
detailed
instructions
containing
the
definition
and
examples
of
lexical
analogies
.
In
all
,
1000
samples
out
of
the
8373
generated
were
graded
,
each
by
at
least
two
different
judges
.
We
evaluated
the
output
at
ten
threshold
values
,
from
0.98
to
0.80
in
0.02
decrements
.
For
each
threshold
,
we
collected
all
samples
down
to
that
threshold
and
computed
the
following
metrics
:
Coverage
:
The
number
of
lexical
analogies
generated
at
the
current
threshold
over
the
number
of
lexical
analogies
generated
at
the
lowest
threshold
(
8373
)
.
Precision
:
The
proportion
of
samples
at
the
current
threshold
that
scored
higher
than
three
.
These
are
considered
valid
lexical
analogies
.
Note
that
this
is
significantly
more
conservative
than
the
survey
scoring
.
We
want
to
ensure
very
poor
lexical
analogies
were
excluded
,
even
if
they
were
'
valid
'
according
to
the
judges
.
Quality
:
The
average
score
of
all
samples
at
the
current
threshold
,
divided
by
ten
to
be
in
the
same
scale
as
the
other
metrics
.
Goodness
:
The
proportion
of
samples
at
the
current
threshold
that
scored
within
10
%
of
the
average
score
of
the
control
set
.
These
are
considered
human
quality
.
5http
:
/
/
www.collegeboard.com
/
Note
that
recall
was
not
an
evaluation
metric
because
there
does
not
exist
a
method
to
determine
the
true
number
of
lexical
analogies
in
the
input
corpus
.
Table
3
summarizes
the
result
of
the
control
set
,
and
Figure
2
:
Left
summarizes
the
result
of
the
lexical
analogies
our
system
generated
.
Table
4
lists
some
good
and
some
poor
lexical
analogies
our
system
generated
,
along
with
some
of
their
shared
features
.
Coverage
Table
3
:
Result
of
the
Control
Set
As
Figure
2
shows
,
our
system
performed
fairly
well
,
generating
valid
lexical
analogies
with
a
precision
around
70
%
.
The
quality
of
the
generated
lexical
analogies
was
reasonable
,
although
not
at
the
level
of
human-generation
.
On
the
other
hand
,
a
small
portion
(
19
%
at
the
highest
threshold
)
of
our
output
was
of
very
high
quality
,
comparable
to
the
best
human-generated
lexical
analogies
.
Our
result
also
showed
that
there
was
a
correspondence
between
the
score
our
system
assigned
to
each
generated
lexical
analogy
and
its
quality
.
Precision
,
quality
,
and
goodness
all
declined
steadily
toward
lower
thresholds
:
precision
0.70-0.66
,
quality
0.540.49
,
and
goodness
0.19-0.14
.
Error
Analysis
Despite
our
aggressive
filtration
of
irrelevant
word-pairs
and
features
,
noise
was
still
the
most
significant
problem
in
our
output
.
Most
low-scoring
samples
contained
at
least
one
word-pair
that
did
not
have
a
meaningful
and
clear
underlying
relation
;
for
examples
,
(
guy
,
ball
)
and
(
issue
,
point
)
.
As
mentioned
,
noise
originated
from
mistakes
in
the
input
data
,
errors
in
sentence
segmentation
and
parsing
,
as
well
as
mismatches
between
dependencies
and
semantic
relatedness
.
An
example
of
the
latter
involved
the
frequent
usage
of
the
proposition
"
of
"
in
various
constructs
.
In
the
sentence
"
the
company
takes
advantage
ofthe
new
legislation
"
,
for
example
,
the
dependency
structure
associates
company
with
advantage
,
whereas
the
semantic
relation
clearly
lies
between
company
and
legislation
.
All
three
of
our
evaluation
metrics
(
precision
,
quality
,
and
goodness
)
were
negatively
affected
by
noise
.
Polysemic
words
,
as
well
as
words
which
were
heavily
context-dependent
,
also
posed
a
problem
.
For
example
,
one
of
the
lexical
analogies
generated
in
the
experiment
was
(
resolution
,
house
)
and
(
legislation
,
senate
)
.
This
lexical
analogy
only
makes
sense
if
"
house
"
is
recognized
as
referring
to
the
House
of
Representatives
,
which
is
often
abbreviated
as
"
the
House
"
in
news
articles
.
Polysemy
also
negatively
affected
all
three
of
our
evaluation
metrics
,
although
to
a
lesser
extent
for
precision
.
Finally
,
our
system
had
difficulties
differentiating
semantic
relations
of
different
granularity
.
The
underlying
relations
of
(
relation
,
country
)
and
(
tie
,
united
states
)
,
for
example
,
are
similar
,
yet
they
do
not
form
a
good
lexical
analogy
because
the
relations
are
at
different
levels
of
granularity
(
countries
in
general
in
the
former
,
and
a
particular
country
in
the
latter
)
.
Undifferentiated
granularity
affected
quality
and
goodness
,
but
it
did
not
have
a
significant
effect
on
precision
.
To
evaluate
how
SGT
compares
to
LRA-S
,
we
repeated
the
experiment
using
SGT
for
relation-matching
.
We
set
K
(
maximum
path
length
)
to
3
,
and
Ksgt
(
cosine
threshold
)
to
0.2
;
these
values
were
again
determined
largely
through
trial-and-error
.
To
train
SGT
,
we
used
90
lexical
analogies
graded
by
human
judges
from
the
previous
experiment
.
In
order
to
facilitate
a
fair
comparison
to
LRA-S
,
we
selected
Kth
values
that
allowed
SGT
to
generate
the
same
number
of
lexical
analogies
as
LRA-S
did
at
each
threshold
interval
.
Running
on
the
same
2.1
GHz
processor
,
SGT
finished
in
just
over
eight
minutes
,
which
is
almost
a
magnitude
faster
than
LRA-S
'
65
minutes
.
SGT
also
used
significantly
less
memory
,
as
the
similarity
graph
was
efficiently
stored
in
an
adjacency
list
.
The
sets
of
lexical
analogies
generated
by
the
two
algorithms
were
quite
similar
,
overlapping
approximately
50
%
at
all
threshold
levels
.
The
significant
overlap
between
SGT
and
LRA-S
'
outputs
allowed
us
to
evaluate
SGT
using
the
samples
collected
from
the
previous
surveys
instead
of
conducting
a
new
round
of
human
grading
.
Specifically
,
we
identified
previously
graded
samples
that
Figure
2
:
System
Evaluation
Results
Good
Examples
(
building
,
office
)
and
(
museum
,
collection
)
(
researcher
,
experiment
)
and
(
doctor
,
surgery
)
Poor
Examples
Shared
Features
(
president
,
change
)
and
(
bush
,
legislation
)
subj
obj
obj
subj
subj
be
down
subj
be
up
Table
4
:
Examples
of
Good
and
Poor
Lexical
Analogies
Generated
had
also
been
generated
by
SGT
,
and
used
these
samples
as
the
evaluation
data
points
for
SGT
.
At
the
lowest
threshold
(
where
8373
lexical
analogies
were
generated
)
,
we
were
able
to
reuse
533
samples
out
of
the
original
1000
samples
.
Figure
2
:
Right
summarizes
the
performance
of
the
system
using
SGT
for
relation-matching
.
As
the
figure
shows
,
SGT
performed
very
similarly
to
LRA-S
.
Both
SGT
's
precision
and
quality
scores
were
slightly
higher
than
LRA-S
,
but
the
differences
were
very
small
and
hence
were
likely
due
to
sample
variation
.
The
goodness
scores
between
the
two
algorithms
were
also
comparable
.
In
the
case
of
SGT
,
however
,
the
score
fluctuated
instead
of
monotonically
decreased
.
We
attribute
the
fluctuation
to
the
smaller
sample
size
.
As
the
samples
were
drawn
exclusively
from
the
portion
of
SGT
's
output
that
overlapped
with
LRA-S
'
output
,
we
needed
to
ensure
that
the
samples
were
not
strongly
biased
and
that
the
reported
result
was
not
better
than
SGT
's
actual
performance
.
To
validate
the
result
,
we
conducted
an
additional
experiment
involving
a
single
human
judge
.
The
judge
was
given
a
survey
with
50
lexical
analogies
,
25
of
which
were
sampled
from
the
overlapping
portion
of
SGT
and
LRA-S
'
outputs
,
and
25
from
lexical
analogies
generated
only
by
SGT
.
Table
5
summarizes
the
result
of
this
experiment
.
As
the
table
demonstrates
,
the
results
from
the
two
sets
were
comparable
with
small
differences
.
Moreover
,
the
differences
were
in
favour
of
the
SGT-only
portion
.
Therefore
,
either
there
was
no
sampling
bias
at
all
,
or
the
sampling
bias
negatively
affected
the
result
.
As
such
,
SGT
's
actual
performance
was
at
least
as
good
as
reported
,
and
may
have
been
slightly
higher
.
Precision
Goodness
SGT-Only
We
conclude
that
SGT
is
indeed
a
viable
alternative
to
LRA-S
.
SGT
generates
lexical
analogies
that
are
of
the
same
quality
as
LRA-S
,
while
being
significantly
faster
and
more
scalable
.
On
the
other
hand
,
an
obvious
limitation
of
SGT
is
that
it
is
a
supervised
algorithm
requiring
manually
labelled
training
data
.
We
claim
this
is
not
a
severe
limitation
because
there
are
only
a
few
variables
to
train
(
i.e.
,
the
weights
)
,
hence
only
a
small
set
of
training
data
is
required
.
Moreover
,
a
supervised
algorithm
can
be
advantageous
in
some
situations
;
for
example
,
it
is
easier
to
tailor
SGT
to
a
particular
input
corpus
.
5
Related
Work
The
study
of
analogy
in
the
artificial
intelligence
community
has
historically
focused
on
computational
models
of
analogy-making
.
French
(
2002
)
and
Hall
(
1989
)
provide
two
of
the
most
complete
surveys
of
such
models
.
Veale
(
2004
;
2005
)
generates
lexical
analogies
from
WordNet
(
Fellbaum
,
1998
)
and
HowNet
(
Dong
,
1988
)
by
dynamically
creating
new
type
hierarchies
from
the
semantic
information
stored
in
these
lexicons
.
Unlike
our
corpus-based
generation
system
,
Veale
's
algorithms
are
limited
by
the
lexicons
in
which
they
operate
,
and
generally
are
only
able
to
generate
near-analogies
such
as
(
Christian
,
Bible
)
and
(
Muslim
,
Koran
)
.
Turney
's
(
2006
)
Latent
Relational
Analysis
is
a
corpus-based
algorithm
that
computes
the
relational
similarity
between
word-pairs
with
remarkably
high
accuracy
.
However
,
LRA
is
focused
solely
on
the
relation-matching
problem
,
and
by
itselfis
insufficient
for
lexical
analogy
generation
.
6
Conclusion
and
Future
Work
We
have
presented
a
system
that
is
,
to
the
best
of
our
knowledge
,
the
first
system
capable
of
generating
lexical
analogies
from
unstructured
text
data
.
Empirical
evaluation
shows
that
our
system
performed
fairly
well
,
generating
valid
lexical
analogies
with
a
precision
of
about
70
%
.
The
quality
of
the
generated
lexical
analogies
was
reasonable
,
although
not
at
the
level
of
human
performance
.
As
part
of
the
system
,
we
have
also
developed
a
novel
algorithm
for
computing
relational
similarity
that
rivals
the
performance
of
the
current
state-of-the-art
while
being
significantly
faster
and
more
scalable
.
One
of
our
immediate
tasks
is
to
complement
dependency
patterns
with
additional
features
.
In
particular
,
we
expect
semantic
features
such
as
word
definitions
from
machine-readable
dictionaries
to
improve
our
system
's
ability
to
differentiate
between
different
senses
of
polysemic
words
,
as
well
as
different
granularities
of
semantic
relations
.
We
also
plan
to
take
advantage
of
our
system
's
flexibility
and
relax
the
constraints
on
dependency
paths
so
as
to
generate
more-varied
lexical
analogies
,
e.g.
,
analogies
involving
verbs
and
adjectives
.
A
potential
application
of
our
system
,
and
the
original
inspiration
for
this
research
,
would
be
to
use
the
system
to
automatically
enrich
ontologies
by
spreading
semantic
relations
between
lexical
analogues
.
For
example
,
if
words
w1
and
w2
are
related
by
relation
r
,
and
(
wi
,
w2
)
and
(
w3
,
w4
)
form
a
lexical
analogy
,
then
it
is
likely
that
w3
and
w4
are
also
related
by
r.
A
dictionary
of
lexical
analogies
therefore
would
allow
an
ontology
to
grow
from
a
small
set
of
seed
relations
.
In
this
way
,
lexical
analogies
become
bridges
through
which
semantic
relations
flow
in
a
sea
of
ontological
concepts
.
Acknowledgments
We
thank
the
reviewers
of
EMNLP
2007
for
valuable
comments
and
suggestions
.
This
work
was
supported
in
part
by
the
Ontario
Graduate
Scholarship
Program
,
Ontario
Innovation
Trust
,
Canada
Foundation
for
Innovation
,
and
the
Natural
Science
and
Engineering
Research
Council
of
Canada
.
