This
paper
describes
ETK
(
Ensemble
of
Transformation-based
Keys
)
a
new
algorithm
for
inducing
search
keys
for
name
filtering
.
ETK
has
the
low
computational
cost
and
ability
to
filter
by
phonetic
similarity
characteristic
of
phonetic
keys
such
as
Soundex
,
but
is
adaptable
to
alternative
similarity
models
.
The
accuracy
of
ETK
in
a
preliminary
empirical
evaluation
suggests
that
it
is
well-suited
for
phonetic
filtering
applications
such
as
recognizing
alternative
cross-lingual
transliterations
.
1
Introduction
The
task
of
name
matching
—
recognizing
when
two
orthographically
distinct
strings
are
likely
to
denote
the
same
individual
—
occurs
in
a
wide
variety
of
important
applications
,
including
law
enforcement
,
national
security
,
and
maintenance
of
government
and
commercial
records
.
Coreference
resolution
,
speech
understanding
,
and
detection
ofaliases
and
duplicate
names
all
require
name
matching
.
The
orthographic
variations
that
give
rise
to
the
name-matching
task
can
result
from
a
variety
of
factors
,
including
transcription
and
OCR
errors
,
and
spelling
variations
.
In
many
applications
,
cross-lingual
transliterations
are
a
particularly
important
source
of
variation
.
For
example
,
romanized
Arabic
names
are
phonetic
transcriptions
of
sounds
that
have
no
direct
equivalent
in
English
,
e.g.
,
"
Mohamed
"
or
"
Muhammet
"
are
two
of
many
possible
transliterations
for
the
same
Arabic
name
.
Name
matching
can
be
viewed
as
a
type
of
range
query
in
which
the
input
is
a
set
of
patterns
(
such
as
names
on
an
immigration-control
watch
list
)
,
a
collection
of
text
strings
(
such
as
a
passenger
list
)
,
a
distance
metric
for
calculating
the
degree
of
relevant
dissimilarity
between
pairs
of
strings
,
and
a
match
threshold
expressing
the
maximum
allowable
distance
between
matching
names
.
The
goal
is
to
find
all
text
/
pattern
pairs
whose
distance
under
the
metric
is
less
than
or
equal
to
the
threshold
.
In
the
simplest
case
,
patterns
and
the
text
strings
with
which
they
are
matched
are
both
individual
words
.
In
the
general
case
,
the
text
may
not
be
segmented
into
strings
corresponding
to
possible
names
.
Distance
metrics
for
name
matching
are
typically
computationally
expensive
.
For
example
,
determining
the
edit
distance
between
two
strings
of
length
n
and
m
requires
,
in
the
general
case
,
nm
steps
.
Metrics
based
on
algorithms
that
learn
from
examples
of
strings
that
should
match
(
Bilenko
et
al.
,
2003
;
Ristad
and
Yianilos
,
1998
)
and
metrics
that
use
phonetic
similarity
criterion
,
e.g.
,
(
Kondrak
,
2000
)
are
no
less
expensive
than
edit
distance
.
The
computational
expense
of
distance
metrics
means
that
tractable
name
matching
on
large
texts
typically
requires
an
inexpensive
,
high-recall
filtering
step
to
find
a
subset
of
the
original
text
to
which
the
expensive
similarity
metric
will
be
applied
.
Desiderata
for
filtering
include
the
following
:
High
recall
.
The
recall
of
the
entire
name-matching
process
is
bounded
by
the
recall
of
the
filtering
step
,
so
high
filtering
recall
is
essential
.
Efficiency
.
Filtering
is
useful
only
to
the
extend
that
it
requires
less
computational
expense
than
applying
the
similarity
metric
to
each
pattern
/
text
pair
.
The
computational
expense
of
Proceedings
of
the
2007
Joint
Conference
on
Empirical
Methods
in
Natural
Language
Processing
and
Computational
Natural
Language
Learning
,
pp.
906-914
,
Prague
,
June
2007
.
©
2007
Association
for
Computational
Linguistics
a
filtering
algorithm
itself
must
therefore
be
less
than
the
cost
of
the
metric
calls
eliminated
through
the
filtering
process
.
Typically
,
the
cost
of
the
metric
is
so
much
higher
than
the
filtering
cost
that
the
latter
can
be
neglected
.
Under
these
circumstances
,
precision
is
a
satisfactory
proxy
for
efficiency
.
Adaptability
to
specific
distance
metrics
.
High
precision
and
recall
are
achievable
in
filtering
only
if
the
filtering
criterion
corresponds
to
the
distance
metric
.
For
example
,
if
a
distance
metric
is
based
on
phonetic
differences
between
strings
,
a
filtering
algorithm
that
selects
candidate
text
strings
based
on
orthographic
differences
may
perform
poorly
.
Similarly
,
poor
performance
may
result
from
use
of
a
filtering
algorithm
based
on
phonetic
differences
if
the
distance
metric
is
based
on
orthographic
differences
.
For
example
,
"
LAYTON
"
and
"
LEIGHTON
"
differ
by
a
large
edit
distance
but
are
phonetically
identical
(
in
most
dialects
)
,
whereas
"
BOUGH
"
and
"
ROUGH
"
are
orthograph-ically
similar
but
phonetically
dissimilar
.
An
ideal
filtering
algorithm
should
be
adaptable
to
any
particular
distance
metric
.
This
paper
describes
ETK
(
Ensemble
of
Transformation-based
Keys
)
a
new
algorithm
for
inducing
filters
that
satisfy
the
three
criteria
above
.
ETK
is
similar
to
phonetic
search
key
algorithms
such
as
Soundex
and
shares
phonetic
search
key
algorithms
'
low
computational
expense
and
ability
to
filter
by
phonetic
similarity
.
However
,
ETK
has
the
advantage
that
it
is
adaptable
to
alternative
distance
metrics
and
is
therefore
applicable
to
a
wider
range
of
circumstances
than
static
key
algorithms
.
The
next
section
describes
previous
work
in
name
filtering
.
Section
3
describes
the
ETK
algorithm
in
detail
,
and
a
preliminary
evaluation
on
English
and
German
surnames
is
set
forth
in
Section
4
.
2
Previous
Work
The
division
of
the
retrieval
task
into
an
inexpensive
,
high-recall
filtering
stage
followed
by
a
more
expensive
high-precision
stage
emerged
independently
in
a
variety
of
different
areas
of
computer
science
.
This
approach
is
termed
two-stage
retrieval
in
the
Information
Retrieval
literature
(
Shin
and
Zhang
,
1998
)
,
MAC
/
FAC
by
some
researchers
in
analogy
(
Gentner
and
Forbus
,
1991
)
,
blocking
in
the
statistical
record
linkage
literature
(
Cohen
et
al.
,
2003
)
,
and
filtering
in
the
approximate
string
matching
literature
(
Navarro
,
2001
)
.
The
two
most
common
approaches
to
filtering
that
have
been
applied
to
name
matching
are
indexing
by
phonetic
keys
and
indexing
by
ngrams
.
Two
less
well
known
filtering
algorithms
that
often
have
higher
recall
than
filtering
by
phonetic
keys
or
ngrams
are
pivot-based
retrieval
and
partition
filtering
.
Phonetic
Key
Indexing
.
In
phonetic
key
indexing
,
names
are
indexed
by
a
phonetic
representation
created
by
a
key
function
that
maps
sequences
of
characters
to
phonetic
categories
.
Such
key
functions
partition
the
name
space
into
equivalence
classes
of
names
having
identical
phonetic
representations
.
Each
member
of
a
partition
is
indexed
by
the
shared
phonetic
representation
.
2000
;
Hodge
and
Austin
,
2001
;
Christen
,
2006
)
.
While
each
of
these
alternatives
has
some
advantages
over
Soundex
,
none
is
adaptable
to
alternative
distance
metrics
.
For
purposes
of
comparison
,
Phonex
(
Gadd
,
1990
)
was
included
in
the
evaluation
below
because
it
was
found
to
be
the
most
accurate
phonetic
key
for
last
names
in
an
evaluation
by
(
Christen
,
2006
)
.
Ngram
Filtering
.
The
second
common
filtering
algorithm
for
names
is
ngram
indexing
,
under
which
each
pattern
string
is
indexed
by
every
n-element
substring
,
i.e.
,
every
sequence
of
n
contiguous
characters
occurring
in
the
pattern
string
(
typically
,
the
original
string
is
padded
with
special
leading
and
trailing
characters
to
distinguish
the
start
and
end
of
the
name
)
.
The
candidates
for
each
target
string
are
retrieved
using
the
ngrams
in
the
target
as
indices
(
Cohen
et
al.
,
2003
)
.
Typical
values
for
n
are
3
or
4
.
Pivot-Based
Retrieval
.
Pivot-based
retrieval
techniques
are
applicable
to
domains
,
such
as
name
matching
,
in
which
entities
are
not
amenable
to
vector
representation
but
for
which
the
distance
metric
satisfies
the
triangle
inequality
(
Chavez
et
The
key
idea
is
to
organize
the
index
around
a
small
group
of
elements
,
called
pivots
.
In
retrieval
,
the
distance
between
the
query
probe
q
and
any
element
e
can
be
estimated
based
on
the
distances
of
each
to
one
or
more
pivots
.
There
are
numerous
pivot-based
metric
space
indexing
algorithms
.
An
instructive
survey
of
these
algorithms
is
set
forth
in
(
Chavez
et
al.
,
2001
)
.
One
of
the
oldest
,
and
often
best-performing
,
pivot-based
indices
is
Burkhart-Keller
Trees
(
BKT
)
(
Burkhard
and
Keller
,
1973
;
Baeza-Yates
and
Navarro
,
1998
)
.
BKT
is
suitable
for
discrete-valued
distance
metrics
.
Construction
of
a
BKT
starts
with
selection
of
an
arbitrary
element
as
the
root
of
the
tree
.
The
ith
child
of
the
root
consists
of
all
elements
of
distance
i
from
the
root
.
A
new
BKT
is
recursively
constructed
for
each
child
until
the
number
of
elements
in
a
child
falls
below
a
predefined
bucket
size
.
A
range
query
on
a
BKT
with
distance
metric
d
,
probe
q
,
range
k
,
and
pivot
p
is
performed
as
follows
.
If
the
BKT
is
a
leaf
node
,
then
the
distance
metric
d
is
applied
between
q
and
each
element
of
the
leaf
node
,
and
those
elements
e
for
which
d
(
q
,
e
)
&lt;
k
are
returned
.
Otherwise
,
all
subtrees
with
index
i
for
which
|
d
(
q
,
e
)
—
i
|
&lt;
k
are
recursively
searched
.
While
all
names
within
k
of
a
query
are
guaran
-
1Edit
distance
satisfies
the
triangle
inequality
because
any
string
A
can
be
transformed
into
another
string
C
by
first
transforming
A
to
any
other
string
B
,
then
transforming
B
into
C.
Thus
,
edit-distance
(
A
,
C
)
cannot
be
greater
than
edit-distance
(
A
,
B
)
+
edit-distance
(
B
,
C
)
for
any
strings
A
,
B
,
and
C.
teed
to
be
retrieved
by
a
BKT
(
i.e.
,
recall
is
100
%
)
,
there
are
no
guarantees
on
precision
.
During
search
,
one
application
of
the
distance
metric
is
required
at
each
internal
node
traversed
,
and
a
distance-metric
application
is
required
for
each
candidate
element
in
leaf
nodes
reached
during
the
traversal
.
The
number
ofnodes
searched
is
exponential
in
k
(
Chavez
et
al.
,
2001
)
.
Partition
Filtering
.
Partition
filtering
(
Wu
and
Manber
,
1991
;
Navarro
and
Baeza-Yates
,
1999
)
,
is
an
improvement
over
ngram
filtering
that
relies
on
the
observation
that
if
a
pattern
string
P
of
length
m
is
divided
into
segments
of
length
L
(
fc+^i
)
-
I
^
men
any
string
that
matches
P
with
at
most
k
errors
must
contain
an
exact
match
for
at
least
one
of
the
segments
(
intuitively
,
it
would
take
at
least
k
+
1
errors
,
e.g.
,
edit
operations
,
to
alter
all
of
these
segments
)
.
Strings
indexed
by
L^pyyJ-length
segments
can
be
retrieved
by
an
efficient
exact
string
matching
algorithm
,
such
as
suffix
trees
or
Aho-Corasick
trees
.
This
is
necessary
because
partitions
,
unlike
ngrams
,
vary
in
length
.
Partition
filtering
differs
from
ngram
filtering
in
two
respects
.
First
,
ngrams
overlap
,
whereas
partition
filtering
involves
partitioning
each
string
into
non-overlapping
segments
.
Second
,
the
choice
of
n
in
ngram
filtering
is
typically
independent
of
k
,
whereas
the
size
of
the
segments
in
filtering
is
chosen
based
on
k.
Since
in
most
applications
n
is
independent
of
k
,
ngram
retrieval
,
like
phonetic
key
indexing
,
lacks
any
guaranteed
lower
bound
on
recall
,
whereas
partition
filtering
guarantees
100
%
recall
when
the
distance
metric
is
edit
distance
.
Any
key
function
partitions
the
universe
of
strings
into
equivalence
classes
of
strings
that
share
a
common
key
.
If
a
key
function
is
to
serve
as
a
filter
,
matching
names
must
be
members
of
the
same
equivalence
class
.
However
,
no
single
partition
can
produce
equivalence
classes
that
both
include
all
matching
pairs
and
exclude
all
non-matching
pairs.2
A
search
key
that
creates
partitions
in
which
there
is
a
low
probability
that
non-matching
pairs
share
a
common
equivalence
class
will
have
high
precision
,
although
possibly
low
recall
.
However
,
the
recall
of
an
ensemble
of
search
keys
,
each
having
non-zero
recall
and
each
being
independent
of
the
others
,
can
be
expected
to
be
greater
than
the
recall
of
any
individual
key
.
A
high-precision
and
high-recall
index
can
therefore
be
constructed
if
one
can
find
,
for
a
given
similarity
metric
and
match
threshold
,
a
sufficiently
large
set
of
key
functions
that
(
1
)
are
independent
,
(
2
)
each
have
high-precision
under
the
metric
and
threshold
,
and
(
3
)
have
non-zero
recall
.
The
objective
of
ETK
is
to
learn
a
set
of
independent
,
high-precision
key
functions
from
training
data
consisting
of
equivalence
classes
of
names
that
satisfy
the
matching
criteria
.
The
similarity
metric
and
threshold
are
implicit
in
the
training
data
.
Thus
,
under
this
approach
a
key
function
can
be
learned
even
if
the
similarity
model
is
unknown
,
provided
that
sufficient
equivalence
classes
are
available
.
For
each
equivalence
class
,
ETK
attempts
to
find
the
shortest
transformation
rules
capable
of
converting
all
members
of
the
equivalence
class
into
an
identical
orthographic
representation
.
The
entire
collection
of
transformation
rules
for
all
equivalence
classes
,
which
in
general
has
many
inconsistencies
,
is
then
partitioned
into
separate
consistent
subsets
.
Each
subset
of
transformation
rules
constitutes
an
independent
key
function
.
Each
pattern
name
is
indexed
by
each
key
produced
by
applying
a
key
function
to
it
,
and
the
candidate
matches
for
a
new
name
consist
of
all
pattern
names
that
share
at
least
one
key
.
The
equivalence
classes
of
matching
names
can
be
obtained
either
through
some
a
priori
source
(
such
as
alias
lists
or
manual
construction
)
or
by
applying
the
similarity
metric
to
pairs
in
a
training
set
,
e.g.
,
repeated
leave-one-out
retrievals
with
a
known
distance
metric
.
In
the
former
case
,
the
keys
are
purely
empirical
;
in
the
later
the
key
functions
are
in
effect
a
way
of
compiling
the
distance
metric
to
speed
retrieval
.
and
C
into
a
different
equivalence
class
,
a
query
on
C
would
require
a
partition
that
puts
B
and
C
in
the
same
equivalence
class
and
A
in
a
different
equivalence
class
,
and
a
query
on
B
would
require
a
partition
in
which
all
three
were
in
the
same
equivalence
class
.
Thus
,
three
independent
keys
would
be
needed
to
satisfy
all
three
queries
while
excluding
non-matching
names
.
Inducing
Transformation
Rules
.
The
inductive
process
starts
with
a
collection
of
equivalence
classes
under
a
given
distance
metric
and
match
threshold
k.
A
collection
of
transformation
rules
are
derived
from
these
equivalence
classes
as
follows
.
For
each
equivalence
class
EC
:
•
The
element
of
EC
with
the
least
mean
pair-wise
edit
distance
to
the
other
class
members
(
breaking
ties
by
preferring
shorter
elements
)
is
selected
as
the
centroid
.
For
example
,
if
EC
is
{
LEIGHTON
LAYTON
SLEIGHTON
}
,
then
LEIGHTON
would
be
the
centroid
because
it
has
a
smaller
edit
distance
to
the
other
elements
than
they
do
to
each
other
.
•
For
each
element
E
other
than
the
centroid
,
dynamic
programming
is
used
to
find
an
alignment
of
E
with
the
centroid
that
maximizes
the
number
of
corresponding
identical
characters.3
For
example
,
the
alignment
of
LEIGHTON
and
LAYTON
would
be
:
LAY
--
TON
LEIGHTON
•
For
each
character
c
of
the
centroid
,
all
windows
of
characters
in
E
of
length
from
1
to
some
constant
maxWindow
centered
on
the
character
in
the
source
corresponding
to
c
are
found
,
skipping
blank
characters
.
Each
mapping
from
a
window
to
c
constitutes
a
rule
.
For
example
,
for
maxWindow
7
and
the
alignment
above
,
the
transformation
rules
for
the
E
in
LEIGHTON
would
be
:
3See
(
Damper
et
al.
,
2004
)
for
details
on
alignment
by
dynamic
programming
.
The
approach
taken
here
assigns
a
slightly
higher
association
weight
for
aligned
identical
consonants
than
for
aligned
identical
vowels
so
that
,
ceteris
paribus
,
consonant
alignment
is
preferred
to
vowel
alignment
and
assigns
a
slightly
higher
association
weight
to
non-identical
letters
that
are
both
vowels
or
both
consonants
than
to
vowel
/
consonant
alignments
.
Transformation
rules
derived
from
multiple
equivalence
classes
typically
have
many
inconsistencies
,
i.e.
,
rules
with
identical
left-hand
sides
(
LHSs
)
but
different
right
hand
sides
(
RHSs
)
.
All
RHSs
for
a
given
LHS
are
grouped
together
and
ranked
by
frequency
ofoccurrence
in
the
training
data
.
For
example
,
the
frequency
of
alternative
rules
for
the
middle
characters
LAN
and
LEI
for
the
U.S.
name
pronunciation
set
with
k
=
1
discussed
below
is
:
Key
Formation
.
The
transformation
rules
are
subdivided
two
different
ways
:
by
LHS
,
e.g.
,
separating
rules
for
LAY
from
those
for
LEI
,
and
by
RHS
frequency
,
e.g.
,
separating
LAY
—
E
(
the
most
frequent
rule
for
LAY
)
from
LAY
—
A
(
the
next
most
frequent
)
.
The
highest
frequency
RHS
rules
from
the
example
above
are
:
and
the
next
most
frequent
are
:
If
rules
are
divided
into
l
LHS
subsets
,
and
each
subset
is
further
subdivided
by
taking
the
r
highest
ranked
RHSs
(
with
RHSs
ranked
lower
than
r
ignored
)
,
the
result
is
a
total
of
1r
subsets
.
Each
of
these
1r
subsets
defines
a
key
function
.
For
each
position
in
a
word
to
which
the
key
function
is
to
be
applied
(
padded
with
leading
and
training
markers
)
,
the
rule
with
the
longest
(
i.e.
,
most
specific
)
LHS
that
matches
the
window
centered
at
that
position
is
used
to
determine
the
corresponding
character
in
the
key
.
If
no
rules
apply
,
the
character
in
the
key
is
the
same
as
that
in
the
original
word
.
For
example
,
suppose
that
the
word
to
which
the
key
is
to
be
applied
is
CREIGHTON
and
transformations
include
LEIGHTO
—
-
,
EIGHT
—
-
and
IGH
—
G.
The
character
in
the
key
corresponding
to
the
G
in
CREIGHTON
would
be
-
(
i.e.
,
a
deletion
)
because
the
EIGHT
is
the
longest
LHS
matching
at
that
position
.
The
key
consists
of
the
concatenation
of
the
RHSs
produced
by
successively
applying
the
key
function
to
each
position
in
the
orginal
word
.
This
procedure
is
similar
to
window-based
pronunciation-learning
algorithms
,
e.g.
,
(
Sejnowski
and
Rosenberg
,
1987
;
Bakiri
and
Dietterich
,
1999
)
,
but
differs
in
that
the
objective
is
not
determining
a
correct
pronunciation
,
but
is
instead
transforming
words
that
are
similar
under
a
given
metric
into
a
single
,
consistent
orthographic
representation
.
The
1r
subsets
of
transformation
rules
induced
from
a
given
set
of
equivalence
classes
define
an
ensemble
of
key
functions
.
To
filter
potential
matches
with
this
ensemble
,
each
pattern
is
added
to
a
hash
table
indexed
by
each
key
generated
by
a
key
function
.
Candidate
matches
to
a
text
string
consist
of
all
patterns
indexed
by
the
keys
generated
from
the
text
by
the
ensemble
of
key
functions
.
For
example
,
suppose
that
(
as
is
the
case
with
the
rule
sets
for
American
names
,
pronunciation
distance
,
and
k
=
0
)
patterns
ROLLINS
and
ROWLAND
have
keys
that
include
{
ROWLINS
ROLINS
}
and
{
RONLLAND
ROLAN
}
,
respectively
,
and
that
text
RAWLINS
has
keys
that
include
{
ROWLINS
RALINS
}
.
Then
ROLLINS
but
not
ROWLAND
would
be
retrieved
because
it
is
indexed
by
a
key
shared
with
ROWLINS.4
4
Evaluation
The
retrieval
accuracy
of
ETK
was
compared
to
that
of
BKT
,
filtering
by
partition
,
ngram
filtering
,
Phonex
,
and
Soundex
on
sets
of
U.S.
and
German
names
.
The
U.S.
name
set
consisted
of
the
5,000
most
common
last
names
identified
during
the
most
recent
U.S.
Census5
which
have
pronunciations
in
cmudict
,
the
CMU
pronouncing
dictionary.6
The
German
name
set
consisted
of
the
first
5963
entries
in
the
HADI-BOMP
collection7
whose
part
of
speech
is
NAM
.
The
filtering
algorithms
were
compared
with
respect
to
two
alternative
distance
metrics
.
The
first
was
pronunciation
distance
,
which
consists
of
edit
4In
the
evaluation
below
,
the
original
string
itself
is
added
as
an
additional
index
key
.
This
addition
slightly
increases
both
recall
and
precision
.
5The
names
were
taken
from
the
1990
U.S.
Census
collection
of
88,799
last
names
at
http
:
/
/
www.census.gov
/
genealogy
/
names
/
names_files.html
.
6http
:
/
/
www.speech.cs.cmu.edu
/
cgi-bin
/
cmudict
.
7http
:
/
/
www.ikp.uni-bonn.de
/
dt
/
forsch
/
phonetik
/
bomp
.
distance
between
pronunciations
represented
using
the
cmudict
phoneme
set
for
U.S.
names
and
the
HADI-BOMP
phoneme
set
for
German
names
.
Stress
values
were
removed
from
cmudict
pronunciations
,
and
syllable
divisions
were
removed
from
HADI-BOMP
pronunciations
.
When
there
were
multiple
pronunciations
for
a
name
in
cmudict
,
the
first
was
used
.
In
cmudict
,
for
example
,
MEUSE
and
MEWES
have
pronunciation
distance
of
0
because
both
have
pronunciation
M
Y
UW
Z.
In
HADI-BOMP
,
HELGARD
and
HERBART
have
pronunciation
distance
2
because
their
pronunciations
are
h
E
l
g
a
r
t
and
h
E
r
b
a
r
t.
The
second
distance
metric
was
edit
distance
with
unit
weights
for
insertions
,
deletions
,
and
substitutions
.
In
practice
,
appropriate
distance
metrics
might
be
Jaro
(
Jaro
,
1995
)
,
Winkler
(
Winkler
,
1999
)
,
or
some
metric
specialized
to
a
particular
phonetic
or
error
model
.
Pronunciation
and
edit
distance
were
chosen
as
representative
of
phonetic
and
non-phonetic
metrics
.
Training
data
for
ETK
for
a
given
language
,
match
threshold
k
,
and
similarity
metric
consisted
of
all
sets
of
at
least
2
names
containing
only
elements
were
within
k
of
some
element
of
the
set
under
the
metric
.
These
training
sets
were
created
by
performing
a
retrieval
on
every
name
in
each
collection
using
BKT
,
which
has
100
%
recall
.
For
each
retrieval
,
the
true
positives
from
BKT
's
return
set
were
determined
by
applying
the
similarity
metric
between
each
return
set
element
and
the
query
.
If
there
were
at
least
2
true
positives
(
including
the
query
itself
)
,
the
set
of
true
positives
was
included
in
the
training
ETK
was
tested
using
cross
validation
,
so
that
names
in
the
training
set
and
those
in
the
testing
set
were
disjoint
.
Specifically
,
all
names
in
the
testing
set
were
removed
from
each
collection
in
the
training
set
.
If
at
least
2
names
remained
,
the
collection
was
retained
.
ETK
's
maxWindow
size
was
7
,
as
in
the
examples
above
.
In
BKT
,
the
bucket
size
(
maximum
number
of
elements
in
any
leaf
node
)
was
2
,
and
the
longest
element
(
rather
than
a
random
element
)
was
selected
8Note
that
each
set
of
true
positives
is
a
cluster
having
the
query
as
its
centroid
and
radius
k
under
the
distance
metric
.
The
triangle
inequality
guarantees
that
the
maximum
distance
between
any
pair
of
names
in
the
collection
is
no
greater
than
2k
.
as
the
root
of
each
subtree
.
The
rationale
for
this
choice
is
that
there
is
typically
more
variance
in
distance
from
a
longer
word
than
from
a
shorter
word
,
and
greater
variance
increases
the
branching
factor
in
BKT
,
reducing
tree
depth
and
therefore
the
number
of
nodes
visited
during
search
.
Since
the
importance
of
precision
in
filtering
is
that
it
determines
the
number
of
calls
to
the
similarity
metric
required
for
a
given
level
of
recall
,
precision
figures
for
BKT
include
internal
calls
to
the
similarity
metric
,
that
is
,
calls
during
indexing
.
Thus
,
precision
of
BKT
is
the
number
of
true
positives
divided
by
the
number
of
all
positives
plus
the
number
of
internal
metric
calls
.
In
Soundex
and
Phonex
indexing
,
each
name
was
indexed
by
its
Soundex
(
Phonex
)
key
.
Similarly
,
in
ngram
filtering
each
name
was
indexed
by
all
its
ngrams
,
with
special
leading
and
trailing
characters
added
.
Retrieval
was
performed
by
finding
the
Soundex
or
Phonex
encoding
or
the
ngrams
of
each
query
and
retrieving
every
name
indexed
by
the
Soundex
or
Phonex
encoding
or
any
ngram
.
Precision
was
measured
with
duplicates
removed
.
In
partition
filtering
,
each
name
was
indexed
by
each
of
its
k
+
1
partitions
,
and
the
partitions
themselves
were
organized
in
an
Aho-Curasick
tree
(
Gus-field
,
1999
)
.
Retrieval
was
performed
by
applying
the
Aho-Curasick
tree
to
the
query
to
determine
all
partitions
occurring
in
the
query
and
retrieving
the
names
corresponding
to
each
partition
,
removing
duplicates
.
4.1
Optimizing
LHS
and
RHS
Subdivisions
The
first
experiment
was
performed
to
clarify
the
optimal
sizes
of
l
,
the
number
of
LHS
subdivisions
,
and
r
,
the
number
of
RHS
ranks
.
ETK
was
tested
on
the
U.S.
name
set
with
k
=
1
,
pronunciation
distance
as
similarity
metric
,
and
10-fold
cross
validation
for
l
e
{
1
,
2
,
4
,
8,16
,
32
}
and
r
e
{
1
,
2
}
.
As
shown
in
Table
1
,
when
l
=
1
,
r
=
2
has
higher
f-measure
than
r
=
1
,
but
when
l
is
2
or
greater
,
the
best
value
for
r
is
1
.
Overall
,
the
highest
f-measure
is
obtained
with
l
=
8
and
r
=
1
.
In
the
experiments
below
,
the
value
of
16
was
used
for
l
because
this
leads
to
slightly
higher
recall
at
a
small
cost
in
decreased
f-measure
.
4.2
Comparison
of
ETK
to
Other
Filter
Algorithms
The
retrieval
accuracy
ofETK
was
compared
to
that
of
BKT
,
partition
,
ngram
,
Phonex
,
and
Soundex
on
the
U.S.
and
German
name
sets
for
pronunciation
distance
with
k
e
{
0,1,2
}
and
for
edit
distance
with
k
e
{
1,2
}
.
In
tests
involving
pronunciation
distance
BKT
was
tested
under
two
conditions
:
with
the
pronunciation
distance
function
available
to
BKT
during
indexing
and
retrieval
;
and
the
distance
function
unavailable
,
so
that
BKT
indexing
and
retrieval
was
performed
on
the
surface
form
even
though
the
actual
similarity
metric
was
pronunciation
distance
.
This
is
intended
to
simulate
the
situation
in
which
examples
of
matching
names
are
available
but
the
underlying
similarity
metric
is
unknown
.
Ngram
and
partition
filtering
were
performed
on
letters
only
.
Tables
2
and
3
show
recall
,
precision
,
and
f-measure
for
pronunciation
distance
on
U.S.
and
German
names
,
respectively
,
with
k
e
{
0,1
,
2
}
,
l
=
16
,
and
r
=
1
.
ETK
has
the
highest
f-measure
under
all
conditions
because
its
precision
is
consistently
higher
than
that
of
the
other
algorithms
.
This
is
because
each
key
function
in
ETK
applies
only
transformations
representing
orthographic
differences
between
names
in
the
same
equivalence
class
.
Thus
,
the
transformations
are
very
conservative
.
BKT
always
has
recall
of
1.0
when
the
pronunciation
model
is
available
,
but
in
many
cases
a
model
may
be
unavailable
.
When
no
model
is
available
,
no
single
algorithm
consistently
has
the
highest
recall
.
Ngrams
,
partition
,
Phonex
,
and
BKT
each
had
the
highest
recall
in
at
least
one
language
/
error
threshold
combination
.
Table
2
:
Recall
,
precision
,
and
f-measure
for
pronunciation
distance
on
U.S.
surnames
.
BKT-NM
is
BKT
without
the
pronunciation
model
.
Best
results
are
shown
in
bold
,
including
highest
recall
in
addition
to
BKT
.
recall
1.0
)
.
Again
,
ETK
has
the
highest
f-measure
because
of
its
consistently
high
precision
.
The
sensitivity
of
ETK
to
training
set
size
was
tested
by
performing
50-fold
cross-validation
with
training
sets
for
pronunciation
distance
on
U.S.
names
of
sizes
in
{
48
,
96
,
191
,
381
,
762
,
1524
,
3047
}
drawn
from
the
3047
equivalence
classes
in
the
5000
U.S.
names
with
pronunciation
distance
and
k
=
1
.
As
shown
in
Figure
1
,
the
learning
curve
rises
steeply
for
the
entire
range
of
training
set
sizes
considered
in
this
experiment
.
5
Conclusion
The
experimental
results
demonstrate
the
feasibility
of
basing
search
keys
on
transformation
rules
acquired
from
examples
.
If
sufficient
examples
of
names
that
match
under
a
given
distance
metric
and
error
threshold
are
available
,
keys
can
be
induced
that
lead
to
good
performance
in
comparison
to
alternative
filtering
algorithms
.
Moreover
,
the
results
involving
pronunciation
distance
illustrate
how
phonetic
keys
can
be
learned
that
are
specific
to
indi
-
Table
3
:
Recall
,
precision
,
and
f-measure
for
pronunciation
distance
on
German
names
.
K
is
maximum
permitted
error
.
Best
results
are
shown
in
bold
.
Table
4
:
Recall
,
precision
,
and
f-measure
for
edit
distance
on
U.S.
surnames
.
Table
5
:
Recall
,
precision
,
and
f-measure
for
edit
distance
on
German
names
.
precision
f-measure
partition
Figure
1
:
F-measure
for
U.S.
names
for
training
sets
containing
varying
numbers
of
collections
,
with
k
=
1
,
l
=
16
,
and
r
=
1
.
Each
training
instance
consists
of
all
names
within
k
of
some
centriod
under
the
metric
.
training
set
size
vidual
match
criteria
.
In
filtering
under
pronunciation
distance
,
ETK
's
f-measure
for
German
names
was
similar
to
its
f-measure
for
U.S.
names
(
actually
higher
for
k
e
{
0,1
}
)
whereas
Soundex
and
Phonex
were
approximately
an
order
of
magnitude
lower
.
Although
ETK
consistently
had
the
highest
f-measure
in
this
experiment
,
it
does
not
follow
that
ETK
is
necessarily
the
most
desirable
name
filter
for
any
particular
application
.
In
many
applications
recall
may
be
much
more
important
than
precision
.
In
such
cases
,
it
may
be
essential
to
choose
the
highest
recall
algorithm
notwithstanding
a
lower
f-measure
.
However
,
the
highest
recall
algorthms
can
lead
to
a
very
large
number
of
distance-metric
applications
.
For
example
,
in
some
data
sets
the
number
of
nodes
examined
by
BKT
during
retrieval
is
a
significant
proportion
of
the
entire
pattern
set
.
ETK
has
the
disadvantage
of
requiring
a
large
set
of
training
examples
consisting
of
equivalence
sets
of
strings
that
match
under
the
metric
and
maximum
allowable
error
.
Where
such
large
numbers
of
equivalence
sets
are
unavailable
,
it
may
be
better
to
use
simpler
and
less-informed
filters
.
A
number
of
variations
of
ETK
are
possible
.
For
example
,
keys
could
consist
of
finite-state
transducers
trained
from
consistent
subsets
of
mappings
rather
than
transformation
rules
.
There
are
also
many
possible
alternatives
to
ETK
's
window-based
approach
to
deriving
mappings
from
examples
.
In
summary
,
this
work
has
demonstrated
that
ensembles
of
keys
induced
from
equivalence
classes
of
names
under
a
specific
distance
metric
and
maximum
allowable
error
can
filter
names
with
high
f-measure
.
The
experimental
results
illustrate
the
benefits
both
of
acquiring
keys
that
are
adapted
to
specific
similarity
criteria
and
of
indexing
with
multiple
independent
keys
.
