Unknown
words
are
a
well-known
hindrance
to
natural
language
applications
.
In
particular
,
they
drastically
impact
machine
translation
quality
.
An
easy
way
out
commercial
translation
systems
usually
offer
their
users
is
the
possibility
to
add
unknown
words
and
their
translations
into
a
dedicated
lexicon
.
Recently
,
Stroppa
and
Yvon
(
2005
)
have
shown
how
analogical
learning
alone
deals
nicely
with
morphology
in
different
languages
.
In
this
study
we
show
that
analogical
learning
offers
as
well
an
elegant
and
effective
solution
to
the
problem
of
identifying
potential
translations
of
unknown
words
.
1
Introduction
Analogical
reasoning
has
received
some
attention
in
cognitive
science
and
artificial
intelligence
(
Gentner
et
al.
,
2001
)
.
It
has
been
for
a
long
time
a
faculty
assessed
in
the
so-called
SAT
Reasoning
tests
used
in
the
application
process
to
colleges
and
universities
in
the
United
States
.
Turney
(
2006
)
has
shown
that
it
is
possible
to
compute
relational
similarities
in
a
corpus
in
order
to
solve
56
%
of
typical
analogical
tests
quizzed
in
SAT
exams
.
The
interested
reader
can
find
in
(
Lepage
,
2003
)
a
particularly
dense
treatment
of
analogy
,
including
a
fascinating
chapter
on
the
history
of
the
notion
of
analogy
.
The
concept
of
proportional
analogy
,
denoted
[
A
:
B
=
C
:
D
]
,
is
a
relation
between
four
entities
which
reads
:
"
A
is
to
B
as
C
is
to
D
"
.
Among
proportional
analogies
,
we
distinguish
formal
analogies
,
that
is
,
ones
that
arise
at
the
graphical
level
,
such
as
[
fournit
:
fleurit
=
fournie
:
fleurie
]
in
French
or
[
believer
:
unbelievable
=
doer
:
undoable
]
in
English
.
Formal
analogies
are
often
good
indices
for
deeper
analogies
(
Stroppa
and
Yvon
,
2005
)
.
Lepage
and
Denoual
(
2005
)
presented
the
system
ALEPH
,
an
intriguing
example-based
system
entirely
built
on
top
of
an
automatic
formal
analogy
solver
.
This
system
has
achieved
state-of-the-art
performance
on
the
IWSLT
task
(
Eck
and
Hori
,
2005
)
,
despite
its
striking
purity
.
As
a
matter
of
fact
,
ALEPH
requires
no
distances
between
examples
,
nor
any
threshold.1
It
does
not
even
rely
on
a
tokenization
device
.
One
reason
for
its
success
probably
lies
in
the
specificity
of
the
BTEC
corpus
:
short
and
simple
sentences
of
a
narrow
domain
.
It
is
doubtful
that
ALEPH
would
still
behave
adequately
on
broader
tasks
,
such
as
translating
news
articles
.
Stroppa
and
Yvon
(
2005
)
propose
a
very
helpful
algebraic
description
of
a
formal
analogy
and
describe
the
theoretical
foundations
of
analogical
learning
which
we
will
recap
shortly
.
They
show
both
its
elegance
and
efficiency
on
two
morphological
analysis
tasks
for
three
different
languages
.
Recently
,
Moreau
et
al.
(
2007
)
showed
that
formal
analogies
of
a
simple
kind
(
those
involving
suffixation
and
/
or
prefixation
)
offer
an
effective
way
to
extend
queries
for
improved
information
retrieval
.
In
this
study
,
we
show
that
analogical
learning
can
be
used
as
an
effictive
method
for
translating
unknown
words
or
phrases
.
We
found
that
our
approach
has
the
potential
to
propose
a
valid
translation
for
80
%
of
ordinary
unknown
words
,
that
is
,
words
that
are
not
proper
names
,
compound
words
,
or
numerical
expressions
.
Specific
solutions
have
been
proposed
for
those
token
types
(
Chen
et
al.
,
1998
;
Al-Onaizan
and
Knight
,
2002
;
Koehn
and
Knight
,
2003
)
.
The
paper
is
organized
as
follows
.
We
first
recall
1Some
heuristics
are
applied
for
speeding
up
the
system
.
Proceedings
of
the
2007
Joint
Conference
on
Empirical
Methods
in
Natural
Language
Processing
and
Computational
Natural
Language
Learning
,
pp.
877-886
,
Prague
,
June
2007
.
©
2007
Association
for
Computational
Linguistics
in
Section
2
the
principle
of
analogical
learning
and
describe
how
it
can
be
applied
to
the
task
of
enriching
a
bilingual
lexicon
.
In
Section
3
,
we
present
the
corpora
we
used
in
our
experiments
.
We
evaluate
our
approach
over
two
translation
tasks
in
Section
4
.
We
discuss
related
work
in
Section
5
and
give
perspectives
of
our
work
in
Section
6
.
2
Analogical
Learning
2.1
Principle
Our
approach
to
bilingual
lexical
enrichment
is
an
instance
of
analogical
learning
described
in
(
Stroppa
and
Yvon
,
2005
)
.
A
learning
set
L
=
[
L
\
,
.
.
.
,
LN
}
gathers
N
observations
.
A
set
of
features
computed
on
an
incomplete
observation
X
defines
an
input
space
.
The
inference
task
consists
in
predicting
the
missing
features
which
belong
to
an
output
space
.
We
denote
I
(
X
)
(
resp
.
O
(
X
)
)
the
projection
of
X
into
the
input
(
resp
.
output
)
space
.
The
inference
procedure
involves
three
steps
:
Building
Sj
(
X
)
=
{
(
A
,
B
,
C
)
e
L3
|
[
I
(
A
)
:
I
(
B
)
=
I
(
C
)
:
I
(
X
)
]
}
,
the
set
of
input
stems2
of
X
,
that
is
the
set
of
triplets
(
A
,
B
,
C
)
which
form
with
X
an
analogical
equation
.
Building
So
(
X
)
=
{
Y
|
[
O
(
A
)
:
O
(
B
)
=
O
(
C
)
:
Y
]
,
V
(
A
,
B
,
C
)
e
Si
(
X
)
}
the
set
of
solutions
to
the
analogical
equations
obtained
by
projecting
the
stems
of
£
j
(
X
)
into
the
output
space
.
So
(
X
)
.
This
inference
procedure
shares
similarities
with
the
K-nearest-neighbor
(
k-NN
)
approach
.
In
particular
,
since
no
model
of
the
training
material
is
being
learned
,
the
training
corpus
needs
to
be
stored
in
order
to
be
queried
.
On
the
contrary
to
k-NN
,
however
,
the
search
for
closest
neighbors
does
not
require
any
distance
,
but
instead
relies
on
relational
similarities
.
This
purity
has
a
cost
:
while
in
k-NN
inference
,
neighbors
can
be
found
in
time
linear
to
the
training
size
,
in
analogical
learning
,
this
operation
requires
a
computation
time
cubic
in
N
,
the
2In
Turney
's
work
(
Turney
,
2006
)
,
a
stem
designates
the
first
two
words
of
a
proportional
analogy
.
number
of
observations
.
In
many
applications
of
interest
,
including
the
one
we
tackle
here
,
this
is
simply
impractical
and
heuristics
must
be
applied
.
The
first
and
second
steps
of
the
inference
procedure
rely
on
the
existence
of
an
analogical
solver
,
which
we
sketch
in
the
next
section
.
One
important
thing
to
note
at
this
stage
,
is
that
an
analogical
equation
may
have
several
solutions
,
some
being
legitimate
word-forms
in
a
given
language
,
others
being
not
.
Thus
,
it
is
important
to
select
wisely
the
generated
solutions
,
therefore
Step
3
.
In
practice
,
the
inference
procedure
involves
the
computation
of
many
analogical
equations
,
and
a
statistic
as
simple
as
the
frequency
of
a
solution
often
suffices
to
separate
good
from
spurious
solutions
.
2.2
Analogical
Solver
Lepage
(
1998
)
proposed
an
algorithm
for
computing
the
solutions
of
a
formal
analogical
equation
[
A
:
B
=
C
:
?
]
.
We
implemented
a
variant
of
this
algorithm
which
requires
to
compute
two
edit-distance
tables
,
one
between
A
and
B
and
one
between
A
and
C.
Since
we
are
looking
for
subsequences
of
B
and
C
not
present
in
A
,
insertion
cost
is
null
.
Once
this
is
done
,
the
algorithm
synchronizes
the
alignments
defined
by
the
paths
of
minimum
cost
in
each
table
.
Intuitively
,
the
synchronization
of
two
alignments
(
one
between
A
and
B
,
and
one
between
A
and
C
)
consists
in
composing
in
the
correct
order
subsequences
of
the
strings
B
and
C
that
are
not
in
A.
We
refer
the
reader
to
(
Lepage
,
1998
)
for
the
intricacies
of
this
process
which
is
illustrated
in
Figure
1
for
the
analogical
equation
[
even
:
usual
=
unevenly
:
?
]
.
In
this
example
,
there
are
681
different
paths
that
align
even
and
usual
(
with
a
cost
of
4
)
,
and
1
path
which
aligns
even
with
unevenly
(
with
a
cost
of
0
)
.
This
results
in
681
synchronizations
which
generate
15
different
solutions
,
among
which
only
unusually
is
a
legitimate
word-form
.
source
(
French
)
stems
Figure
1
:
The
top
table
reports
the
edit-distance
tables
computed
between
even
and
usual
(
left
part
)
,
and
even
and
unevenly
(
right
part
)
.
The
bottom
part
of
the
figure
shows
2
of
the
681
synchronizations
computed
while
solving
the
equation
[
even
:
usual
=
unevenly
:
?
]
.
The
first
one
corresponds
to
the
path
marked
in
bold
italics
and
leads
to
a
spurious
solution
;
the
second
leads
to
a
legitimate
solution
and
corresponds
to
the
path
shown
as
squares
.
and
the
second
one
corresponds
to
the
maximum
time
needed
to
synchronize
them
.
|
X
|
denotes
the
length
,
counted
in
characters
of
the
string
X
,
whilst
ins
(
B
,
C
)
stands
for
the
number
of
characters
of
B
and
C
not
belonging
to
A.
Given
the
typical
length
of
the
strings
we
consider
in
this
study
,
our
solver
is
quite
efficient.3
Stroppa
and
Yvon
(
2005
)
described
a
generalization
of
this
algorithm
which
can
solve
a
formal
analogical
equation
by
composing
two
finite-state
transducers
.
2.3
Application
to
Lexical
Enrichment
Analogical
inference
can
be
applied
to
the
task
of
extending
an
existing
bilingual
lexicon
(
or
transfer
table
)
with
new
entries
.
In
this
study
,
we
focus
on
a
particular
enrichment
task
:
the
one
of
translating
valid
words
or
phrases
that
were
not
encountered
at
training
time
.
A
simple
example
of
how
our
approach
translates
unknown
words
is
illustrated
in
Figure
2
for
the
(
un
-
[
activités
:
activité
=
futilités
:
futilité
]
[
hostilités
:
hostilité
=
futilités
:
futilité
]
projection
by
lexicon
look-up
activités
^actions
hostilités
^hostilitiés
futilités
&lt;
-
&gt;
trivialitiés
,
gimmicks
hostilité
^hostility
activité^action
target
(
English
)
resolution
selection
of
target
candidates
(
triviality
,
2
)
,
(
gimmick
,
1
)
,
.
.
.
Figure
2
:
Illustration
of
the
analogical
inference
procedure
applied
to
the
translation
of
the
unknown
French
word
futilite
.
known
)
French
word
futilite
.
In
this
example
,
translations
is
inferred
by
commuting
plural
and
singular
words
.
The
inference
process
lazily
captures
the
fact
that
English
plural
nouns
ending
in
-
ies
usually
correspond
to
singular
nouns
ending
in
-
y.
Given
an
unknown
source
word-form
S
,
Step
1
of
the
inference
process
consists
in
identifying
source
stems
which
have
S
as
a
solution
:
4
Si
(
S
)
=
{
&lt;
*
,
;
•
,
*
)
e
[
1
,
N
]
3
|
[
Sj
:
Sj
=
Sfc
:
S
]
}
.
During
Step
2a
,
each
source
stem
belonging
to
fx
(
S
)
is
projected
form
by
form
into
(
potentially
several
)
stems
in
the
output
space
,
thanks
to
an
operator
proj
that
will
be
defined
shortly
:
(
U
,
V
,
W
)
e
(
proxc
(
Sj
)
x
proj
£
(
Sj
)
x
proj
£
(
Sfc
)
)
.
3Several
thousands
of
equations
solved
within
one
second
.
4All
strings
in
a
stem
must
be
different
,
otherwise
,
it
can
be
shown
that
all
source
words
would
be
considered
.
During
Step
2b
,
each
solution
to
those
output
stems
is
collected
in
SO
(
S
)
along
with
its
associated
frequency
:
fo
(
S
)
=
U
S
(
i
j
,
k
&gt;
(
S
)
.
Step
3
selects
from
SO
(
S
)
one
or
several
solutions
.
We
use
frequency
as
criteria
to
sort
the
generated
solutions
.
The
projection
mechanism
we
resort
to
in
this
study
simply
is
a
lexicon
look-up
:
proj
£
(
S
)
=
{
T
|
(
S
,
T
)
eL
}
.
There
are
several
situations
where
this
inference
procedure
will
introduce
noise
.
First
,
both
source
and
target
analogical
equations
can
lead
to
spurious
solutions
.
For
instance
,
[
show
:
showing
=
eating
:
?
]
will
erroneously
produce
eatinging
.
Second
,
an
error
in
the
original
lexicon
may
introduce
as
well
erroneous
target
word-forms
.
For
instance
,
when
translating
the
German
word
proklamierung
,
by
making
use
of
the
analogy
[
formalisiert
:
formalisierung
=
proklamiert
:
proklamierung
]
,
the
English
equation
[
formalised
:
formalized
=
sets
:
?
]
will
be
considered
if
it
happens
that
proklamiert^sets
belongs
to
L
;
in
which
case
,
zets
will
be
erroneously
produced
.
We
control
noise
in
several
ways
.
The
source
word-forms
we
generate
are
filtered
by
imposing
that
they
belong
to
the
input
space
.
We
also
use
a
(
large
)
target
vocabulary
to
eliminate
spurious
target
word-forms
(
see
Section
3
)
.
More
importantly
,
since
we
consider
many
analogical
equations
when
translating
a
word-form
,
spurious
analogical
solutions
tend
to
appear
less
frequently
than
ones
arising
from
paradigmatic
commutations
.
2.4
Practical
Considerations
Searching
for
Sx
(
S
)
is
an
operation
which
requires
solving
a
number
of
(
source
)
analogical
equations
cubic
in
the
size
of
the
input
space
.
In
many
settings
of
interest
,
including
ours
,
this
is
simply
not
practical
.
We
therefore
resort
to
two
strategies
to
reduce
computation
time
.
The
first
one
consists
in
using
the
analogical
equations
in
a
generative
mode
.
Instead
of
searching
through
the
set
of
stems
(
Sj
,
Sj
,
Sk
)
that
have
for
solution
the
unknown
source
wordform
S
,
we
search
for
all
pairs
(
Sj
,
Sj
)
to
the
solutions
of
[
Sj
:
Sj
=
S
:
?
]
that
are
valid
word-forms
This
leaves
us
with
a
quadratic
computation
time
which
is
still
intractable
in
our
case
.
Therefore
,
we
apply
a
second
strategy
which
consists
in
computing
the
analogical
equations
[
Sj
:
Sj
=
S
:
?
]
for
the
only
words
Sj
and
Sj
close
enough
to
S.
More
precisely
,
we
enforce
that
Sj
e
v
&lt;
s
(
S
)
and
that
Sj
e
(
Sj
)
for
a
neighborhood
function
vY
(
A
)
of
the
form
:
where
f
is
a
distance
;
we
used
the
edit-distance
in
this
study
(
Levenshtein
,
1966
)
.
Note
that
the
second
strategy
we
apply
is
only
a
heuristic
.
3
Resources
In
this
work
,
we
are
concerned
with
one
concrete
problem
a
machine
translation
system
must
face
:
the
one
of
translating
unknown
words
.
We
are
further
focusing
on
the
shared
task
of
the
workshop
on
Statistical
Machine
Translation
,
which
took
place
last
year
(
Koehn
and
Monz
,
2006
)
and
consisted
in
translating
Spanish
,
German
,
and
French
texts
from
and
to
English
.
For
some
reasons
,
we
restricted
ourselves
to
translating
only
into
English
.
The
training
material
available
is
coming
from
the
Europarl
corpus
.
The
test
material
was
divided
into
two
parts.5
The
first
one
(
hereafter
called
test-in
)
is
composed
of
2
000
sentences
from
European
parliament
debates
.
The
second
part
(
called
test-out
)
gathers
1
064
sentences6
collected
from
editorials
of
the
Project
Syndicate
website.7
The
main
statistics
pertinent
to
our
study
are
summarized
in
Table
1
.
5The
participants
were
not
aware
ofthis
.
6We
removed
30
sentences
which
had
encoding
problems
.
7http
:
/
/
www.project-syndicate.com
I
unknown
I
Table
1
:
Number
of
different
(
source
)
test
words
not
seen
at
training
time
,
and
out-of-vocabulary
rate
expressed
as
a
percentage
(
oov
%
)
.
words
)
,
7
words
are
acronyms
,
and
4
are
tokeniza-tion
problems
.
The
238
other
words
(
54
%
)
are
ordinary
words
.
We
considered
different
lexicons
for
testing
our
approach
.
These
lexicons
were
derived
from
the
training
material
of
the
shared
task
by
training
with
Giza++
(
Och
and
Ney
,
2000
)
—
default
settings
—
two
transfer
tables
(
source-to-target
and
the
reverse
)
that
we
intersected
to
remove
some
noise
.
In
order
to
investigate
how
sensitive
our
approach
is
to
the
amount
of
training
material
available
,
we
varied
the
size
of
our
lexicon
LT
by
considering
different
portions
of
the
training
corpus
(
t
=
5
000
,
10
000
,
100
000
,
200
000
,
and
500
000
pairs
of
sentences
)
.
The
lexicon
trained
on
the
full
training
material
(
688
000
pairs
of
sentences
)
,
called
Lref
hereafter
,
is
used
for
validation
purposes
.
We
kept
(
at
most
)
the
20
best
associations
of
each
source
word
in
these
lexicons
.
In
practice
,
because
we
intersect
two
models
,
the
average
number
of
translations
kept
for
each
source
word
is
lower
(
see
Table
2
)
.
Last
,
we
collected
from
various
target
texts
(
English
here
)
we
had
at
our
disposal
,
a
vocabulary
set
V
gathering
466
439
words
,
that
we
used
to
filter
out
spurious
word-forms
generated
by
our
approach
.
4
Experiments
4.1
Translating
Unknown
Words
For
the
three
translation
directions
(
from
Spanish
,
German
,
and
French
into
English
)
,
we
applied
the
analogical
reasoning
to
translate
the
(
non-numerical
)
source
words
of
the
test
material
,
absent
from
LT
.
Examples
of
translations
produced
by
analogical
inference
are
reported
in
Figure
3
,
sorted
by
decreasing
order
of
times
they
have
been
generated
.
Figure
3
:
Candidate
translations
inferred
from
L200000
and
their
frequency
.
The
candidates
reported
are
those
that
have
been
intersected
with
V.
Translations
in
bold
are
clearly
erroneous
.
We
devised
two
baselines
against
which
we
compared
our
approach
(
hereafter
analog
)
.
The
first
one
,
base1
,
simply
proposes
as
translations
the
target
words
in
the
lexicon
LT
which
are
the
most
similar
(
in
the
sense
of
the
edit-distance
)
to
the
unknown
source
word
.
Naturally
,
this
approach
is
only
appropriate
for
pairs
of
languages
that
share
many
cognates
(
i.e.
,
docteur
—
doctor
)
.
The
second
baseline
,
base2
,
is
more
sensible
and
more
closely
corresponds
to
our
approach
.
We
first
collect
a
set
of
source
words
that
are
close-enough
(
according
to
the
edit-distance
)
to
the
unknown
word
.
Those
source
words
are
then
projected
into
the
output
space
by
simple
bilingual
lexicon
look-up
.
So
for
instance
,
the
French
word
demanda
will
be
translated
into
the
English
word
request
if
the
French
word
demande
is
in
LT
and
that
request
is
one
of
its
sanctioned
translations
.
Each
of
these
baselines
is
tested
in
two
variants
.
The
first
one
(
id
)
,
which
allows
a
direct
comparison
,
proposes
as
many
translations
as
analog
does
.
The
second
one
(
i0
)
proposes
the
first
10
translations
of
each
unknown
word
.
Evaluating
the
quality
of
translations
requires
to
inspect
lists
of
words
each
time
we
want
to
test
a
variant
of
our
approach
.
This
cumbersome
process
not
only
requires
to
understand
the
source
language
,
test-out
analog
base1id
base2id
Table
2
:
Performance
of
the
different
approaches
on
the
French-to-English
direction
as
a
function
of
the
number
T
of
pairs
of
sentences
used
for
training
LT
.
A
pair
[
n
,
t
]
in
lines
labeled
by
unk
stands
for
the
number
of
words
to
translate
,
and
the
average
number
of
their
translations
in
Lref
.
but
happens
to
be
in
practice
a
delicate
task
.
We
therefore
decided
to
resort
to
an
automatic
evaluation
procedure
which
relies
on
Lref
,
a
bilingual
lexicon
which
entries
are
considered
correct
.
We
translated
all
the
words
of
Lref
absent
from
LT
.
We
evaluated
the
different
approaches
by
computing
response
and
precision
rates
.
The
response
rate
is
measured
as
the
percentage
of
words
for
which
we
do
have
at
least
one
translation
produced
(
correct
or
not
)
.
The
precision
is
computed
in
our
case
as
the
percentage
of
words
for
which
at
least
one
translation
is
sanctioned
by
Lref
.
Note
that
this
way
of
measuring
response
and
precision
is
clearly
biased
toward
translation
systems
that
can
hypothesize
several
candidate
translations
for
each
word
,
as
statistical
systems
usually
do
.
The
reason
of
this
choice
was
however
guided
by
a
lack
of
precision
of
the
reference
we
anticipated
,
a
point
we
discuss
in
Section
4.1.3
.
The
figures
for
the
French-to-English
direction
are
reported
in
Table
2
.
We
observe
that
the
ratio
of
unknown
words
that
get
a
translation
by
analog
is
clearly
impacted
by
the
size
of
the
lexicon
LT
we
use
for
computing
analogies
:
the
larger
the
better
.
This
was
expected
since
the
larger
a
lexicon
is
,
the
higher
the
number
of
source
analogies
that
can
be
made
and
consequently
,
the
higher
the
number
of
analogies
that
can
be
projected
onto
the
output
space
.
The
precision
of
analog
is
rather
stable
across
variants
and
ranges
between
50
%
to
60
%
.
The
second
observation
we
make
is
that
the
baselines
perform
worse
than
analog
in
all
but
the
L500000
cases
.
Since
our
baselines
propose
translations
to
each
source
word
,
their
response
rate
is
maximum
.
Their
precision
,
however
,
is
an
issue
.
Expectedly
,
base1
is
the
worst
ofthe
two
baselines
.
If
we
arbitrarily
fix
the
response
rate
of
base2
to
the
one
of
ana
lo
g
,
the
former
approach
shows
a
far
lower
precision
(
e.g.
,
34.4
against
59.4
for
L200000
)
.
This
not
only
indicates
that
analogical
learning
is
handling
unknown
words
better
than
base2
,
but
as
well
,
that
a
combination
of
both
approaches
could
potentially
yield
further
improvements
.
A
last
observation
concerns
the
fact
that
analog
performs
equally
well
on
the
out-domain
material
.
This
is
very
important
from
a
practical
point
of
view
and
contrasts
with
some
related
work
we
discuss
in
Section
5
.
At
first
glance
,
the
fact
that
base2
outperforms
analog
on
the
larger
training
size
is
disappointing
.
After
investigations
,
we
came
to
the
conclusion
that
this
is
mainly
due
to
two
facts
.
First
,
the
num
-
ber
of
unknown
words
on
which
both
systems
were
tested
is
rather
low
in
this
particular
case
(
e.g.
,
34
for
the
in-domain
corpus
)
.
Second
,
we
noticed
a
deficiency
of
the
reference
lexicon
Lref
for
many
of
those
words
.
After
all
,
this
is
not
surprising
since
the
words
unseen
in
the
500
000
pairs
of
training
sentences
,
but
encountered
in
the
full
training
corpus
(
688
000
pairs
)
are
likely
to
be
observed
only
a
few
times
,
therefore
weakening
the
associations
automatically
acquired
for
these
entries
.
We
evaluate
that
a
third
of
the
reference
translations
were
wrong
in
this
setting
,
which
clearly
raises
some
doubts
on
our
automatic
evaluation
procedure
in
this
case
.
The
performance
of
analog
across
the
three
language
pairs
are
reported
in
Table
3
.
We
observe
a
drop
of
performance
of
roughly
10
%
(
both
in
precision
and
response
)
for
the
German-to-English
translation
direction
.
This
is
likely
due
to
the
heuristic
procedure
we
apply
during
the
search
for
stems
,
which
is
not
especially
well
suited
for
handling
compound
words
that
are
frequent
in
German
.
We
observe
that
for
Spanish
-
and
German-to-English
translation
directions
,
the
precision
rate
tends
to
decrease
for
larger
values
of
t.
One
explanation
for
that
is
that
we
consider
all
analogies
equally
likely
in
this
work
,
while
we
clearly
noted
that
some
are
spurious
ones
.
With
larger
training
material
,
spurious
analogies
become
more
likely
.
Table
3
:
Performance
across
language
pairs
measured
on
test-in
.
The
number
t
of
pairs
of
sentences
used
for
training
LT
is
reported
in
thousands
.
We
measured
the
impact
the
translations
produced
by
analog
have
on
a
state-of-the-art
phrase-based
translation
engine
,
which
is
described
in
(
Patry
et
al.
,
2006
)
.
For
that
purpose
,
we
extended
a
phrase-table
with
the
first
translation
proposed
by
analog
or
base2
for
each
unknown
word
of
the
test
material
.
Results
in
terms
of
word-error-rate
(
wer
)
and
bleu
score
(
Papineni
et
al.
,
2002
)
are
reported
in
Table
4
for
those
sentences
that
contain
at
least
one
unknown
word
.
Small
but
consistent
improvements
are
observed
for
both
metrics
with
analog
.
This
was
expected
,
since
the
original
system
simply
leaves
the
unknown
words
untranslated
.
What
is
more
surprising
is
that
the
base2
version
slightly
underperforms
the
baseline
.
The
reason
is
that
some
unknown
words
that
should
appear
unmodified
in
a
translation
,
often
get
an
erroneous
translation
by
base2
.
Forcing
base2
to
propose
a
translation
for
the
same
words
for
which
analog
found
one
,
slightly
improves
the
figures
(
base2id
)
.
wer
bleu
sentences
Table
4
:
Translation
quality
produced
by
our
phrase-based
SMT
engine
(
base
)
with
and
without
the
first
translation
produced
by
analog
,
base2
,
or
base2i
(
d
for
each
unknown
word
.
As
we
already
mentioned
,
the
lexicon
used
as
a
reference
in
our
automatic
evaluation
procedure
is
not
perfect
,
especially
for
low
frequency
words
.
We
further
noted
that
several
words
receive
valid
translations
that
are
not
sanctioned
by
Lref
.
This
is
for
instance
the
case
of
the
examples
in
Figure
4
,
where
circumventing
and
fellow
are
arguably
legitimate
translations
of
the
French
words
contournant
and
concitoyen
,
respectively
.
Note
that
in
the
second
example
,
the
reference
translation
is
in
the
plural
form
while
the
French
word
is
not
.
Therefore
,
we
conducted
a
manual
evaluation
of
the
translations
produced
from
L100000
by
analog
and
base2
on
the
127
French
words
of
the
corpus
test-in8
unknown
of
Lref
.
Those
are
the
non-numerical
unknown
words
the
participating
systems
in
the
shared
task
had
to
face
in
the
8We
did
not
notice
important
differences
between
test-in
and
test-out
.
contournant
(
49
candidates
)
Lref
o
skirting
,
bypassing
,
by-pass
,
overcoming
concitoyen
(
24
candidates
)
Lref
o
fellow-citizens
Figure
4
:
10
best
ranked
candidate
translations
produced
by
analog
from
L200000
for
two
unknown
words
and
their
sanctioned
translations
in
Lref
.
Words
in
bold
are
present
in
both
the
candidate
and
the
reference
lists
.
in-domain
part
of
the
test
material
.
75
(
60
%
)
of
those
words
received
at
least
one
valid
translation
by
analog
while
only
63
(
50
%
)
did
by
base2
.
Among
those
words
that
received
(
at
least
)
one
valid
translation
,
61
(
81
%
)
were
ranked
first
by
analog
against
only
22
(
35
%
)
by
base2
.
We
further
observed
that
among
the
52
words
that
did
not
receive
a
valid
translation
by
analog
,
38
(
73
%
)
did
not
receive
a
translation
at
all
.
Those
untranslated
words
are
mainly
proper
names
(
bush
)
,
foreign
words
(
munere
)
,
and
compound
words
(
rhenanie-du-nord-westphalie
)
,
for
which
our
approach
is
not
especially
well
suited
.
We
conclude
from
this
informal
evaluation
that
80
%
of
ordinary
unknown
words
received
a
valid
translation
in
our
French-to-English
experiment
,
and
that
roughly
the
same
percentage
had
a
valid
translation
proposed
in
the
first
place
by
analog
.
4.2
Translating
Unknown
Phrases
Our
approach
is
not
limited
to
translate
solely
unknown
words
,
but
might
serve
as
well
to
enrich
existing
entries
in
a
lexicon
.
For
instance
,
low-frequency
words
,
often
poorly
handled
by
current
statistical
methods
,
could
receive
useful
translations
.
This
is
illustrated
in
Figure
5
where
we
report
the
best
candidates
produced
by
analog
for
the
French
word
invitees
,
which
appears
7
times
in
the
200
000
Figure
5
:
10
best
candidates
produced
by
analog
for
the
low-frequency
French
word
invitees
and
its
translations
in
L200000
.
first
pairs
of
the
training
corpus
.
Interestingly
,
analog
produced
the
candidate
guest
which
corresponds
to
a
legitimate
meaning
of
the
French
word
that
was
absent
in
the
training
data
.
Because
it
can
treat
separators
as
any
other
character
,
analog
is
not
bounded
to
translate
only
words
.
As
a
proof
of
concept
,
we
applied
analogical
reasoning
to
translate
those
source
sequences
of
at
most
5
words
in
the
test
material
that
contain
an
unknown
word
.
Since
there
are
many
more
sequences
than
there
are
words
,
the
input
space
in
this
experiment
is
far
larger
,
and
we
had
to
resort
to
a
much
more
aggressive
pruning
technique
to
find
the
stems
of
the
sequences
to
be
translated
.
Figure
6
:
Examples
of
translations
produced
by
analog
where
the
input
(
resp
.
output
)
space
is
defined
by
the
set
of
source
(
resp
.
target
)
word
sequences
.
Words
in
bold
are
unknown
.
We
applied
the
automatic
evaluation
procedure
described
in
Section
4.1.2
for
the
French-to-English
translation
direction
,
with
a
reference
lexicon
being
this
time
the
phrase
table
acquired
on
the
full
training
material.9
The
response
rate
in
this
experiment
is
particularly
low
since
only
a
tenth
of
the
sequences
9This
model
contains
1.5
millions
pairs
of
phrases
.
received
(
at
least
)
a
translation
by
analog
.
Those
are
short
sequences
that
contain
at
most
three
words
,
which
clearly
indicates
the
limitation
of
our
pruning
strategy
.
Among
those
sequences
that
received
at
least
one
translation
,
the
precision
rate
is
55
%
,
which
is
consistent
with
the
rate
we
measured
while
translating
words
.
Examples
of
translations
are
reported
in
Figure
6
.
We
observe
that
single
words
are
not
contrived
anymore
to
be
translated
by
a
single
word
.
This
allows
to
capture
1
:
n
relations
such
as
depasseront
&lt;
-
&gt;
will
exceed
,
where
the
future
tense
of
the
French
word
is
adequately
rendered
by
the
modal
will
in
English
.
5
Related
Work
We
are
not
the
first
to
consider
the
translation
of
unknown
words
or
phrases
.
Several
authors
have
for
instance
proposed
approaches
for
translating
proper
names
and
named
entities
(
Chen
et
al.
,
1998
;
Al-Onaizan
and
Knight
,
2002
)
.
Our
approach
is
complementary
to
those
ones
.
Recently
and
more
closely
related
to
the
approach
we
described
,
Callison-Burch
et
al.
(
2006
)
proposed
to
replace
an
unknown
phrase
in
a
source
sentence
by
a
paraphrase
.
Paraphrases
in
their
work
are
acquired
thanks
to
a
word
alignment
computed
over
a
large
external
set
of
bitexts
.
One
important
difference
between
their
work
and
ours
is
that
our
approach
does
not
require
additional
material.10
Indeed
,
they
used
a
rather
idealistic
set
of
large
,
homogeneous
bitexts
(
European
parliament
debates
)
to
acquire
paraphrases
from
.
Therefore
we
feel
our
approach
is
more
suited
for
translating
"
low
density
"
languages
and
languages
with
a
rich
morphology
.
Several
authors
considered
as
well
the
translation
of
new
words
by
relying
on
distributional
collocational
properties
computed
from
a
huge
non-parallel
corpus
(
Rapp
,
1999
;
Fung
and
Yee
,
1998
;
Takaaki
if
admittedly
non-parallel
corpora
are
easier
to
acquire
than
bitexts
,
this
line
of
work
is
still
heavily
dependent
on
huge
external
resources
.
Most
of
the
analogies
made
at
the
word
level
in
our
study
are
capturing
morphological
information
.
10We
do
use
a
target
vocabulary
list
to
filter
out
spurious
analogies
,
but
we
believe
we
could
do
without
.
The
frequency
with
which
we
generate
a
string
could
serve
to
decide
upon
its
legitimacy
.
The
use
of
morphological
analysis
in
(
statistical
)
machine
translation
has
been
the
focus
of
several
studies
,
(
NieBen
,
2002
)
among
the
first
.
Depending
on
the
pairs
of
languages
considered
,
gains
have
been
reported
when
the
training
material
is
of
modest
size
(
Lee
,
2004
;
Popovic
and
Ney
,
2004
;
Gold-water
and
McClosky
,
2005
)
.
Our
approach
does
not
require
any
morphological
knowledge
of
the
source
,
the
target
,
or
both
languages
.
Admittedly
,
several
unsupervised
morphological
induction
methodologies
have
been
proposed
,
e.g.
,
the
recent
approach
in
Freitag
(
2005
)
.
In
any
case
,
as
we
have
shown
,
analog
is
not
bounded
to
treat
only
words
,
which
we
believe
to
be
at
our
advantage
.
6
Discussion
and
Future
Work
In
this
paper
,
we
have
investigated
the
appropriateness
of
analogical
learning
to
handle
unknown
words
in
machine
translation
.
On
the
contrary
to
several
lines
of
work
,
our
approach
does
not
rely
on
massive
additional
resources
but
capitalizes
instead
on
an
information
which
is
inherently
pertaining
to
the
language
.
We
measured
that
roughly
80
%
of
ordinary
unknown
French
words
can
receive
a
valid
translation
into
English
with
our
approach
.
This
work
is
currently
being
developed
in
several
directions
.
First
,
we
are
investigating
why
our
approach
remains
silent
for
some
words
or
phrases
.
This
will
allow
us
to
better
characterize
the
limitations
of
analog
and
will
hopefully
lead
us
to
design
a
better
strategy
for
identifying
the
stems
of
a
given
word
or
phrase
.
Second
,
we
are
investigating
how
a
systematic
enrichment
of
a
phrase-transfer
table
will
impact
a
phrase-based
statistical
machine
translation
engine
.
Last
,
we
want
to
investigate
the
training
of
a
model
that
can
learn
regularities
from
the
analogies
we
are
making
.
This
would
relieve
us
from
requiring
the
training
material
while
translating
,
and
would
allow
us
to
compare
our
approach
with
other
methods
proposed
for
unsupervised
morphology
acquisition
.
Acknowledgement
We
are
grateful
to
the
anonymous
reviewers
for
their
useful
suggestions
and
to
Pierre
Poulin
for
his
fruitful
comments
.
This
study
has
been
partially
funded
by
NSERC
.
