We
consider
the
problem
of
learning
to
parse
sentences
to
lambda-calculus
representations
of
their
underlying
semantics
and
present
an
algorithm
that
learns
a
weighted
combinatory
categorial
grammar
(
CCG
)
.
A
key
idea
is
to
introduce
non-standard
CCG
combinators
that
relax
certain
parts
of
the
grammar
—
for
example
allowing
flexible
word
order
,
or
insertion
of
lexical
items
—
with
learned
costs
.
We
also
present
a
new
,
online
algorithm
for
inducing
a
weighted
CCG
.
Results
for
the
approach
on
ATIS
data
show
86
%
F-measure
in
recovering
fully
correct
semantic
analyses
and
95.9
%
F-measure
by
a
partial-match
criterion
,
a
more
than
5
%
improvement
over
the
90.3
%
partial-match
figure
reported
by
He
and
Young
(
2006
)
.
1
Introduction
Recent
work
(
Mooney
,
2007
;
He
and
Young
,
2006
;
Zettlemoyer
and
Collins
,
2005
)
has
developed
learning
algorithms
for
the
problem
of
mapping
sentences
to
underlying
semantic
representations
.
In
one
such
approach
(
Zettlemoyer
and
Collins
,
2005
)
(
ZC05
)
,
the
input
to
the
learning
algorithm
is
a
training
set
consisting
of
sentences
paired
with
lambda-calculus
expressions
.
For
instance
,
the
training
data
might
contain
the
following
example
:
Sentence
:
list
flights
to
boston
In
this
case
the
lambda-calculus
expression
denotes
the
set
of
all
flights
that
land
in
Boston
.
In
ZC05
it
is
assumed
that
training
examples
do
not
include
additional
information
,
for
example
parse
trees
or
b
)
show
me
information
on
american
airlines
from
fort
worth
texas
to
philadelphia
Xx.airline
(
x
,
americanjairlines
)
A
c
)
okay
that
one
's
great
too
now
we
're
going
to
go
on
april
twenty
second
dallas
to
washington
the
latest
nighttime
departure
one
way
Figure
1
:
Three
sentences
from
the
ATIS
domain
.
other
derivations
.
The
output
from
the
learning
algorithm
is
a
combinatory
categorial
grammar
(
CCG
)
,
together
with
parameters
that
define
a
log-linear
distribution
over
parses
under
the
grammar
.
Experiments
show
that
the
approach
gives
high
accuracy
on
two
database-query
problems
,
introduced
by
Zelle
and
Mooney
(
1996
)
and
Tang
and
Mooney
(
2000
)
.
The
use
of
a
detailed
grammatical
formalism
such
as
CCG
has
the
advantage
that
it
allows
a
system
to
handle
quite
complex
semantic
effects
,
such
as
coordination
or
scoping
phenomena
.
In
particular
,
it
allows
us
to
leverage
the
considerable
body
of
work
on
semantics
within
these
formalisms
,
for
example
see
Carpenter
(
1997
)
.
However
,
a
grammar
based
on
a
formalism
such
as
CCG
can
be
somewhat
rigid
,
and
this
can
cause
problems
when
a
system
is
faced
with
spontaneous
,
unedited
natural
language
input
,
as
is
commonly
seen
in
natural
language
interface
applications
.
For
example
,
consider
the
sentences
shown
in
figure
1
,
which
were
taken
from
the
ATIS
travel-planning
domain
(
Dahl
et
al.
,
1994
)
.
These
sentences
exhibit
characteristics
which
present
significant
challenges
to
the
approach
of
ZC05
.
For
ex
-
Proceedings
of
the
2007
Joint
Conference
on
Empirical
Methods
in
Natural
Language
Processing
and
Computational
Natural
Language
Learning
,
pp.
678-687
,
Prague
,
June
2007
.
©
2007
Association
for
Computational
Linguistics
ample
,
the
sentences
have
quite
flexible
word
order
,
and
include
telegraphic
language
where
some
words
are
effectively
omitted
.
In
this
paper
we
describe
a
learning
algorithm
that
retains
the
advantages
of
using
a
detailed
grammar
,
but
is
highly
effective
in
dealing
with
phenomena
seen
in
spontaneous
natural
language
,
as
exemplified
by
the
ATIS
domain
.
A
key
idea
is
to
extend
the
approach
of
ZC05
by
allowing
additional
nonstandard
CCG
combinators
.
These
combinators
relax
certain
parts
of
the
grammar
—
for
example
allowing
flexible
word
order
,
or
insertion
of
lexical
items
—
with
learned
costs
for
the
new
operations
.
This
approach
has
the
advantage
that
it
can
be
seamlessly
integrated
into
CCG
learning
algorithms
such
as
the
algorithm
described
in
ZC05
.
A
second
contribution
of
the
work
is
a
new
,
online
algorithm
for
CCG
learning
.
The
approach
involves
perceptron
training
of
a
model
with
hidden
variables
.
In
this
sense
it
is
related
to
the
algorithm
of
Liang
et
al.
(
2006
)
.
However
it
has
the
additional
twist
of
also
performing
grammar
induction
(
lexical
learning
)
in
an
online
manner
.
In
our
experiments
,
we
show
that
the
new
algorithm
is
considerably
more
efficient
than
the
ZC05
algorithm
;
this
is
important
when
training
on
large
training
sets
,
for
example
the
ATIS
data
used
in
this
paper
.
Results
for
the
approach
on
ATIS
data
show
86
%
F-measure
accuracy
in
recovering
fully
correct
semantic
analyses
,
and
95.9
%
F-measure
by
a
partial-match
criterion
described
by
He
and
Young
(
2006
)
.
The
latter
figure
contrasts
with
a
figure
of
90.3
%
for
the
approach
reported
by
He
and
Young
(
2006
)
.
1
Results
on
the
Geo880
domain
also
show
an
improvement
in
accuracy
,
with
88.9
%
F-measure
for
the
new
approach
,
compared
to
87.0
%
F-measure
for
the
method
in
ZC05
.
Training
examples
in
our
approach
consist
of
sentences
paired
with
lambda-calculus
expressions
.
We
use
a
version
of
the
lambda
calculus
that
is
closely
related
to
the
one
presented
by
Carpenter
(
1997
)
.
There
are
three
basic
types
:
t
,
the
type
of
truth
val
-
1He
and
Young
(
2006
)
do
not
give
results
for
recovering
fully
correct
parses
.
ues
;
e
,
the
type
for
entities
;
and
r
,
the
type
for
real
numbers
.
Functional
types
are
defined
by
specifying
their
input
and
output
types
,
for
example
(
e
,
t
)
is
the
type
of
a
function
from
entities
to
truth
values
.
In
general
,
declarative
sentences
have
a
logical
form
of
type
t.
Question
sentences
generally
have
functional
types.2
Each
expression
is
constructed
from
constants
,
logical
connectors
,
quantifiers
and
lambda
functions
.
2.2
Combinatory
Categorial
Grammars
Combinatory
categorial
grammar
(
CCG
)
is
a
syntactic
theory
that
models
a
wide
range
of
linguistic
phenomena
(
Steedman
,
1996
;
Steedman
,
2000
)
.
The
core
of
a
CCG
grammar
is
a
lexicon
A.
For
example
,
consider
the
lexicon
Each
entry
in
the
lexicon
is
a
pair
consisting
of
a
word
and
an
associated
category
.
The
category
contains
both
syntactic
and
semantic
information
.
For
example
,
the
first
entry
states
that
the
word
flights
can
have
the
category
N
:
Xx.flight
(
x
)
.
This
category
consists
of
a
syntactic
type
N
,
together
with
the
semantics
Xx.flight
(
x
)
.
In
general
,
the
semantic
entries
for
words
in
the
lexicon
can
consist
of
any
lambda-calculus
expression
.
Syntactic
types
can
either
be
simple
types
such
as
N
,
NP
,
or
S
,
or
can
be
more
complex
types
that
make
use
of
slash
notation
,
for
example
(
N
\
N
)
/
NP
.
CCG
makes
use
of
a
set
of
combinators
which
are
used
to
combine
categories
to
form
larger
pieces
of
syntactic
and
semantic
structure
.
The
simplest
such
rules
are
the
functional
application
rules
:
The
first
rule
states
that
a
category
with
syntactic
type
A
/
B
can
be
combined
with
a
category
to
the
right
of
syntactic
type
B
to
create
a
new
category
of
type
A.
It
also
states
that
the
new
semantics
will
be
formed
by
applying
the
function
f
to
the
expression
g.
The
second
rule
handles
arguments
to
the
left
.
Using
these
rules
,
we
can
parse
the
2For
example
,
many
question
sentences
have
semantics
of
type
(
e
,
t
)
,
as
in
Xx.flight
(
x
)
A
to
(
x
,
boston
)
.
following
phrase
to
create
a
new
category
of
type
N
:
flights
to
boston
The
top-most
parse
operations
pair
each
word
with
a
corresponding
category
from
the
lexicon
.
The
later
steps
are
labeled
—
&gt;
(
for
each
instance
of
forward
application
)
or
—
&lt;
(
for
backward
application
)
.
A
second
set
of
combinators
in
CCG
grammars
are
the
rules
of
functional
composition
:
These
rules
allow
for
an
unrestricted
notion
of
constituency
that
is
useful
for
modeling
coordination
and
other
linguistic
phenomena
.
As
we
will
see
,
they
also
turn
out
to
be
useful
when
modeling
constructions
with
relaxed
word
order
,
as
seen
frequently
in
domains
such
as
ATIS
.
In
addition
to
the
application
and
composition
rules
,
we
will
also
make
use
of
type
raising
and
coordination
combinators
.
A
full
description
of
these
combinators
goes
beyond
the
scope
of
this
paper
.
Steedman
(
1996
;
2000
)
presents
a
detailed
description
ofCCG
.
2004
;
Taskar
et
al.
,
2004
)
.
We
will
write
x
to
denote
a
sentence
,
and
y
to
denote
a
CCG
parse
for
a
sentence
.
We
use
GEN
(
x
;
A
)
to
refer
to
all
possible
CCG
parses
for
x
under
some
CCG
lexicon
A.
We
will
define
f
(
x
,
y
)
e
Rd
to
be
a
d-dimensional
feature-vector
that
represents
a
parse
tree
y
paired
with
an
input
sentence
x.
In
principle
,
f
could
include
features
that
are
sensitive
to
arbitrary
substructures
within
the
pair
(
x
,
y
)
.
We
will
define
w
e
Rd
to
be
a
parameter
vector
.
The
optimal
parse
for
a
sentence
x
under
parameters
w
and
lexicon
A
is
then
defined
as
y
*
(
x
)
=
arg
max
w
•
f
(
x
,
y
)
.
Assuming
sufficiently
local
features3
in
f
,
search
for
y
*
can
be
achieved
using
dynamic-programming-style
algorithms
,
typically
with
some
form
of
beam
search.4
Training
a
model
of
this
form
involves
learning
the
parameters
w
and
potentially
also
the
lexicon
A.
This
paper
focuses
on
a
method
for
learning
a
(
w
,
A
)
pair
from
a
training
set
of
sentences
paired
with
lambda-calculus
expressions
.
We
now
give
a
description
of
the
approach
of
Zettle-moyer
and
Collins
(
2005
)
.
This
method
will
form
the
basis
for
our
approach
,
and
will
be
one
of
the
baseline
models
for
the
experimental
comparisons
.
The
input
to
the
ZC05
algorithm
is
a
set
of
training
examples
(
xi
,
Zi
)
for
i
=
1
.
.
.
n.
Each
xj
is
a
sentence
,
and
each
zi
is
a
corresponding
lambda-expression
.
The
output
from
the
algorithm
is
a
pair
(
w
,
A
)
specifying
a
set
of
parameter
values
,
and
a
CCG
lexicon
.
Note
that
for
a
given
training
example
(
xi
,
Zi
)
,
there
may
be
many
possible
parses
y
which
lead
to
the
correct
semantics
zi.5
For
this
reason
the
training
problem
is
a
hidden-variable
problem
,
where
the
training
examples
contain
only
partial
information
,
and
the
CCG
lexicon
and
parse
derivations
must
be
learned
without
direct
supervision
.
A
central
part
of
the
ZC05
approach
is
a
function
GENLEX
(
x
,
z
)
which
maps
a
sentence
x
together
with
semantics
z
to
a
set
of
potential
lexical
entries
.
The
function
GENLEX
is
defined
through
a
set
of
rules
—
see
figure
2
—
that
consider
the
expression
z
,
and
generate
a
set
of
categories
that
may
help
in
building
the
target
semantics
z.
An
exhaustive
set
of
lexical
entries
is
then
generated
by
taking
all
categories
generated
by
the
GENLEX
rules
,
and
pairing
them
with
all
possible
sub-strings
of
the
sentence
x.
Note
that
our
lexicon
can
contain
multi-word
entries
,
where
a
multi-word
string
such
as
New
York
can
be
paired
with
a
CCG
category
.
The
final
out
-
3For
example
,
features
which
count
the
number
of
lexical
entries
of
a
particular
type
,
or
features
that
count
the
number
of
applications
of
a
particular
CCG
combinator
.
4In
our
experiments
we
use
a
parsing
algorithm
that
is
similar
to
a
CKY-style
parser
with
dynamic
programming
.
Dynamic
programming
is
used
but
each
entry
in
the
chart
maintains
a
full
semantic
expression
,
preventing
a
polynomial-time
algorithm
;
beam
search
is
used
to
make
the
approach
tractable
.
5This
problem
is
compounded
by
the
fact
that
the
lexicon
is
unknown
,
so
that
many
of
the
possible
hidden
derivations
involve
completely
spurious
lexical
entries
.
Example
categories
produced
from
the
logical
form
arg
max
(
Xx.flight
(
x
)
A
from
(
x
,
boston
)
,
Xx.cost
(
x
)
)
Input
Trigger
Output
Category
constant
c
arity
one
predicate
p
arity
one
predicate
pi
literal
with
arity
two
predicate
p2
and
constant
second
argument
c
arity
two
predicate
p2
an
arg
max
/
min
with
second
argument
arity
one
function
f
arity
one
function
f
no
trigger
Figure
2
:
Rules
used
in
GENLEX
.
Each
row
represents
a
rule
.
The
first
column
lists
the
triggers
that
identify
some
sub-structure
within
a
logical
form
.
The
second
column
lists
the
category
that
is
created
.
The
third
column
lists
categories
that
are
created
when
the
rule
is
applied
to
the
logical
form
at
the
top
of
this
column
.
We
use
the
10
rules
described
in
ZC05
and
add
two
new
rules
,
listed
in
the
last
two
rows
above
.
This
first
new
rule
is
instantiated
for
greater
than
(
&gt;
)
and
less
than
(
&lt;
)
comparisions
.
The
second
new
rule
has
no
trigger
;
it
is
always
applied
.
It
generates
categories
that
are
used
to
learn
lexical
entries
for
semantically
vacuous
sentence
prefixes
such
as
the
phrase
show
me
information
on
in
the
example
in
figure
1
(
b
)
.
put
from
GENLEX
(
x
,
z
)
is
a
large
set
of
potential
lexical
entries
,
with
the
vast
majority
of
those
entries
being
spurious
.
The
algorithm
in
ZC05
embeds
GENLEX
within
an
overall
learning
approach
that
simultaneously
selects
a
small
subset
of
all
entries
generated
by
GENLEX
and
estimates
parameter
values
w.
Zettlemoyer
and
Collins
(
2005
)
present
more
complete
details
.
In
section
4.2
we
describe
a
new
,
online
algorithm
that
uses
GENLEX
.
3
Parsing
Extensions
:
Combinators
This
section
describes
a
set
of
CCG
combinators
which
we
add
to
the
conventional
CCG
combinators
described
in
section
2.2
.
These
additional
combi-nators
are
natural
extensions
of
the
forward
application
,
forward
composition
,
and
type-raising
rules
seen
in
CCG
.
We
first
describe
a
set
of
combinators
that
allow
the
parser
to
significantly
relax
constraints
on
word
order
.
We
then
describe
a
set
of
type-raising
rules
which
allow
the
parser
to
cope
with
telegraphic
input
(
in
particular
,
missing
function
words
)
.
In
both
cases
these
additional
rules
lead
to
significantly
more
parses
for
any
sentence
x
given
a
lexicon
A.
Many
of
these
parses
will
be
suspect
from
a
linguistic
perspective
;
broadening
the
set
of
CCG
combinators
in
this
way
might
be
considered
a
dangerous
move
.
However
,
the
learning
algorithm
in
our
approach
can
learn
weights
for
the
new
rules
,
effectively
allowing
the
model
to
learn
to
use
them
only
in
appropriate
contexts
;
in
the
experiments
we
show
that
the
rules
are
highly
effective
additions
when
used
within
a
weighted
CCG
.
3.1
Application
and
Composition
Rules
The
first
new
combinators
we
consider
are
the
relaxed
functional
application
rules
:
These
are
variants
of
the
original
application
rules
,
where
the
slash
direction
on
the
principal
categories
(
A
/
B
or
A
\
B
)
is
reversed.6
These
rules
allow
simple
reversing
of
regular
word
order
,
for
example
Note
that
we
can
recover
the
correct
analysis
for
this
fragment
,
with
the
same
lexical
entries
as
those
used
for
the
conventional
word
order
,
one-way
flights
.
A
second
set
of
new
combinators
are
the
relaxed
functional
composition
rules
:
These
rules
are
variantions
of
the
standard
functional
composition
rules
,
where
the
slashes
of
the
principal
categories
are
reversed
.
6Rules
of
this
type
are
non-standard
in
the
sense
that
they
violate
Steedman
's
Principle
of
Consistency
(
2000
)
;
this
principle
states
that
rules
must
be
consistent
with
the
slash
direction
of
the
principal
category
.
Steedman
(
2000
)
only
considers
rules
that
do
not
violate
this
principle
—
for
example
,
crossed
composition
rules
,
which
we
consider
later
,
and
which
Steedman
also
considers
,
do
not
violate
this
principle
.
An
important
point
is
that
that
these
new
composition
and
application
rules
can
deal
with
quite
flexible
word
orders
.
For
example
,
take
the
fragment
to
washington
the
latest
flight
.
In
this
case
the
parse
is
to
washington
the
latest
flight
Note
that
in
this
case
the
substring
the
latest
has
category
NP
/
N
,
and
this
prevents
a
naive
parse
where
the
latest
first
combines
with
flight
,
and
to
washing-ton
then
combines
with
the
latest
flight
.
The
functional
composition
rules
effectively
allow
the
latest
to
take
scope
over
flight
and
to
washington
,
in
spite
of
the
fact
that
the
latest
appears
between
the
two
other
sub-strings
.
Examples
like
this
are
quite
frequent
in
domains
such
as
ATIS
.
We
add
features
in
the
model
which
track
the
occurrences
of
each
of
these
four
new
combinators
.
Specifically
,
we
have
four
new
features
in
the
definition
of
f
;
each
feature
tracks
the
number
of
times
one
of
the
combinators
is
used
in
a
CCG
parse
.
The
model
learns
parameter
values
for
each
of
these
features
,
allowing
it
to
learn
to
penalise
these
rules
to
the
correct
extent
.
3.2
Additional
Rules
of
Type-Raising
We
now
describe
new
CCG
operations
designed
to
deal
with
cases
where
words
are
in
some
sense
missing
in
the
input
.
For
example
,
in
the
string
flights
Boston
to
New
York
,
one
style
of
analysis
would
assume
that
the
preposition
from
had
been
deleted
from
the
position
before
Boston
.
The
first
set
of
rules
is
generated
from
the
following
role-hypothesising
type
shifting
rules
template
:
This
rule
can
be
applied
to
any
NP
with
semantics
c
,
and
any
arity-two
function
p
such
that
the
second
argument
of
p
has
the
same
type
as
c.
By
"
any
"
arity-two
function
,
we
mean
any
of
the
arity-two
functions
seen
in
training
data
.
We
define
features
within
the
feature-vector
f
that
are
sensitive
to
the
number
of
times
these
rules
are
applied
in
a
parse
;
a
separate
feature
is
defined
for
each
value
of
p.
In
practice
,
in
our
experiments
most
rules
of
this
form
have
p
as
the
semantics
of
some
preposition
,
for
example
from
or
to
.
A
typical
example
of
a
use
of
this
rule
would
be
the
following
:
flights
boston
to
new
york
The
second
rule
we
consider
is
the
null-head
type
shifting
rule
:
This
rule
allows
parses
of
fragments
such
as
American
Airlines
from
New
York
,
where
there
is
again
a
word
that
is
in
some
sense
missing
(
it
is
straightforward
to
derive
a
parse
for
American
Airlines
flights
from
New
York
)
.
The
analysis
would
be
as
follows
:
American
Airlines
from
New
York
The
new
rule
effectively
allows
the
prepositional
phrase
from
New
York
to
type-shift
to
an
entry
with
syntactic
type
N
and
semantics
Xx.f
rom
(
x
,
new
jyork
)
,
representing
the
set
of
all
things
from
New
York.7
We
introduce
a
single
additional
feature
which
counts
the
number
of
times
this
rule
is
used
.
3.3
Crossed
Composition
Rules
Finally
,
we
include
crossed
functional
composition
rules
:
These
rules
are
standard
CCG
operators
but
they
were
not
used
by
the
parser
described
in
ZC05
.
When
used
in
unrestricted
contexts
,
they
can
significantly
relax
word
order
.
Again
,
we
address
this
7Note
that
we
do
not
analyze
this
prepositional
phrase
as
having
the
semantics
Xx.flight
(
x
)
A
from
(
x
,
newjyork
)
—
although
in
principle
this
is
possible
—
as
the
flight
(
x
)
predicate
is
not
necessarily
implied
by
this
utterance
.
NP
dallas
washington
Washington
the
latest
Figure
3
:
A
parse
with
the
flexible
parser
.
problem
by
introducing
features
that
count
the
number
of
times
they
are
used
in
a
parse.8
As
a
final
point
,
to
see
how
these
rules
can
interact
in
practice
,
see
figure
3
.
This
example
demonstrates
the
use
of
the
relaxed
application
and
composition
rules
,
as
well
as
the
new
type-raising
rules
.
4
Learning
This
section
describes
an
approach
to
learning
in
our
model
.
We
first
define
the
features
used
and
then
describe
a
new
online
learning
algorithm
for
the
task
.
Section
2.3
described
the
use
of
a
function
f
(
x
,
y
)
which
maps
a
sentence
x
together
with
a
CCG
parse
y
to
a
feature
vector
.
As
described
in
section
3
,
we
introduce
features
for
the
new
CCG
combina-tors
.
In
addition
,
we
follow
ZC05
in
defining
features
which
track
the
number
of
times
each
lexical
item
in
A
is
used
.
For
example
,
we
would
have
one
feature
tracking
the
number
of
times
the
lexical
entry
flights
:
=
N
:
Xx.flights
(
x
)
is
used
in
a
parse
,
and
similar
features
for
all
other
members
of
A.
Finally
,
we
introduce
new
features
which
directly
consider
the
semantics
of
a
parse
.
For
each
predicate
f
seen
in
training
data
,
we
introduce
a
feature
that
counts
the
number
of
times
f
is
conjoined
with
itself
at
some
level
in
the
logical
form
.
For
example
,
the
expression
Xx.flight
(
x
)
A
from
(
x
,
new
jyork
)
A
from
(
x
,
boston
)
would
trigger
the
new
feature
for
8In
general
,
applications
of
the
crossed
composition
rules
can
be
lexically
governed
,
as
described
in
work
on
Multi-Modal
CCG
(
Baldridge
,
2002
)
.
In
the
future
we
would
like
to
incorporate
more
fine-grained
lexical
distinctions
of
this
type
.
the
from
predicate
signaling
that
the
logical-form
describes
flights
with
more
than
one
origin
city
.
We
introduce
similar
features
which
track
disjunction
as
opposed
to
conjunction
.
4.2
An
Online
Learning
Algorithm
Figure
4
shows
a
learning
algorithm
that
takes
a
training
set
of
(
xi
,
zi
)
pairs
as
input
,
and
returns
a
weighted
CCG
(
i.e.
,
a
pair
(
w
,
A
)
)
as
its
output
.
The
algorithm
is
online
,
in
that
it
visits
each
example
in
turn
,
and
updates
both
w
and
A
if
necessary
.
In
Step
1
on
each
example
,
the
input
xi
is
parsed
.
If
it
is
parsed
correctly
,
the
algorithm
immediately
moves
to
the
next
example
.
In
Step
2
,
the
algorithm
temporarily
introduces
all
lexical
entries
seen
in
GENLEX
(
xi
,
zi
)
,
and
finds
the
highest
scoring
parse
that
leads
to
the
correct
semantics
zi
.
A
small
subset
of
GENLEX
(
xi
,
zi
)
—
namely
,
only
those
lexical
entries
that
are
contained
in
the
highest
scoring
parse
—
are
added
to
A.
In
Step
3
,
a
simple
perceptron
update
(
Collins
,
2002
)
is
performed
.
The
hypothesis
is
parsed
again
with
the
new
lexicon
,
and
an
update
to
the
parameters
w
is
made
if
the
resulting
parse
does
not
have
the
correct
logical
form
.
This
algorithm
differs
from
the
approach
in
ZC05
in
a
couple
of
important
respects
.
First
,
the
ZC05
algorithm
performed
learning
of
the
lexicon
A
at
each
iteration
in
a
batch
method
,
requiring
a
pass
over
the
entire
training
set
.
The
new
algorithm
is
fully
online
,
learning
both
A
and
w
in
an
example-by-example
fashion
.
This
has
important
consequences
for
the
efficiency
of
the
algorithm
.
Second
,
the
parameter
estimation
method
in
ZC05
was
based
on
stochastic
gradient
descent
on
a
log-likelihood
objective
function
.
The
new
algorithm
makes
use
of
perceptron
Inputs
:
Training
examples
{
(
xi
,
zi
)
:
i
=
1
.
.
.
n
}
where
each
xi
is
a
sentence
,
each
zi
is
a
logical
form
.
An
initial
lexicon
A0
.
Number
of
training
iterations
,
T.
Definitions
:
GENLEX
(
x
,
z
)
takes
as
input
a
sentence
x
and
a
logical
form
z
and
returns
a
set
of
lexical
items
as
described
in
section
2.4
.
GEN
(
x
;
A
)
is
the
set
of
all
parses
for
x
with
lexicon
A.
GEN
(
x
,
z
;
A
)
is
the
set
of
all
parses
for
x
with
lexicon
A
,
which
have
logical
form
z.
The
function
f
(
x
,
y
)
represents
the
features
described
in
section
4.1
.
The
function
L
(
y
)
maps
a
parse
tree
y
to
its
associated
logical
form
.
Initialization
:
Set
parameters
w
to
initial
values
described
in
section
6.2
.
Set
A
=
Ao
.
Algorithm
:
•
Let
y
*
=
argmaXygGEN^
;
A
)
w
•
f
(
xi
,
y
)
.
•
Set
X
=
A
U
GENLEX
(
xi
,
zi
)
.
•
Let
y
*
=
argmaXygGEN
(
xi
,
Zi
;
A
)
w
•
f
(
xi
,
y
)
.
•
Define
Xi
to
be
the
set
of
lexical
entries
in
y
*
.
•
Let
y
'
=
arg
maXy6GEN
(
x
,
;
A
)
w
•
f
(
xi
,
y
)
.
•
Set
w
=
w
+
f
(
xi
,
y
*
)
—
f
(
xi
,
y
'
)
.
Output
:
Lexicon
A
together
with
parameters
w.
Figure
4
:
An
online
learning
algorithm
.
updates
,
which
are
simpler
and
cheaper
to
compute
.
As
in
ZC05
,
the
algorithm
assumes
an
initial
lexicon
Ao
that
contains
two
types
of
entries
.
First
,
we
compile
entries
such
as
Boston
:
=
NP
:
boston
for
entities
such
as
cities
,
times
and
month-names
that
occur
in
the
domain
or
underlying
database
.
In
practice
it
is
easy
to
compile
a
list
of
these
atomic
entities
.
Second
,
the
lexicon
has
entries
for
some
function
words
such
as
wh-words
,
and
determiners.9
5
Related
Work
There
has
been
a
significant
amount
of
previous
work
on
learning
to
map
sentences
to
underlying
semantic
representations
.
A
wide
variety
9Our
assumption
is
that
these
entries
are
likely
to
be
domain
independent
,
so
it
is
simple
enough
to
compile
a
list
that
can
be
reused
in
new
domains
.
Another
approach
,
which
we
may
consider
in
the
future
,
would
be
to
annotate
a
small
subset
of
the
training
examples
with
full
CCG
derivations
,
from
which
these
frequently
occurring
entries
could
be
learned
.
ideas
from
string
kernels
and
support
vector
machines
(
Kate
and
Mooney
,
2006
;
Nguyen
et
al.
,
2006
)
.
In
our
experiments
we
compare
to
He
and
Young
(
2006
)
on
the
ATIS
domain
and
Zettlemoyer
and
Collins
(
2005
)
on
the
Geo880
domain
,
because
these
systems
currently
achieve
the
best
performance
on
these
problems
.
The
approach
of
Zettlemoyer
and
Collins
(
2005
)
was
presented
in
section
2.4
.
He
and
Young
(
2005
)
describe
an
algorithm
that
learns
a
probabilistic
push-down
automaton
that
models
hierarchical
dependencies
but
can
still
be
trained
on
a
data
set
that
does
not
have
full
treebank-style
annotations
.
This
approach
has
been
integrated
with
a
speech
recognizer
and
shown
to
be
robust
to
recognition
errors
(
He
and
Young
,
2006
)
.
There
is
also
related
work
in
the
CCG
literature
.
Clark
and
Curran
(
2003
)
present
a
method
for
learning
the
parameters
of
a
log-linear
CCG
parsing
model
from
fully
annotated
normal-form
parse
trees
.
Watkinson
and
Manandhar
(
1999
)
present
an
unsupervised
approach
for
learning
CCG
lexicons
that
does
not
represent
the
semantics
of
the
training
sentences
.
Bos
et
al.
(
2004
)
present
an
algorithm
that
learns
CCG
lexicons
with
semantics
but
requires
fully-specified
CCG
derivations
in
the
training
data
.
Bozsahin
(
1998
)
presents
work
on
using
CCG
to
model
languages
with
free
word
order
.
In
addition
,
there
is
related
work
that
focuses
on
modeling
child
language
learning
.
Siskind
(
1996
)
presents
an
algorithm
that
learns
word-to-meaning
mappings
from
sentences
that
are
paired
with
a
set
of
possible
meaning
representations
.
Villavicencio
(
2001
)
describes
an
approach
that
learns
a
categorial
grammar
with
syntactic
and
semantic
information
.
Both
of
these
approaches
use
sentences
from
child-directed
speech
,
which
differ
significantly
from
the
natural
language
interface
queries
we
consider
.
Finally
,
there
is
work
on
manually
developing
parsing
techniques
to
improve
robustness
(
Carbonell
and
Hayes
,
1983
;
Seneff
,
1992
)
.
In
contrast
,
our
approach
is
integrated
into
a
learning
framework
.
6
Experiments
The
main
focus
of
our
experiments
is
on
the
ATIS
travel
planning
domain
.
For
development
,
we
used
4978
sentences
,
split
into
a
training
set
of
4500
examples
,
and
a
development
set
of478
examples
.
For
test
,
we
used
the
ATIS
NOV93
test
set
which
contains
448
examples
.
To
create
the
annotations
,
we
created
a
script
that
maps
the
original
SQL
annotations
provided
with
the
data
to
lambda-calculus
expressions
.
He
and
Young
(
2006
)
previously
reported
results
on
the
ATIS
domain
,
using
a
learning
approach
which
also
takes
sentences
paired
with
semantic
annotations
as
input
.
In
their
case
,
the
semantic
structures
resemble
context-free
parses
with
semantic
(
as
opposed
to
syntactic
)
non-terminal
labels
.
In
our
experiments
we
have
used
the
same
split
into
training
and
test
data
as
He
and
Young
(
2006
)
,
ensuring
that
our
results
are
directly
comparable
.
He
and
Young
(
2006
)
report
partial
match
figures
for
their
parser
,
based
on
precision
and
recall
in
recovering
attribute-value
pairs
.
(
For
example
,
the
sentence
flights
to
Boston
would
have
a
single
attribute-value
entry
,
namely
destination
=
Boston
.
)
It
is
simple
for
us
to
map
from
lambda-calculus
expressions
to
attribute-value
entries
of
this
form
;
for
example
,
the
expression
to
(
x
,
Boston
)
would
be
mapped
to
destination
=
Boston
.
He
and
Young
(
2006
)
gave
us
their
data
and
annotations
,
so
we
can
directly
compare
results
on
the
partial-match
criterion
.
We
also
report
accuracy
for
exact
matches
of
lambda-calculus
expressions
,
which
is
a
stricter
criterion
.
In
addition
,
we
report
results
for
the
method
on
the
Geo880
domain
.
This
allows
us
to
compare
directly
to
the
previous
work
of
Zettlemoyer
and
Collins
(
2005
)
,
using
the
same
split
of
the
data
into
training
and
test
sets
of
sizes
600
and
280
respectively
.
We
use
cross-validation
of
the
training
set
,
as
opposed
to
a
separate
development
set
,
for
optimization
of
parameters
.
The
simplest
approach
to
the
task
is
to
train
the
parser
and
directly
apply
it
to
test
sentences
.
In
our
experiments
we
will
see
that
this
produces
results
which
have
high
precision
,
but
somewhat
lower
recall
,
due
to
some
test
sentences
failing
to
parse
(
usually
due
to
words
in
the
test
set
which
were
never
observed
in
training
data
)
.
A
simple
strategy
to
alleviate
this
problem
is
as
follows
.
If
the
sentence
fails
to
parse
,
we
parse
the
sentence
again
,
this
time
allowing
parse
moves
which
can
delete
words
at
some
cost
.
The
cost
of
this
deletion
operation
is
optimized
on
development
data
.
This
approach
can
significantly
improve
F-measure
on
the
partial-match
criterion
in
particular
.
We
report
results
both
with
and
without
this
second
pass
strategy
.
6.2
Parameters
in
the
Approach
The
algorithm
in
figure
4
has
a
number
of
parameters
,
the
set
{
T
,
a
,
/
3
,
y
]
,
which
we
now
describe
.
The
values
of
these
parameters
were
chosen
to
optimize
the
performance
on
development
data
.
T
is
the
number
of
passes
over
the
training
set
,
and
was
set
to
be
4
.
Each
lexical
entry
in
the
initial
lexicon
Ao
has
an
associated
feature
which
counts
the
number
of
times
this
entry
is
seen
in
a
parse
.
The
initial
parameter
value
in
w
for
all
features
of
this
form
was
chosen
to
be
some
value
a.
Each
of
the
new
CCG
rules
—
the
application
,
composition
,
crossed-composition
,
and
type-raising
rules
described
in
section
3
—
has
an
associated
parameter
.
We
set
all
of
these
parameters
to
the
same
initial
value
f.
Finally
,
when
new
lexical
entries
are
added
to
A
(
in
step
2
of
the
algorithm
)
,
their
initial
weight
is
set
to
some
value
7
.
In
practice
,
optimization
on
development
data
led
to
a
positive
value
for
a
,
and
negative
values
for
f
and
7
.
Table
1
shows
accuracy
for
the
method
by
the
exact-match
criterion
on
the
ATIS
test
set
.
The
two
pass
strategy
actually
hurts
F-measure
in
this
case
,
although
it
does
improve
recall
of
the
method
.
Table
2
shows
results
under
the
partial-match
criterion
.
The
results
for
our
approach
are
higher
than
those
reported
by
He
and
Young
(
2006
)
even
without
the
second
,
high-recall
,
strategy
.
With
the
two-pass
strategy
our
method
has
more
than
halved
the
F-measure
error
rate
,
giving
improvements
from
90.3
%
F-measure
to
95.9
%
F-measure
.
accuracy
on
the
ATIS
test
set
.
Table
2
:
Partial-credit
accuracy
on
the
ATIS
test
set
.
new
method
gives
improvements
in
performance
both
with
and
without
the
two
pass
strategy
,
showing
that
the
new
CCG
combinators
,
and
the
new
learning
algorithm
,
give
some
improvement
on
even
this
domain
.
The
improved
performance
comes
from
a
slight
drop
in
precision
which
is
offset
by
a
large
increase
in
recall
.
Table
4
shows
ablation
studies
on
the
ATIS
data
,
where
we
have
selectively
removed
various
aspects
of
the
approach
,
to
measure
their
impact
on
performance
.
It
can
be
seen
that
accuracy
is
seriously
degraded
if
the
new
CCG
rules
are
removed
,
or
if
the
features
associated
with
these
rules
(
which
allow
the
model
to
penalize
these
rules
)
are
removed
.
Finally
,
we
report
results
concerning
the
efficiency
of
the
new
online
algorithm
as
compared
to
the
ZC05
algorithm
.
We
compared
running
times
for
the
new
algorithm
,
and
the
ZC05
algorithm
,
on
the
geography
domain
,
with
both
methods
making
4
passes
over
the
training
data
.
The
new
algorithm
took
less
than
4
hours
,
compared
to
over
12
hours
for
the
ZC05
algorithm
.
The
main
explanation
for
this
improved
performance
is
that
on
many
training
examples,10
in
step
1
of
the
new
algorithm
a
correct
parse
is
found
,
and
the
algorithm
immediately
moves
on
to
the
next
example
.
Thus
GENLEX
is
not
required
,
and
in
particular
parsing
the
example
with
the
large
set
of
entries
generated
by
GENLEX
is
not
required
.
7
Discussion
We
presented
a
new
,
online
algorithm
for
learning
a
combinatory
categorial
grammar
(
CCG
)
,
together
with
parameters
that
define
a
log-linear
parsing
model
.
We
showed
that
the
use
of
non-standard
CCG
combinators
is
highly
effective
for
parsing
sen
-
10Measurements
on
the
Geo880
domain
showed
that
in
the
4
iterations
,
83.3
%
of
all
parses
were
successful
at
step
1
.
Single-Pass
Parsing
Two-Pass
Parsing
Table
3
:
Exact-match
accuracy
on
the
Geo880
test
set
.
Precision
Full
Online
Method
Without
control
features
Without
relaxed
word
order
Without
word
insertion
Table
4
:
Exact-match
accuracy
on
the
ATIS
development
set
for
the
full
algorithm
and
restricted
versions
of
it
.
The
second
row
reports
results
of
the
approach
without
the
features
described
in
section
3
that
control
the
use
of
the
new
combi-nators
.
The
third
row
presents
results
without
the
combinators
from
section
3.1
that
relax
word
order
.
The
fourth
row
reports
experiments
without
the
type-raising
combinators
presented
in
section
3.2
.
tences
with
the
types
of
phenomena
seen
in
spontaneous
,
unedited
natural
language
.
The
resulting
system
achieved
significant
accuracy
improvements
in
both
the
ATIS
and
Geo880
domains
.
Acknowledgements
We
would
like
to
thank
Yulan
He
and
Steve
Young
for
their
help
with
obtaining
the
ATIS
data
set
.
We
also
acknowledge
the
support
for
this
research
.
Luke
Zettlemoyer
was
funded
by
a
Microsoft
graduate
research
fellowship
and
Michael
Collins
was
supported
by
the
National
Science
Foundation
under
grants
0347631
and
DMS-0434222
.
