We
present
a
simple
history-based
model
for
sentence
generation
from
LFG
f-structures
,
which
improves
on
the
accuracy
of
previous
models
by
breaking
down
PCFG
independence
assumptions
so
that
more
f-structure
conditioning
context
is
used
in
the
prediction
of
grammar
rule
expansions
.
In
addition
,
we
present
work
on
experiments
with
named
entities
and
other
multi-word
units
,
showing
a
statistically
significant
improvement
of
generation
accuracy
.
Tested
on
section
23
of
the
Penn
Wall
Street
Journal
Tree-bank
,
the
techniques
described
in
this
paper
improve
BLEU
scores
from
66.52
to
68.82
,
and
coverage
from
98.18
%
to
99.96
%
.
1
Introduction
Sentence
generation
,
or
surface
realisation
,
is
the
task
of
generating
meaningful
,
grammatically
correct
and
fluent
text
from
some
abstract
semantic
or
syntactic
representation
of
the
sentence
.
It
is
an
important
and
growing
field
of
natural
language
processing
with
applications
in
areas
such
as
transfer-based
machine
translation
(
Riezler
and
Maxwell
,
2006
)
and
sentence
condensation
(
Riezler
et
al.
,
2003
)
.
While
recent
work
on
generation
in
restricted
domains
,
such
as
(
Belz
,
2007
)
,
has
shown
promising
results
there
remains
much
room
for
improvement
particularly
for
broad
coverage
and
robust
generators
,
like
those
of
Nakanishi
et
al.
(
2005
)
and
Cahill
*
Now
at
the
Institut
für
Maschinelle
Sprachverarbeitung
,
Universität
Stuttgart
,
Azenbergstrae
12
,
D-70174
Stuttgart
,
Germany
.
aoife.cahill
@
ims.uni-stuttgart.de
and
van
Genabith
(
2006
)
,
which
do
not
rely
on
handcrafted
grammars
and
thus
can
easily
be
ported
to
new
languages
.
This
paper
is
concerned
with
sentence
generation
from
Lexical-Functional
Grammar
(
LFG
)
f-structures
(
Kaplan
,
1995
)
.
We
present
improvements
in
previous
LFG-based
generation
models
firstly
by
breaking
down
PCFG
independence
assumptions
so
that
more
f-structure
conditioning
context
is
included
when
predicting
grammar
rule
expansions
.
This
history-based
approach
has
worked
well
in
parsing
(
Collins
,
1999
;
Charniak
,
2000
)
and
we
show
that
it
also
improves
PCFG-based
generation
.
We
also
present
work
on
utilising
named
entities
and
other
multi-word
units
to
improve
generation
results
for
both
accuracy
and
coverage
.
There
has
been
a
limited
amount
of
exploration
into
the
use
of
multi-word
units
in
probabilistic
parsing
,
for
example
in
(
Kaplan
and
King
,
2003
)
(
LFG
parsing
)
and
(
Nivre
and
Nilsson
,
2004
)
(
dependency
parsing
)
.
We
are
not
aware
of
any
similar
work
on
generation
.
In
the
LFG-based
generation
algorithm
presented
by
Cahill
and
van
Genabith
(
2006
)
complex
named
entities
(
i.e.
those
consisting
of
more
than
one
word
token
)
and
other
multi-word
units
can
be
fragmented
in
the
surface
realization
.
We
show
that
the
identification
of
such
units
may
be
used
as
a
simple
measure
to
constrain
the
generation
model
's
output
.
We
take
the
generator
of
(
Cahill
and
van
Genabith
,
2006
)
as
our
baseline
generator
.
When
tested
on
f-structures
for
all
sentences
from
Section
23
of
the
Penn
Wall
Street
Journal
(
WSJ
)
treebank
(
Mar
-
Proceedings
of
the
2007
Joint
Conference
on
Empirical
Methods
in
Natural
Language
Processing
and
Computational
Natural
Language
Learning
,
pp.
267-276
,
Prague
,
June
2007
.
©
2007
Association
for
Computational
Linguistics
cus
et
al.
,
1993
)
,
the
techniques
described
in
this
paper
improve
BLEU
score
from
66.52
to
68.82
.
In
addition
,
coverage
is
increased
from
98.18
%
to
almost
100
%
(
99.96
%
)
.
The
remainder
of
the
paper
is
structured
as
follows
:
in
Section
2
we
review
related
work
on
statistical
sentence
generation
.
Section
3
describes
the
baseline
generation
model
and
in
Section
4
we
show
how
the
new
history-based
model
improves
over
the
baseline
.
In
Section
5
we
describe
the
source
of
the
multi-word
units
(
MWU
)
used
in
our
experiments
and
the
various
techniques
we
employ
to
make
use
of
these
MWUs
in
the
generation
process
.
Section
6
gives
experimental
details
and
results
.
2
Related
Work
on
Statistical
Generation
In
(
statistical
)
generators
,
sentences
are
generated
from
an
abstract
linguistic
encoding
via
the
application
of
grammar
rules
.
These
rules
can
be
handcrafted
grammar
rules
,
such
as
those
of
(
Langkilde-Geary
,
2002
;
Carroll
and
Oepen
,
2005
)
,
created
semi-automatically
(
Belz
,
2007
)
or
,
alternatively
,
extracted
fully
automatically
from
treebanks
(
Bangalore
and
Rambow
,
2000
;
Nakanishi
et
al.
,
2005
;
Cahill
and
van
Genabith
,
2006
)
.
parkhi
,
2000
)
.
Another
feature
which
characterises
statistical
generators
is
the
probability
model
used
to
select
the
most
probable
sentence
from
among
the
space
of
all
possible
sentences
licensed
by
the
grammar
.
One
generation
technique
is
to
first
generate
all
possible
sentences
,
storing
them
in
a
word
lattice
(
Langkilde
and
Knight
,
1998
)
or
,
alternatively
,
a
generation
forest
,
a
packed
represention
ofalternate
trees
proposed
by
the
generator
(
Langkilde
,
2000
)
,
and
then
select
the
most
probable
sequence
of
words
via
an
n-gram
language
model
.
Increasingly
syntax-based
information
is
being
incorporated
directly
into
the
generation
model
.
For
example
,
Carroll
and
Oepen
(
2005
)
describe
a
sen
-
tence
realisation
process
which
uses
a
hand-crafted
HPSG
grammar
to
generate
a
generation
forest
.
A
selective
unpacking
algorithm
allows
the
extraction
of
an
n-best
list
of
realisations
where
realisation
ranking
is
based
on
a
maximum
entropy
model
.
This
unpacking
algorithm
is
used
in
(
Velldal
and
Oepen
,
2005
)
to
rank
realisations
with
features
defined
over
HPSG
derivation
trees
.
They
achieved
the
best
results
when
combining
the
tree-based
model
with
an
n-gram
language
model
.
Nakanishi
et
al.
(
2005
)
describe
a
treebank-extracted
HPSG-based
chart
generator
.
Importing
techniques
developed
for
HPSG
parsing
,
they
apply
a
log
linear
model
to
a
packed
representation
of
all
alternative
derivation
trees
for
a
given
input
.
They
found
that
a
model
which
included
syntactic
information
outperformed
a
bigram
model
as
well
as
a
combination
of
bigram
and
syntax
model
.
The
probability
model
described
in
this
paper
also
incorporates
syntactic
information
,
however
,
unlike
the
discriminative
HPSG
models
just
described
,
it
is
a
generative
history
-
and
PCFG-based
model
.
mention
the
use
of
contextual
features
for
the
rules
in
their
generation
models
,
they
do
not
provide
details
nor
do
they
provide
a
formal
probability
model
.
To
the
best
of
our
knowledge
this
is
the
first
paper
providing
a
probabilistic
generative
,
history-based
generation
model
.
3
Surface
Realisation
from
f-Structures
Cahill
and
van
Genabith
(
2006
)
present
a
probabilistic
surface
generation
model
for
LFG
(
Kaplan
,
1995
)
.
LFG
is
a
constraint-based
theory
of
grammar
,
which
analyses
strings
in
terms
of
c
(
onstituency
)
-
structure
and
f
(
unctional
)
-
structure
(
Figure
1
)
.
C-structure
is
defined
in
terms
of
CFGs
,
and
f-structures
are
recursive
attribute-value
matrices
which
represent
abstract
syntactic
functions
(
such
as
suBJect
,
OBJect
,
OBLique
,
coMPlement
(
sentential
)
,
ADJ
(
N
)
unct
)
,
agreement
,
control
,
longdistance
dependencies
and
some
semantic
information
(
e.g.
tense
,
aspect
)
.
C-structures
and
f-structures
are
related
in
a
projection
architecture
in
terms
of
a
piecewise
correspondence
0.1
The
correspondence
is
indicated
in
'
Our
formalisation
follows
(
Kaplan
,
1995
)
.
Susan
contacted
Figure
1
:
C
-
and
f-structures
with
0
links
for
the
sentence
Susan
contacted
her
.
terms
of
the
curvy
arrows
pointing
from
c-structure
nodes
to
f-structure
components
in
Figure
1
.
Given
a
c-structure
node
ui
,
the
corresponding
f-structure
component
fj
is
0
(
ui
)
.
F-structures
and
the
c-structure
/
f-structure
correspondence
are
described
in
terms
of
functional
annotations
on
c-structure
nodes
(
CFG
grammar
rules
)
.
An
equation
of
the
form
(
|
F
)
=
j
states
that
the
f-structure
associated
with
the
mother
of
the
current
c-structure
node
(
|
)
has
an
attribute
(
grammatical
function
)
(
F
)
,
whose
value
is
the
f-structure
of
the
current
node
(
j
)
.
The
up-arrows
and
down-arrows
are
shorthand
for
0
(
M
(
ui
)
)
=
0
(
ui
)
where
ui
is
the
c-structure
node
annotated
with
the
equation.2
The
generation
model
of
(
Cahill
and
van
Genabith
,
2006
)
maximises
the
probability
of
a
tree
given
an
f-structure
(
Eqn
.
1
)
,
and
the
string
generated
is
the
yield
of
the
highest
probability
tree
.
The
generation
process
is
guided
by
purely
local
information
in
the
input
f-structure
:
f-structure
annotated
CFG
rules
(
LHS
—
&gt;
RHS
)
are
conditioned
on
their
LHSs
and
on
the
set
of
features
/
attributes
Feats
=
[
ai
\
3vj0
(
LHS
)
a^
=
Vj
}
3
0-linked
to
the
LHS
(
Eqn
.
2M
is
the
mother
function
on
CFG
tree
nodes
.
3In
words
,
Feats
is
the
set
of
top
level
features
/
attributes
(
those
attributes
at
for
which
there
is
a
value
Vi
)
of
the
f-structure
0
linked
to
the
LHS
.
2
)
.
Table
1
shows
a
generation
grammar
rule
and
conditioning
features
extracted
from
the
example
in
Figure
1
.
The
probability
of
a
tree
is
decomposed
into
the
product
of
the
probabilities
of
the
f-structure
annotated
rules
(
conditioned
on
the
LHS
and
local
Feats
)
contributing
to
the
tree
.
Conditional
probabilities
are
estimated
using
maximum
likelihood
estimation
.
grammar
rule_
|
local
conditioning
features_
s
(
T
=
4
)
^NP
(
TsuBj
=
|
)
VP
(
t
=
4
)
|
s
(
|
=
i
)
,
{
subj
,
obj
,
pred
,
tense
}
Table
1
:
Example
grammar
rule
(
from
Figure
1
)
.
Cahill
and
van
Genabith
(
2006
)
note
that
conditioning
f-structure
annotated
generation
rules
on
local
features
(
Eqn
.
2
)
can
sometimes
cause
the
model
to
make
inappropriate
choices
.
Consider
the
following
scenario
where
in
addition
to
the
c
-
/
f-structure
in
Figure
1
,
the
training
set
contains
the
c
-
/
f-structure
displayed
in
Figure
2
.
From
Figures
1
and
2
,
the
model
learns
(
among
others
)
the
generation
rules
and
conditional
probabilities
displayed
in
Tables
2
and
3
.
Table
2
:
A
sample
of
internal
grammar
rules
extracted
from
Figures
1
and
2
.
Given
the
input
f-structure
(
for
She
accepted
)
in
Figure
3
,
(
and
assuming
suitable
generation
rules
for
intransitive
VPs
and
accepted
)
the
model
would
produce
the
inappropriate
highest
probability
tree
of
Figure
4
with
an
incorrect
case
for
the
pronoun
in
subject
position
.
She
hired
PRED
NUM
PERS
Figure
2
:
C
-
and
f-structures
with
0
links
for
the
sentence
She
hired
her
.
Table
3
:
A
sample
of
lexical
item
rules
extracted
from
Figures
1
and
2
.
PRED
TENSE
Figure
3
:
Input
f-structure
for
She
accepted
.
To
solve
the
problem
,
Cahill
and
van
Genabith
(
2006
)
apply
an
automatic
generation
grammar
transformation
to
their
training
data
:
they
automatically
label
CFG
nodes
with
additional
case
information
and
the
model
now
learns
the
new
improved
generation
rules
of
Tables
4
and
5
.
Note
how
the
additional
case
labelling
subverts
the
problematic
independence
assumptions
of
the
probability
model
and
communicates
the
fact
that
a
subject
NP
has
to
be
realised
as
nominative
case
from
the
S
—
NP-nom
VP
production
,
via
the
intermediate
NP-nom
—
PRP-nom
,
down
to
the
lexical
production
PRP-nom
—
she
.
The
labelling
guarantees
that
,
given
the
example
f-structure
in
Figure
3
,
the
model
generates
the
correct
string
she
accepted
.
her
accepted
Figure
4
:
Inappropriate
output
:
her
accepted
.
Table
4
:
Internal
grammar
rules
with
case
markings
.
Table
5
:
Lexical
item
rules
with
case
markings
4
A
History-Based
Generation
Model
The
automatic
generation
grammar
transform
presented
in
(
Cahill
and
van
Genabith
,
2006
)
provides
a
solution
to
coarse-grained
and
(
in
fact
)
inappropriate
independence
assumptions
in
the
basic
generation
model
.
However
,
there
is
a
sense
in
which
the
proposed
cure
improves
on
the
symptoms
,
but
not
the
cause
of
the
problem
:
it
weakens
independence
assumptions
by
multiplying
and
hence
increasing
the
specificity
of
conditioning
CFG
category
labels
.
There
is
another
option
available
to
us
,
and
that
is
the
option
we
will
explore
in
this
paper
:
instead
of
applying
a
generation
grammar
transform
,
we
will
improve
the
f-structure-based
conditioning
of
the
generation
rule
probabilities
.
In
the
original
model
,
rules
are
conditioned
on
purely
local
f-structure
context
:
the
set
of
features
/
attributes
^-linked
to
the
LHS
of
a
grammar
rule
.
As
a
direct
consequence
of
this
,
the
conditioning
(
and
hence
the
model
)
cannot
not
distinguish
between
NP
,
PRP
and
NNP
rules
appropriate
to
e.g.
subject
(
subj
)
or
object
contexts
(
obj
)
in
a
given
input
f-structure
.
However
,
the
required
information
can
easily
be
incorporated
into
the
generation
model
by
uniformly
conditioning
generation
rules
on
their
parent
(
mother
)
grammatical
function
,
in
addition
to
the
local
(
(
-
linked
feature
set
.
This
additional
conditioning
has
the
effect
of
making
the
choice
of
generation
rules
sensitive
to
the
history
of
the
generation
process
,
and
,
we
argue
,
provides
a
simpler
,
more
uniform
,
general
,
intuitive
and
natural
probabilistic
generation
model
obviating
the
need
for
CFG-grammar
transforms
in
the
original
proposal
of
(
Cahill
and
van
Genabith
,
2006
)
.
In
the
new
model
,
each
generation
rule
is
now
conditioned
on
the
LHS
rule
CFG
category
,
the
set
of
features
(
(
-
linked
to
LHS
and
the
parent
grammatical
function
of
the
f-structure
(
(
-
linked
to
LHS
.
In
a
given
c
-
/
f-structure
pair
,
for
a
CFG
node
n
,
the
parent
grammatical
function
of
the
f-structure
(
(
-
linked
to
n
is
that
grammatical
function
GF
,
which
,
if
we
take
the
f-structure
(
(
-
linked
to
the
mother
M
(
n
)
,
and
apply
it
to
GF
,
returns
the
f-structure
(
(
-
linked
to
n
:
(
&lt;
(
M
(
n
)
)
GF
)
=
&lt;
(
n
)
.
The
basic
idea
is
best
explained
by
way
of
an
example
.
Consider
again
Figure
1
.
The
mother
grammatical
function
of
the
f-structure
f2
associated
with
node
NP
(
|
subj
=
|
)
and
its
daughter
NNP
(
|
=
D
(
via
the
|
=
j
functional
annotation
)
is
subj
,
as
(
&lt;&lt;
(
M
(
n2
)
)
suBj
)
=
&lt;&lt;
(
n2
)
,
or
equivalently
(
fi
subj
)
=
f2
.
Given
Figures
1
and
2
as
training
set
,
the
improved
model
learns
the
generation
rules
(
the
mother
grammatical
function
of
the
outermost
f-structure
is
assumed
to
be
a
dummy
top
grammatical
function
)
of
Tables
6
and
7
.
Table
6
:
Grammar
rules
with
extra
feature
extracted
from
F-Structures
.
Note
,
that
for
our
example
the
effect
of
the
uniform
additional
conditioning
on
mother
grammatical
function
has
the
same
effect
as
the
generation
grammar
transform
of
(
Cahill
and
van
Genabith
,
2006
)
,
but
without
the
need
for
the
gram
-
F-Struct
Feats
Grammar
Rules
Table
7
:
Lexical
item
rules
.
mar
transform
.
Given
the
input
f-structure
in
Figure
3
,
the
model
will
generate
the
correct
string
she
accepted
.
In
addition
,
uniform
conditioning
on
mother
grammatical
function
is
more
general
than
the
case-phenomena
specific
generation
grammar
transform
of
(
Cahill
and
van
Genabith
,
2006
)
,
in
that
it
applies
to
each
and
every
sub-part
of
a
recursive
input
f-structure
driving
generation
,
making
available
relevant
generation
history
(
context
)
to
guide
local
generation
decisions
.
The
new
history-based
probabilistic
generation
model
is
defined
as
:
Note
that
the
new
conditioning
feature
,
the
f-structure
mother
grammatical
function
,
GF
,
is
available
from
structure
previously
generated
in
the
c-structure
tree
.
As
such
,
it
is
part
of
the
history
of
the
tree
,
i.e.
it
has
already
been
generated
in
the
top-down
derivation
of
the
tree
.
In
this
way
,
the
generation
model
resembles
history-based
models
for
parsing
(
Black
et
al.
,
1992
;
Collins
,
1999
;
Charniak
,
2000
)
.
Unlike
,
say
,
the
parent
annotation
for
parsing
of
(
Johnson
,
1998
)
the
parent
GF
feature
for
a
particular
node
expansion
is
not
merely
extracted
from
the
parent
node
in
the
c-structure
tree
,
but
is
sometimes
extracted
from
an
ancestor
node
further
up
the
c-structure
tree
via
intervening
|
=
|
functional
annotations
.
Section
6
provides
evaluation
results
for
the
new
model
on
section
23
of
the
Penn
treebank
.
5
Multi-Word
Units
In
another
effort
to
improve
generator
accuracy
over
the
baseline
model
we
explored
the
use
of
multiword
units
in
generation
.
We
expect
that
the
identification
of
MWUs
may
be
useful
in
imposing
word-order
constraints
and
reducing
the
complexity
of
the
generation
task
.
Take
,
for
example
,
the
following
Figure
5
:
Three
different
f-structure
formats
.
From
left
to
right
:
the
original
f-structure
format
;
the
MWU
chunk
format
;
the
MWU
mark-up
format
.
two
sentences
which
show
the
gold
version
of
a
sentence
followed
by
the
version
of
the
sentence
produced
by
the
generator
:
and
Mr.
Smith
fielded
a
call
from
a
New
York
customer
wanting
an
opinion
on
the
British
stock
market
,
which
had
been
having
troubles
of
its
own
even
before
Friday
's
New
York
market
break
.
Test
By
this
time
,
in
New
York
,
it
was
4
:
30
a.m.
,
and
Mr.
Smith
fielded
a
call
from
New
a
customer
York
,
wanting
an
opinion
on
the
market
British
stock
which
had
been
having
troubles
of
its
own
even
before
Friday
'
s
New
York
market
break
.
The
gold
version
of
the
sentence
contains
a
multiword
unit
,
New
York
,
which
appears
fragmented
in
the
generator
output
.
If
multi-word
units
were
either
treated
as
one
token
throughout
the
generation
process
,
or
,
alternatively
,
if
a
constraint
were
imposed
on
the
generator
such
that
multi-word
units
were
always
generated
in
the
correct
order
,
then
this
should
help
improve
generation
accuracy
.
In
Section
5.1
we
describe
the
various
techniques
that
were
used
to
incorporate
multi-word
units
into
the
generation
process
and
in
5.2
we
detail
the
different
types
and
sources
of
multi-word
unit
used
in
the
experiments
.
Section
6
provides
evaluation
results
on
test
and
development
sets
from
the
WSJ
treebank
.
5.1
Incorporating
MWUs
into
the
Generation
Process
We
carried
out
three
types
of
experiment
which
,
in
different
ways
,
enabled
the
generation
process
to
respect
the
restrictions
on
word-order
provided
by
multi-word
units
.
For
the
first
experiments
(
type
1
)
,
the
WSJ
treebank
training
and
test
data
were
altered
so
that
multi-word
units
are
concatenated
into
single
words
(
for
example
,
New
York
becomes
New.York
)
.
As
in
(
Cahill
and
van
Genabith
,
2006
)
f-structures
are
generated
from
the
(
now
altered
)
tree-bank
and
from
this
data
,
along
with
the
treebank
trees
,
the
PCFG-based
grammar
,
which
is
used
for
training
the
generation
model
,
is
extracted
.
Similarly
,
the
f-structures
for
the
test
and
development
sets
are
created
from
Penn
Treebank
trees
which
have
been
modified
so
that
multi-word
units
form
single
units
.
The
leftmost
and
middle
f-structures
in
Figure
5
show
an
example
of
an
original
f-structure
format
and
a
named-entity
chunked
format
,
respectively
.
Strings
output
by
the
generator
are
then
post-processed
so
that
the
concatenated
word
sequences
are
converted
back
into
single
words
.
In
the
second
experiment
(
type
2
)
only
the
test
data
was
altered
with
no
concatenation
of
MWUs
carried
out
on
the
training
data
.
In
the
final
experiments
(
type
3
)
,
instead
of
concatenating
named
entities
,
a
constraint
is
introduced
to
the
generation
algorithm
which
penalises
the
generation
of
sequences
of
words
which
violate
the
internal
word
order
of
named
entities
.
The
input
is
marked-up
in
such
a
way
that
,
although
named
entities
are
no
longer
chunked
together
to
form
single
words
,
the
algorithm
can
read
which
items
are
part
of
named
entities
.
See
the
rightmost
f-structure
in
Figure
5
for
an
example
of
an
f-structure
marked-up
in
this
way
.
The
tag
NE1J
,
for
example
,
indicates
that
the
sub-f-structure
is
part
of
a
named
identity
with
id
number
1
and
that
the
item
corresponds
to
the
first
word
of
the
named
entity
.
The
baseline
generation
algorithm
,
following
Kay
(
1996
)
'
s
work
on
chart
generation
,
already
contains
the
hard
constraint
that
when
combining
two
chart
edges
they
must
cover
disjoint
sets
of
words
.
We
added
an
additional
constraint
which
prevents
edges
from
being
combined
if
this
would
result
in
the
generation
of
a
string
which
contained
a
named
entity
which
was
either
incomplete
or
where
the
words
in
the
named
entity
were
generated
in
the
wrong
order
.
5.2
Types
of
MWUs
used
in
Experiments
We
carry
out
experiments
with
multi-word
units
from
three
different
sources
.
First
,
we
use
the
output
of
the
maximum
entropy-based
named
entity
recognition
system
of
(
Chieu
and
Ng
,
2003
)
.
This
system
identifies
four
types
of
named
entity
:
person
,
organisation
,
location
,
and
miscellaneous
.
Additionally
we
use
a
dictionary
of
candidate
multi-word
expressions
based
on
a
list
from
the
Stanford
Multiword
Expression
Project4
.
Finally
,
we
also
carry
out
experiments
with
multi-word
units
extracted
from
the
BBN
Pronoun
Coreference
and
Entity
Type
Corpus
(
Weischedel
and
Brunstein
,
2005
)
.
This
supplements
the
Penn
WSJ
treebank
's
one
million
words
of
syntax-annotated
Wall
Street
Journal
text
with
additional
annotations
of
23
named
entity
types
,
including
nominal-type
named
entities
such
as
person
,
organisation
,
location
,
etc.
as
well
as
numeric
types
such
as
date
,
time
,
quantity
and
money
.
Since
the
BBN
corpus
data
is
very
comprehensive
and
is
handannotated
we
take
this
be
be
a
gold
standard
,
representing
an
upper
bound
for
any
gains
that
might
be
made
by
identifying
complex
named
entities
in
our
experiments.5
Table
8
gives
examples
of
the
various
types
of
MWUs
identified
by
the
three
sources
.
For
our
purposes
we
are
not
concerned
with
the
distinctions
between
different
types
of
named
entities
;
we
are
merely
exploiting
the
fact
that
they
may
be
treated
as
atomic
units
in
the
generation
model
.
In
all
cases
we
disregard
multi-word
units
that
cross
the
original
syntactic
bracketing
of
the
WSJ
treebank
.
An
overview
ofthe
various
types
ofmulti-word
units
used
in
our
experiments
is
presented
in
Table
9
.
6
Experimental
Evaluation
All
experiments
were
carried
out
on
the
WSJ
tree-bank
with
sections
02-21
for
training
,
section
24
for
development
and
section
23
for
final
test
results
.
The
LFG
annotation
algorithm
of
(
Cahill
et
al.
,
2004
)
was
used
to
produce
the
f-structures
for
development
,
test
and
training
sets
.
4mwe.stanford.edu
5
Although
it
is
possible
there
are
other
types
of
MWUs
that
may
be
more
suitable
to
the
task
than
the
named
entities
identified
by
BBN
,
so
further
gains
might
be
possible
.
MWU
type
Examples
Martha
Matthews
Yoshio
Hatakeyama
Organisations
Rolls-Royce
Motor
Cars
Inc.
Washington
State
University
Locations
New
York
City
New
Zealand
Time
expressions
two
years
ago
Quantities
Prepositional
expressions
at
the
time
on
average
Table
8
:
Examples
of
some
of
the
types
of
MWU
from
the
three
different
sources
.
average
number
average
length
Stanford
MWE
Project
BBN
Corpus
Table
9
:
Average
number
of
MWUs
per
sentence
and
average
MWU
length
in
the
WSJ
treebank
grouped
by
MWU
source
.
Table
10
shows
the
final
results
for
section
23
.
For
each
test
we
present
BLEU
score
results
as
well
as
String
Edit
Distance
and
coverage
.
We
measure
statistical
significance
using
two
different
tests
.
First
we
use
a
bootstrap
resampling
method
,
popular
for
machine
translation
evaluations
,
to
measure
the
significance
of
improvements
in
BLEU
scores
,
with
a
resampling
rate
of
1000.6
We
also
calculated
the
significance
of
an
increase
in
String
Edit
Distance
by
carrying
out
a
paired
t-test
on
the
mean
difference
of
the
String
Edit
Distance
scores
.
In
Table
10
,
»
means
significant
at
level
0.005
.
&gt;
means
significant
at
level
0.05
.
In
Table
10
,
Baseline
gives
the
results
of
the
generation
algorithm
of
(
Cahill
and
van
Genabith
,
2006
)
.
HB
Model
refers
to
the
improved
model
with
the
increased
history
context
,
as
described
in
Section
4
.
The
results
,
where
for
example
the
BLEU
score
rises
from
66.52
to
67.24
,
show
that
even
increasing
the
conditioning
context
by
a
limited
6
Scripts
for
running
the
bootstrapping
method
carried
out
in
our
evaluation
are
available
for
download
at
projec-tile.is.cs.cmu.edu
/
research
/
public
/
tools
/
bootStrap
/
tutorial.htm
BLEU
Bootstrap
Signif
StringEd
Paired
T-Test
Table
10
:
Results
on
Section
23
for
all
sentence
lengths
.
amount
increases
the
accuracy
of
the
system
significantly
for
both
BLEU
and
String
Edit
Distance
.
In
addition
,
coverage
goes
up
from
98.18
%
to
99.88
%
.
+MWU
Best
Automatic
displays
our
best
results
using
automatically
identified
named
entities
.
These
were
achieved
using
experiment
type
2
,
described
in
Section
5
,
with
the
MWUs
produced
by
(
Chieu
and
Ng
,
2003
)
.
Results
displayed
in
Table
10
up
to
this
point
are
cumulative
.
The
final
row
in
Table
10
,
MWU
BBN
,
shows
the
best
results
with
BBN
MWUs
:
the
history-based
model
with
BBN
multiword
units
incorporated
in
a
type
1
experiment
.
We
now
discuss
the
various
MWU
experiments
in
more
detail
.
See
Table
11
for
a
breakdown
of
the
MWU
experiment
results
on
the
development
set
,
WSJ
section
24
.
Our
baseline
for
these
experiments
is
the
history-based
generator
presented
in
Section
4
.
For
each
experiment
type
described
in
Section
5.1
we
ran
three
experiments
,
varying
the
source
of
MWUs
.
First
,
MWUs
came
from
the
automatic
NE
recogniser
of
(
Chieu
and
Ng
,
2003
)
,
then
we
added
the
MWUs
from
the
Stanford
list
and
finally
we
ran
tests
with
MWUs
extracted
from
the
BBN
corpus
.
Our
first
set
of
experiments
(
type
1
)
,
where
both
training
data
and
development
set
data
were
MWU-chunked
,
produced
the
worst
results
for
the
automatically
chunked
MWUs
.
BLEU
score
accuracy
actually
decreased
for
the
automatically
chunked
MWU
experiments
.
In
an
error
analysis
of
type
1
experiments
with
(
Chieu
and
Ng
,
2003
)
concatenated
MWUs
,
we
inspected
those
sentences
where
accuracy
had
decreased
from
the
baseline
.
We
found
that
for
over
half
(
51.5
%
)
of
these
sentences
,
the
input
f-structures
contained
no
multi-word
units
at
all
.
The
problem
for
these
sentences
therefore
lay
with
the
probabilistic
grammar
extracted
from
the
MWU-chunked
training
data
.
When
the
source
of
MWU
for
the
type
1
experiments
was
the
BBN
,
however
,
accuracy
improved
significantly
over
the
baseline
and
the
result
is
the
highest
accuracy
achieved
over
all
experiment
types
.
One
possible
reason
for
the
low
accuracy
scores
in
the
type
1
experiments
with
the
(
Chieu
and
Ng
,
2003
)
MWU
chunked
data
could
be
noisy
MWUs
which
negatively
affect
the
grammar
.
For
example
,
the
named
entity
recogniser
of
(
Chieu
and
Ng
,
2003
)
achieves
an
accuracy
of
88.3
%
on
section
23
of
the
Penn
Treebank
.
In
order
to
avoid
changing
the
grammar
through
concatenation
of
MWU
components
(
as
in
experiment
type
1
)
and
thus
risking
side-effects
which
cause
some
heretofore
likely
constructions
become
less
likely
and
vice
versa
,
we
ran
the
next
set
of
experiments
(
type
2
)
which
leave
the
original
grammar
intact
and
alter
the
input
f-structures
only
.
These
experiments
were
more
successful
overall
and
we
achieved
an
improvement
over
the
baseline
for
both
BLEU
and
String
Edit
Distance
scores
with
all
MWU
types
.
As
can
be
seen
from
Table
11
the
best
score
for
automatically
chunked
MWUs
are
decreases
marginally
when
we
added
the
Stanford
MWUs
.
In
our
final
set
of
experiments
(
type
3
)
although
the
accuracy
for
all
three
types
of
MWUs
improves
over
the
baseline
,
accuracy
is
a
little
below
the
type
2
experiments
.
StringEd
Coverage
HB
Model
(
training
and
test
data
chunked
)
(
test
data
chunked
)
(
internal
generation
constraint
)
+Stanford
MWEs
Table
11
:
Results
on
Section
24
,
all
sentence
lengths
.
ports
82.7
%
coverage
and
a
BLEU
score
of
75.7
%
on
the
same
test
set
with
the
'
permute
,
no
dir
'
type
input
.
Langkilde-Geary
(
2002
)
report
results
for
experiments
with
varying
levels
of
linguistic
detail
in
the
input
given
to
the
generator
.
As
with
Nakanishi
et
al.
(
2005
)
we
find
the
'
permute
,
no
dir
'
type
of
input
is
most
comparable
to
the
level
of
information
contained
in
our
input
f-structures
.
Finally
,
the
symbolic
generator
of
Callaway
(
2003
)
reports
a
Simple
String
Accuracy
score
of
88.84
and
coverage
of
98.7
%
on
section
23
for
all
sentence
lengths
.
7
Conclusion
and
Future
Work
We
have
presented
techniques
which
improve
the
accuracy
of
an
already
state-of-art
surface
generation
model
.
We
found
that
a
history-based
model
that
increases
conditioning
context
in
PCFG
style
rules
by
simply
including
the
grammatical
function
of
the
f-structure
parent
,
improves
generator
accuracy
.
In
the
future
we
will
experiment
with
increasing
conditioning
context
further
and
using
more
sophisticated
smoothing
techniques
to
avoid
sparse
data
problems
when
conditioning
is
increased
.
We
have
also
demonstrated
that
automatically
acquired
multi-word
units
can
bring
about
moderate
,
but
significant
,
improvements
in
generator
accuracy
.
For
automatically
acquired
MWUs
,
we
found
that
this
could
best
be
achieved
by
concatenating
input
items
when
generating
the
f-structure
input
to
the
generator
,
while
training
the
input
generation
grammar
on
the
original
(
i.e.
non-MWU
concatenated
)
sections
of
the
treebank
.
Relying
on
the
BBN
corpus
as
a
source
of
multi-word
units
,
we
gave
an
upper
bound
to
the
potential
usefulness
of
multi-word
units
in
generation
and
showed
that
automatically
acquired
multi-word
units
,
encouragingly
,
give
results
not
far
below
the
upper
bound
.
