Previous
work
on
statistical
language
generation
has
primarily
focused
on
grammat-icality
and
naturalness
,
scoring
generation
possibilities
according
to
a
language
model
or
user
feedback
.
More
recent
work
has
investigated
data-driven
techniques
for
controlling
linguistic
style
without
overgeneration
,
by
reproducing
variation
dimensions
extracted
from
corpora
.
Another
line
of
work
has
produced
handcrafted
rule-based
systems
to
control
specific
stylistic
dimensions
,
such
as
politeness
and
personality
.
This
paper
describes
a
novel
approach
that
automatically
learns
to
produce
recognisable
variation
along
a
meaningful
stylistic
dimension
—
personality
—
without
the
computational
cost
incurred
by
overgeneration
techniques
.
We
present
the
first
evaluation
of
a
data-driven
generation
method
that
projects
multiple
personality
traits
simultaneously
and
on
a
continuous
scale
.
We
compare
our
performance
to
a
rule-based
generator
in
the
same
domain
.
1
Introduction
Over
the
last
20
years
,
statistical
language
models
(
SLMs
)
have
been
used
successfully
in
many
tasks
in
natural
language
processing
,
and
the
data
available
for
modeling
has
steadily
grown
(
Lapata
and
Keller
,
2005
)
.
Langkilde
and
Knight
(
1998
)
first
applied
SLMs
to
statistical
natural
language
generation
(
SNLG
)
,
showing
that
high
quality
paraphrases
can
be
generated
from
an
underspecified
representation
of
meaning
,
by
first
applying
a
very
undercon-strained
,
rule-based
overgeneration
phase
,
whose
outputs
are
then
ranked
by
an
SLM
scoring
phase
.
Since
then
,
research
in
SNLG
has
explored
a
range
of
models
for
both
dialogue
and
text
generation
.
One
line
of
work
has
primarily
focused
on
gram-maticality
and
naturalness
,
scoring
the
overgener
-
ation
phase
with
a
SLM
,
and
evaluating
against
a
gold-standard
corpus
,
using
string
or
tree-match
metrics
(
Langkilde-Geary
,
2002
;
Bangalore
and
Rambow
,
2000
;
Chambers
and
Allen
,
2004
;
Belz
,
2005
;
Isardetal
.
,
2006
)
.
Another
thread
investigates
SNLG
scoring
models
trained
using
higher-level
linguistic
features
to
replicate
human
judgments
of
utterance
quality
Stent
and
Guo
,
2005
)
.
The
error
of
these
scoring
models
approaches
the
gold-standard
human
ranking
with
a
relatively
small
training
set
.
A
third
SNLG
approach
eliminates
the
overgeneration
phase
(
Paiva
and
Evans
,
2005
)
.
It
applies
factor
analysis
to
a
corpus
exhibiting
stylistic
variation
,
and
then
learns
which
generation
parameters
to
manipulate
to
correlate
with
factor
measurements
.
The
generator
was
shown
to
reproduce
intended
factor
levels
across
several
factors
,
thus
modelling
the
stylistic
variation
as
measured
in
the
original
corpus
.
Our
goal
is
a
generation
technique
that
can
target
multiple
stylistic
effects
simultaneously
and
over
a
continuous
scale
,
controlling
stylistic
dimensions
that
are
commonly
understood
and
thus
meaningful
to
users
and
application
developers
.
Our
intended
applications
are
output
utterances
for
intelligent
training
or
intervention
systems
,
video
game
characters
,
or
virtual
environment
avatars
.
In
previous
work
,
we
presented
PERSONAGE
,
a
psychologically-informed
rule-based
generator
based
on
the
Big
Five
personality
model
,
and
we
showed
that
Personage
can
project
extreme
personality
on
the
extraversion
scale
,
i.e.
both
introverted
and
extraverted
personality
types
(
Mairesse
and
Walker
,
2007
)
.
We
used
the
Big
Five
model
to
develop
Personage
for
several
reasons
.
First
,
the
Big
Five
has
been
shown
in
psychology
to
ex
-
warm
,
assertive
,
sociable
,
excitement
seeking
,
active
,
spontaneous
,
optimistic
,
talkative
shy
,
quiet
,
reserved
,
passive
,
solitary
,
moody
calm
,
even-tempered
,
reliable
,
peaceful
,
confident
neurotic
,
anxious
,
depressed
,
self-conscious
trustworthy
,
considerate
,
friendly
,
generous
,
helpful
unfriendly
,
selfish
,
suspicious
,
uncooperative
,
malicious
competent
,
disciplined
,
dutiful
,
achievement
striving
disorganised
,
impulsive
,
unreliable
,
forgetful
creative
,
intellectual
,
curious
,
cultured
,
complex
narrow-minded
,
conservative
,
ignorant
,
simple
Table
1
:
Example
adjectives
associated
with
extreme
values
of
the
Big
Five
trait
scales
.
plain
much
of
the
variation
in
human
perceptions
of
personality
differences
.
second
,
we
believe
that
the
adjectives
used
to
develop
the
Big
Five
model
provide
an
intuitive
,
meaningful
definition
of
linguistic
style
.
Table
1
shows
some
of
the
trait
adjectives
associated
with
the
extremes
of
each
Big
Five
trait
.
Third
,
there
are
many
studies
linking
personality
to
linguistic
variables
(
Pennebaker
and
King
,
1999
;
Mehl
et
al.
,
2006
,
inter
alia
)
.
See
(
Mairesse
and
Walker
,
2007
)
for
more
detail
.
In
this
paper
,
we
further
test
the
utility
of
basing
stylistic
variation
on
the
Big
Five
personality
model
.
The
Big
Five
traits
are
represented
by
scalar
values
that
range
from
1
to
7
,
with
values
normally
distributed
among
humans
.
While
our
previous
work
targeted
extreme
values
of
individual
traits
,
here
we
show
that
we
can
target
multiple
personality
traits
simultaneously
and
over
the
continuous
scales
of
the
Big
Five
model
.
section
2
describes
a
novel
parameter-estimation
method
that
automatically
learns
to
produce
recognisable
variation
for
all
Big
Five
traits
,
without
overgeneration
,
implemented
in
a
new
SNLG
called
Personage-PE
.
We
show
that
Personage-PE
generates
targets
for
multiple
personality
dimensions
,
using
linear
and
non-linear
parameter
estimation
models
to
predict
generation
parameters
directly
from
the
scalar
targets
.
Section
3.2
shows
that
humans
accurately
perceive
the
intended
variation
,
and
Section
3.3
compares
Personage-PE
(
trained
)
with
Personage
(
rule-based
;
Mairesse
and
Walker
,
2007
)
.
We
delay
a
detailed
discussion
of
related
work
to
Section
4
,
where
we
summarize
and
discuss
future
work
.
2
Parameter
Estimation
Models
The
data-driven
parameter
estimation
method
consists
of
a
development
phase
and
a
generation
phase
(
Section
3
)
.
The
development
phase
:
Uses
a
base
generator
to
produce
multiple
utterances
by
randomly
varying
its
parameters
;
Collects
human
judgments
rating
the
personality
of
each
utterance
;
Trains
statistical
models
to
predict
the
parameters
from
the
personality
judgments
;
Agreeableness
rating
Figure
1
:
Distribution
of
average
agreeableness
ratings
from
the
2
expert
judges
for
160
random
utterances
.
Selects
the
best
model
for
each
parameter
via
cross-validation
.
We
make
minimal
assumptions
about
the
input
to
the
generator
to
favor
domain
independence
.
The
input
is
a
speech
act
,
a
potential
content
pool
that
can
be
used
to
achieve
that
speech
act
,
and
five
scalar
personality
parameters
(
1
.
.
.
7
)
,
specifying
values
for
the
continuous
scalar
dimensions
of
each
trait
in
the
Big
Five
model
.
See
Table
1
.
This
requires
a
base
generator
that
generates
multiple
outputs
expressing
the
same
input
content
by
varying
linguistic
parameters
related
to
the
Big
Five
traits
.
We
start
with
the
Personage
generator
(
Mairesse
and
Walker
,
2007
)
,
which
generates
recommendations
and
comparisons
of
restaurants
.
We
extend
Personage
with
new
parameters
for
a
total
of
67
parameters
in
Personage-PE
.
See
Table
2
.
These
parameters
are
derived
from
psychological
studies
identifying
linguistic
markers
of
the
Big
Five
traits
(
Pennebaker
and
King
,
1999
;
Mehl
et
al.
,
2006
,
inter
alia
)
.
As
Personage
's
input
parameters
are
domain-independent
,
most
parameters
range
continuously
between
0
and
1
,
while
pragmatic
marker
insertion
parameters
are
binary
,
except
for
the
subject
implicitness
,
stuttering
and
pronomi
-
Parameters
Description
Verbosity
Restatements
Repetitions
Content
polarity
Repetitions
polarity
Concessions
Concessions
polarity
Polarisation
Positive
content
first
Control
the
number
of
propositions
in
the
utterance
Paraphrase
an
existing
proposition
,
e.g.
'
Chanpen
Thai
has
great
service
,
it
has
fantastic
waiters
'
Repeat
an
existing
proposition
Control
the
polarity
of
the
propositions
expressed
,
i.e.
referring
to
negative
or
positive
attributes
Control
the
polarity
of
the
restated
propositions
Emphasise
one
attribute
over
another
,
e.g.
'
even
if
Chanpen
Thai
has
great
food
,
it
has
bad
service
'
Determine
whether
positive
or
negative
attributes
are
emphasised
Control
whether
the
expressed
polarity
is
neutral
or
extreme
Determine
whether
positive
propositions
—
including
the
claim
—
are
uttered
first
Syntactic
template
selection
Self-references
Claim
complexity
Claim
polarity
parameters
:
Control
the
number
of
first
person
pronouns
Control
the
syntactic
complexity
(
syntactic
embedding
)
Control
the
connotation
of
the
claim
,
i.e.
whether
positive
or
negative
affect
is
expressed
Relative
clause
With
cue
word
Conjunction
Merge
Also
cue
word
Contrast
-
cue
word
Justify
-
cue
word
Concede
-
cue
word
Merge
with
comma
conj
.
with
ellipsis
Leave
two
propositions
in
their
own
sentences
,
e.g.
'
Chanpen
Thai
has
great
service
.
It
has
nice
decor
.
'
Aggregate
propositions
with
a
relative
clause
,
e.g.
'
Chanpen
Thai
,
which
has
great
service
,
has
nice
decor
'
Aggregate
propositions
using
with
,
e.g.
'
Chanpen
Thai
has
great
service
,
with
nice
decor
'
Join
two
propositions
using
a
conjunction
,
or
a
comma
if
more
than
two
propositions
Merge
the
subject
and
verb
of
two
propositions
,
e.g.
'
Chanpen
Thai
has
great
service
and
nice
decor
'
Join
two
propositions
using
also
,
e.g.
'
Chanpen
Thai
has
great
service
,
also
it
has
nice
decor
'
Contrast
two
propositions
using
while
,
but
,
however
,
on
the
other
hand
,
e.g.
'
While
Chanpen
Thai
has
great
service
,
it
has
bad
decor
'
,
'
Chanpen
Thai
has
great
service
,
but
it
has
bad
decor
'
Justify
a
proposition
using
because
,
since
,
so
,
e.g.
'
Chanpen
Thai
is
the
best
,
because
it
has
great
service
'
Concede
a
proposition
using
although
,
even
if
,
but
/
though
,
e.g.
'
Although
Chanpen
Thai
has
great
service
,
it
has
bad
decor
'
,
'
Chanpen
Thai
has
great
service
,
but
it
has
bad
decor
though
'
Restate
a
proposition
by
repeating
only
the
object
,
e.g.
'
Chanpen
Thai
has
great
service
,
nice
waiters
'
Restate
a
proposition
after
replacing
its
object
by
an
ellipsis
,
e.g.
'
Chanpen
Thai
has
it
has
great
service
'
Subject
implicitness
Negation
Softener
hedges
Acknowledgments
Filled
pauses
Exclamation
Expletives
Near-expletives
Competence
mitigation
Tag
question
Stuttering
Confirmation
Initial
rejection
In-group
marker
Pronominalization
Make
the
restaurant
implicit
by
moving
the
attribute
to
the
subject
,
e.g.
'
the
service
is
great
'
Negate
a
verb
by
replacing
its
modifier
by
its
antonym
,
e.g.
'
Chanpen
Thai
doesn
't
have
bad
service
'
Insert
syntactic
elements
(
sort
of
,
kind
of
,
somewhat
,
quite
,
around
,
rather
,
I
think
that
,
it
seems
that
,
it
seems
to
me
that
)
to
mitigate
the
strength
of
a
proposition
,
e.g.
'
Chanpen
Thai
has
kind
of
great
service
'
or
'
It
seems
to
me
that
Chanpen
Thai
has
rather
great
service
'
Insert
syntactic
elements
(
really
,
basically
,
actually
,
just
)
to
strengthen
a
proposition
,
e.g.
'
Chanpen
Thai
has
really
great
service
'
or
'
Basically
,
Chanpen
Thai
just
has
great
service
'
Duplicate
the
first
letters
of
a
restaurant
's
name
,
e.g.
'
Ch-ch-anpen
Thai
is
the
best
'
Begin
the
utterance
with
a
confirmation
of
the
restaurant
's
name
,
e.g.
'
didyou
say
Chanpen
Thai
?
'
Refer
to
the
hearer
as
a
member
of
the
same
social
group
,
e.g.
pal
,
mate
and
buddy
Replace
occurrences
of
the
restaurant
's
name
by
pronouns
Lexical
frequency
Word
length
Verb
strength
Control
the
average
frequency
of
use
of
each
content
word
,
according
to
BNC
frequency
counts
Control
the
average
number
of
letters
of
each
content
word
Control
the
strength
of
the
selected
verbs
,
e.g.
'
I
would
suggest
'
vs.
'
I
would
recommend
'
Table
2
:
The
67
generation
parameters
whose
target
values
are
learned
.
Aggregation
cue
words
,
hedges
,
acknowledgments
and
filled
pauses
are
learned
individually
(
as
separate
parameters
)
,
e.g.
kind
of
is
modeled
differently
than
somewhat
in
the
SOFTENER
HEDGES
category
.
Parameters
are
detailed
in
previous
work
(
Mairesse
and
Walker
,
2007
)
.
nalization
parameters
.
2.2
Random
Sample
Generation
and
Expert
Judgments
We
generate
a
sample
of
160
random
utterances
by
varying
the
parameters
in
Table
2
with
a
uniform
distribution
.
This
sample
is
intended
to
provide
enough
training
material
for
estimating
all
67
parameters
for
each
personality
dimension
.
Following
Mairesse
and
Walker
(
2007
)
,
two
expert
judges
(
not
the
authors
)
familiar
with
the
Big
Five
adjectives
(
Table
1
)
evaluate
the
personality
of
each
utterance
using
the
Ten-Item
Personality
Inventory
(
TIPI
;
Gosling
et
al.
,
2003
)
,
and
also
judge
the
utterance
's
naturalness
.
Thus
11
judgments
were
made
for
each
utterance
for
a
total
of
1760
judgments
.
The
TIPI
outputs
a
rating
on
a
scale
from
1
(
low
)
to
7
(
high
)
for
each
Big
Five
trait
.
The
expert
judgments
are
approximately
nor
-
mally
distributed
;
Figure
1
shows
the
distribution
for
agreeableness
.
2.3
Statistical
Model
Training
Training
data
is
created
for
each
generation
parameter
—
i.e.
the
output
variable
—
to
train
statistical
models
predicting
the
optimal
parameter
value
from
the
target
personality
scores
.
The
models
are
thus
based
on
the
simplifying
assumption
that
the
generation
parameters
are
independent
.
Any
personality
trait
whose
correlation
with
a
generation
decision
is
below
0.1
is
removed
from
the
training
data
.
This
has
the
effect
of
removing
parameters
that
do
not
correlate
strongly
with
any
trait
,
which
are
set
to
a
constant
default
value
at
generation
time
.
since
the
input
parameter
values
may
not
be
satisfiable
depending
on
the
input
content
,
the
actual
generation
decisions
made
for
each
utterance
are
recorded
.
For
example
,
the
concessions
decision
value
is
the
actual
number
of
concessions
produced
in
the
utterance
.
To
ensure
that
the
models
'
output
can
control
the
generator
,
the
generation
decision
values
are
normalized
to
match
the
input
range
(
0
.
.
.
1
)
of
Personage-PE
.
Thus
the
dataset
consists
of
160
utterances
and
the
corresponding
generation
decisions
,
each
associated
with
5
personality
ratings
averaged
over
both
judges
.
Parameter
estimation
models
are
trained
to
predict
either
continuous
(
e.g.
verbosity
)
or
binary
(
e.g.
exclamation
)
generation
decisions
.
We
compare
various
learning
algorithms
using
the
Weka
toolkit
(
with
default
values
unless
specified
;
Witten
and
Frank
,
2005
)
.
Continuous
parameters
are
modeled
with
a
linear
regression
model
(
LR
)
,
an
M5
'
model
tree
(
M5
)
,
and
a
model
based
on
support
vector
machines
with
a
linear
kernel
(
sVM
)
.
As
regression
models
can
extrapolate
beyond
the
[
0,1
]
interval
,
the
output
parameter
values
are
truncated
if
needed
—
at
generation
time
—
before
being
sent
to
the
base
generator
.
Binary
parameters
are
modeled
using
classifiers
that
predict
whether
the
parameter
is
enabled
or
disabled
.
We
test
a
Naive
Bayes
classifier
(
NB
)
,
a
j48
decision
tree
(
J48
)
,
a
nearest-neighbor
classifier
using
one
neighbor
(
NN
)
,
a
Java
implementation
of
the
RIPPER
rule-based
learner
(
JRIP
)
,
the
AdaBoost
boosting
algorithm
(
ADA
)
,
and
a
support
vector
machines
classifier
with
a
linear
kernel
(
SVM
)
.
Figures
2
,
3
and
4
show
the
models
learned
for
the
exclamation
(
binary
)
,
stuttering
(
continuous
)
,
and
content
polarity
(
continuous
)
parameters
in
Table
2
.
The
models
predict
generation
parameters
from
input
personality
scores
;
note
that
Condition
agreeableness
extraversion
Figure
2
:
AdaBoost
model
predicting
the
exclamation
parameter
.
Given
input
trait
values
,
the
model
outputs
the
class
yielding
the
largest
sum
of
weights
for
the
rules
returning
that
class
.
Class
0
=
disabled
,
class
1
=
enabled
.
(
normalized
)
Content
polarity
=
experience
Figure
3
:
sVM
model
with
a
linear
kernel
predicting
the
content
polarity
parameter
.
sometimes
the
best
performing
model
is
non-linear
.
Given
input
trait
values
,
the
AdaBoost
model
in
Figure
2
outputs
the
class
yielding
the
largest
sum
of
weights
for
the
rules
returning
that
class
.
For
example
,
the
first
rule
of
the
exclamation
model
shows
that
an
extraversion
score
above
6.42
out
of
7
would
increase
the
weight
of
the
enabled
class
by
1.81
.
The
fifth
rule
indicates
that
a
target
agreeableness
above
5.13
would
further
increase
the
weight
by
.
42
.
The
stuttering
model
tree
in
Figure
4
lets
us
calculate
that
a
low
emotional
stability
(
1.0
)
together
with
a
neutral
conscientiousness
and
openness
to
experience
(
4.0
)
yield
a
parameter
value
of
.
62
(
see
LM2
)
,
whereas
a
neutral
emotional
stability
decreases
the
value
down
to
.
17
.
Figure
4
also
shows
how
personality
traits
that
do
not
affect
the
parameter
are
removed
,
i.e.
emotional
stability
,
conscientiousness
and
openness
to
experience
are
the
traits
that
affect
stuttering
.
The
linear
model
in
Figure
3
shows
that
agreeableness
has
a
strong
effect
on
the
content
polarity
parameter
(
.
97
weight
)
,
but
emotional
stability
,
conscientiousness
and
openness
to
experience
also
have
an
effect
.
The
final
step
of
the
development
phase
identifies
the
best
performing
model
(
s
)
for
each
generation
parameter
via
cross-validation
.
For
continuous
pa
-
Stuttering
=
Emotional
stability
Stuttering
=
Stuttering
=
Figure
4
:
M5
'
model
tree
predicting
the
stuttering
parameter
.
Continuous
parameters
Content
parameters
:
Verbosity
Restatements
Repetitions
Content
polarity
Repetitions
polarity
Concessions
Concessions
polarity
Polarisation
Syntactic
template
selection
:
Claim
complexity
Claim
polarity
Aggregation
operations
:
Infer
-
With
cue
word
Infer
-
Also
cue
word
Justify
-
Since
cue
word
Justify
-
So
cue
word
Justify
-
Period
Contrast
-
Period
Restate
-
Merge
with
comma
Concede
-
Although
cue
word
Concede
-
Even
if
cue
word
Subject
implicitness
Stuttering
insertion
Pronominalization
Lexical
choice
parameters
:
Lexical
frequency
Word
length
Table
3
:
Pearson
's
correlation
between
parameter
model
predictions
and
continuous
parameter
values
,
for
different
regression
models
.
Aggregation
operations
are
associated
with
a
rhetorical
relation
(
e.g.
INFER
)
.
rameters
,
Table
3
evaluates
modeling
accuracy
by
comparing
the
correlations
between
the
model
's
predictions
and
the
actual
parameter
values
in
the
test
folds
.
Table
4
reports
results
for
binary
parameter
classifiers
,
by
comparing
the
F-measures
of
the
enabled
class
.
Best
performing
models
are
identified
in
bold
;
parameters
that
do
not
correlate
with
any
trait
or
that
produce
a
poor
modeling
accuracy
are
omitted
.
The
content
polarity
parameter
is
modeled
Binary
parameters
Pragmatic
markers
:
Softener
hedges
Emphasizer
hedges
basically
Acknowledgments
Filled
pauses
Exclamation
Expletives
In-group
marker
Tag
question
Confirmation
Table
4
:
F-measure
of
the
enabled
class
for
classification
models
of
binary
parameters
.
Parameters
that
do
not
correlate
with
any
trait
are
omitted
.
Results
are
averaged
over
a
10-fold
cross-validation
.
JRIP
models
are
not
shown
as
they
never
perform
best
.
the
most
accurately
,
with
the
SVM
model
in
Figure
3
producing
a
correlation
of
.
47
with
the
true
parameter
values
.
Models
of
the
period
aggregation
operation
also
perform
well
,
with
a
linear
regression
model
yielding
a
correlation
of
.
36
when
realizing
a
justification
,
and
.
27
when
contrasting
two
propositions
.
claim
complexity
and
verbosity
are
also
modeled
successfully
,
with
correlations
of
.
33
and
.
26
using
a
model
tree
.
The
model
tree
controlling
the
stuttering
parameter
illustrated
in
Figure
4
produces
a
correlation
of
.
23
.
For
binary
parameters
,
Table
4
shows
that
the
Naive
Bayes
classifier
is
generally
the
most
accurate
,
with
F-measures
of
.
40
for
the
in-group
marker
parameter
,
and
.
32
for
both
the
insertion
of
filled
pauses
(
err
)
and
tag
questions
.
The
AdaBoost
algorithm
best
predicts
the
exclamation
parameter
,
with
an
F-measure
of
.
38
for
the
model
in
Figure
2
.
Traits
End
Output
utterance
Extraversion
high
Agreeableness
high
Radio
Perfecto
's
price
is
25
dollars
but
Les
Routiers
provides
adequate
food
.
I
imagine
they
're
alright
!
Emotional
stability
high
Conscientiousness
high
Let
's
see
,
Les
Routiers
and
Radio
Perfecto
.
.
.
You
would
probably
appreciate
them
.
Radio
Perfecto
is
in
the
East
Village
with
kind
of
acceptable
food
.
Les
Routiers
is
located
in
Manhattan
.
Its
price
is
4l
dollars
.
Extraversion
low
Agreeableness
low
Err
.
.
.
you
would
probably
appreciate
Trattoria
Rustica
,
wouldn
't
you
?
It
's
in
Manhattan
,
also
it
's
an
italian
restaurant
.
It
offers
poor
ambience
,
also
it
's
quite
costly
.
Emotional
stability
low
Openness
to
l
experience
Trattoria
Rustica
isn
't
as
bad
as
the
others
.
Err
.
.
.
even
if
it
's
costly
,
it
offers
kind
of
adequate
food
,
alright
?
It
's
an
italian
place
.
Table
5
:
Example
outputs
controlled
by
the
parameter
estimation
models
for
a
comparison
(
#1
)
and
a
recommendation
(
#2
)
,
with
the
average
judges
'
ratings
(
Rating
)
and
naturalness
(
Nat
)
.
Ratings
are
on
a
scale
from
1
to
7
,
with
1
=
very
low
(
e.g.
neurotic
or
introvert
)
and
7
=
very
high
on
the
dimension
(
e.g.
emotionally
stable
or
extraverted
)
.
3
Evaluation
Experiment
The
generation
phase
of
our
parameter
estimation
sNLG
method
consists
of
the
following
steps
:
1
.
use
the
best
performing
models
to
predict
parameter
values
from
the
desired
personality
scores
;
Generate
the
output
utterance
using
the
predicted
parameter
values
.
We
then
evaluate
the
output
utterances
using
naive
human
judges
to
rate
their
perceived
personality
and
naturalness
.
3.1
Evaluation
Method
Given
the
best
performing
model
for
each
generation
parameter
,
we
generate
5
utterances
for
each
of
5
recommendation
and
5
comparison
speech
acts
.
Each
utterance
targets
an
extreme
value
for
two
traits
(
either
1
or
7
out
of
7
)
and
neutral
values
for
the
remaining
three
traits
(
4
out
of
7
)
.
The
goal
is
for
each
utterance
to
project
multiple
traits
on
a
continuous
scale
.
To
generate
a
range
of
alternatives
,
a
Gaussian
noise
with
a
standard
deviation
of
10
%
of
the
full
scale
is
added
to
each
target
value
.
subjects
were
24
native
English
speakers
(
12
male
and
12
female
graduate
students
from
a
range
of
disciplines
from
both
the
U.K.
and
the
U.S.
)
.
Subjects
evaluate
the
naturalness
and
personality
of
each
utterance
using
the
TIPI
(
Gosling
et
al.
,
2003
)
.
To
limit
the
experiment
's
duration
,
only
the
two
traits
with
extreme
target
values
are
evaluated
for
each
utterance
.
subjects
thus
answered
5
questions
for
50
utterances
,
two
from
the
TIPI
for
each
extreme
trait
and
one
about
naturalness
(
250
judgments
in
total
per
subject
)
.
subjects
were
not
told
that
the
utterances
were
intended
to
manifest
extreme
trait
values
.
Table
5
shows
several
sample
outputs
and
the
mean
personality
ratings
from
the
human
judges
.
For
example
,
utterance
1.a
projects
a
high
extraversion
through
the
insertion
of
an
exclamation
mark
based
on
the
model
in
Figure
2
,
whereas
utterance
2.a
conveys
introversion
by
beginning
with
the
filled
pause
err
.
The
same
utterance
also
projects
a
low
agreeableness
by
focusing
on
negative
propositions
,
through
a
low
content
polarity
parameter
value
as
per
the
model
in
Figure
3
.
This
evaluation
addresses
a
number
of
open
questions
discussed
below
.
Q1
:
is
the
personality
projected
by
models
trained
on
ratings
from
a
few
expert
judges
recognised
by
a
larger
sample
of
naive
judges
?
(
Section
3.2
)
Q2
:
Can
a
combination
of
multiple
traits
within
a
single
utterance
be
detected
by
naive
judges
?
(
section
3.2
)
Q3
:
How
does
Personage-PE
compare
to
PersonAGE
,
a
psychologically-informed
rule-based
generator
for
projecting
extreme
personality
?
(
section
3.3
)
Q4
:
Does
the
parameter
estimation
sNLG
method
produce
natural
utterances
?
(
section
3.4
)
3.2
Parameter
Estimation
Evaluation
Table
6
shows
that
extraversion
is
the
dimension
modeled
most
accurately
by
the
parameter
estimation
models
,
producing
a
.
45
correlation
with
the
subjects
'
ratings
(
p
&lt;
.
Emotional
stability
,
agreeableness
,
and
openness
to
experience
ratings
also
correlate
strongly
with
the
target
scores
,
with
correlations
of
.
39
,
.
36
and
.
17
respectively
(
p
&lt;
.
01
)
.
Additionally
,
Table
6
shows
that
the
magnitude
of
the
correlation
increases
when
considering
the
perception
of
a
hypothetical
average
subject
,
i.e.
smoothing
individual
variation
by
averaging
the
ratings
over
all
24
judges
,
producing
a
correlation
ravg
up
to
.
80
for
extraversion
.
These
correlations
are
unexpectedly
high
;
in
corpus
analyses
,
significant
correlations
as
low
as
.
05
to
.
10
are
typically
observed
between
personality
and
linguistic
markers
(
Pennebaker
and
King
,
1999
;
Mehl
et
al.
,
2006
)
.
Conscientiousness
is
the
only
dimension
whose
ratings
do
not
correlate
with
the
target
scores
.
The
comparison
with
rule-based
results
in
section
3.3
suggests
that
this
is
not
because
conscientiousness
cannot
be
exhibited
in
our
domain
or
manifested
in
a
single
utterance
,
so
perhaps
this
arises
from
differing
perceptions
of
conscientiousness
between
the
expert
and
naive
judges
.
Table
6
:
Pearson
's
correlation
coefficient
r
and
mean
absolute
error
e
between
the
target
personality
scores
and
the
480
judges
'
ratings
(
20
ratings
per
trait
for
24
judges
)
;
ravg
is
the
correlation
between
the
personality
scores
and
the
average
judges
'
ratings
.
Table
6
shows
that
the
mean
absolute
error
varies
between
1.89
and
2.79
on
a
scale
from
1
to
7
.
Such
large
errors
result
from
the
decision
to
ask
judges
to
answer
just
the
TIPI
questions
for
the
two
traits
that
were
the
extreme
targets
(
See
Section
3.1
)
,
because
the
judges
tend
to
use
the
whole
scale
,
with
approximately
normally
distributed
ratings
.
This
means
that
although
the
judges
make
distinctions
leading
to
high
correlations
,
they
do
so
on
a
compressed
scale
.
This
explains
the
large
correlations
despite
the
magnitude
of
the
absolute
error
.
Table
7
shows
results
evaluating
whether
utterances
targeting
the
extremes
of
a
trait
are
perceived
differently
.
The
ratings
differ
significantly
for
all
traits
but
conscientiousness
(
p
&lt;
.
001
)
.
Thus
parameter
estimation
models
can
be
used
in
applications
that
only
require
discrete
binary
variation
.
a
statistically
significant
difference
Table
7
:
Average
personality
ratings
for
the
utterances
generated
with
the
low
and
high
target
values
for
each
trait
on
a
scale
from
1
to
7
.
it
is
important
to
emphasize
that
generation
parameters
were
predicted
based
on
5
target
personality
values
.
Thus
,
the
results
show
that
individual
traits
are
perceived
even
when
utterances
project
other
traits
as
well
,
confirming
that
the
Big
Five
theory
models
independent
dimensions
and
thus
provides
a
useful
and
meaningful
framework
for
modeling
variation
in
language
.
Additionally
,
although
we
do
not
directly
evaluate
the
perception
of
mid-range
values
of
personality
target
scores
,
the
results
suggest
that
mid-range
personality
is
modeled
correctly
because
the
neutral
target
scores
do
not
affect
the
perception
of
extreme
traits
.
3.3
Comparison
with
Rule-Based
Generation
Personage
is
a
rule-based
personality
generator
based
on
handcrafted
parameter
settings
derived
from
psychological
studies
.
Mairesse
and
Walker
(
2007
)
show
that
this
approach
generates
utterances
that
are
perceptibly
different
along
the
extraversion
dimension
.
Table
8
compares
the
mean
ratings
of
the
utterances
generated
by
Personage-PE
with
ratings
of
20
utterances
generated
by
Personage
for
each
extreme
of
each
Big
Five
scale
(
40
for
extraversion
,
resulting
in
240
handcrafted
utterances
in
total
)
.
Table
8
shows
that
the
handcrafted
parameter
settings
project
a
significantly
more
extreme
personality
for
6
traits
out
of
10
.
However
,
the
learned
parameter
models
for
neuroticism
,
disagreeableness
,
unconscientiousness
and
openness
to
experience
do
not
perform
significantly
worse
than
the
handcrafted
generator
.
These
findings
are
promising
as
we
discuss
further
in
section
4
.
parameters
Extraversion
Emotional
stability
Agreeableness
Conscientiousness
Openness
to
experience
•
,
o
significant
increase
or
decrease
of
the
variation
range
Table
8
:
Pair-wise
comparison
between
the
ratings
of
the
utterances
generated
using
PERSONAGE-PE
with
extreme
target
values
(
Learned
Parameters
)
,
and
the
ratings
for
utterances
generated
with
Mairesse
and
Walker
's
rule-based
Personage
generator
,
(
Rule-based
)
.
Ratings
are
averaged
over
all
judges
.
3.4
Naturalness
Evaluation
The
naive
judges
also
evaluated
the
naturalness
of
the
outputs
of
our
trained
models
.
Table
9
shows
that
the
average
naturalness
is
3.98
out
of
7
,
which
is
significantly
lower
(
p
&lt;
.
05
)
than
the
naturalness
of
handcrafted
and
randomly
generated
utterances
reported
by
Mairesse
and
Walker
(
2007
)
.
It
is
possible
that
the
differences
arise
from
judgments
of
utterances
targeting
multiple
traits
,
or
that
the
naive
judges
are
more
critical
.
Rule-based
Table
9
:
Average
naturalness
ratings
for
utterances
generated
using
(
1
)
Personage
,
the
rule-based
generator
,
(
2
)
the
random
utterances
(
expert
judges
)
and
(
3
)
the
outputs
of
Personage-PE
using
the
parameter
estimation
models
(
Learned
,
naive
judges
)
.
The
means
differ
significantly
at
the
p
&lt;
.
05
level
(
two-tailed
independent
sample
t-test
)
.
4
Conclusion
We
present
a
new
method
for
generating
linguistic
variation
projecting
multiple
personality
traits
continuously
,
by
combining
and
extending
previous
research
in
statistical
natural
language
generation
(
Paiva
and
Evans
,
2005
;
Rambow
et
al.
,
2001
;
Isard
et
al.
,
2006
;
Mairesse
and
Walker
,
2007
)
.
While
handcrafted
rule-based
approaches
are
limited
to
variation
along
a
small
number
of
discrete
points
(
Hovy
,
1988
;
Walker
etal
.
,
1997
;
Lester
etal
.
,
1997
;
Power
et
al.
,
2003
;
Cassell
and
Bickmore
,
2003
;
Pi-wek
,
2003
;
Mairesse
and
Walker
,
2007
;
Rehm
and
Andre
,
in
press
)
,
we
learn
models
that
predict
parameter
values
for
any
arbitrary
value
on
the
variation
dimension
scales
.
Additionally
,
our
data-driven
approach
can
be
applied
to
any
dimension
that
is
meaningful
to
human
judges
,
and
it
provides
an
elegant
way
to
project
multiple
dimensions
simultaneously
,
by
including
the
relevant
dimensions
as
features
of
the
parameter
models
'
training
data
.
isard
et
al.
(
2006
)
and
Mairesse
and
Walker
(
2007
)
also
propose
a
personality
generation
method
,
in
which
a
data-driven
personality
model
selects
the
best
utterance
from
a
large
candidate
set
.
Isard
et
al.
's
technique
has
not
been
evaluated
,
while
Mairesse
and
Walker
's
overgenerate
and
score
approach
is
inefficient
.
Paiva
and
Evans
'
technique
does
not
overgenerate
(
2005
)
,
but
it
requires
a
search
for
the
optimal
generation
decisions
according
to
the
learned
models
.
our
approach
does
not
require
any
search
or
overgeneration
,
as
parameter
estimation
models
predict
the
generation
decisions
directly
from
the
target
variation
dimensions
.
This
technique
is
therefore
beneficial
for
real-time
generation
.
Moreover
the
variation
dimensions
of
Paiva
and
Evans
'
data-driven
technique
are
extracted
from
a
corpus
:
there
is
thus
no
guarantee
that
they
can
be
easily
interpreted
by
humans
,
and
that
they
generalise
to
other
corpora
.
Previous
work
has
shown
that
modeling
the
relation
between
personality
and
language
is
far
from
trivial
(
Pennebaker
and
King
,
1999
;
Argamon
et
al.
,
2005
;
Oberlander
and
Now-son
,
2006
;
Mairesse
et
al.
,
2007
)
,
suggesting
that
the
control
of
personality
is
a
harder
problem
than
the
control
of
data-driven
variation
dimensions
.
We
present
the
first
human
perceptual
evaluation
of
a
data-driven
stylistic
variation
method
.
in
terms
of
our
research
questions
in
Section
3.1
,
we
show
that
models
trained
on
expert
judges
to
project
multiple
traits
in
a
single
utterance
generate
utterances
whose
personality
is
recognized
by
naive
judges
.
There
is
only
one
other
similar
evaluation
of
an
SNLG
(
Rambow
et
al.
,
2001
)
.
Our
models
perform
only
slightly
worse
than
a
handcrafted
rule-based
generator
in
the
same
domain
.
These
findings
are
promising
as
(
1
)
parameter
estimation
models
are
able
to
target
any
combination
of
traits
over
the
full
range
of
the
Big
Five
scales
;
(
2
)
they
do
not
benefit
from
psychological
knowledge
,
i.e.
they
are
trained
on
randomly
generated
utterances
.
This
work
also
has
several
limitations
that
should
be
addressed
in
future
work
.
Even
though
the
parameters
of
Personage-PE
were
suggested
by
psychological
studies
(
Mairesse
and
Walker
,
2007
)
,
some
of
them
are
not
modeled
successfully
by
our
approach
,
and
thus
omitted
from
Tables
3
and
4
.
This
could
be
due
to
the
relatively
small
development
dataset
size
(
160
utterances
to
optimize
67
parameters
)
,
or
to
the
implementation
of
some
parameters
.
The
strong
parameter-independence
assumption
could
also
be
responsible
,
but
we
are
not
aware
of
any
state
of
the
art
implementation
for
learning
multiple
dependent
variables
,
and
this
approach
could
further
aggravate
data
sparsity
issues
.
In
addition
,
it
is
unclear
why
Personage
performs
better
for
projecting
extreme
personality
and
produces
more
natural
utterances
,
and
why
Personage-PE
fails
to
project
conscientiousness
correctly
.
it
might
be
possible
to
improve
the
parameter
estimation
models
with
a
larger
sample
of
random
utterances
at
development
time
,
or
with
additional
extreme
data
generated
using
the
rule-based
approach
.
such
hybrid
models
are
likely
to
perform
better
for
extreme
target
scores
,
as
they
are
trained
on
more
uniformly
distributed
ratings
(
e.g.
compared
to
the
normal
distribution
in
Figure
1
)
.
in
addition
,
we
have
only
shown
that
personality
can
be
expressed
by
information
presentation
speech-acts
in
the
restaurant
domain
;
future
work
should
assess
the
extent
to
which
the
parameters
derived
from
psychological
findings
are
culture
,
domain
,
and
speech
act
dependent
.
