This
paper
proposes
a
framework
for
semi-supervised
structured
output
learning
(
SOL
)
,
specifically
for
sequence
labeling
,
based
on
a
hybrid
generative
and
discriminative
approach
.
We
define
the
objective
function
of
our
hybrid
model
,
which
is
written
in
log-linear
form
,
by
discriminatively
combining
discriminative
structured
predic-tor
(
s
)
with
generative
model
(
s
)
that
incorporate
unlabeled
data
.
Then
,
unlabeled
data
is
used
in
a
generative
manner
to
increase
the
sum
of
the
discriminant
functions
for
all
outputs
during
the
parameter
estimation
.
Experiments
on
named
entity
recognition
(
CoNLL-2003
)
and
syntactic
chunking
(
CoNLL-2000
)
data
show
that
our
hybrid
model
significantly
outperforms
the
state-of-the-art
performance
obtained
with
supervised
SOL
methods
,
such
as
conditional
random
fields
(
CRFs
)
.
1
Introduction
Structured
output
learning
(
SOL
)
methods
,
which
attempt
to
optimize
an
interdependent
output
space
globally
,
are
important
methodologies
for
certain
natural
language
processing
(
NLP
)
tasks
such
as
part-of-speech
tagging
,
syntactic
chunking
(
Chunking
)
and
named
entity
recognition
(
NER
)
,
which
are
also
referred
to
as
sequence
labeling
tasks
.
When
we
consider
the
nature
of
these
sequence
labeling
tasks
,
a
semi-supervised
approach
appears
to
be
more
natural
and
appropriate
.
This
is
because
the
number
of
features
and
parameters
typically
become
extremely
large
,
and
labeled
examples
can
only
sparsely
cover
the
parameter
space
,
even
if
thousands
of
labeled
ex
-
Scheffer
,
2006
)
.
With
the
generative
approach
,
we
can
easily
incorporate
unlabeled
data
into
probabilistic
models
with
the
help
of
expectation-maximization
(
EM
)
algorithms
(
Dempster
et
al.
,
1977
)
.
For
example
,
the
Baum-Welch
algorithm
is
a
well-known
algorithm
for
training
a
hidden
Markov
model
(
HMM
)
of
sequence
learning
.
Generally
,
with
sequence
learning
tasks
such
as
NER
and
Chunking
,
we
cannot
expect
to
obtain
better
performance
than
that
obtained
using
discriminative
approaches
in
supervised
learning
settings
.
In
contrast
to
the
generative
approach
,
with
the
discriminative
approach
,
it
is
not
obvious
how
un-labeled
training
data
can
be
naturally
incorporated
into
a
discriminative
training
criterion
.
For
example
,
the
effect
of
unlabeled
data
will
be
eliminated
from
the
objective
function
if
the
unlabeled
data
is
directly
used
in
traditional
i.i.d.
conditional-probability
models
.
Nevertheless
,
several
attempts
have
recently
been
made
to
incorporate
unlabeled
data
in
the
discriminative
approach
.
An
approach
based
on
pairwise
similarities
,
which
encourage
nearby
data
points
to
have
the
same
class
label
,
has
been
proposed
as
a
way
of
incorporating
unlabeled
data
discriminatively
(
Zhu
et
al.
,
2003
;
Altun
et
al.
,
approach
generally
requires
joint
inference
over
the
whole
data
set
for
prediction
,
which
is
not
practical
as
regards
the
large
data
sets
used
for
standard
sequence
labeling
tasks
in
NLP
.
Another
discriminative
approach
to
semi-supervised
SOL
involves
the
incorporation
of
an
entropy
regularizer
(
Grand
-
Proceedings
of
the
2007
Joint
Conference
on
Empirical
Methods
in
Natural
Language
Processing
and
Computational
Natural
Language
Learning
,
pp.
191-800
,
Prague
,
June
2001
.
©
2001
Association
for
Computational
Linguistics
valet
and
Bengio
,
2004
)
.
Semi-supervised
conditional
random
fields
(
CRFs
)
based
on
a
minimum
entropy
regularizer
(
SS-CRF-MER
)
have
been
proposed
in
(
Jiao
et
al.
,
2006
)
.
With
this
approach
,
the
parameter
is
estimated
to
maximize
the
likelihood
of
labeled
data
and
the
negative
conditional
entropy
of
unlabeled
data
.
Therefore
,
the
structured
predictor
is
trained
to
separate
unlabeled
data
well
under
the
entropy
criterion
by
parameter
estimation
.
In
contrast
to
these
previous
studies
,
this
paper
proposes
a
semi-supervised
SOL
framework
based
on
a
hybrid
generative
and
discriminative
approach
.
A
hybrid
approach
was
first
proposed
in
a
supervised
learning
setting
(
Raina
et
al.
,
2003
)
for
text
classification
.
(
Fujino
et
al.
,
2005
)
have
developed
a
semi-supervised
approach
by
discriminatively
combining
a
supervised
classifier
with
generative
models
that
incorporate
unlabeled
data
.
We
extend
this
framework
to
the
structured
output
domain
,
specifically
for
sequence
labeling
tasks
.
Moreover
,
we
re-formalize
the
objective
function
to
allow
the
incorporation
of
discriminative
models
(
structured
predictors
)
trained
from
labeled
data
,
since
the
original
framework
only
considers
the
combination
of
generative
classifiers
.
As
a
result
,
our
hybrid
model
can
significantly
improve
on
the
state-of-the-art
performance
obtained
with
supervised
SOL
methods
,
such
as
CRFs
,
even
if
a
large
amount
of
labeled
data
is
available
,
as
shown
in
our
experiments
on
CoNLL
-
addition
,
compared
with
SS-CRF-MER
,
our
hybrid
model
has
several
good
characteristics
including
a
low
calculation
cost
and
a
robust
optimization
in
terms
of
a
sensitiveness
of
hyper-parameters
.
This
is
described
in
detail
in
Section
5.3
.
2
Supervised
SOL
:
CRFs
This
paper
focuses
solely
on
sequence
labeling
tasks
,
such
as
named
entity
recognition
(
NER
)
and
syntactic
chunking
(
Chunking
)
,
as
SOL
problems
.
Thus
,
let
x
=
(
x1
}
.
.
.
,
xs
)
be
an
input
sequence
,
and
y
=
(
y0
,
.
.
.
,
ys+i
)
be
a
particular
output
sequence
,
where
yo
and
ys+i
are
special
fixed
labels
that
represent
the
beginning
and
end
of
a
sequence
.
As
regards
supervised
sequence
learning
,
CRFs
are
recently
introduced
methods
that
constitute
flexible
and
powerful
models
for
structured
predictors
based
on
undirected
graphical
models
that
have
been
globally
conditioned
on
a
set
of
inputs
(
Lafferty
et
al.
,
2001
)
.
Let
X
be
a
parameter
vector
and
f
(
ys-1
,
ys
,
x
)
be
a
(
local
)
feature
vector
obtained
from
the
corresponding
position
s
given
x.
CRFs
define
the
conditional
probability
,
p
(
y
\
x
)
,
as
being
proportional
to
a
product
of
potential
functions
on
the
cliques
.
That
is
,
p
(
y
\
x
)
on
a
(
linear-chain
)
CRF
can
be
defined
as
follows
:
p
(
y
\
x
;
X
)
=
Zx
)
Ilexp
(
X
•
f
(
Vs-i
,
ys
,
x
)
)
.
Z
(
x
)
=
E
„
Ftf+i1
exP
(
X
•
f
(
ys-i
,
ys
,
x
)
)
is
a
normalization
factor
over
all
output
values
,
Y
,
and
is
also
known
as
the
partition
function
.
For
parameter
estimation
(
training
)
,
given
labeled
data
V
\
=
{
(
xk
,
yk
)
}
K
=
1
,
the
Maximum
a
Posteriori
(
MAP
)
parameter
estimation
,
namely
maximizing
logp
(
X
\
Vi
)
,
is
now
the
most
widely
used
CRF
training
criterion
.
Thus
,
we
maximize
the
following
objective
function
to
obtain
optimal
X
:
-
J2
EP
(
y
\
xk
-
,
x
)
[
J2
f
&gt;
\
+
V
log
p
(
X
)
.
Calculating
Ep
(
y
\
x^X
)
as
well
as
the
partition
function
Z
(
x
)
is
not
always
tractable
.
However
,
for
linear-chain
CRFs
,
a
dynamic
programming
algorithm
similar
in
nature
to
the
forward-backward
algorithm
in
HMMs
has
already
been
developed
for
an
efficient
calculation
(
Lafferty
et
al.
,
2001
)
.
For
prediction
,
the
most
probable
output
,
that
is
,
y
=
argmaxy^y
p
(
y
\
x
;
X
)
,
can
be
efficiently
obtained
by
using
the
Viterbi
algorithm
.
3
Hybrid
Generative
and
Discriminative
Approach
to
Semi-Supervised
SOL
In
this
section
,
we
describe
our
formulation
of
a
hybrid
approach
to
SOL
and
a
parameter
estimation
method
for
sequence
predictors
.
We
assume
Du
=
{
xm
}
m
=
1
.
Let
us
assume
that
we
have
I-units
of
discriminative
models
,
pD
,
and
J-units
of
generative
models
,
pG
.
Our
hybrid
model
for
a
structured
predictor
is
designed
by
the
discriminative
combination
of
several
joint
probability
densities
of
x
and
y
,
p
(
x
,
y
)
.
That
is
,
the
posterior
probability
of
our
hybrid
model
is
defined
by
providing
the
log-values
of
p
(
x
,
y
)
as
the
features
of
a
log-linear
model
,
such
that
:
discriminative
combination
weight
of
each
model
where
S
[
0,1
]
.
Moreover
,
A
=
{
Xi
}
j
=
1
and
0
=
{
0j
}
J
=
1
represent
model
parameters
of
individual
models
estimated
from
labeled
and
unlabeled
data
,
respectively
.
Using
pD
(
x
,
y
)
=
pD
(
y
\
x
)
pD
(
x
)
,
we
can
derive
the
third
line
from
the
second
line
,
where
pD
(
x
;
Xi
)
Yi
for
all
i
are
canceled
out
.
Thus
,
our
hybrid
model
is
constructed
by
combining
discriminative
models
,
piD
(
y
\
x
;
Xi
)
,
with
generative
models
,
pf
(
x
,
y
;
0j
)
.
Hereafter
,
let
us
assume
that
our
hybrid
model
consists
of
CRFs
for
discriminative
models
,
pD
,
and
HMMs
for
generative
models
,
pG
,
shown
in
Equation
(
2
)
,
since
this
paper
focuses
solely
on
sequence
modeling
.
For
HMMs
,
we
consider
a
first
order
HMM
defined
in
the
following
equation
:
where
0
.
and
9ysXs
represent
the
transition
probability
between
states
ys-i
and
ys
and
the
symbol
emission
probability
of
the
s-th
position
of
the
corresponding
input
sequence
,
respectively
,
where
Qys+i
,
xs+i
=
1
.
It
can
be
seen
that
the
formalization
in
the
loglinear
combination
of
our
hybrid
model
is
very
similar
to
that
of
LOP-CRFs
(
Smith
et
al.
,
2005
)
.
fact
,
if
we
only
use
a
combination
of
discriminative
models
(
CRFs
)
,
which
is
equivalent
to
Yj
=
0
for
all
j
,
we
obtain
essentially
the
same
objective
function
as
that
of
the
LOP-CRFs
.
Thus
,
our
framework
can
also
be
seen
as
an
extension
of
LOP-CRFs
that
enables
us
to
incorporate
unlabeled
data
.
3.1
Discriminative
Combination
For
estimating
the
parameter
r
,
let
us
assume
that
we
already
have
discriminatively
trained
models
on
labeled
data
,
pD
(
y
\
x
;
Aj
)
.
We
maximize
the
following
objective
function
for
estimating
parameter
r
under
a
fixed
0
:
where
p
(
T
)
is
a
prior
probability
distribution
of
r.
The
value
of
r
providing
a
global
maximum
of
£
Hys°L
(
r
\
0
)
is
guaranteed
under
an
arbitrary
fixed
value
in
the
0
domain
,
since
CHySOL
(
T
\
0
)
is
a
concave
function
of
r.
Thus
,
we
can
easily
maximize
Equation
(
3
)
by
using
a
gradient-based
optimization
algorithm
such
as
(
bound
constrained
)
L-BFGS
(
Liu
andNocedal
,
1989
)
.
3.2
Incorporating
Unlabeled
Data
We
cannot
directly
incorporate
unlabeled
data
for
discriminative
training
such
as
Equation
(
3
)
since
the
correct
outputs
y
for
unlabeled
data
are
unknown
.
On
the
other
hand
,
generative
approaches
can
easily
deal
with
unlabeled
data
as
incomplete
data
(
data
with
missing
variable
y
)
by
using
a
mixture
model
.
A
well-known
way
to
achieve
this
incorporation
is
to
maximize
the
log
likelihood
of
un-labeled
data
with
respect
to
the
marginal
distribution
of
generative
models
as
£
(
0
)
=
J2iogJ2
p
(
xm
,
y
;
O
)
.
In
fact
,
(
Nigam
et
al.
,
2000
)
have
reported
that
using
unlabeled
data
with
a
mixture
model
can
improve
the
text
classification
performance
.
According
to
Bayes
'
rule
,
p
(
y
\
x
;
O
)
oc
p
(
x
,
y
;
O
)
,
the
discriminant
functions
of
generative
classifiers
are
provided
by
generative
models
p
(
x
,
y
;
O
)
.
Therefore
,
we
can
regard
L
(
O
)
as
the
logarithm
of
the
sum
of
discriminant
functions
for
all
missing
variables
y
of
unlabeled
data
.
Following
this
view
,
we
can
directly
incorporate
unlabeled
data
into
our
hybrid
model
by
maximizing
the
discriminant
functions
g
of
our
hybrid
model
in
the
same
way
as
for
a
mixture
model
as
explained
above
.
Thus
,
we
maximize
the
following
objective
function
for
estimating
the
model
parameters
0
for
generative
models
ofunlabeled
data
:
where
p
(
0
)
is
a
prior
probability
distribution
of
0
.
Here
,
the
discriminant
function
g
of
output
y
given
input
x
in
our
hybrid
model
can
be
obtained
by
the
numerator
on
the
third
line
of
Equation
(
2
)
,
since
the
denominator
does
not
affect
the
determination
of
y
,
that
is
,
such
that
+
log
p
(
&amp;
"
)
.
Since
Q
(
0
'
,
0
'
;
r
)
is
independent
of
0
''
,
we
can
improve
the
value
of
G
(
0
\
r
)
by
computing
0
''
to
maximize
Q
(
0
''
,
0
'
;
r
)
.
We
can
obtain
a
0
estimate
by
iteratively
performing
this
update
while
G
(
0
\
r
)
is
hill
climbing
.
As
shown
in
Equation
(
5
)
,
R
is
used
for
estimating
the
parameter
0
.
The
intuitive
effect
of
maximizing
Equation
(
4
)
is
similar
to
performing
'
soft-clustering
'
.
That
is
,
unlabeled
data
is
clustered
with
respect
to
the
R
distribution
,
which
also
includes
information
about
labeled
data
,
under
the
constraint
of
generative
model
structures
.
3.3
Parameter
Estimation
Procedure
According
to
our
definition
,
the
0
and
r
estimations
are
mutually
dependent
.
That
is
,
the
parameters
of
the
hybrid
model
,
r
,
should
be
estimated
4.Perform
the
following
until
J-^Q
(
t
)
^
-
&lt;
£
.
under
fixed
r
(
t
)
and
A
using
Du
.
under
fixed
0
(
t+1
)
and
A
using
D
''
.
4.3
.
t
&lt;
-
t
+
1
.
5.Output
a
structured
predictor
R
(
y
\
x
,
A
,
®
(
t
)
,
r
(
t
)
)
.
Figure
1
:
Algorithm
of
learning
model
parameters
used
in
our
hybrid
model
.
using
Equation
(
3
)
with
a
fixed
0
,
while
the
parameters
of
the
generative
models
,
0
,
should
be
estimated
using
Equation
(
4
)
with
a
fixed
r.
As
a
solution
to
our
parameter
estimation
,
we
search
for
the
0
and
r
that
maximize
£
HySOL
(
r
\
0
)
and
G
(
0
\
r
)
simultaneously
.
For
this
search
,
we
compute
and
r
by
maximizing
the
objective
functions
shown
in
Equations
(
4
)
and
(
3
)
iteratively
and
alternately
.
We
summarize
the
algorithm
for
estimating
these
model
parameters
in
Figure
1
.
Note
that
during
the
r
estimation
(
procedure
4.2
in
Figure
1
)
,
r
can
be
over-fitted
to
the
labeled
training
data
if
we
use
the
same
labeled
training
data
as
used
for
the
A
estimation
.
There
are
several
possible
ways
to
reduce
this
over-fit
.
In
this
paper
,
we
select
one
of
the
simplest
;
we
divide
the
labeled
training
data
Dl
into
two
distinct
sets
V
[
and
V
"
.
Then
,
V
[
and
Dl
'
are
individually
used
for
estimating
A
and
r
,
respectively
.
In
our
experiments
,
we
divide
the
labeled
training
data
Vl
so
that
4
/
5
is
used
for
V
[
and
the
remaining
1
/
5
for
V
''
.
3.4
Efficient
Parameter
Estimation
Algorithm
Let
Nr
(
x
)
represent
the
denominator
of
Equation
(
2
)
,
that
is
the
normalization
factor
of
R.
We
can
rearrange
Equation
(
2
)
as
follows
:
where
ViDs
represents
the
potential
function
of
the
s-th
position
of
the
sequence
in
the
i-th
CRF
and
VGS
represents
the
probability
of
the
s-th
position
in
the
j-th
HMM
,
that
is
,
ViDs
=
exp
(
Xi
•
f
s
)
and
VGS
=
0ys-1
,
ysdys
,
xs
,
respectively
.
See
the
Appendix
for
the
derivation
of
Equation
(
6
)
from
Equation
(
2
)
.
To
estimate
r
(
*
+1
)
,
namely
procedure
4.2
in
Figure
1
,
we
employ
the
derivatives
with
respect
to
Yi
and
Yj
shown
in
Equation
(
6
)
,
which
are
the
parameters
of
the
discriminative
and
generative
models
,
respectively
.
Thus
,
we
obtain
the
following
derivatives
with
respect
to
Yi
:
n
n
-
ER
(
y
\
xn
;
A
,
®
,
T
)
[
'
^2
log
VDS
]
.
The
first
and
second
terms
are
constant
during
iterative
procedure
4
in
our
optimization
algorithm
shown
in
Figure
1
.
Thus
,
we
only
need
to
calculate
these
values
once
at
the
beginning
of
procedure
4
.
Let
as
(
y
)
and
(
3s
(
y
)
represent
the
forward
and
backward
state
costs
at
position
s
with
output
y
for
corresponding
input
x.
Let
Vs
(
y
,
y
'
)
represent
the
products
of
the
total
value
of
the
transition
cost
between
s
—
1
and
s
with
labels
y
and
y
'
in
the
corresponding
input
sequence
,
that
is
,
Vs
(
y
,
y
'
)
=
Yli
[
VDs
(
y
,
y'Ylj
[
VGs
(
y
,
y
'
)
]
Yj
.
The
third
term
,
which
indicates
the
expectation
of
potential
functions
,
can
be
rewritten
in
the
form
of
a
forward-backward
algorithm
,
that
is
,
where
ZR
(
x
)
represents
the
partition
function
of
our
hybrid
model
,
that
is
,
ZR
(
x
)
=
NR
(
x
)
Ui
[
Zi
(
x
)
]
Yi
.
Hence
,
the
calculation
of
derivatives
with
respect
to
Yi
is
tractable
since
we
can
incorporate
the
same
forward-backward
algorithm
as
that
used
in
a
standard
CRF
.
Then
,
the
derivatives
with
respect
to
Yj
,
which
are
the
parameters
of
generative
models
,
can
be
written
as
follows
:
=
log
pf
(
x
»
,
y
»
)
-
^2
ER
(
y
\
xn.A
:
@r
)
^^2
log
VGs
]
.
Again
,
the
second
term
,
which
indicates
the
expectation
of
transition
probabilities
and
symbol
emission
probabilities
,
can
be
rewritten
in
the
form
of
a
forward-backward
algorithm
in
the
same
manner
as
Y
%
,
where
the
only
difference
is
that
ViDs
is
substituted
by
VGs
in
Equation
(
7
)
.
To
estimate
0
(
*
+1
)
,
which
is
procedure
4.1
in
Figure
1
,
the
same
forward-backward
algorithm
as
used
in
standard
HMMs
is
available
since
the
form
of
our
Q-function
shown
in
Equation
(
5
)
is
the
same
as
that
of
standard
HMMs
.
The
only
difference
is
that
our
method
uses
marginal
probabilities
given
by
R
instead
of
the
p
(
x
,
y
;
0
)
of
standard
HMMs
.
Therefore
,
only
a
forward-backward
algorithm
is
required
for
the
efficient
calculation
of
our
parameter
estimation
process
.
Note
that
even
though
our
hybrid
model
supports
the
use
of
a
combination
of
several
generative
and
discriminative
models
,
we
only
need
to
calculate
the
forward-backward
algorithm
once
for
each
sample
during
optimization
procedures
4.1
and
4.2
.
This
means
that
the
required
number
of
executions
of
the
forward-backward
algorithm
for
our
parameter
estimation
is
independent
of
the
number
of
models
used
in
the
hybrid
model
.
In
addition
,
after
training
,
we
can
easily
merge
all
the
parameter
values
in
a
single
parameter
vector
.
This
means
that
we
can
simply
employ
the
Viterbi-algorithm
for
evaluating
unseen
samples
,
as
well
as
that
of
standard
CRFs
,
without
any
additional
cost
.
4
Experiments
We
examined
our
hybrid
model
(
HySOL
)
by
applying
it
to
two
sequence
labeling
tasks
,
named
entity
recognition
(
NER
)
and
syntactic
chunking
(
Chunking
)
.
We
used
the
same
Chunking
and
'
English
'
NER
data
as
those
used
for
the
shared
tasks
of
CoNLL-2000
(
Tjong
Kim
Sang
and
Buchholz
,
2000
)
and
CoNLL-2003
(
Tjong
Kim
Sang
and
Meulder
,
2003
)
,
respectively
.
For
the
baseline
method
,
we
performed
a
conditional
random
field
(
CRF
)
,
which
is
exactly
the
same
training
procedure
described
in
(
Sha
and
Pereira
,
2003
)
with
L-BFGS
.
Moreover
,
LOP-CRF
(
Smith
et
al.
,
2005
)
is
also
compared
with
our
hybrid
model
,
since
the
formalism
of
our
hybrid
model
can
be
seen
as
an
extension
of
LOP-CRFs
as
described
in
Section
3
.
For
CRF
,
we
used
the
Gaussian
prior
as
the
second
term
on
the
RHS
in
Equation
(
1
)
,
where
52
represents
the
hyper-parameter
in
the
Gaussian
prior
.
In
contrast
,
for
LOP-CRF
and
HySOL
,
we
used
the
Dirichlet
priors
as
the
second
term
on
the
Table
1
:
Features
used
in
NER
experiments
RHS
in
Equations
(
3
)
,
and
(
4
)
,
where
£
and
n
are
the
hyper-parameters
in
each
Dirichlet
prior
.
4.1
Named
Entity
Recognition
Experiments
The
English
NER
data
consists
of
203,621
,
51,362
and
46,435
words
from
14,987
,
3,466
and
3,684
sentences
in
training
,
development
and
test
data
,
respectively
,
with
four
named
entity
tags
,
PERSON
,
LOCATION
,
ORGANIZATION
and
MISC
,
plus
the
'
O
'
tag
.
The
unlabeled
data
consists
of
17,003,926
words
from
1,029,122
sentences
.
These
data
sets
are
exactly
the
same
as
those
provided
for
the
shared
task
ofCoNLL-2003
.
We
slightly
extended
the
feature
set
of
the
supplied
data
by
adding
feature
types
such
as
'
word
type
'
,
and
word
prefix
and
suffix
.
Examples
of
'
word
type
'
include
whether
the
word
is
capitalized
,
contains
digit
or
contains
punctuation
,
which
basically
follows
the
baseline
features
of
(
Sutton
et
al.
,
2006
)
without
regular
expressions
.
Note
that
,
unlike
several
previous
studies
,
we
did
not
employ
additional
information
from
external
resources
such
as
gazetteers
.
All
our
features
can
be
automatically
extracted
from
the
supplied
data
.
For
LOP-CRF
and
HySOL
,
we
used
four
base
discriminative
models
trained
by
CRFs
with
different
feature
sets
.
Table
1
shows
the
feature
sets
we
used
for
training
these
models
.
The
design
of
these
feature
sets
was
derived
from
a
suggestion
in
(
Smith
et
al.
,
2005
)
,
which
exhibited
the
best
performance
in
the
several
feature
division
.
Note
that
the
CRF
for
the
comparison
method
was
trained
by
using
all
fea
-
all
of
the
above
Table
2
:
Features
used
in
Chunking
experiments
ture
types
,
namely
the
same
as
A4
.
As
we
explained
in
Section
3.3
,
for
training
HySOL
,
the
parameters
of
four
discriminative
models
,
A
,
were
trained
from
4
/
5
of
the
labeled
training
data
,
and
r
were
trained
from
remaining
1
/
5
.
For
the
features
of
the
generative
models
,
we
used
all
of
the
feature
types
shown
in
Figure
1
.
Note
that
one
feature
type
corresponds
to
one
HMM
.
Thus
,
each
HMM
maintains
to
consist
ofa
non-overlapping
feature
set
since
each
feature
type
only
generates
one
symbol
per
state
.
4.2
Syntactic
Chunking
Experiments
CoNLL-2000
Chunking
data
was
obtained
from
the
Wall
Street
Journal
(
WSJ
)
corpus
:
sections
15-18
as
training
data
(
8,936
sentences
and
211,727
words
)
,
and
section
20
as
test
data
(
2,012
sentences
and
47,377
words
)
,
with
11
different
chunk-tags
,
such
as
NP
and
VP
plus
the
'
O
'
tag
,
which
represents
the
region
outside
any
target
chunk
.
For
LOP-CRF
and
HySOL
,
we
also
used
four
base
discriminative
models
trained
by
CRFs
with
different
feature
sets
.
Table
2
shows
the
feature
set
we
used
in
the
Chunking
experiments
.
We
used
the
feature
set
of
the
supplied
data
without
any
extension
of
additional
feature
types
.
To
train
HySOL
,
we
used
the
same
unlabeled
data
as
used
for
our
NER
experiments
(
17,003,926
words
from
the
Reuters
corpus
)
.
Moreover
,
the
division
of
the
labeled
training
data
and
the
feature
set
of
the
generative
models
were
derived
in
the
same
manner
as
our
NER
experiments
(
see
Section
4.1
)
.
That
is
,
we
divided
the
labeled
training
data
into
4
/
5
for
estimating
A
and
1
/
5
for
estimating
T
;
one
feature
type
shown
in
Table
2
is
assigned
in
one
generative
model
.
methods
(
hyper-params
)
5
Results
and
Discussion
We
evaluated
the
performance
in
terms
of
the
Fp
=
1
score
,
which
is
the
evaluation
measure
used
in
CoNLL-2000
and
2003
,
and
sentence
accuracy
,
since
all
the
methods
in
our
experiments
optimize
sequence
loss
.
Tables
3
and
4
show
the
results
of
the
NER
and
Chunking
experiments
,
respectively
.
The
Fp
=
1
and
'
Sent
'
columns
show
the
performance
evaluated
using
the
Fp
=
1
score
and
sentence
accuracy
,
respectively
.
52
,
£
and
n
,
which
are
the
hyperparameters
in
Gaussian
or
Dirichlet
priors
,
are
selected
from
a
certain
value
set
by
using
a
develop-mentset1
,
thatis
,
52
e
(
0.01
,
0.1,1,10,100,1000
}
,
£
-
1
=
£
e
(
0.01,0.1,1,10
}
and
n
-
1
=
i
e
(
0.00001
,
0.0001
,
0.001
,
0.01
}
.
The
second
rows
of
CRF
in
Tables
3
and
4
represent
the
performance
of
base
discriminative
models
used
in
HySOL
with
all
the
features
,
which
are
trained
with
4
/
5
of
the
labeled
training
data
.
The
third
rows
of
HySOL
show
the
performance
obtained
without
using
generative
models
(
unlabeled
data
)
.
The
model
itself
is
essentially
the
same
as
LOP-CRFs
.
However
the
performance
in
the
third
HySOL
rows
was
consistently
lower
than
that
of
LOP-CRF
since
the
discriminative
models
in
HySOL
are
trained
with
4
/
5
labeled
data
.
1
Chunking
(
CoNLL-2000
)
data
has
no
common
development
set
.
Thus
,
our
preliminary
examination
employed
by
using
4
/
5
labeled
training
data
with
the
remaining
1
/
5
as
development
data
to
determine
the
hyper-parameter
values
.
Figure
2
:
Changes
in
the
performance
and
the
convergence
condition
value
(
procedure
4
in
Figure
1
)
of
HySOL
.
cantly
improved
the
performance
of
supervised
setting
,
CRF
and
LOP-CRF
,
as
regards
both
NER
and
Chunking
experiments
.
5.1
Impact
of
Incorporating
Unlabeled
Data
The
contributions
provided
by
incorporating
unla-beled
data
in
our
hybrid
model
can
be
seen
by
comparison
with
the
performance
of
the
first
and
third
rows
in
HySOL
,
namely
a
2.64
point
F-score
and
a
2.96
point
sentence
accuracy
gain
in
the
NER
experiments
and
a
0.46
point
F-score
and
a
1.99
point
sentence
accuracy
gain
in
the
Chunking
experiments
.
We
believe
there
are
two
key
ideas
that
enable
the
unlabeled
data
in
our
approach
to
exhibit
this
improvement
compared
with
the
the
state-of-the-art
performance
provided
by
discriminative
models
in
supervised
settings
.
First
,
unlabeled
data
is
only
used
for
optimizing
Equation
(
4
)
to
obtain
a
similar
effect
to
'
soft-clustering
'
,
which
can
be
calculated
without
information
about
the
correct
output
.
Second
,
by
using
a
combination
of
generative
models
,
we
can
enhance
the
flexibility
of
the
feature
design
for
unlabeled
data
.
For
example
,
we
can
handle
arbitrary
overlapping
features
,
similar
to
those
used
in
discriminative
models
,
for
unlabeled
data
by
assigning
one
feature
type
for
one
generative
model
as
in
our
experiments
.
5.2
Impact
of
Iterative
Parameter
Estimation
Figure
2
shows
the
changes
in
the
performance
and
the
convergence
condition
value
of
HySOL
during
parameter
estimation
iteration
in
our
NER
and
Chunking
experiments
,
respectively
.
As
shown
in
the
figure
,
HySOL
was
able
to
reach
the
conver
-
gence
condition
in
a
small
number
of
iterations
in
our
experiments
.
Moreover
,
the
change
in
the
performance
remains
quite
stable
during
the
iteration
.
However
,
theoretically
,
our
optimization
procedure
is
not
guaranteed
to
converge
in
the
r
and
0
space
,
since
the
optimization
of
0
has
local
maxima
.
Even
if
we
were
unable
to
meet
the
convergence
condition
,
we
were
easily
able
to
obtain
model
parameters
by
performing
a
sufficient
fixed
number
of
iterations
,
and
then
select
the
parameters
when
Equation
(
4
)
obtained
the
maximum
objective
value
.
When
we
consider
semi-supervised
SOL
methods
,
SS-CRF-MER
(
Jiao
et
al.
,
2006
)
is
the
most
competitive
with
HySOL
,
since
both
methods
are
defined
based
on
CRFs
.
We
planned
to
compare
the
performance
with
that
of
SS-CRF-MER
in
our
NER
and
Chunking
experiments
.
Unfortunately
,
we
failed
to
implement
SS-CRF-MER
since
it
requires
the
use
of
a
slightly
complicated
algorithm
,
called
the
'
nested
'
forward-backward
algorithm
.
Although
,
we
cannot
compare
the
performance
,
our
hybrid
approach
has
several
good
characteristics
compared
with
SS-CRF-MER
.
First
,
it
requires
a
higher
order
algorithm
,
namely
a
'
nested
'
forward-backward
algorithm
,
for
the
parameter
estimation
of
unlabeled
data
whose
time
complexity
is
O
(
L3S2
)
for
each
unlabeled
data
,
where
L
and
S
represent
the
output
label
size
and
unlabeled
sample
length
,
respectively
.
Thus
,
our
hybrid
approach
is
more
scalable
for
the
size
of
unlabeled
data
,
since
HySOL
only
needs
a
standard
forward-backward
algorithm
whose
time
complexity
is
O
(
L2S
)
.
In
fact
,
we
still
have
a
question
as
to
whether
SS-CRF-MER
is
really
scalable
in
practical
time
for
such
a
large
amount
of
unlabeled
data
as
used
in
our
experiments
,
which
is
about
680
times
larger
than
that
of
(
Jiao
et
al.
,
2006
)
.
Scalability
for
unlabeled
data
will
become
really
important
in
the
future
,
as
it
will
be
natural
to
use
millions
or
billions
of
unlabeled
data
for
further
improvement
.
Second
,
SS-CRF-MER
has
a
sensitive
hyper-parameter
in
the
objective
function
,
which
controls
the
influence
of
the
un-labeled
data
.
In
contrast
,
our
objective
function
only
has
a
hyper-parameter
of
prior
distribution
,
which
is
widely
used
for
standard
MAP
estimation
.
Moreover
,
the
experimental
results
shown
in
Tables
3
and
their
own
large
gazetteers
,
2M-word
labeled
data
their
own
large
gazetteers
,
very
elaborated
features
unlabeled
data
(
17M
words
)
supplied
gazetters
additional
resources
ASO-semi
full
parser
output
4
indicate
that
HySOL
is
rather
robust
with
respect
to
the
hyper-parameter
since
we
can
obtain
fairly
good
performance
without
a
prior
distribution
.
5.4
Comparison
with
Previous
Top
Systems
With
respect
to
the
performance
of
NER
and
Chunking
tasks
,
the
current
best
performance
is
reported
in
(
Ando
and
Zhang
,
2005
)
,
which
we
refer
to
as
'
ASO-semi
'
,
as
shown
in
Figures
5
and
6
.
ASO-semi
also
incorporates
unlabeled
data
solely
for
the
additional
information
in
the
same
way
as
our
method
.
Unfortunately
,
our
results
could
not
reach
their
level
of
performance
,
although
the
size
and
source
of
the
unlabeled
data
are
not
the
same
for
certain
reasons
.
First
,
(
Ando
and
Zhang
,
2005
)
does
not
describe
the
unlabeled
data
used
in
their
NER
experiments
in
detail
,
and
second
,
we
are
not
licensed
to
use
the
TREC
corpus
including
WSJ
unlabeled
data
that
they
used
for
their
Chunking
experiments
(
training
and
test
data
for
Chunking
is
derived
from
WSJ
)
.
Therefore
,
we
simply
used
the
supplied
unla-beled
data
of
the
CoNLL-2003
shared
task
for
both
NER
and
Chunking
.
If
we
consider
the
advantage
of
our
approach
,
our
hybrid
model
incorporating
generative
models
seems
rather
intuitive
,
since
it
is
sometimes
difficult
to
find
out
a
design
of
effective
auxiliary
problems
for
the
target
problem
.
Interestingly
,
the
additional
information
obtained
+
supplied
gazetters
Table
7
:
The
HySOL
performance
with
the
F-score
optimization
technique
and
some
additional
resources
in
NER
(
CoNLL-2003
)
experiments
Table
8
:
The
HySOL
performance
with
the
F-score
optimization
technique
on
Chunking
(
CoNLL-2000
)
experiments
from
unlabeled
data
appear
different
from
each
other
.
ASO-semi
uses
unlabeled
data
for
constructing
auxiliary
problems
to
find
the
'
shared
structures
'
of
auxiliary
problems
that
are
expected
to
improve
the
performance
of
the
main
problem
.
Moreover
,
it
is
possible
to
combine
both
methods
,
for
example
,
by
incorporating
the
features
obtained
with
their
method
in
our
base
discriminative
models
,
and
then
construct
a
hybrid
model
using
our
method
.
Therefore
,
there
may
be
a
possibility
of
further
improving
the
performance
by
this
simple
combination
.
In
NER
,
most
of
the
top
systems
other
than
ASO-semi
boost
performance
by
employing
external
hand-crafted
resources
such
as
large
gazetteers
.
This
is
why
their
results
are
superior
to
those
obtained
with
HySOL
.
In
fact
,
if
we
simply
add
the
gazetteers
included
in
CoNLL-2003
supplied
data
as
features
,
HySOL
achieves
88.14
.
5.5
Applying
F-score
Optimization
Technique
In
addition
,
we
can
simply
apply
the
F-score
optimization
technique
for
the
sequence
labeling
tasks
proposed
in
(
Suzuki
et
al.
,
2006
)
to
boost
the
HySOL
performance
since
the
base
discriminative
models
pD
(
y
\
x
)
and
discriminative
combination
,
namely
Equation
(
3
)
,
in
our
hybrid
model
basically
uses
the
same
optimization
procedure
as
CRFs
.
Tables
7
and
8
show
the
F-score
gain
when
we
apply
the
F-score
optimization
technique
.
As
shown
in
the
Tables
,
the
F-score
optimization
technique
can
easily
improve
the
(
F-score
)
performance
without
any
additional
resources
or
feature
engineering
.
In
NER
,
we
also
examined
HySOL
with
additional
resources
to
observe
the
performance
gain
.
The
third
row
represents
the
performance
when
we
add
approximately
10M
words
ofunlabeled
data
(
total
27M
words
)
2
that
are
derived
from
1996
/
11
/
1530
articles
in
Reuters
corpus
.
Then
,
the
fourth
and
fifth
rows
represent
the
performance
when
we
add
the
supplied
gazetters
in
the
CoNLL-2003
data
as
features
,
and
adding
development
data
as
training
data
of
r.
In
this
case
,
HySOL
achieved
a
comparable
performance
to
that
of
the
current
best
system
,
ASO-semi
,
in
both
NER
and
Chunking
experiments
even
though
the
NER
experiment
is
not
a
fair
comparison
since
we
added
additional
resources
(
gazetters
and
dev
.
set
)
that
ASO-semi
does
not
use
in
training
.
6
Conclusion
and
Future
Work
We
proposed
a
framework
for
semi-supervised
SOL
based
on
a
hybrid
generative
and
discriminative
approach
.
Experimental
results
showed
that
incorporating
unlabeled
data
in
a
generative
manner
has
the
power
to
further
improve
on
the
state-of-the-art
performance
provided
by
supervised
SOL
methods
such
as
CRFs
,
with
the
help
of
our
hybrid
approach
,
which
discriminatively
combines
with
discriminative
models
.
In
future
we
intend
to
investigate
more
appropriate
model
and
feature
design
for
unlabeled
data
,
which
may
further
improve
the
performance
achieved
in
our
experiments
.
Appendix
2In
order
to
keep
the
consistency
of
POS
tags
,
we
reattached
POS
tags
of
the
supplied
data
set
and
new
10M
words
of
unlabeled
data
using
a
POS
tagger
trained
from
WSJ
corpus
.
