This
paper
presents
a
simple
yet
in
practice
very
efficient
technique
serving
for
automatic
detection
of
those
positions
in
a
part-of-speech
tagged
corpus
where
an
error
is
to
be
suspected
.
The
approach
is
based
on
the
idea
of
learning
and
later
application
of
"
negative
bigrams
"
,
i.e.
on
the
search
for
pairs
of
adjacent
tags
which
constitute
an
incorrect
configuration
in
a
text
of
a
particular
language
(
in
English
,
e.g.
,
the
bigram
ARTICLE
-
FINITE
VERB
)
.
Further
,
the
paper
describes
the
generalization
of
the
"
negative
bigrams
"
into
"
negative
n-grams
"
,
for
any
natural
n
,
which
indeed
provides
a
powerful
tool
for
error
detection
in
a
corpus
.
The
implementation
is
also
discussed
,
as
well
as
evaluation
of
results
of
the
approach
when
used
for
error
detection
in
the
NEGRA
®
corpus
of
German
,
and
the
general
implications
for
the
quality
of
results
of
statistical
taggers
.
Illustrative
examples
in
the
text
are
taken
from
German
,
and
hence
at
least
a
basic
command
of
this
language
would
be
helpful
for
their
understanding
-
due
to
the
complexity
of
the
necessary
accompanying
explanation
,
the
examples
are
neither
glossed
nor
translated
.
However
,
the
central
ideas
of
the
paper
should
be
understandable
also
without
any
knowledge
of
German
.
Errors
in
PoS-Tagged
Corpora
The
importance
of
correctness
(
error-freeness
)
of
language
resources
in
general
and
of
tagged
corpora
in
particular
cannot
probably
be
overestimated
.
However
,
the
definition
of
what
constitutes
an
error
in
a
tagged
corpus
depends
on
the
intended
usage
of
this
corpus
.
1.1
If
we
consider
a
quite
typical
case
of
a
Part-of-Speech
(
PoS
)
tagged
corpus
used
for
training
statistical
taggers
,
then
an
error
is
defined
naturally
as
any
deviation
from
the
regularities
which
the
system
is
expected
to
learn
;
in
this
particular
case
this
means
that
the
corpus
should
contain
neither
errors
in
assignment
PoS-tags
nor
ungrammatical
constructions
in
the
corpus
body1
,
since
if
any
of
the
two
cases
is
present
in
the
corpus
,
then
the
learning
process
necessarily
:
•
gets
a
confused
view
of
probability
distribution
of
configurations
(
e.g.
,
trigrams
)
in
a
correct
text
•
gets
positive
evidence
also
about
configurations
(
e.g.
,
trigrams
)
which
should
not
occur
as
the
output
of
tagging
linguistically
correct
texts
,
while
simultaneously
getting
less
evidence
about
correct
configurations
.
1.2
If
we
consider
PoS-tagged
corpora
desti-nated
for
testing
NLP
systems
,
then
obviously
they
should
not
contain
any
errors
in
tagging
(
since
this
would
be
detrimental
to
the
validity
of
results
of
the
testing
)
but
on
the
other
hand
they
should
contain
a
certain
amount
of
ungram-matical
constructions
,
in
order
to
test
the
behaviour
of
the
tested
system
on
a
realistic
input
.
Both
these
cases
share
the
quiet
presupposition
that
the
tagset
used
is
linguistically
adequate
,
i.e.
it
is
sufficient
for
unequivocal
and
consistent
assignment
of
tags
to
the
source
text2
.
1
In
this
paper
we
on
purpose
do
not
distinguish
between
"
genuine
"
ungrammaticality
,
i.e.
one
which
was
present
already
in
the
source
text
,
and
ungram-maticality
which
came
into
being
as
a
result
of
faulty
conversion
of
the
source
into
the
corpus-internal
format
,
e.g.
,
incorrect
tokenization
,
OCR-errors
,
etc.
2
This
problem
might
be
-
in
a
very
simplified
form
-
illustrated
on
an
example
of
a
tagset
introducing
tags
for
nouns
and
verbs
only
,
and
then
trying
to
tag
the
sentence
John
walks
slowly
-
whichever
tag
is
assigned
to
the
word
slowly
,
it
is
obviously
an
incorrect
one
.
Natural
as
this
requirement
might
1.3
As
for
using
annotated
corpora
for
linguistic
research
,
then
it
seems
that
even
inadequacies
of
tagset
are
tolerable
provided
they
are
marked
off
properly
-
in
fact
,
these
spots
in
the
corpus
might
well
be
quite
an
important
source
of
linguistic
investigation
since
,
more
often
than
not
,
they
constitute
direct
pointers
to
occurrences
of
linguistically
"
interesting
"
(
or
at
least
"
difficult
"
)
constructions
in
the
text
.
Automatic
PoS-Tagging
Errors
Detection
In
the
following
,
we
shall
concentrate
on
the
first
case
mentioned
above
,
i.e.
on
methods
and
techniques
of
generating
"
completely
error-free
"
corpora
,
or
,
more
precisely
,
on
the
possibilities
of
(
semi
-
)
automatic
detection
(
and
hence
correction
)
of
errors
in
a
PoS-tagged
corpus
.
Due
to
this
,
i.e.
to
the
aim
of
achieving
an
"
error-free
"
corpus
,
we
shall
not
distinguish
between
errors
due
to
incorrect
tagging
,
faulty
conversion
or
ill-formed
input
,
and
we
shall
treat
them
on
a
par
.
The
approach
as
well
as
its
impact
on
the
correctness
of
the
resulting
corpus
will
be
demonstrated
on
the
version
2
of
the
NEGRA
®
corpus
of
German
(
for
the
corpus
itself
see
www.coli.uni-sb.de
/
sfb378
/
negra-corpus
,
for
description
cf.
Skut
et
al.
(
1997
)
)
.
However
,
we
believe
the
solutions
developed
and
presented
in
this
paper
are
not
bound
particularly
to
correcting
this
corpus
or
to
German
,
but
hold
generally
.
The
error
search
we
use
has
several
phases
which
differ
in
the
amount
of
context
that
has
to
be
taken
into
consideration
during
the
error
detection
process
.
Put
plainly
,
the
extent
of
context
mirrors
the
linguistic
complexity
of
the
detection
,
or
,
in
other
words
,
at
the
moment
when
the
objective
is
to
search
for
"
complex
"
errors
,
the
"
simple
(
r
)
"
errors
should
be
already
eliminated
.
The
first
,
preliminary
phase
,
is
thus
the
search
for
errors
which
are
detectable
absolutely
locally
,
i.e.
without
any
context
at
all
.
2.1
Preliminary
Phase
:
Trivial
Errors
When
aiming
at
correction
of
errors
in
general
,
the
basic
condition
which
is
to
be
met
is
that
the
seem
,
it
is
in
fact
not
met
fully
satisfactorily
in
any
tagset
we
know
of
;
for
more
,
cf.
Kveton
and
Oliva
(
in
prep
.
)
.
local
assignment
of
PoS-tags
is
if
not
correct
then
at
least
(
morphologically
)
plausible
.
In
particular
,
the
first
errors
to
be
corrected
are
those
where
the
assignment
of
PoS-tags
violates
morphological
(
and
possibly
other
local
,
e.g.
,
phonological
)
laws
of
the
language
.
Important
point
is
,
however
,
that
only
the
error
detection
is
strictly
local
-
for
the
correction
,
a
vaster
context
might
be
(
and
as
a
rule
is
)
needed
.
From
this
it
follows
that
the
first
phase
should
be
the
search
for
"
impossible
unigrams
"
,
i.e.
for
tags
which
are
assigned
in
conflict
with
morphological
or
lexical
information
.
A
simple
(
but
unrealistic
)
example
from
English
would
be
the
cases
that
the
word
table
were
assigned
the
tag
PLURAL-NOUN
or
the
tag
PREPOSITION
.
As
realistic
examples
from
NEGRA
®
,
it
is
possible
to
put
forward
tagging
the
(
German
!
)
word
die
as
a
masculine
singular
form
of
an
article
,
tagging
ein
as
a
definite
article
,
assigning
a
word
starting
with
a
capital
letter
and
standing
on
a
non-first
position
of
a
sentence
a
verbal
tag
(
typically
,
such
word
is
a
verbal
noun
)
or
tagging
bis
as
a
preposition
requiring
dative
case3
.
A
particular
case
of
locally
recognizable
errors
is
constituted
by
numerals
with
round
thousands
,
written
in
digits
,
with
blank
between
the
last
three
zeroes
(
e.g.
,
12
000
)
which
in
NEGRA
®
are
systematically
tokenized
/
segmented
as
two
cardinal
numbers
following
each
other
,
e.g.
,
12
000
is
segmented
as
&lt;
position
&gt;
12
tag
=
CARD
&lt;
end
of
position
&gt;
&lt;
position
&gt;
000
tag
=
CARD
&lt;
end
of
position
&gt;
while
it
is
obviously
to
be
segmented
as
a
single
numeral
,
i.e.
2.2
Medium
Phase
:
Impossible
Bigrams
The
errors
described
in
the
previous
section
were
cases
of
incorrect
morphological
analysis
(
e.g.
,
die
tagged
as
masculine
singular
)
,
errors
in
lexical
analysis
(
the
case
of
preposition
bis
tagged
as
requiring
a
dative
case
)
and
diverse
errors
in
lemmatization
,
conversion
and
segmentation
,
and
if
discussed
alone
,
they
had
better
be
classified
as
such
.
In
fact
,
calling
these
kind
of
errors
3
The
corrections
to
be
performed
are
not
presented
,
since
they
might
differ
from
case
to
case
,
in
dependence
on
the
particular
context
.
"
impossible
unigrams
"
(
as
above
)
makes
little
sense
apart
from
serving
as
a
motivation
for
error
detection
based
on
search
for
"
impossible
n-grams
"
,
i.e.
n-tuples
(
n
e
N
)
of
tags
which
,
if
occuring
as
tags
of
adjacent
words
in
a
text
of
a
particular
language
,
constitute
a
violation
of
(
syntactic
)
rules
of
this
language
.
The
starting
point
for
application
of
this
idea
is
the
search
for
"
impossible
bigrams
"
.
These
as
a
rule
occur
in
a
realistic
large-scale
PoS-tagged
corpus
,
for
the
following
reasons
:
•
in
a
hand
tagged
corpus
,
an
"
impossible
bigram
"
results
from
(
and
unmistakeably
signals
)
either
an
ill-formed
text
in
the
corpus
body
(
including
wrong
conversion
)
or
a
human
error
in
tagging
•
in
a
corpus
tagged
by
a
statistical
tagger
,
an
"
impossible
bigram
"
may
result
also
from
an
ill-formed
source
text
,
as
above
,
and
further
either
from
incorrect
tagging
of
the
training
data
(
i.e.
the
error
was
seen
as
a
"
correct
configuration
(
bigram
)
"
in
the
training
data
,
and
was
hence
learned
by
the
tagger
)
or
from
the
process
of
so-called
"
smoothing
"
,
i.e.
of
assignment
of
non-zero
probabilities
also
to
configurations
(
bigrams
,
in
the
case
discussed
)
which
were
not
seen
in
the
learning
phase4
.
For
learning
the
process
of
detecting
errors
in
PoS-tagging
,
let
us
make
a
provisional
and
in
practice
unrealistic
assumption
(
which
we
shall
correct
immediately
)
that
we
have
an
error-free
and
representative
(
wrt.
bigrams
)
corpus
of
sentences
of
a
certain
language
at
our
disposal
.
By
saying
error-free
and
representative
,
we
have
in
mind
that
,
for
the
case
of
bigrams
:
•
any
sentence
in
the
set
of
sentences
constituting
the
corpus
is
a
grammatical
sentence
of
the
language
in
question
(
error-freeness
wrt.
source
)
•
any
bigram
can
occur
in
a
grammatical
sentence
of
the
language
if
and
only
if
it
occurs
at
least
once
in
the
corpus
(
i.e.
if
any
bigram
is
a
possible
bigram
in
the
language
,
it
occurs
in
the
corpus
(
representativity
)
,
if
any
bigram
is
an
"
impossible
bigram
"
,
it
does
not
occur
in
the
corpus
(
error-freeness
wrt.
tagging
)
)
.
4
This
"
smoothing
"
is
necessary
since
-
put
very
simply
-
otherwise
configurations
(
bigrams
)
which
were
not
seen
during
the
learning
phase
cannot
be
processed
if
they
occur
in
the
text
to
be
tagged
.
Given
such
a
(
hypothetical
)
corpus
,
all
the
bi-grams
in
the
corpus
are
to
be
collected
to
a
set
CB
(
correct
bigrams
)
,
and
then
the
complement
of
CB
to
the
set
of
all
possible
bigrams
is
to
be
computed
;
let
this
set
be
called
IB
(
incorrect
bigrams
)
.
The
idea
is
now
that
if
any
element
of
IB
occurs
in
a
PoS-tagged
corpus
whose
correctness
is
to
be
checked
,
then
the
two
adjacent
corpus
positions
where
this
happened
must
contain
an
error
(
which
then
can
be
corrected
)
.
When
implementing
this
approach
to
error
detection
,
it
is
first
of
all
necessary
to
realize
that
learning
the
"
impossible
bigrams
"
is
extremely
sensible
both
to
error-freeness
and
to
represent-ativity
of
the
learning
corpus
:
•
the
presence
of
an
erroneous
bigram
in
the
set
of
CB
causes
that
the
respective
error
cannot
be
detected
in
the
corpus
whose
correctness
is
to
be
checked
(
even
a
single
occurrence
of
a
bigram
in
the
learning
corpus
means
correctness
of
the
bigram
)
,
•
the
absence
of
a
correct
bigram
from
the
CB
set
causes
this
bigram
to
occur
in
IB
,
and
hence
any
of
its
occurrences
in
the
checked
corpus
to
be
marked
as
a
possible
error
(
absence
of
a
bigram
in
the
learning
corpus
means
incorrectness
of
the
bigram
)
.
However
,
the
available
corpora
are
neither
error-free
nor
representative
.
Therefore
,
in
practice
these
deficiencies
have
to
be
compensated
for
by
appropriate
means
.
When
applying
the
approach
to
NEGRA
®
,
we
employed
•
bootstrapping
for
achieving
correctness
•
manual
pruning
of
the
CB
and
IB
sets
for
achieving
representativity
.
We
started
by
very
careful
hand-cleaning
errors
in
a
very
small
sub-corpus
of
about
80
sentences
(
about
1.200
words
)
.
From
this
small
corpus
,
we
generated
the
CB
set
,
and
pruned
it
manually
,
using
linguistic
knowledge
(
as
well
as
linguistic
imagination
)
about
German
syntax
.
Based
on
the
CB
set
achieved
,
we
generated
the
corresponding
IB
set
and
pruned
it
manually
again
.
The
resulting
IB
set
was
then
used
for
automatic
detection
of
"
suspect
spots
"
in
the
sample
of
next
500
sentences
from
the
corpus
,
and
for
hand-elimination
of
errors
in
this
sample
where
appropriate
(
obviously
,
not
all
IB
violations
were
genuine
errors
!
)
.
Thus
we
arrived
at
a
cleaned
sample
of
580
sentences
,
which
we
used
just
in
the
same
way
for
generating
CB
set
,
prun
-
ing
it
,
generating
IB
set
and
pruning
this
set
,
arriving
at
an
IB
set
which
we
used
for
detection
of
errors
in
the
whole
body
of
the
corpus
(
about
20.500
sentences
,
350.000
positions
)
.
The
procedure
was
then
re-applied
to
the
whole
corpus
.
For
this
purpose
,
we
divided
the
corpus
into
four
parts
of
approximately
5.000
sentences
each
.
Then
,
proceeding
in
four
rounds
,
first
the
IB
set
was
generated
(
without
manual
checking
)
out
of
15.000
sentences
and
then
the
IB
set
was
applied
to
the
rest
of
the
corpus
(
on
the
respective
5.000-sentence
partition
)
.
The
corrections
based
on
the
results
improved
the
corpus
to
such
an
extent
that
we
made
the
final
round
,
this
time
dividing
the
corpus
into
20
partitions
with
approximately
1.000
sentences
each
and
then
re-applying
the
whole
process
20
times
.
2.3
Advanced
Phase
:
Variable-length
n-grams
The
"
impossible
bigrams
"
are
a
powerful
tool
for
checking
the
correctness
of
a
corpus
,
however
,
a
tool
which
works
on
a
very
local
scale
only
,
since
it
is
able
to
detect
solely
errors
which
are
detectable
as
deviations
from
the
set
of
possible
pairs
of
adjacently
standing
tags
.
Thus
,
obviously
,
quite
a
number
of
errors
remain
undetected
by
such
a
strategy
.
As
an
example
of
such
an
as
yet
"
undetectable
"
error
in
German
we
might
take
the
configuration
where
two
words
tagged
as
finite
verbs
are
separated
from
each
other
by
a
string
consisting
of
nouns
,
adjectives
,
articles
and
prepositions
only
.
In
particular
,
such
a
configuration
is
erroneous
since
the
rules
of
German
orthography
require
that
some
kind
of
clause
separator
(
comma
,
dash
,
coordinating
conjunction
)
occur
inbetween
two
finite
verbs5
.
5
At
stake
are
true
regular
finite
forms
,
exempted
are
words
occurring
in
fixed
collocations
which
do
not
function
as
heads
of
clauses
.
As
an
example
of
such
usage
of
a
finite
verb
form
,
one
might
take
the
collocation
wie
folgt
,
e.g.
,
in
the
sentence
Diese
Übersicht
sieht
wie
folgt
aus
:
.
.
.
Mind
that
in
this
sentence
,
the
verb
folgt
has
no
subject
,
which
is
impossible
with
any
active
finite
verb
form
of
a
German
verb
subcategorizing
for
a
subject
(
and
possible
only
marginally
with
passive
forms
,
e.g.
,
in
Gestern
wurde
getanzt
,
or
-
obviously
-
with
verbs
which
do
not
subcategorize
for
a
subject
,
such
as
frieren
,
grauen
in
Mich
friert
,
Mir
graut
vor
Statistik
)
.
In
order
to
be
able
to
detect
also
such
kind
of
errors
,
the
above
"
impossible
bigrams
"
have
to
be
extended
substantially
.
Searching
for
the
generalization
needed
,
it
is
first
of
all
necessary
to
get
a
linguistic
view
on
the
"
impossible
bi-grams
"
,
in
other
words
,
to
get
a
deeper
insight
into
the
impossibility
for
a
certain
pair
of
PoStags
to
occur
immediately
following
each
other
in
any
linguistically
correct
and
correctly
tagged
sentence
.
The
point
is
that
this
indeed
does
not
happen
by
chance
,
that
any
"
impossible
bigram
"
comes
into
being
as
a
violation
of
a
certain
-
predominantly
syntactic6
-
rule
(
s
)
of
the
language
.
Viewed
in
more
detail
,
these
violations
might
be
of
the
following
nature
:
•
Violation
of
constituency
.
The
occurrence
of
an
"
impossible
bigram
"
in
the
text
signals
that
-
if
the
tagging
were
correct
-
there
is
a
basic
constituency
relation
violated
(
resulting
in
the
occurrence
of
the
"
imposible
bigram
"
)
;
as
an
example
of
such
configuration
,
we
might
consider
the
bigram
PREPOSITION
-
FINITE
VERB
(
possible
German
example
string
:
.
.
.
für-PREP
reiche-VFIN
.
.
.
)
.
From
this
it
follows
that
either
there
is
indeed
an
error
in
the
source
text
(
in
our
example
,
probably
a
missing
word
,
e.g.
,
Der
Sprecher
der
UNO-Hilfsorganisation
teilte
mit
,
für
Arme
reiche
diese
Hilfe
nicht
.
)
or
there
was
a
tagging
error
detected
(
in
the
example
,
e.g.
,
an
error
as
in
the
sentence
.
.
.
für
reiche
Leute
ist
solche
Hilfe
nicht
nötig
.
.
.
)
.
The
source
of
the
error
is
in
both
cases
violation
of
the
linguistic
rule
postulating
that
,
in
German
,
a
preposition
must
always
be
followed
by
a
corresponding
noun
(
NP
)
or
at
least
by
an
adjectival
remnant
of
this
NP7
.
•
Violation
of
feature
cooccurrence
rules
(
such
as
agreement
,
subcategorization
etc.
)
.
The
point
here
is
that
there
exist
configurations
such
that
if
two
wordforms
(
words
with
certain
morphological
features
)
occur
next
to
each
6
Examples
of
other
such
violations
are
rare
and
are
related
mainly
to
phonological
rules
.
In
English
,
relevant
cases
would
be
the
word
pairs
an
table
,
a
apple
,
provided
the
tagset
were
so
fine-grained
to
express
such
a
distinction
,
better
examples
are
to
be
found
in
other
languages
,
e.g.
the
case
of
the
Czech
ambiguous
word
se
,
cf.
(
Oliva
,
to
appear
)
.
7
Unlike
English
,
(
standard
)
German
has
no
preposition
stranding
and
similar
phenomena
-
we
disregard
the
colloquial
examples
like
Da
weiss
ich
nix
von
.
other
,
they
necessarily
stand
in
such
a
configuration
,
and
because
of
this
also
in
a
certain
grammatical
relation
.
This
relation
,
in
turn
,
poses
further
requirements
on
the
(
morphological
)
features
of
the
two
wordforms
,
and
if
these
requirements
are
not
met
,
the
tags
of
the
two
wordforms
result
in
an
"
impossible
bigram
"
.
Let
us
take
an
example
again
,
this
time
with
tags
expressing
also
morphological
characteristics
:
if
the
words
.
.
.
Staaten
schickt
.
.
.
are
tagged
as
Staaten-NOUN-MASC-PL-NOM
and
schickt-MAINVERB-PRES-ACT-SG
,
then
the
respective
tags
NOUN-MASC-PL-NOM
and
MAINVERB-PRES-ACT-SG
(
in
this
order
)
create
an
"
impossible
bigram
"
.
The
reason
for
this
bigram
being
impossible
is
that
if
a
noun
in
nominative
case
occurs
in
a
German
clause
headed
by
a
finite
main
verb
different
from
sein
/
werden
(
which
,
however
,
are
not
tagged
as
main
verbs
in
the
STTS
tagset
used
in
NEGRA
®
)
,
then
either
this
noun
must
be
the
verb
's
subject
,
which
in
turn
requires
that
the
noun
and
the
verb
agree
in
number
,
or
that
the
noun
is
a
part
of
coordinated
subject
,
in
which
case
the
verb
must
be
in
plural
.
The
configuration
from
the
example
meets
neither
of
these
conditions
,
and
hence
it
generates
an
"
impossible
bigram
"
.
The
central
observation
lies
then
in
the
fact
that
the
property
of
being
an
impossible
configuration
can
often
be
retained
also
after
the
components
of
the
"
impossible
bigram
"
get
separated
by
material
occurring
inbetween
them
.
Thus
,
for
example
,
in
both
our
examples
the
property
of
being
an
impossible
configuration
is
conserved
if
an
adverb
is
placed
inbetween
,
creating
thus
an
"
impossible
trigram
"
.
In
particular
,
in
the
first
example
,
the
configuration
PREP
ADV
VFIN
cannot
be
a
valid
trigram
,
exactly
for
the
same
reasons
as
PREP
VFIN
was
not
a
valid
bigram
:
ADV
is
not
a
valid
NP
remnant
.
In
the
second
case
,
the
configuration
NOUN-MASC-PL-NOM
ADV
MAINVERB-PRES-ACT-SG
is
not
a
valid
trigram
either
,
since
obviously
the
presence
(
or
absence
)
of
an
adverb
in
the
sentence
does
not
change
the
subject-verb
relation
in
the
sentence
.
In
fact
,
due
to
recursiv-ity
of
language
,
also
two
,
three
and
in
fact
any
number
of
adverbs
would
not
make
the
configurations
grammatical
and
hence
would
not
disturb
the
error
detection
potential
of
the
"
extended
impossible
bigrams
"
from
the
examples
.
These
linguistic
considerations
have
a
straightforward
practical
impact
.
Provided
an
error-free
and
representative
(
in
the
above
sense
)
corpus
is
available
,
it
is
possible
to
construct
the
IB
set
.
Then
,
for
each
bigram
[
First
,
Second
]
from
this
set
,
it
is
possible
to
collect
all
trigrams
of
the
form
[
First
,
Between
,
Second
]
occurring
in
the
corpus
,
and
collect
all
the
possible
tags
Between
in
the
set
Possible_Inner_Tags
.
Furthermore
,
given
the
impossible
bigram
[
First
,
Second
]
and
the
respective
set
Possible_Inner_Tags
,
the
learning
corpus
is
to
be
searched
for
all
tetragrams
[
First
,
Middle_1
,
Middle_2
,
Second
]
.
In
case
one
of
the
tags
Middle_1
,
Middle_2
occurs
already
in
the
set
Possible_Inner_Tags
,
no
action
is
to
be
taken
,
but
in
case
the
set
Possible_Inner_Tags
contains
neither
of
Middle_1
,
Middle_2
,
both
the
tags
Middle_1
and
Middle_2
are
to
be
added
into
the
set
Possible_Inner_Tags
.
The
same
action
is
then
to
be
repeated
for
pentagrams
,
hexagrams
,
etc.
,
until
the
maximal
length
of
sentence
in
the
learn
corpus
prevents
any
further
prolongation
of
the
n-grams
and
the
process
terminates
.
If
now
the
set
Impossible_Inner_Tags
is
constructed
as
the
complement
of
Possible_Inner_Tags
relatively
to
the
whole
tagset
,
then
any
n-gram
consisting
of
the
tag
First
,
of
any
number
of
tags
from
the
set
Impossible_Inner_Tags
and
finally
from
the
tag
Second
is
very
likely
to
be
an
n-gram
impossible
in
the
language
and
hence
if
it
occurs
in
the
corpus
whose
correctness
is
to
be
checked
,
it
is
to
be
signalled
as
a
"
suspect
spot
"
.
Obviously
,
this
idea
is
again
based
on
the
assumption
of
error-freeness
and
representativity
of
the
learning
corpus
,
so
that
for
training
on
a
realistic
corpus
the
correctness
of
the
resulting
"
impossible
n-grams
"
has
to
be
hand-checked
.
This
,
however
,
is
well-worth
the
effort
,
since
the
resulting
"
impossible
n-grams
"
are
an
extremely
efficient
tool
for
error
detection
.
The
implementation
of
the
idea
is
a
straightforward
extension
of
the
above
approach
to
"
impossible
bigrams
"
.
The
above
approach
does
not
guarantee
,
however
,
that
all
"
impossible
n-grams
"
are
considered
.
In
particular
,
any
"
impossible
trigram
"
[
First
,
Second
,
Third
]
cannot
be
detected
as
such
(
i.e.
as
impossible
)
if
the
[
First
,
Second
]
,
[
Second
,
Third
]
and
[
First
,
Third
]
are
all
possible
bigrams
(
i.e.
they
all
belong
to
the
set
CB
)
.
Such
an
"
impossible
trigram
"
in
German
is
,
e.g.
,
[
nominative-noun
,
mainverb
,
nominative-noun
]
-
this
trigram
is
impossible8
since
no
German
verb
apart
from
sein
/
werden
(
which
,
as
said
above
,
are
not
tagged
as
main
verbs
in
NEGRA
®
)
can
occur
in
a
context
where
a
nominative
noun
stands
both
to
its
right
and
to
its
left
,
however
,
all
the
respective
bigrams
occur
quite
commonly
(
e.g.
,
Johann
schläft
,
Jetzt
schläft
Johann
,
König
Johann
schläft
)
.
Here
,
an
obvious
generalization
of
the
approach
from
"
impossible
bigrams
"
to
"
impossible
trigrams
"
(
and
"
impossible
tetragrams
"
,
etc.
)
is
possible
,
however
,
we
did
not
perform
this
in
full
due
to
the
amount
of
possible
trigrams
as
well
as
to
the
data
sparseness
problem
which
,
taken
together
,
would
make
the
manual
work
on
checking
the
results
unfeasible
in
practice
.
We
rather
applied
only
about
20
"
impossible
tri-grams
"
and
6
"
impossible
tetragrams
"
stemming
from
"
linguistic
invention
"
(
such
as
the
trigram
discussed
above
)
.
Evaluation
of
the
Results
By
means
of
the
error-detection
techniques
described
above
,
we
were
able
to
correct
2.661
errors
in
the
NEGRA
®
corpus
.
These
errors
were
of
all
sorts
mentioned
in
Sect
.
1
,
however
the
prevailing
part
was
that
of
incorrect
tagging
(
only
less
than
8
%
were
genuine
source
errors
,
about
26
%
were
errors
in
segmentation
)
.
The
whole
resulted
in
changes
on
3.774
lines
of
the
corpus
;
the
rectification
of
errors
in
segmentation
resulted
in
reducing
the
number
of
corpus
positions
by
over
700
,
from
355.096
to
354.3549
.
After
finishing
the
corrections
,
we
experimented
with
training
and
testing
the
TnT
tagger
(
Brants
,
8
Exempted
are
quotations
and
other
metalinguistic
contexts
,
such
as
Der
Fluss
heisst
Donau
,
Peter
übersetzte
Faust
-
eine
Tragödie
ins
Englische
als
Fist
-
one
tragedy
,
which
,
however
,
are
as
a
rule
lexically
specific
and
hence
can
be
coped
with
as
such
.
9
Which
is
a
much
nicer
number
than
355.096
,
and
thus
an
additional
motivation
for
correcting
corpora
each
part
having
parallel
starting
and
end
position
in
each
of
the
versions
,
and
then
running
the
system
ten
times
,
each
time
training
on
nine
parts
and
testing
on
the
tenth
part
,
and
finally
computing
the
mean
of
the
quality
results
.
In
doing
so
,
we
arrived
at
the
following
results
:
•
in
the
most
interesting
final
experiment
,
the
training
was
performed
on
the
"
old
"
and
the
testing
on
the
"
correct
"
NEGRA
®
;
in
the
result
,
the
tags
assigned
by
TnT
differed
from
the
hand-assigned
tags
in
the
test
sections
on
(
together
)
12.075
positions
(
out
of
the
total
of
354.354
)
,
yielding
the
error
rate
of
3,41
%
.
These
results
show
that
there
was
only
a
negligible
(
and
,
according
to
the
C
test
,
statistically
insignificant
)
difference
between
the
results
in
the
cases
when
the
tagger
was
both
trained
and
tested
on
"
old
"
corpus
and
both
trained
and
tested
on
the
"
corrected
"
corpus
.
However
,
the
difference
in
the
error
rate
when
the
tagger
was
once
trained
on
the
"
old
"
and
once
on
the
"
corrected
"
version
,
and
then
in
both
cases
tested
on
the
"
corrected
"
version10
,
brought
up
a
relative
error
improvement
of
9,97
%
.
This
improvement
documents
the
old
and
hardly
surprizing
truth
that
-
apart
from
the
size
-
also
the
correctness
of
the
training
data
is
absolutely
essential
for
the
results
of
a
statistical
tagger
.
10
We
did
not
perform
training
on
the
"
corrected
"
corpus
and
testing
on
the
"
old
"
one
,
because
it
is
not
clear
how
the
results
of
such
an
experiment
should
be
evaluated
:
in
particular
,
in
such
a
case
it
is
to
be
expected
that
it
often
happens
that
the
tags
assigned
by
the
tagger
and
the
ones
in
the
"
pyrite
standard
"
(
since
it
cannot
be
really
called
"
golden
"
,
then
)
differ
due
to
an
error
in
the
"
standard
"
-
and
hence
the
measuring
of
the
accuracy
of
the
results
of
the
tagger
are
problematic
at
best
within
such
an
architecture
.
Conclusions
