word distribution 0.003037879
word error 0.003022273
topic model 0.00295457
topic words 0.00293166
same word 0.002769657
document word 0.0027642369999999997
high word 0.002683441
word recognition 0.002594432
single word 0.0025734539999999998
word distributions 0.0025711669999999996
word types 0.002567627
word type 0.0025656349999999997
top word 0.0025612649999999996
word tokens 0.0025534769999999997
original word 0.002512267
low word 0.002510259
word sim 0.002506371
long word 0.002504518
word vectors 0.0024818669999999996
whole word 0.002472145
true word 0.0024717539999999996
word occurrence 0.0024687109999999997
account word 0.002465152
word identities 0.002464367
lectn word 0.002464367
ocr data 0.002431252
different words 0.00241597
error model 0.002381083
text data 0.002125895
such data 0.002078913
topic models 0.0020783629999999997
only words 0.002031406
ing words 0.002018591
training topic 0.00195985
lda model 0.0019586539999999998
function words 0.001954748
english words 0.0019382280000000002
noise model 0.001934454
model quality 0.001920861
model inference 0.0019146369999999998
words distributions 0.0019070670000000001
data figure 0.00190467
top words 0.0018971650000000001
words tokens 0.001889377
corrupted words 0.00188046
first topic 0.001868353
ment model 0.001867987
real words 0.0018502430000000001
document topic 0.001842897
long words 0.0018404180000000001
performance topic 0.0018368219999999999
known words 0.0018334480000000001
multinomials model 0.0018319949999999999
trained model 0.001829453
individual words 0.001819597
unknown words 0.001811876
probable words 0.001807145
topic analysis 0.001805733
quently words 0.0018009990000000002
sufficient words 0.0018009990000000002
noisy data 0.001782473
corrupted data 0.00177861
final data 0.001777229
ocred data 0.0017652969999999999
enron data 0.00175384
topic modeling 0.001752154
work topic 0.0017515509999999998
language models 0.001745683
eisenhower data 0.0017424699999999999
diverse data 0.001738155
clean data 0.0017370229999999999
synthetic data 0.001732826
reuters data 0.001725238
newsgroups data 0.001699979
unprocessed data 0.001697904
heldout data 0.001697904
groups data 0.001697904
topic labels 0.001680446
lda topic 0.001678504
topic quality 0.001640711
supervised topic 0.0016406709999999998
model 0.00161736
words 0.00159445
particular topic 0.0015914599999999998
pervised topic 0.001578234
topic vectors 0.001560527
narrow topic 0.0015448139999999998
assigned topic 0.001542823
ocr errors 0.001434503
ocr algorithm 0.001346487
topic 0.00133721
such models 0.001327466
ocr corpus 0.0012850280000000001
ocr output 0.001275389
ocr noise 0.001255746
language processing 0.001241934
other topics 0.0012403100000000001
language mod 0.001232105
noisy ocr 0.001228525
ocr engine 0.0012210279999999999
