same tokenization 0.0020863
different sentence 0.002004924
different fragments 0.001975302
sentence tokenization 0.001937994
large corpus 0.0019263260000000001
newswire corpus 0.001702126
corpus evidences 0.001672068
same source 0.0016701609999999999
corpus investigation 0.001666911
concrete corpus 0.0016562290000000002
corpus verification 0.001653348
different grammar 0.0016429650000000001
english word 0.00160334
same set 0.001602333
other information 0.0015834339999999999
critical tokenization 0.0015744459999999997
different sentences 0.001527045
possible tokenization 0.0015167849999999999
such fragment 0.001516331
different context 0.001485247
critical fragment 0.001483824
tokenization performance 0.001479518
tokenization accuracy 0.001472211
character fragment 0.0014646
english sentence 0.0014623840000000002
different tokenizations 0.001447511
first fragment 0.001441177
corpus 0.00142667
tokenization dictionary 0.0013987399999999999
correct tokenization 0.0013917649999999999
different people 0.0013870940000000002
different contexts 0.0013867480000000001
chinese fragment 0.001378449
maximum tokenization 0.0013753469999999999
explicit word 0.0013722640000000002
critical fragments 0.001369518
different readings 0.001361751
different okenizations 0.0013540630000000002
tokenization ambiguity 0.0013536379999999999
tokenization criteria 0.001352448
tokenization disambiguation 0.001340624
overall tokenization 0.0013174209999999998
other sentences 0.001315767
manual tokenization 0.001314048
tokenization errors 0.0013080359999999998
fragment forms 0.001305241
chinese language 0.001304958
tokenization strategies 0.0012919869999999999
tokenization practice 0.001288736
tokenization theory 0.0012886339999999999
tokenization consistencies 0.0012826469999999998
english words 0.001262652
sentence tree 0.00126224
same clarity 0.0012575730000000001
word delimiters 0.001251581
ambiguous fragment 0.001220965
other types 0.001218541
linguistic words 0.0012107429999999998
fragment hat 0.001194976
language processing 0.001192886
fragment abcd 0.00119264
fragment abc 0.00119264
fragment okenization 0.00119264
underlying sentence 0.001177658
other occurrences 0.001160466
arbitrary sentence 0.001146792
natural language 0.001134211
entry fragments 0.001121314
sentence segment 0.001110132
sentence formation 0.001107493
ambiguous fragments 0.0011066589999999999
single source 0.0010991949999999999
controversial fragments 0.001092573
fragments hat 0.00108067
english character 0.001079612
questionable fragments 0.0010778559999999999
tokenization 0.00105665
multiple languages 0.001054547
good results 0.001036572
comparable results 9.81441E-4
fragment 9.66028E-4
multiple tokens 9.623030000000001E-4
mutual information 9.46045E-4
representative corpora 9.34378E-4
english dictionary 9.231300000000001E-4
single tokens 9.16612E-4
data problem 9.110559999999999E-4
chinese characters 9.100670000000001E-4
large number 8.9664E-4
language 8.92537E-4
many ways 8.82592E-4
sentence 8.81344E-4
character string 8.812609999999999E-4
source noticing 8.66562E-4
fragments 8.51722E-4
preceding results 8.43176E-4
critical tokenizations 8.41727E-4
token score 8.34736E-4
data the 8.319110000000001E-4
machine translation 8.31089E-4
