same system 0.002559328
same test 0.002328338
different scores 0.002199253
system results 0.0021822359999999997
task test 0.002166119
test set 0.002144229
rte entailment 0.002059694
system ranking 0.002005162
same score 0.001976668
system rankings 0.0019168549999999999
system rank 0.001879625
system behavior 0.0018736039999999999
system responses 0.0018638729999999998
system internals 0.00185958
system devel 0.00185753
system developer 0.0018547589999999999
system builders 0.0018547589999999999
test pairs 0.001845174
test pair 0.001817613
different judges 0.0017998250000000001
test cases 0.001784716
entailment task 0.0017574729999999998
same set 0.001749247
test sets 0.001730138
individual test 0.001690355
many rte 0.0016689860000000001
test case 0.001661528
evaluation task 0.001648874
different nist 0.00163127
evaluation set 0.0016269840000000002
balanced test 0.001622371
main rte 0.0016162530000000002
other evaluation 0.001608767
human agreement 0.001594961
system 0.00159265
different assessors 0.0015569400000000001
human judges 0.001554105
different justifications 0.0015241500000000002
different choice 0.001512756
different organizations 0.001509547
accuracy scores 0.001496276
score pair 0.0014659430000000002
main entailment 0.001462587
score value 0.00144051
entailment pairs 0.0014365279999999999
textual entailment 0.0014350679999999998
correctness score 0.0014193
high score 0.001416816
new annotation 0.001410307
entailment pair 0.001408967
user task 0.0014042479999999999
rte annotators 0.0014014510000000002
task results 0.001394045
mean score 0.0013936170000000002
particular score 0.0013892460000000002
understandability score 0.0013879920000000002
rte organizers 0.0013722860000000001
several systems 0.001364861
correctness scores 0.0013617430000000001
human understanding 0.001355321
first set 0.0013494190000000001
annotation rules 0.0013493350000000001
mean scores 0.00133606
reference systems 0.0013305489999999999
understandability scores 0.001330435
same order 0.001321615
main task 0.0013140320000000001
entailment decision 0.001305723
human reading 0.001304301
human users 0.001299935
human annotators 0.001295871
human judgments 0.0012924030000000001
entailment decisions 0.001286615
correct entailment 0.00128556
evaluation tasks 0.001282338
ond score 0.001277705
same justification 0.001277307
racy score 0.0012750980000000001
derstandability score 0.001271905
integer score 0.001271905
rectness score 0.001271905
human tweaking 0.001263646
annotation rule 0.001256314
separate scores 0.0012546950000000001
nlp phenomena 0.001230924
ness scores 0.0012294200000000002
same thing 0.00122895
same organization 0.00122895
racy scores 0.001217541
second set 0.001216694
candidate entailment 0.001215264
task reference 0.001196177
effective evaluation 0.00118961
task nist 0.001188909
decision task 0.001157168
justification evaluation 0.001155044
unknown annotation 0.001148435
such reports 0.001124788
such assertions 0.001124788
evaluation design 0.001123817
