=== Skip and tie rates per subtask ===
  a-en: n=9343, skip=3105 (33.2%), tie=1133 (12.1%)
  a-es: n=2643, skip=336 (12.7%), tie=380 (14.4%)
  a-zh: n=2256, skip=153 (6.8%), tie=333 (14.8%)
  b1: n=1458, skip=236 (16.2%), tie=273 (18.7%)
  b2: n=1177, skip=111 (9.4%), tie=212 (18.0%)

=== Pairwise agreement on overlapping items ===
  a-en: items=126, %-agreement=0.440, Cohen's kappa=0.114, Krippendorff alpha=0.137
  a-es: items=64, %-agreement=0.516, Cohen's kappa=0.207, Krippendorff alpha=0.213
  a-zh: items=34, %-agreement=0.588, Cohen's kappa=0.344, Krippendorff alpha=0.354
  b1: items=38, %-agreement=0.450, Cohen's kappa=0.127, Krippendorff alpha=0.146
  b2: items=26, %-agreement=0.357, Cohen's kappa=-0.062, Krippendorff alpha=-0.040

=== BT model vs majority human vote ===
  Pairs compared: 10408; BT-majority agreement: 0.631
  Spearman(BT margin, majority margin): 0.322

=== Split-half system-ranking reliability ===
  Split each subtask's votes into two halves and compute per-system
  win rates; report Spearman rho between the two ranking lists.
  a-en: systems=31, Spearman(half_A, half_B) win-rate rho=0.762
  a-es: systems=16, Spearman(half_A, half_B) win-rate rho=0.938
  a-zh: systems=20, Spearman(half_A, half_B) win-rate rho=0.721
  b1: systems=11, Spearman(half_A, half_B) win-rate rho=0.900
  b2: systems=10, Spearman(half_A, half_B) win-rate rho=0.479
  Mean Spearman rho across subtasks: 0.760

=== Split-by-annotator ranking reliability ===
  Randomly assign each annotator to one of two disjoint pools and
  compute per-system win rates from each; report Spearman rho between
  the two pools' rankings.
  a-en: systems=31, votes=2618/3620 per pool, Spearman(pool_A, pool_B) win-rate rho=0.793
  a-es: systems=16, votes=1358/949 per pool, Spearman(pool_A, pool_B) win-rate rho=0.900
  a-zh: systems=20, votes=924/1179 per pool, Spearman(pool_A, pool_B) win-rate rho=0.791
  b1: systems=11, votes=586/636 per pool, Spearman(pool_A, pool_B) win-rate rho=0.900
  b2: systems=10, votes=537/529 per pool, Spearman(pool_A, pool_B) win-rate rho=0.588
  Mean Spearman rho across subtasks: 0.794
