Hak Hyun Kim


2026

Cross-lingual bias benchmarks such as JBBQ and KoBBQ translate English bias probes and compare scores across languages, assuming the translated probe measures the same construct. We test this assumption at the representation and behavioral levels using 13B-parameter models matched on architecture but differing in language-training regime. A multi-anchor logit lens shows that an English-centric model (Llama 2) processes Japanese and Korean inputs predominantly through English-script predictions in its middle layers, even where Centered Kernel Alignment (CKA) between languages is high: geometric convergence masks English-hub routing. Matched continual-adaptation comparisons show that target-language adaptation reduces this English-script mass: from 0.77 to 0.56 after Japanese adaptation (Swallow), and from 0.78 to 0.71 after Korean adaptation (koen), while balanced bilingual pretraining (LLM-jp) lowers it further to 0.19. Behaviorally, every model is more stereotype-biased in English than in Japanese, with gaps from 0.13 to 0.14, but this asymmetry is language-specific: in Korean it is weak and disappears after Korean adaptation, with Korean nearly as stereotype-leaning as English. Yet patching English hub states into target-language processing does not transplant this bias. Cross-lingual bias scores thus reflect genuine language-specific behavior, not an English-pivot artifact, even though the underlying representations are not comparable. We distill this dissociation between representation and behavior into a four-step audit protocol for translated bias benchmarks.