Goody Ben Horin
2026
Scaling Sentence Similarity for Classical Tibetan with Automatic Annotations
Shay Cohen | Jingyi Yang | Gal Rabinovitz | Sonam Choden | Ofir Shtrosberg | Nicola Bajetta | Goody Ben Horin | Rebecca Sundén | Omri Drori | Sonam Jamtsho | Dorji Wangchuk | Kfir Bar | Orna Almogi | Shai Fine
Proceedings of the 6th International Conference on Natural Language Processing for the Digital Humanities
Shay Cohen | Jingyi Yang | Gal Rabinovitz | Sonam Choden | Ofir Shtrosberg | Nicola Bajetta | Goody Ben Horin | Rebecca Sundén | Omri Drori | Sonam Jamtsho | Dorji Wangchuk | Kfir Bar | Orna Almogi | Shai Fine
Proceedings of the 6th International Conference on Natural Language Processing for the Digital Humanities
Identifying intertextual parallels is central to philology, traditionally requiring labor-intensive manual analysis. While digitized historical corpora enable automated approaches using semantic sentence embeddings, training such models requires large annotated datasets, which are scarce for low-resource languages. We address this challenge by introducing a scalable automatic annotation pipeline for training semantic embedding models for Classical Tibetan. Our method combines unsupervised contrastive bootstrapping with iterative pair mining, generating silver-standard similarity labels through two complementary annotation strategies: (1) an ensemble of embedding models and rerankers, and (2) an LLM-as-a-judge committee using best–worst scaling. When combined with a domain-specific, gold-standard annotated dataset for sequential fine-tuning, the resulting text-similarity model achieves a state-of-the-art Spearman correlation of 0.864 on the STS task. This enables effective semantic search in Classical Tibetan and provides a framework for automatic supervision in low-resource languages used in digital humanities. We will make our code, dataset, and trained model publicly available upon publication.