Sonam Choden
2026
Scaling Sentence Similarity for Classical Tibetan with Automatic Annotations
Shay Cohen | Jingyi Yang | Gal Rabinovitz | Sonam Choden | Ofir Shtrosberg | Nicola Bajetta | Goody Ben Horin | Rebecca Sundén | Omri Drori | Sonam Jamtsho | Dorji Wangchuk | Kfir Bar | Orna Almogi | Shai Fine
Proceedings of the 6th International Conference on Natural Language Processing for the Digital Humanities
Shay Cohen | Jingyi Yang | Gal Rabinovitz | Sonam Choden | Ofir Shtrosberg | Nicola Bajetta | Goody Ben Horin | Rebecca Sundén | Omri Drori | Sonam Jamtsho | Dorji Wangchuk | Kfir Bar | Orna Almogi | Shai Fine
Proceedings of the 6th International Conference on Natural Language Processing for the Digital Humanities
Identifying intertextual parallels is central to philology, traditionally requiring labor-intensive manual analysis. While digitized historical corpora enable automated approaches using semantic sentence embeddings, training such models requires large annotated datasets, which are scarce for low-resource languages. We address this challenge by introducing a scalable automatic annotation pipeline for training semantic embedding models for Classical Tibetan. Our method combines unsupervised contrastive bootstrapping with iterative pair mining, generating silver-standard similarity labels through two complementary annotation strategies: (1) an ensemble of embedding models and rerankers, and (2) an LLM-as-a-judge committee using best–worst scaling. When combined with a domain-specific, gold-standard annotated dataset for sequential fine-tuning, the resulting text-similarity model achieves a state-of-the-art Spearman correlation of 0.864 on the STS task. This enables effective semantic search in Classical Tibetan and provides a framework for automatic supervision in low-resource languages used in digital humanities. We will make our code, dataset, and trained model publicly available upon publication.
2025
DharmaBench: Evaluating Language Models on Buddhist Texts in Sanskrit and Tibetan
Kai Golan Hashiloni | Shay Cohen | Asaf Shina | Jingyi Yang | Orr Meir Zwebner | Nicola Bajetta | Guy Bilitski | Rebecca Sundén | Guy Maduel | Ryan Conlon | Ari Barzilai | Daniel Mass | Shanshan Jia | Aviv Naaman | Sonam Choden | Sonam Jamtsho | Yadi Qu | Harunaga Isaacson | Dorji Wangchuk | Shai Fine | Orna Almogi | Kfir Bar
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
Kai Golan Hashiloni | Shay Cohen | Asaf Shina | Jingyi Yang | Orr Meir Zwebner | Nicola Bajetta | Guy Bilitski | Rebecca Sundén | Guy Maduel | Ryan Conlon | Ari Barzilai | Daniel Mass | Shanshan Jia | Aviv Naaman | Sonam Choden | Sonam Jamtsho | Yadi Qu | Harunaga Isaacson | Dorji Wangchuk | Shai Fine | Orna Almogi | Kfir Bar
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
We assess the capabilities of large language models on tasks involving Buddhist texts written in Sanskrit and Classical Tibetan—two typologically distinct, low-resource historical languages. To this end, we introduce DharmaBench, a benchmark suite comprising 13 classification and detection tasks grounded in Buddhist textual traditions: six in Sanskrit and seven in Tibetan, with four shared across both. The tasks are curated from scratch, tailored to the linguistic and cultural characteristics of each language. We evaluate a range of models, from proprietary systems like GPT-4o to smaller, domain-specific open-weight models, analyzing their performance across tasks and languages. All datasets and code are publicly released, under the CC-BY-4 License and the Apache-2.0 License respectively, to support research on historical language processing and the development of culturally inclusive NLP systems.
Search
Fix author
Co-authors
- Orna Almogi 2
- Nicola Bajetta 2
- Kfir Bar 2
- Shai Fine 2
- Sonam Jamtsho 2
- Rebecca Sundén 2
- Dorji Wangchuk 2
- Jingyi Yang 2
- Ari Barzilai 1
- Guy Bilitski 1
- Shay Cohen 1
- Shay Cohen 1
- Ryan Conlon 1
- Omri Drori 1
- Kai Golan Hashiloni 1
- Goody Ben Horin 1
- Harunaga Isaacson 1
- Shanshan Jia 1
- Guy Maduel 1
- Daniel Mass 1
- Aviv Naaman 1
- Yadi Qu 1
- Gal Rabinovitz 1
- Asaf Shina 1
- Ofir Shtrosberg 1
- Orr Meir Zwebner 1