Nicola Bajetta
2026
Automatic Segmentation of Classical Tibetan Texts into Autochthonous and Allochthonous Regions
Guy Bilitski | Lev Shechter | Sonam Jamtsho | Nir Marciano | Nicola Bajetta | Rebecca Sunden | Omri Drori | Kai Golan Hashiloni | Orr Zwebner | Asaf Shina | Orna Almogi | Dorji Wangchuk | Kfir Bar
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Guy Bilitski | Lev Shechter | Sonam Jamtsho | Nir Marciano | Nicola Bajetta | Rebecca Sunden | Omri Drori | Kai Golan Hashiloni | Orr Zwebner | Asaf Shina | Orna Almogi | Dorji Wangchuk | Kfir Bar
Proceedings of the Fifteenth Language Resources and Evaluation Conference
We introduce a new computational framework for segmenting Classical Tibetan texts into autochthonous and allochthonous regions, distinguishing between indigenous Tibetan compositions and translated materials, primarily from Sanskrit sources. To support this task, we release the first annotated Tibetan corpus for ALLO/AUTO segmentation and evaluate several multilingual encoders, including mBERT and XLM-R, fine-tuned for sequence labeling. Our best model achieves strong alignment with expert annotations, showing that multilingual representations can effectively capture philological boundaries in low-resource settings. This work contributes new resources and methods for computational philology and sheds light on the linguistic markers that trace the intercultural transmission of Buddhist thought in Tibet.
2025
DharmaBench: Evaluating Language Models on Buddhist Texts in Sanskrit and Tibetan
Kai Golan Hashiloni | Shay Cohen | Asaf Shina | Jingyi Yang | Orr Meir Zwebner | Nicola Bajetta | Guy Bilitski | Rebecca Sundén | Guy Maduel | Ryan Conlon | Ari Barzilai | Daniel Mass | Shanshan Jia | Aviv Naaman | Sonam Choden | Sonam Jamtsho | Yadi Qu | Harunaga Isaacson | Dorji Wangchuk | Shai Fine | Orna Almogi | Kfir Bar
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
Kai Golan Hashiloni | Shay Cohen | Asaf Shina | Jingyi Yang | Orr Meir Zwebner | Nicola Bajetta | Guy Bilitski | Rebecca Sundén | Guy Maduel | Ryan Conlon | Ari Barzilai | Daniel Mass | Shanshan Jia | Aviv Naaman | Sonam Choden | Sonam Jamtsho | Yadi Qu | Harunaga Isaacson | Dorji Wangchuk | Shai Fine | Orna Almogi | Kfir Bar
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
We assess the capabilities of large language models on tasks involving Buddhist texts written in Sanskrit and Classical Tibetan—two typologically distinct, low-resource historical languages. To this end, we introduce DharmaBench, a benchmark suite comprising 13 classification and detection tasks grounded in Buddhist textual traditions: six in Sanskrit and seven in Tibetan, with four shared across both. The tasks are curated from scratch, tailored to the linguistic and cultural characteristics of each language. We evaluate a range of models, from proprietary systems like GPT-4o to smaller, domain-specific open-weight models, analyzing their performance across tasks and languages. All datasets and code are publicly released, under the CC-BY-4 License and the Apache-2.0 License respectively, to support research on historical language processing and the development of culturally inclusive NLP systems.