Stefan Krsteski
2026
Valid Survey Simulations with Limited Human Data: The Roles of Prompting, Fine-Tuning, and Rectification
Stefan Krsteski | Giuseppe Russo | Serina Chang | Robert West | Kristina Gligori\'c
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Stefan Krsteski | Giuseppe Russo | Serina Chang | Robert West | Kristina Gligori\'c
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Surveys provide valuable insights into public opinion and behavior, but their execution is costly and slow. Large language models (LLMs) have been proposed as a scalable, low-cost substitute for human respondents, but their outputs are often biased and yield invalid estimates. We study the interplay between synthesis methods that use LLMs to generate survey responses and rectification methods that debias population estimates, and explore how human responses are best allocated between them. Using two panel surveys with questions on nutrition, politics, and economics, we find that synthesis alone introduces substantial bias (24–86%), whereas combining it with rectification reduces bias below 5% and increases effective sample size by up to 14%. Overall, we challenge the common practice of using all human responses for fine-tuning, showing that under a fixed budget, allocating most to rectification results in far more effective estimation.
2025
Towards Open Foundation Language Model and Corpus for Macedonian: A Low-Resource Language
Stefan Krsteski | Borjan Sazdov | Matea Tashkovska | Branislav Gerazov | Hristijan Gjoreski
Proceedings of the 10th Workshop on Slavic Natural Language Processing (Slavic NLP 2025)
Stefan Krsteski | Borjan Sazdov | Matea Tashkovska | Branislav Gerazov | Hristijan Gjoreski
Proceedings of the 10th Workshop on Slavic Natural Language Processing (Slavic NLP 2025)
The increase in technological adoption worldwide comes with demands for novel tools to be used by the general population. Large Language Models (LLMs) provide a great opportunity in this respect, but their capabilities remain limited for low-resource languages, restricting applications in countries where such languages are spoken. We create several resources to facilitate the adoption of LLMs and to support research advancements for Macedonian. We collect the largest Macedonian corpus to date, consisting of 40GB of textual data and totaling 3.5B words. To support conversational applications, we collect a 106k-instance instruction dataset, carefully built to be culturally grounded. For evaluation, we construct a Macedonian evaluation suite covering seven benchmarks. Finally, we train domestic-yak, a state-of-the-art 8B-parameter model, on our curated datasets and evaluate it against eight baseline models using the newly constructed benchmark suite. Our model outperforms all existing models in the 8B parameter range across all benchmarks, and achieves performance comparable to models up to 10× larger. Furthermore, a qualitative analysis with native speakers reveals that our model is preferred over larger counterparts, receiving higher ratings for grammatical correctness and cultural appropriateness. All datasets, code, and model weights are openly released, setting a foundation for advancing LLMs in similarly underrepresented languages. These resources are publicly available at https://github.com/LVSTCK for source code, and at https://huggingface.co/LVSTCK for pretrained model weights and data.