Saee Saadat


2026

A wide array of NLP/NLU models have been developed for the Persian language and have shown promising results. However, the performance of such models drops significantly when applied to the colloquial form of Persian. This challenge arises from the substantial differences between colloquial and formal Persian and the lack of parallel data facilitating the robustness of the model to the colloquial data or to transform the data to formal Persian. In addressing this gap, our research is dedicated to the development of the HarfoSokhan dataset, a large-scale colloquial to formal Persian parallel dataset of 6M sentence pairs. Our proposed dataset is a critical resource for training models that can effectively bridge the linguistic variations between colloquial and formal Persian. To illustrate the utility of our dataset, we used it to train a GPT2 model, which exhibited remarkable proficiency in colloquial to formal text style transfer, outperforming both OpenAI’s GPT-3.5-turbo model and a leading rule-based system in this task. This conclusion is supported by our proposed ranking-based human evaluation. The results underscore the significance of the HarfoSokhan dataset in enhancing the performance of natural language processing models in the challenging task of colloquial to formal Persian conversion.