Mohammad Hossein Shalchian
2026
PersianAnonymizer: Evaluating LLM-Labeled Training for Efficient NER-based Anonymization in Persian
Mohammad Hossein Shalchian | Mostafa Amiri | Amir Mahdi Sadeghzadeh
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Mohammad Hossein Shalchian | Mostafa Amiri | Amir Mahdi Sadeghzadeh
Proceedings of the Fifteenth Language Resources and Evaluation Conference
We target practical anonymization of Persian customer chats by training a compact NER model from LLM-labeled supervision and selecting the best labeler for deployment. We compare three instruction-tuned LLMs—DEEPSEEKV3-0324, GPT-OSS-120B, and QWEN3-235B-A22B-INSTRUCT-2507—to produce span annotations under a shared JSON protocol, yielding four corpora (OSS_ZeroShot, Qwen_ZeroShot, Qwen_FewShot, DeepSeek_FewShot). A MATINAROBERTA-based token-classifier is trained per corpus and evaluated with token-level Precision/Recall/F1 (overall and per-class). We also report Label Coverage Recall (LCR), the proportion of gold non-O tokens predicted as non-O, and quantify cross-labeler behavior via a token-level Venn on test annotations. Finally, we contrast test-set annotation latency of the LLMs on H200 nodes with the trained NER’s test-time labeling on a single RTX 3090. Results show that supervision from OSS_ZeroShot yields the strongest macro-F1 and LCR, while the resulting NER labels an entire 40K-message test set in ∼2 minutes on one consumer GPU. This establishes a practical path to high-quality, low-cost anonymization for Persian industrial data.
Advancing Retrieval-Augmented Generation for Persian: Development of Language Models, Comprehensive Benchmarks, and Best Practices for Optimization
Sara Bourbour Hosseinbeigi | Mohammad Hossein Shalchian | Sina Asghari | Mohammad Ali Seif Kashani | Mohammad Amin Abbasi
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Sara Bourbour Hosseinbeigi | Mohammad Hossein Shalchian | Sina Asghari | Mohammad Ali Seif Kashani | Mohammad Amin Abbasi
Proceedings of the Fifteenth Language Resources and Evaluation Conference
This paper examines the specific obstacles of constructing Retrieval-Augmented Generation (RAG) systems in low resource languages, with a focus on Persian’s complicated morphology and versatile syntax. The research aims to improve retrieval and generation accuracy by introducing Persian-specific models, namely MatinaRoberta (a masked language model) and MatinaSRoberta (a fine-tuned Sentence-BERT), along with a comprehensive benchmarking framework. Three datasets—general knowledge (PQuad), scientifically specialized texts, and organizational reports—were used to assess these models after they were trained on a varied corpus of 73.11 billion Persian tokens. The methodology involved extensive pretraining, fine-tuning with tailored loss functions, and systematic evaluations using both traditional metrics and the Retrieval-Augmented Generation Assessment (RAGAS) framework. The results show that MatinaSRoberta outperformed previous embeddings, achieving superior contextual relevance and retrieval accuracy across datasets. Temperature tweaking, chunk size modifications, and document summary indexing were explored to enhance RAG setups. Larger models like Llama-3.1 (70B) consistently demonstrated the highest generation accuracy, while smaller models faced challenges with domain-specific and formal contexts. The findings underscore the potential for developing RAG systems in Persian through customized embeddings and retrieval-generation settings and highlight the enhancement of NLP applications such as search engines and legal document analysis in low-resource languages.