PersianAnonymizer: Evaluating LLM-Labeled Training for Efficient NER-based Anonymization in Persian
Mohammad Hossein Shalchian, Mostafa Amiri, Amir Mahdi Sadeghzadeh
Abstract
We target practical anonymization of Persian customer chats by training a compact NER model from LLM-labeled supervision and selecting the best labeler for deployment. We compare three instruction-tuned LLMs—DEEPSEEKV3-0324, GPT-OSS-120B, and QWEN3-235B-A22B-INSTRUCT-2507—to produce span annotations under a shared JSON protocol, yielding four corpora (OSS_ZeroShot, Qwen_ZeroShot, Qwen_FewShot, DeepSeek_FewShot). A MATINAROBERTA-based token-classifier is trained per corpus and evaluated with token-level Precision/Recall/F1 (overall and per-class). We also report Label Coverage Recall (LCR), the proportion of gold non-O tokens predicted as non-O, and quantify cross-labeler behavior via a token-level Venn on test annotations. Finally, we contrast test-set annotation latency of the LLMs on H200 nodes with the trained NER’s test-time labeling on a single RTX 3090. Results show that supervision from OSS_ZeroShot yields the strongest macro-F1 and LCR, while the resulting NER labels an entire 40K-message test set in ∼2 minutes on one consumer GPU. This establishes a practical path to high-quality, low-cost anonymization for Persian industrial data.- Anthology ID:
- 2026.lrec-main.352
- Volume:
- Proceedings of the Fifteenth Language Resources and Evaluation Conference
- Month:
- May
- Year:
- 2026
- Address:
- Palma de Mallorca, Spain
- Editors:
- Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
- Venue:
- LREC
- SIG:
- Publisher:
- ELRA Language Resource Association
- Note:
- Pages:
- 4497–4506
- Language:
- URL:
- https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.352/
- DOI:
- Cite (ACL):
- Mohammad Hossein Shalchian, Mostafa Amiri, and Amir Mahdi Sadeghzadeh. 2026. PersianAnonymizer: Evaluating LLM-Labeled Training for Efficient NER-based Anonymization in Persian. International Conference on Language Resources and Evaluation, main:4497–4506.
- Cite (Informal):
- PersianAnonymizer: Evaluating LLM-Labeled Training for Efficient NER-based Anonymization in Persian (Shalchian et al., LREC 2026)
- PDF:
- https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.352.pdf