PersianAnonymizer: Evaluating LLM-Labeled Training for Efficient NER-based Anonymization in Persian

Mohammad Hossein Shalchian; Mostafa Amiri; Amir Mahdi Sadeghzadeh

PersianAnonymizer: Evaluating LLM-Labeled Training for Efficient NER-based Anonymization in Persian

Mohammad Hossein Shalchian, Mostafa Amiri, Amir Mahdi Sadeghzadeh

Abstract

We target practical anonymization of Persian customer chats by training a compact NER model from LLM-labeled supervision and selecting the best labeler for deployment. We compare three instruction-tuned LLMs—DEEPSEEKV3-0324, GPT-OSS-120B, and QWEN3-235B-A22B-INSTRUCT-2507—to produce span annotations under a shared JSON protocol, yielding four corpora (OSS_ZeroShot, Qwen_ZeroShot, Qwen_FewShot, DeepSeek_FewShot). A MATINAROBERTA-based token-classifier is trained per corpus and evaluated with token-level Precision/Recall/F1 (overall and per-class). We also report Label Coverage Recall (LCR), the proportion of gold non-O tokens predicted as non-O, and quantify cross-labeler behavior via a token-level Venn on test annotations. Finally, we contrast test-set annotation latency of the LLMs on H200 nodes with the trained NER’s test-time labeling on a single RTX 3090. Results show that supervision from OSS_ZeroShot yields the strongest macro-F1 and LCR, while the resulting NER labels an entire 40K-message test set in ∼2 minutes on one consumer GPU. This establishes a practical path to high-quality, low-cost anonymization for Persian industrial data.

Anthology ID:: 2026.lrec-main.352
Volume:: Proceedings of the Fifteenth Language Resources and Evaluation Conference
Month:: May
Year:: 2026
Address:: Palma de Mallorca, Spain
Editors:: Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
Venue:: LREC
SIG:
Publisher:: ELRA Language Resource Association
Note:
Pages:: 4497–4506
Language:
URL:: https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.352/
DOI:
Bibkey:
Cite (ACL):: Mohammad Hossein Shalchian, Mostafa Amiri, and Amir Mahdi Sadeghzadeh. 2026. PersianAnonymizer: Evaluating LLM-Labeled Training for Efficient NER-based Anonymization in Persian. International Conference on Language Resources and Evaluation, main:4497–4506.
Cite (Informal):: PersianAnonymizer: Evaluating LLM-Labeled Training for Efficient NER-based Anonymization in Persian (Shalchian et al., LREC 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.352.pdf

PDF Cite Search Fix data