Validator-Guided Hard Negative Mining for Masked Language Modeling in Low-Resource Ancient Languages

Andrei Voinea

Validator-Guided Hard Negative Mining for Masked Language Modeling in Low-Resource Ancient Languages

Abstract

Masked language modeling for low-resource ancient languages remains challenging because pre-trained multilingual models lack exposure to these languages. We investigate rule-based linguistic constraints and hard negative mining for Sumerian, a language isolate not included in multilingual BERT’s training data. We build a hierarchical validator capturing subword, word, and part-of-speech patterns from 4,545 annotated sequences, using it to filter candidates and identify hard negatives for fine-tuning. Vanilla mBERT achieves 18.0% hit@10 accuracy. The validator alone improves this to 72.8%, while hard negative fine-tuning reaches 78.3%. Combining both yields 86.7%, a 68.7 percentage point improvement. Temporal generalization evaluation on tablets from 600 years earlier shows that both the hard negative mining and the validator alone improve performance, but the combined approach underperforms due to the validator’s period specific rules. These findings demonstrate that hard negative mining transfers across periods while explicit rule-based constraints provide strong in-domain improvements but limited cross-period generalization.

Anthology ID:: 2026.acl-srw.69
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Santosh T.Y.S.S., Juan Diego Rodriguez, Ona de Gibert
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 779–790
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-srw.69/
DOI:
Bibkey:
Cite (ACL):: Andrei Voinea. 2026. Validator-Guided Hard Negative Mining for Masked Language Modeling in Low-Resource Ancient Languages. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026), pages 779–790, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Validator-Guided Hard Negative Mining for Masked Language Modeling in Low-Resource Ancient Languages (Voinea, ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-srw.69.pdf

PDF Cite Search Fix data