Restoring Missing Spaces in Scraped Hebrew Social Media

Avi Shmidman, Shaltiel Shmidman


Abstract
A formidable challenge regarding scraped corpora of social media is the omission of whitespaces, causing pairs of words to be conflated together as one. In order for the text to be properly parsed and analyzed, these missing spaces must be detected and restored. However, it is particularly hard to restore whitespace in languages such as Hebrew which are written without vowels, because a conflated form can often be split into multiple different pairs of valid words. Thus, a simple dictionary lookup is not feasible. In this paper, we present and evaluate a series of neural approaches to restore missing spaces in scraped Hebrew social media. Our best all-around method involved pretraining a new character-based BERT model for Hebrew, and then fine-tuning a space restoration model on top of this new BERT model. This method is blazing fast, high-performing, and open for unrestricted use, providing a practical solution to process huge Hebrew social media corpora with a consumer-grade GPU. We release the new BERT model and the fine-tuned space-restoration model to the NLP community.
Anthology ID:
2025.wnut-1.3
Volume:
Proceedings of the Tenth Workshop on Noisy and User-generated Text
Month:
May
Year:
2025
Address:
Albuquerque, New Mexico, USA
Editors:
JinYeong Bak, Rob van der Goot, Hyeju Jang, Weerayut Buaphet, Alan Ramponi, Wei Xu, Alan Ritter
Venues:
WNUT | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
16–25
Language:
URL:
https://preview.aclanthology.org/fix-sig-urls/2025.wnut-1.3/
DOI:
Bibkey:
Cite (ACL):
Avi Shmidman and Shaltiel Shmidman. 2025. Restoring Missing Spaces in Scraped Hebrew Social Media. In Proceedings of the Tenth Workshop on Noisy and User-generated Text, pages 16–25, Albuquerque, New Mexico, USA. Association for Computational Linguistics.
Cite (Informal):
Restoring Missing Spaces in Scraped Hebrew Social Media (Shmidman & Shmidman, WNUT 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/fix-sig-urls/2025.wnut-1.3.pdf