Time-Aware Word Embeddings for Three Lebanese News Archives

Jad Doughman, Fatima Abu Salem, Shady Elbassuoni


Abstract
Word embeddings have proven to be an effective method for capturing semantic relations among distinct terms within a large corpus. In this paper, we present a set of word embeddings learnt from three large Lebanese news archives, which collectively consist of 609,386 scanned newspaper images and spanning a total of 151 years, ranging from 1933 till 2011. The diversified ideological nature of the news archives alongside the temporal variability of the embeddings offer a rare glimpse onto the variation of word representation across the left-right political spectrum. To train the word embeddings, Google’s Tesseract 4.0 OCR engine was employed to transcribe the scanned news archives, and various archive-level as well as decade-level word embeddings were learnt. To evaluate the accuracy of the learnt word embeddings, a benchmark of analogy tasks was used. Finally, we demonstrate an interactive system that allows the end user to visualize for a given word of interest, the variation of the top-k closest words in the embedding space as a function of time and across news archives using an animated scatter plot.
Anthology ID:
2020.lrec-1.580
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
4717–4725
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.580
DOI:
Bibkey:
Cite (ACL):
Jad Doughman, Fatima Abu Salem, and Shady Elbassuoni. 2020. Time-Aware Word Embeddings for Three Lebanese News Archives. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4717–4725, Marseille, France. European Language Resources Association.
Cite (Informal):
Time-Aware Word Embeddings for Three Lebanese News Archives (Doughman et al., LREC 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/2020.lrec-1.580.pdf
Code
 record/3538880