Abstract
Word embeddings have proven to be an effective method for capturing semantic relations among distinct terms within a large corpus. In this paper, we present a set of word embeddings learnt from three large Lebanese news archives, which collectively consist of 609,386 scanned newspaper images and spanning a total of 151 years, ranging from 1933 till 2011. The diversified ideological nature of the news archives alongside the temporal variability of the embeddings offer a rare glimpse onto the variation of word representation across the left-right political spectrum. To train the word embeddings, Google’s Tesseract 4.0 OCR engine was employed to transcribe the scanned news archives, and various archive-level as well as decade-level word embeddings were learnt. To evaluate the accuracy of the learnt word embeddings, a benchmark of analogy tasks was used. Finally, we demonstrate an interactive system that allows the end user to visualize for a given word of interest, the variation of the top-k closest words in the embedding space as a function of time and across news archives using an animated scatter plot.- Anthology ID:
- 2020.lrec-1.580
- Volume:
- Proceedings of the Twelfth Language Resources and Evaluation Conference
- Month:
- May
- Year:
- 2020
- Address:
- Marseille, France
- Venue:
- LREC
- SIG:
- Publisher:
- European Language Resources Association
- Note:
- Pages:
- 4717–4725
- Language:
- English
- URL:
- https://aclanthology.org/2020.lrec-1.580
- DOI:
- Cite (ACL):
- Jad Doughman, Fatima Abu Salem, and Shady Elbassuoni. 2020. Time-Aware Word Embeddings for Three Lebanese News Archives. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4717–4725, Marseille, France. European Language Resources Association.
- Cite (Informal):
- Time-Aware Word Embeddings for Three Lebanese News Archives (Doughman et al., LREC 2020)
- PDF:
- https://preview.aclanthology.org/ingestion-script-update/2020.lrec-1.580.pdf
- Code
- record/3538880