Basreh or Basra? Geoparsing Historical Locations in the Svoboda Diaries

Jolie Zhou, Camille Cole, Annie Chen


Abstract
Geoparsing, the task of assigning coordinates to locations extracted from free text, is invaluable in enabling us to place locations in time and space. In the historical domain, many geoparsing corpora are from large news collections. We examine the Svoboda Diaries, a small historical corpus written primarily in English, with many location names in transliterated Arabic. We develop a pipeline employing named entity recognition for geotagging, and a map-based generate-and-rank approach incorporating candidate name augmentation and clustering of location context words for geocoding. Our system outperforms existing map-based geoparsers in terms of accuracy, lowest mean distance error, and number of locations correctly identified. As location names may vary from those in knowledge bases, we find that augmented candidate generation is instrumental in the system’s performance. Among our candidate generation methods, the generation of transliterated names contributed the most to increased location matches in the knowledge base. Our main contribution is proposing an integrated pipeline for geoparsing of historical corpora using augmented candidate location name generation and clustering methods – an approach that can be generalized to other texts with foreign or non-standard spellings.
Anthology ID:
2024.acl-srw.33
Volume:
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Xiyan Fu, Eve Fleisig
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
377–390
Language:
URL:
https://aclanthology.org/2024.acl-srw.33
DOI:
10.18653/v1/2024.acl-srw.33
Bibkey:
Cite (ACL):
Jolie Zhou, Camille Cole, and Annie Chen. 2024. Basreh or Basra? Geoparsing Historical Locations in the Svoboda Diaries. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop), pages 377–390, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
Basreh or Basra? Geoparsing Historical Locations in the Svoboda Diaries (Zhou et al., ACL 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-5/2024.acl-srw.33.pdf