Abstract
Geoparsing, the task of assigning coordinates to locations extracted from free text, is invaluable in enabling us to place locations in time and space. In the historical domain, many geoparsing corpora are from large news collections. We examine the Svoboda Diaries, a small historical corpus written primarily in English, with many location names in transliterated Arabic. We develop a pipeline employing named entity recognition for geotagging, and a map-based generate-and-rank approach incorporating candidate name augmentation and clustering of location context words for geocoding. Our system outperforms existing map-based geoparsers in terms of accuracy, lowest mean distance error, and number of locations correctly identified. As location names may vary from those in knowledge bases, we find that augmented candidate generation is instrumental in the system’s performance. Among our candidate generation methods, the generation of transliterated names contributed the most to increased location matches in the knowledge base. Our main contribution is proposing an integrated pipeline for geoparsing of historical corpora using augmented candidate location name generation and clustering methods – an approach that can be generalized to other texts with foreign or non-standard spellings.- Anthology ID:
- 2024.acl-srw.33
- Volume:
- Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)
- Month:
- August
- Year:
- 2024
- Address:
- Bangkok, Thailand
- Editors:
- Xiyan Fu, Eve Fleisig
- Venue:
- ACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 377–390
- Language:
- URL:
- https://aclanthology.org/2024.acl-srw.33
- DOI:
- 10.18653/v1/2024.acl-srw.33
- Cite (ACL):
- Jolie Zhou, Camille Cole, and Annie Chen. 2024. Basreh or Basra? Geoparsing Historical Locations in the Svoboda Diaries. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop), pages 377–390, Bangkok, Thailand. Association for Computational Linguistics.
- Cite (Informal):
- Basreh or Basra? Geoparsing Historical Locations in the Svoboda Diaries (Zhou et al., ACL 2024)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-5/2024.acl-srw.33.pdf