Abstract
Much previous work on geoparsing has focused on identifying and resolving individual toponyms in text like Adrano, S.Maria di Licodia or Catania. However, geographical locations occur not only as individual toponyms, but also as compositions of reference geolocations joined and modified by connectives, e.g., “. . . between the towns of Adrano and S.Maria di Licodia, 32 kilometres northwest of Catania”. Ideally, a geoparser should be able to take such text, and the geographical shapes of the toponyms referenced within it, and parse these into a geographical shape, formed by a set of coordinates, that represents the location described. But creating a dataset for this complex geoparsing task is difficult and, if done manually, would require a huge amount of effort to annotate the geographical shapes of not only the geolocation described but also the reference toponyms. We present an approach that automates most of the process by combining Wikipedia and OpenStreetMap. As a result, we have gathered a collection of 360,187 uncurated complex geolocation descriptions, from which we have manually curated 1,000 examples intended to be used as a test set. To accompany the data, we define a new geoparsing evaluation framework along with a scoring methodology and a set of baselines.