Introduction
Starting with the seminal projects CYC and WordNet, a lot of research and development has been put into the construction of large knowledge bases (KBs). While older KBs have been constructed mostly manually by trained knowledge engineers, the emphasis in the last ten years has been in their automatic construction. This has been facilitated by the availability of techniques for scalable information extraction from the Web and large curated knowledge sources, such as Wikipedia. Well-known KBs include DBpedia Freebase, Yago, KnowItAll, NELL, BabelNet, KnowledgeVault, Wikidata and DeepDive. Most of these KBs represent knowledge using a simple data model based on triples of the form SPO where S is the subject, P is the predicate and O is the object (i.e., as in RDF). Equivalently, these KBs can be viewed as directed graphs, hence the equivalent term knowledge graphs.
Geographic knowledge has been studied for many years by researchers in Geography, Geographic Information Systems (GIS), Databases, Artificial Intelligence and the Semantic Web, and there is a wealth of research results concerning representation, querying and inference for geographic knowledge. In GIS terminology, a geographic feature (or simply feature) is an abstraction of a real-world phenomenon and can have various attributes that describe its thematic and spatial characteristics. For example, the country Greece is a feature, its name and population are thematic attributes, while its location on Earth, in terms of polar coordinates, is a spatial attribute. Knowledge about the spatial attributes of a feature can be given in a quantitative or qualitative way. For example, the fact that the distance between Athens and Salonika is 502 km is quantitative knowledge, while the fact that river Evros crosses Bulgaria and Turkey and is at the border of Greece with Turkey is qualitative knowledge. Quantitative geographic knowledge is usually represented using geometries (e.g., points, lines and polygons on the Cartesian plane, while qualitative geographic knowledge is captured by qualitative binary relations between the geometries of features. The most important classes of qualitative binary relations between geometries, that have been studied in the literature, are topological (e.g., the well-known frameworks of Egenhofer and RCC-8) and cardinal directions. Important research results of this area in the last 30 years have found their way into commercial products, most notably GIS systems like ArcGIS or QGIS, and DBMSs such as PostGIS and Oracle Spatial and Graph. In the context of the Web, in pending extensions, the well-known community activity schema.org, which defines vocabularies for annotating Web pages with semantic information, will be adding classes and relations that model geographic features and their binary topological relations.
Geographic knowledge is useful for many human activities hence one would expect that large KBs like DBpedia Freebase etc. would contain a wealth of such knowledge. However, although all these KBs contain lots of knowledge about features and their thematic attributes, they contain very little knowledge about spatial attributes of features. For example, DBpedia contains latitude/longitude coordinates of the center of cities, towns etc. extracted from Wikipedia. In addition, DBpedia contains knowledge about some thematic attributes that can be used to infer knowledge about spatial attributes of features. For example, for each country, the neighboring countries are given, or for each city, the country to which the city belongs is given. In this way, one can infer knowledge about the corresponding spatial attributes of features e.g., “the geometry of Greece externally connects with the geometry of Bulgaria” or “the geometry of Athens is a non-tangential proper part of the geometry of Greece”, using the vocabulary of RCC-8. Recently, DBpedia has been attempting to add cardinal direction knowledge (e.g., Athens is north of Crete) via properties dbp:north, dbp:south etc. However, to the best of our knowledge, this information is currently incomplete and sometimes wrong.
Yago2 is the first large knowledge base that has made an effort to increase its geographic knowledge by harvesting geographic information not just from Wikipedia but also from GeoNames 2 . GeoNames is a gazetteer that currently has information about 11.000.000 place names. In order to represent geographic knowledge, Yago2 introduces the class yagoGeoEntity and the special data type yagoGeoCoordinates for representing geographical coordinates. Yago2 uses GeoNames to obtain alternate names for locations (e.g., Athina and Athens) together with hierarchical part-of information (e.g., Athens is part of Greece). Yago2 also provides a data type, yagoDate, for the representation of temporal information. The dates in Yago2 follow the ISO 8601 (YYYY-MM- DD) format and represent time points. If we want to model intervals e.g., the lifetime of an entities such as a person in Yago2, we can use pairs of properties e.g., wasBornOnDate and diedOnDate which connect an entity with a date. To represent geospatial and temporal knowledge, Yago2 use the SPOTL data model, which extends the SPO model for knowledge base triples discussed above: T stands for Time, L stands for Location, and S, P and O are like in any other knowledge base. The SPOTL model not only allows temporal and geospatial relations between entities, but also temporal and geospatial relations between facts. For example, the fact that Barack Obama was inaugurated as president of the United states of America can be associated with a place (Washington D.C.) and a date (2009-01-20).
Wikidata 3 is an open and free knowledge base which is developed collaboratively by members of its community. It is an activity of the Wikimedia foundation and it is used to serve many other projects of Wikimedia. The users of Wikidata are able to add new knowledge to the underlying graph but also modify its schema. Wikidata is a multilingual knowledge base, and unlike Wikipedia which has different versions for every language, the information of the entities of Wikidata is translated to multiple languages and is part of the same graph. When it comes to quantitative geospatial information, Wikidata provides two data types: Globe Coordinate and Geographic Shape. The coordinates of an entity can be obtained using the property coordinate location for that entity. There are currently over 7 million triples that contain this property (i.e., over 7 million entities for which Wikidata knows their coordinates). The data type Geographic Shape has the property geoshape which can be used to associate a knowledge graph entity (e.g., the entity for Athens) with a geometry. Geometries in Wikidata are encoded using the GeoJSON 4 format which can be used to encode geometries of the following types: Points, Multipoints, Polygons, MultiPolygons, Linestrings and MultiLineStrings. Currently, Wikidata contains less than 1500 geometries which are mostly Polygons and MultiPolygons. For example, for the Wikidata entity for the city of Athens, we can have the following GeoJSON representation: {"type": "Feature", "geometry": {"type": "Point", "coordinates": [23.7161, 37.9794]}, "properties": {"name": "Athens"}}. Apart from quantitative geospatial information, Wikidata also contains rich qualitative geospatial information, that is represented with various properties, such as “shares border with” and “country”. Using the first property, one can represent the knowledge that Greece borders with Bulgaria. In Wikidata, temporal knowledge can be represented using points and intervals. Points are encoded using values of the Time data type. Similarly to Yago2, time intervals are essentially pairs of points and can be encoded by using two properties (e.g., date of birth and date of death).