Research

The GeoQA project focuses on the Scalable Answering of Questions Expressed in Natural Language over Large Geographic Knowledge Bases.

Introduction

Starting with the seminal projects CYC and WordNet, a lot of research and development has been put into the construction of large knowledge bases (KBs). While older KBs have been constructed mostly manually by trained knowledge engineers, the emphasis in the last ten years has been in their automatic construction. This has been facilitated by the availability of techniques for scalable information extraction from the Web and large curated knowledge sources, such as Wikipedia. Well-known KBs include DBpedia Freebase, Yago, KnowItAll, NELL, BabelNet, KnowledgeVault, Wikidata and DeepDive. Most of these KBs represent knowledge using a simple data model based on triples of the form SPO where S is the subject, P is the predicate and O is the object (i.e., as in RDF). Equivalently, these KBs can be viewed as directed graphs, hence the equivalent term knowledge graphs.

Geographic knowledge has been studied for many years by researchers in Geography, Geographic Information Systems (GIS), Databases, Artificial Intelligence and the Semantic Web, and there is a wealth of research results concerning representation, querying and inference for geographic knowledge. In GIS terminology, a geographic feature (or simply feature) is an abstraction of a real-world phenomenon and can have various attributes that describe its thematic and spatial characteristics. For example, the country Greece is a feature, its name and population are thematic attributes, while its location on Earth, in terms of polar coordinates, is a spatial attribute. Knowledge about the spatial attributes of a feature can be given in a quantitative or qualitative way. For example, the fact that the distance between Athens and Salonika is 502 km is quantitative knowledge, while the fact that river Evros crosses Bulgaria and Turkey and is at the border of Greece with Turkey is qualitative knowledge. Quantitative geographic knowledge is usually represented using geometries (e.g., points, lines and polygons on the Cartesian plane, while qualitative geographic knowledge is captured by qualitative binary relations between the geometries of features. The most important classes of qualitative binary relations between geometries, that have been studied in the literature, are topological (e.g., the well-known frameworks of Egenhofer and RCC-8) and cardinal directions. Important research results of this area in the last 30 years have found their way into commercial products, most notably GIS systems like ArcGIS or QGIS, and DBMSs such as PostGIS and Oracle Spatial and Graph. In the context of the Web, in pending extensions, the well-known community activity schema.org, which defines vocabularies for annotating Web pages with semantic information, will be adding classes and relations that model geographic features and their binary topological relations.

Geographic knowledge is useful for many human activities hence one would expect that large KBs like DBpedia Freebase etc. would contain a wealth of such knowledge. However, although all these KBs contain lots of knowledge about features and their thematic attributes, they contain very little knowledge about spatial attributes of features. For example, DBpedia contains latitude/longitude coordinates of the center of cities, towns etc. extracted from Wikipedia. In addition, DBpedia contains knowledge about some thematic attributes that can be used to infer knowledge about spatial attributes of features. For example, for each country, the neighboring countries are given, or for each city, the country to which the city belongs is given. In this way, one can infer knowledge about the corresponding spatial attributes of features e.g., “the geometry of Greece externally connects with the geometry of Bulgaria” or “the geometry of Athens is a non-tangential proper part of the geometry of Greece”, using the vocabulary of RCC-8. Recently, DBpedia has been attempting to add cardinal direction knowledge (e.g., Athens is north of Crete) via properties dbp:north, dbp:south etc. However, to the best of our knowledge, this information is currently incomplete and sometimes wrong.

Yago2 is the first large knowledge base that has made an effort to increase its geographic knowledge by harvesting geographic information not just from Wikipedia but also from GeoNames 2 . GeoNames is a gazetteer that currently has information about 11.000.000 place names. In order to represent geographic knowledge, Yago2 introduces the class yagoGeoEntity and the special data type yagoGeoCoordinates for representing geographical coordinates. Yago2 uses GeoNames to obtain alternate names for locations (e.g., Athina and Athens) together with hierarchical part-of information (e.g., Athens is part of Greece). Yago2 also provides a data type, yagoDate, for the representation of temporal information. The dates in Yago2 follow the ISO 8601 (YYYY-MM- DD) format and represent time points. If we want to model intervals e.g., the lifetime of an entities such as a person in Yago2, we can use pairs of properties e.g., wasBornOnDate and diedOnDate which connect an entity with a date. To represent geospatial and temporal knowledge, Yago2 use the SPOTL data model, which extends the SPO model for knowledge base triples discussed above: T stands for Time, L stands for Location, and S, P and O are like in any other knowledge base. The SPOTL model not only allows temporal and geospatial relations between entities, but also temporal and geospatial relations between facts. For example, the fact that Barack Obama was inaugurated as president of the United states of America can be associated with a place (Washington D.C.) and a date (2009-01-20).

Wikidata 3 is an open and free knowledge base which is developed collaboratively by members of its community. It is an activity of the Wikimedia foundation and it is used to serve many other projects of Wikimedia. The users of Wikidata are able to add new knowledge to the underlying graph but also modify its schema. Wikidata is a multilingual knowledge base, and unlike Wikipedia which has different versions for every language, the information of the entities of Wikidata is translated to multiple languages and is part of the same graph. When it comes to quantitative geospatial information, Wikidata provides two data types: Globe Coordinate and Geographic Shape. The coordinates of an entity can be obtained using the property coordinate location for that entity. There are currently over 7 million triples that contain this property (i.e., over 7 million entities for which Wikidata knows their coordinates). The data type Geographic Shape has the property geoshape which can be used to associate a knowledge graph entity (e.g., the entity for Athens) with a geometry. Geometries in Wikidata are encoded using the GeoJSON 4 format which can be used to encode geometries of the following types: Points, Multipoints, Polygons, MultiPolygons, Linestrings and MultiLineStrings. Currently, Wikidata contains less than 1500 geometries which are mostly Polygons and MultiPolygons. For example, for the Wikidata entity for the city of Athens, we can have the following GeoJSON representation: {"type": "Feature", "geometry": {"type": "Point", "coordinates": [23.7161, 37.9794]}, "properties": {"name": "Athens"}}. Apart from quantitative geospatial information, Wikidata also contains rich qualitative geospatial information, that is represented with various properties, such as “shares border with” and “country”. Using the first property, one can represent the knowledge that Greece borders with Bulgaria. In Wikidata, temporal knowledge can be represented using points and intervals. Points are encoded using values of the Time data type. Similarly to Yago2, time intervals are essentially pairs of points and can be encoded by using two properties (e.g., date of birth and date of death).

Objectives and Contributions of GeoQA

In parallel with the development of large KBs, the problem of answering questions expressed in natural language, or simply question answering (QA), from such KBs has been studied. In addition to academic research, companies have started developing their QA systems e.g., Wolfram Alpha, Google Knowledge Graph, Facebook Graph Search, Microsoft Satori and Amazon Evi. There is also work on domain specific knowledge bases in business, finance, life sciences etc.

Search engines like Google are quickly moving into this interesting application area and are today able to answer factoid geographic questions. A question is called factoid if it can be answered by a single fact in the knowledge graph of the QA system. For example, if we ask Google today “Which river crosses the city of Larissa in Greece?” we will get the precise answer “Pinios River”. The success of search engines like Google in answering such factoid questions is due to the fact that the relevant knowledge is encoded as a triple in their knowledge graph which powers the search engine. But if we try the more complex non-factoid question “What is the length of the river which crosses the city of Larissa in Greece?”, the limitations of search engines like Google immediately become apparent. Now Google simply returns a set of links to web pages with information about Larissa, Thessaly, Volos, rivers of Greece etc. By visiting these Web pages, a user will eventually find that the length of Pinios river is 216 km. The reader is asked to try for himself the even more challenging question “What is the length of the river that crosses the city of Thessaly in which the team that won the football championship of the period 1987-88, in the category now called Super League, plays?”. The resulting links that Google now gives for this question point to web pages that do not contain anything about the river Pinios, but rather information about various football players like Vassilis Karapialis who played in the team Athlitiki Enosi Larissa F.C. which actually won the Greek football championship in the first division (now called Super League) in the period 1987-88.

The detailed scientific contributions of GeoQA will be the following:

Contribution C1. We will extend the well-known knowledge graph Yago2 discussed above with geographic knowledge as found in the datasets GADM and OpenStreetMap and national geospatial datasets from selected European countries and selected European institutions (e.g., the European Environment Agency). The new geo- knowledge graph will be called GeoYago and, when it will become available, it will be the point of reference for knowledge graph research and development in the area of geospatial knowledge and related applications. GeoYago will be the richest knowledge graph in terms of geographic knowledge for the European Union. It will be able to power any European search engine such as QWANT 9 which is currently fighting hard to obtain a piece of the search engine market dominated by Google and other search engine companies based in the United States. GeoYago will be made publicly available on the Web site of the project so that it can also be used by other researchers and practitioners. Compared with other knowledge graphs which currently have only point coordinates for entities (e.g., DBpedia Yago2, Wikidata) and a small number of entities with more complex geometries (Wikidata), GeoYago will contain millions of entities with point coordinates but also millions of points with more complex geometries.

Contribution C2. We will develop scalable techniques and software for keeping GeoYago up-to-date when its data sources (e.g., OpenStreetMap) change. This problem has been studied for existing knowledge bases but there is no work in this area for geo-knowledge graphs. Our techniques will work incrementally and will scale to very large knowledge graphs.

Contribution C3. We will develop knowledge graph embeddings using deep learning techniques. Knowledge graph embeddings will encode latent information present in the entities, properties and literals of the knowledge graph GeoYago, and will be utilized subsequently in the development of our question answering engine. Our contribution is this area is original since there is currently no work on knowledge graph embeddings for knowledge graphs with geospatial information.

Contribution C4. We will develop question answering techniques that go beyond the factoid geographic questions that Google can answer to today and beyond our preliminary work in this area. The questions that interest us are complex non-factoid ones such as “Which rivers cross cities of Greece and their length is more than 20 km?” or “Which cities of Greece are within 10 km of a Natura2000 protected area?” or “What is the total land area of the Greek islands?” or the two complex questions regarding the Pinios river presented above. The techniques will be based partially on deep learning architectures and the knowledge graph embeddings of Contribution C3. They will be implemented in a prototype question answering engine which will also allow for the intuitive visualization of query answers using our linked data visualization tool Sextant 10 . The resulting question answering engine will be the first such engine internationally to be able to answer complex non-factoid questions.

Contribution C5. We will develop a gold standard corpus of geographic questions and their answers. The corpus will contain at least 1000 questions and it will be used to train the deep learning algorithms to be developed, and also to test the effectiveness and efficiency of the question answering techniques implemented in GeoQA. We will make this dataset freely available on the Web site http://geoqa.di.uoa.gr/ which contains our preliminary efforts in this area and has been recently presented in the 12th Workshop on Geographic Information Retrieval.

Contribution C6. We will study the efficiency and scalability of our question answering techniques, an issue that has been ignored in the question answering literature so far. An important component of a question answering engine that affects the efficiency of the engine is the query execution component. Such a component typically executes 5-100 SPARQL queries that run on the knowledge graph to retrieve resources that are then passed to the user as answers. These queries are very similar, hence very amenable to what is called multiple query optimization in database research. We will solve the multiple query optimization problem for the GeoSPARQL query language for which there is currently no research in this area.