Expanding the YAGO knowledge base

Abstract

A knowledge base is a set of facts about the world. YAGO was one of the first large-scale knowledge bases that were constructed automatically. This thesis focuses on extending the YAGO knowledge base along two axes : extraction and preprocessing. The first main contribution of this thesis is improving the number of facts about people. The thesis describes algorithms and heuristics for extracting more facts about birth and death date, about gender, and about the place of residence. The thesis also shows how to use these data for studies in Digital Humanities. The second main contribution are two algorithms for repairing a regular expression automatically so that it matches a given set of words. Experiments on various datasets show the effectiveness and generality of these algorithms. Both algorithms improve the recall of the initial regular expression while achieving a similar or better precision. The last contribution is a system for translating database queries into Bash scripts. This approach allows preprocessing large tabular datasets and knowledge bases by executing Datalog and SPARQL queries, without installing any software beyond a Unix-like operating system. Experiments show that the performance of our system is comparable with state-of-the-art systems.

Publication
PhD thesis
Date
Links