Data Quality of Knowledge Bases

Knowledge bases store general knowledge in machine-readable form and they allow to use this knowledge to improve search results, for automatic question answering, and for many further applications. We develop innovative methods to assess and improve the data quality of those knowledge bases.

Members: Stefan Heindorf

Contact: Stefan Heindorf

Cooperation: Bauhaus-Universität Weimar


Knowledge bases such as DBpedia, Wikidata, or Yago are used for a wide range of applications, e.g., quick answer boxes in search engines, personal assistants like Siri, question answering systems like IBM Watson. However, as of today all major knowledge bases suffer from quality problems. In particular, crowdsourced knowledge bases like Wikidata are prone to vandalism, i.e., malicious edits by some of its volunteers.

In a first step, we compiled a vandalism corpus containing over 100,000 cases of vandalism and we are currently developing a machine learning-based approach to detect this vandalism automatically. In the future, we plan to tackle further quality dimensions such as consistency, completeness, or contextual data quality.