Knowledge bases such as Wikidata are used for a wide range of applications, e.g., quick answer boxes in search engines (e.g. Google, Bing), personal assistants (e.g., Siri, Google), or question answering systems (e.g., IBM Watson). However, today’s knowledge bases suffer from quality problems. For example, Wikidata currently reports over 10 million constraint violations. In traditional databases, all data which violates constraints is simply discarded. However, this approach is not applicable to real-world, large-scale knowledge bases as for almost every constraint, there is an exception in the real world, and strictly enforcing constraints prevents an agile and flexible development of the knowledge base. Nevertheless, constraint violations often point to quality problems. To overcome this dilemma, we envision a semi-automatic approach: constraint violations are ranked by the severity of their consequences, thus, enabling the volunteers of the knowledge base to manually review and fix the most important violations first.
Description of the Task
- Investigate some examples of constraint violations in Wikidata, and manually order them by the severity of their conse-quences
- Develop systematic criteria to rank constraint violations in knowledge bases
(more/better criteria for master’s thesis)
- Develop a prototype for automatically ranking the constraint violations
- Evaluate your prototype by comparing its result with your initial, manual ranking (or even perform a crowdsourcing experiment for master’s thesis)
- For the most common and severe types of constraint violations, offer suggestions how to fix them (semi-) automatically