Billions of people use the internet every day, producing quintillions of bytes of data. Artificial intelligence (AI) makes it possible to gain structured insights from these huge amounts of data. Companies that make business-critical decisions based on data benefit from this in particular. The problem: Although data is now available in a variety of languages, there is a lack of multilingual data sets such as knowledge graphs, which model information in a structured way and are the basis for many AI applications. In a new research project, scientists from the Data Science group at the Institute of Computer Science at the University of Paderborn are working with partners from industry to enable end users to query large amounts of multilingual text data using knowledge graphs. This key component should make the use of AI-supported solutions in companies more efficient, for example in question answering systems (QA) in the form of chatbots or enterprise search, i.e., internal company search engines.
The project entitled "Polylingual Hybrid Question Answering" (PORQUE) is being funded for the next three years by the German Federal Ministry of Education and Research (BMBF) as part of the "Eurostars" funding programme with a total of 1.2 million euros. The project partners include the Semantic Web Company (consortium leader) and the software developer SiteFusion.
New platform unites multilingual data
"Our project aims to further develop polylingual, i.e., multilingual, conversational AI to enable users to query a variety of multilingual data sources. This should enable companies to use globally available data to make informed business-critical decisions," says Prof. Dr Axel-Cyrille Ngonga Ngomo, head of the "Data Science" department at the Institute of Computer Science at the University of Paderborn.
The challenge lies in answering complex questions across multiple languages, based on large amounts of heterogeneous data. "The innovation of our approach lies in the combination of automatic machine translation and knowledge graphs," explains the computer scientist. "Knowledge graphs are the foundation without which many AI applications and assistants would not work today: They are in information retrieval solutions and QA systems," says Artem Revenko, Director Research, PoolParty Semantic Suite. For example, the data sets are hidden behind the blocks of information that Google displays for search queries even before you go to a page, or are used by Amazon to answer questions to Alexa. "Besides the difficulty that a person can ask a question in many different ways, there is a lack of knowledge graphs in languages other than English, as just under half of all information on the web is not available in English," explains Ngonga Ngomo. "Although a great deal of effort has already been made to make knowledge graphs available across languages, most popular knowledge graphs, e.g. DBpedia, are most comprehensive in their English versions. This lack of multilingual datasets limits the transfer of machine learning-based models - such as QA systems - to different languages," he continues.
Answering cross-language questions from the European market
The novel platform for multilingual question answering is intended to be a hybrid solution, Ngonga Ngomo explains. "Our platform will include translation and cross-lingual enrichment of knowledge graphs coupled with information from texts on the web. Once a knowledge graph is enriched with multilingual content, we want to use it as background knowledge for creating and improving the quality of polylingual QA systems." This is particularly relevant in the European context, he said, as data in this space is available in a variety of languages.
So far, there are very few solutions that link entity names (such as names of people or places) contained in texts with polylingual domain knowledge to answer questions, says Ngonga Ngomo. He adds, "Commercial applications that enable multilingual QA have so far relied heavily on humans to do some of the quality assurance of the data, which is time-consuming and costly. By combining machine translation as an automated system with specific language processing techniques, we enable end users to ask multilingual questions and get accurate answers automatically."