Assessing Large-Scale, Cross-Domain Knowledge Bases for Semantic Search

Semantic Search refers to set of approaches dealing with usage of Semantic Web technologies for information retrieval in order to make the process machine understandable and fetch precise results. Knowledge Bases (KB) act as the backbone for semantic search approaches to provide machine interpretable information for query processing and retrieval of results. These KB include Resource Description Framework (RDF) datasets and populated ontologies. In this paper, an assessment of the largest cross-domain KB is presented that are exploited in large scale semantic search and are freely available on Linked Open Data Cloud. Analysis of these datasets is a prerequisite for modeling effective semantic search approaches because of their suitability for particular applications. Only the large scale, cross-domain datasets are considered, which are having sizes more than 10 million RDF triples. Survey of sizes of the datasets in triples count has been depicted along with triples data format(s) supported by them, which is quite significant to develop effective semantic search models.


INTRODUCTION
etrieval of concerned specific information from available repositories based on an input query is called search. The information that is retrieved, also known as the result set for specified query, may or may not be relevant to the user due to lack of context understanding on the machine part. That is, result set may contain highly irrelevant responses if intent of the query is not understandable by underlying search mechanisms. Semantic Search refers to search mechanisms considering meaning of query terms and its context as a whole. For making a transition towards semantic search, information retrieval mechanisms are exploiting Semantic Web technologies along with NLP (Natural Language 1 University School of Information, Communication & Technology, Guru Gobind Singh Indraprastha University, Delhi, India. Email: a aatif1992@gmail.com (Corresponding Author), b sdmalik@hotmail.com Processing) techniques to process the query in a machine understandable way. RDF based representation of data along with schema description using Ontology is transforming traditional information processing into knowledge processing.
To process queries intelligently, machines require proper formatting of data, large Knowledge Bases, and powerful ambiguity resolution techniques (for multimeaning terms used in query). RDF representation of data often with XML (eXtensible Markup Language) formatting (collectively referred as RDF/XML), makes information machine interpretable. Also, with the availability of web scale knowledge repositories such as DBpedia (KB behind Wikimedia Projects), Google Knowledge Graph etc., approaches are being actively developed to exploit this global range of knowledge for processing information needs specific to them. And fortunately for the ambiguity resolution part, effective techniques such as Word Sense Disambiguation (WSD) are being advanced rapidly in the NLP domain. WSD resolves multi-meaning mappings of query terms by considering overall context of the query and deriving best mapping to meaning by analyzing rest of the query terms [1].
Data present in KBs needs to be valid across multiple domains for approaching a true web scale semantic search. This is to make sure that knowledge vocabulary for one domain should not collide with another. e.g. query term "mean" has differing interpretations in Linguistics and Mathematics domains. In the former it corresponds to the "meaning" and for the Mathematics it represents the "average of sum of numbers". Thus, there is a need for development and utilization of cross-domain KBs.
In this paper an assessment of large scale and crossdomain KBs is presented with the motivation of a formal comparative analysis of such datasets for suitability towards semantic search applications. This study is essentially a prerequisite to model and develop effective semantic search approaches of global scale, as these KBs are the backbone for deriving knowledge in ways machine can understand. For keeping the discussion compact and useful, we have shortlisted only the largest KBs in terms of data size i.e. the 25 largest datasets with more than 10 million semantic triples are only considered. In section 2, semantic search is introduced, also discussing its necessity in new age information retrieval. We have also discussed the need of KBs in semantic search process in this section. In section 3, technical discussion regarding KBs on Linked Open Data Cloud and various RDF serialization formats is presented. In section 4, we have concisely tabulated descriptions of KBs from the perspective of their usage in information retrieval and specifically in semantic search applications. Then, two key parameters regarding KB sizes and their support for serialization formats are surveyed, analyzed and depicted with pictorial representations. Finally, we conclude the paper in section 5 along with future work in this direction.

Contributions
Availability of Linked Open Data (LOD) is a practical measure of realization progress towards the Semantic Web (Web 3.0). There exists research and literature in the direction of comparative growth analysis of LOD as a whole over time (e.g. growth in number of datasets, triples counts over the years) [2]. But, no efforts are yet made to assess the LOD datasets individually to the best of our knowledge. Following are the significant contributions of this article.
• This article surveys and concisely summarizes two key parameters of Knowledge Size and Knowledge Representation for 25 largest LOD cross-domain KBs.
• Analyzes all the available RDF triples formats for their usability scope towards specific applications.
• Guides application developers to suitably select and exploit particular RDF serialization to overcome constrains such as Storage, Network Bandwidth, Universal Character Set support, Web application support etc. It further lists the supported KBs they may utilize.
• Expands the research prospects towards some less popular but global scale KBs summarizing the nature of knowledge present in them.

SEMANTIC SEARCH
Search approaches where machines are capable of analyzing the meaning of query and information are referred as Semantic Search approaches. Typically, semantic search includes the usage of Semantic Web Technologies such as RDF, Ontology etc. as knowledge repositories in order to make content machine interpretable, effectively improving the efficiency of search. NLP techniques such as Part of Speech Tagging, Named Entity Recognition etc. are also used to preprocess the search query. Semantic search is different from keyword-based searching in the way that it actually analyzes the concepts behind the query and its context, while keyword-based searching rely only on the effectiveness of string matching algorithms. In the literature, keyword-based searching is also referred as navigational search, and the searching with conceptual clarity as research search [3].

Need of Semantic Search
The World Wide Web (WWW) introduced the searching on the internet with approaches based on keyword matching. As the web expanded, approaches are modified in terms of efficiency but still maintaining the keyword-centric methodology. But, at its present Big Data age, information is overloaded on the web with issues of inconsistency and redundancy. Now, if keyword-based approaches are used alone, result set will suffer in terms of precision of results. Hence, modern web search providers (including search engine giants Google and Bing) have started to use semantic search elements as additional parameters in their web search offerings. Table 1 tabulates the issues with keyword-based approaches on web scale information retrieval and their remedies with semantic search.

Knowledge Bases for Semantic Search
Semantic search approaches utilize machine interpretable knowledge contained in KBs to process the query; get context out of the query; use the derived context to search conceptually similar information on target repositories; and finally, present the retrieved results. Most of the KBs contains knowledge in the form of RDF triples, making it machine understandable. As RDF data triples are represented in <subject, predicate, object> form, machine processing has an extra formal metadata in the form of predicate to derive conceptual relationships among query terms and other concepts. By matching the target results conceptually, semantic search yields increased precision and hence, relevancy in result set for that specified search query.

KNOWLEDGE BASES
KB are data repositories containing machine interpretable information i.e. the knowledge. KBs utilizing Semantic Web technologies represent data in the form of RDF triples, which are often structured in XML, and also in other serialization formats as shown in Section 3.2 Keyword-based information retrieval produces low precision results due to availability of tremendous information in the ever increasing web.
Semantic search does not depends on the size of target information repositories, instead it analyzes the concepts in search query.

Inconsistent Information
Inconsistent information at multiple sources provoke the need of trustworthiness of information sources.
Semantic search relies on the data facts as available on underlying KBs. Hence, it has very little scope for knowledge inconsistencies.

Redundant Information
Availability of similar information at multiple sources effectively doesn't improve the quality of result set. It just increase its size.
Availability of similar information at multiple sources doesn't make it semantically different.
They resolve to very same concepts.

Usage of Ambiguous terms in queries
This is the key issue for irrelevant results in the result set. Often, machines fails to interpret the correct conceptual usage of terms that have mapping to multiple meanings at different contexts. (E.g. "Mean").
Resolving the ambiguity among concepts is a primary step in semantic search processing.

Dataset Formats
RDF triples are represented in various data serialization formats depending on constraints on storage and processing power. Triple formats specified by WWWC World Wide Web Consortium (WWWC) are tabulated in Table 2 also listing popular KBs that represent their data using these formats.

Significance of RDF Triples Serialization Formats
Representation of RDF triples in various formats is a result of need of efficiently processing huge amount of data by different applications. These format limits the nature of applications that may use these datasets. Major factors include size of data, Unicode support, bandwidth requirements and web application support. Storage and processing of such huge amount of data is a constraint for any system (e.g. Freebase KB, the fifth largest dataset in our list has over 220 GB (Giga Bytes) of data). Hence, for accessing such amount of data, developer may develop web applications and utilize web-friendly XML and JSONLD formats for efficient processing. Further, some formats are human readable making it easier for developers to debug their application code.
For most applications, RDF/XML is preferred as it is supported by most programming languages, further reducing the size of triples using namespaces instead of full Universal Resource Identifiers (URIs). Turtle is more developer friendly in terms of readability and hence, debugging. Also, it is much efficient for low bandwidth connections over RDF and supports Unicode character set. JSONLD is the most convenient and efficient format for processing in JavaScript web applications. N-Triples is easily for  Former increases the domain and range of search, and the latter is required for designing and developing applications considering storage and processing constraints. Fig. 1 depicts the survey of comparative sizes of all 25 surveyed KBs. DBpedia is indeed the most valuable dataset being generic as well as multilingual. Data.gov catalogue being second largest in size provides data in divided sets as categorized by US government. WikiData is the third largest crossdomain KB, and is heavily used in real world search applications due to its very close proximity to human readable Wiki articles.  It is a community-driven dataset populated by extracting structured information available in multiple Wikimedia projects. It is largest multilingual cross-domain KB which is actively exploited by range of semantic search approaches due to its generic vocabulary and largest domain-specific data collection also [4].

Assessing Large-Scale, Cross-Domain Knowledge Bases for Semantic Search
Data.gov US government documents converted to RDF and categorized into 417 datasets pertaining to different aspects. It is the largest Open Government Dataset [5].

WikiData
WikiData KB focused on structuring and linking of data extracted from Wikipedia, the free encyclopedia. It maintains facts from data present in Wikipedia articles [6]. Source Code Ecosystem KB of collected facts about source code from open source projects on the web. Facts are extracted at different levels of syntax and semantics of the code [7].

Freebase
Freebase KB was designed as wiki for structured content on the web. At present, its data is migrated to WikiData. Its last data dump is still one of the largest KBs available. Hence, applications use it for knowledge which is time invariant [8].

Open Library
It contains structured data about most of the books ever published globally. Catalogue for authority files pertaining to people, corporations, Geographic information, works, events etc. It is derived from German Integrated Authority File and has data from German National Library on these subjects [11]. Linked Open Numbers KB containing billions of facts about numerals. These include numeral usage in multiple languages and relations with other number systems (binary, hex etc.) [12] EPA (FRS, RCRA, SRS and TRI) It contains datasets about biomedical chemicals manufactured and their recorded effects for protection of human health and the environment. This KB is majorly used in medicine domain but also has cross linkages to other global KBs. YAGO It is a massive semantic repository for people, organization and geographic data. LinkedDrugs Structured data about medicines (drugs) from 23 countries [13]. LinkLion Central KB for storing links of resources available on Linked Open Data [14].

Muninn World War 1 Dataset
Multi-disciplinary and multi-national KB with millions of investigation records from World War 1 archives. DBkWik Single consolidated KB derived out of thousands of Wikipedia articles [15]. Influence Tracker Social Networking knowledge repository for tracking influence of individual users on Twitter microblogging website [16].

Product Types Ontology
Repository providing definitions to 0.3 million products described in various Wikipedia articles [17]. WarSampo LOD KB resulted by transforming Finnish World War 2 data archives [18].

CONCLUSION AND FUTURE WORK
KB are the backbone for deriving semantic relationships among keywords used during web search. In this paper, a concise assessment across 25 largest cross-domain knowledge bases available as Linked Open Data is presented. Surveyed datasets are analyzed and compared across two key parameters of Knowledge Size (in triples count) and Knowledge Representation (RDF triples Serialization format).
Knowing the nature of data available and their efficiency constrains may aid application developers to target their applications for suitable formats and datasets. Survey results are analyzed and depicted with pictorial representations in Section 4. DBpedia KB is found to be most valuable for web scale semantic search applications being the largest and having maximum linkages from other KBs. Also, WikiData KB, has wider application support due to availability of its multi-format data dumps.
LOD Cloud is not just limited to cross-domain knowledge bases but also has linkages with datasets pertaining to specialized domains of geography, governments, life sciences, linguistics, media, social networking, and publications among others. As part of the future work, this work can be expanded towards a comprehensive survey across all the knowledge bases available and linked on the LOD Cloud. This work may also be expanded to study various large scale semantic search applications to analyze state of art research towards global semantic search solutions.