Generating synthetic keywords using natural keywords for cross-repository search
Publication Date: 2015-Mar-12
The IP.com Prior Art Database
Disclosed is an enhanced information discovery system that is capable of searching across disconnected content repositories by automatically extracting the relationship between the words from the search engine. The novelty in this approach is that it utilizes the normalized information distance algorithm to generate synthetic keywords from a given set of natural keywords. The synthetic keywords are further used to enhance the search and identify related datasets which would otherwise not be found in the search.
Page 01 of 6
Generating synthetic keywords using natural keywords for cross -repository search
Disconnected repositories are a collection of different repositories containing information
pertaining to the same entity, but using different keywords (such as part numbers) to represent the same. Presented in this paper is a system to identify relationships between different keywords that represent the same entity. It is further proposed that these relationships can be used for enhanced information discovery from the repositories.
In today's industry, this problem can be seen manifested in different ways. Many core industrial sectors such as manufacturing, oil and natural gas, energy and utilities have large volumes of structured and unstructured information stored in different disconnected repositories. When this data is produced by different parties or produced over different periods of time, then there is no common metadata model that binds the information stored in these repositories. There lacks an ability to search across these repositories seamlessly as the taxonomy used by various data sources and repositories are different; and this makes it very difficult to connect together the metadata originating from different sources of content. This restricts the value that can be derived out of the large volumes of data.
Existing solutions today allow drilling down through the data, create a metadata model and manually correlate the part numbers using the metadata model. However these have severe limitations such as:
1. Exploring the large volume of data and building a metadata model is a very tedious
process 2. Cross-linking elements from the metadata manually is error-prone 3. The data needs to be structured or semi-structured so that the metadata can be extracted from the corpus of data
The solution proposed here will allow users to identify the keywords for a set of repositories given a set of user specified search key words (natural keywords). The key words generated by the system are called synthetic keywords. This solution also proposes an algorithm to generate the synthetic keywords given the natural keywords.
The problem of cross-repository search is spread across many industries. There are use cases that come from industries such as manufacturing, energy and utilities, oil and natural gas etc. The following examples help to better understand the problem.
Let's consider a scenario from manufacturing industry. In this example, the aircraft manufacturer has an issue with a particular aircraft, and the aircraft is grounded due to a faulty component. The component was not manufactured by this aircraft manufacturer, but supplied by a third party vendor. All the technical documentation related to the part such as engineering specifications, operating parameters, crash tests carried out and the test results, etc. are now available with aircraft manufacturer. These are stored in a repository called TDMS (Technical Document Management System). The...