Browse Prior Art Database

System and method to progressively find the largest most occuring statements across large database which contain high duplicacy

IP.com Disclosure Number: IPCOM000243957D
Publication Date: 2015-Nov-02
Document File: 13 page(s) / 63K

Publishing Venue

The IP.com Prior Art Database

Abstract

This algorithm find the largest most occuring statements from large set of data and runs very fast if there is high duplicacy ie multiple copies of large sequenced data is present. It is a memory efficient and time saving algorithm.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 24% of the total text.

Page 01 of 13

System and method to progressively find the largest most occuring statements across large database which contain high duplicacy

The problem with the existing method is that it creates a large dataset since it finds out each and every combination of steps. Also, it takes a lot of time to complete each combination even when there are complete duplicate scripts present in the system.. The below algorithm works faster for large set of data and much faster if multiple copies of large sequenced data is present. It reaches to the output in very less operations as compared to the algorithm proposed earlier. It is memory efficient algorithm also as at every step only relevant data is created.

Below algorithm provides faster results for finding large common sequence in many different scripts .It performs very specific and less operations to get to the required result. So it is time saving as well as memory efficient algorithm compared to the existing prior arts.

The algorithm works in two passes.

In this algorithm we will consider each script's statements as a directed list of nodes as shown in detailed section [1]. While watched individually, they would appear as ordered linked lists however once we start to super-impose the linked lists, the statements will transform into a graph/forest of statements with common sequence or ordered statements overlapping in the forest/graph. This information will be stored in a data structure mentioned in detailed section [2].

Next we use this data structure and iterate over each node and update the global list. The updated is based on the no. of steps, No. of scripts and the order of source and destinations within Global list mapping the destination & source value [2] Based on the updates made in the Global list, the list will start including the common set of steps in increasing order of number of common steps and reference the scripts which have those.

In this algorithm ,the tables (T(#)) where # is 1,2 ,3.....so on are formed by

1) Mapping entries present within the same table ie the table created the most recently (T(#-1) if the Destination column of a row in (T#-1) matches a Source column of a row in T(#-1) and comparing the scriptIds present in the rows OR

2) If (a) was not a perfect match, by mapping T(#-1) to T(#-2) T(#-3)....T(1) ie. Mapping T(#-1) one by one to earlier created tables maintaining the same conditions for comparison ie. match scriptIds and destination of T(#-1) must be same as source in the matching tables T(#-2) or in T(#-3)…..

Note: 2 is followed incase there is at-least a single script which did not have a valid match in T(#-1)

Here the table formed contains possible sequences of any length which can be formed by the combination of already existing tables (T). It also removes formation of irrelevant sequences and does not progress linearly as proposed herein.

The overall approach is to maintain the scripts in a graph/forest format in the database (DB) and

1


Page 02 of 13

ut...