Browse Prior Art Database

A Skip List based approach to evaluate structural similarity in XML documents

IP.com Disclosure Number: IPCOM000198180D
Publication Date: 2010-Jul-29
Document File: 4 page(s) / 102K

Publishing Venue

The IP.com Prior Art Database

Abstract

XML has become a popular method of data representation both on the web and in databases in recent years. One of the reasons for the popularity of XML has been its ability to encode structural information about data records. However, this structural characteristic of data sets also makes it a challenging problem for a variety of data mining problems. One such problem is that of clustering, in which the structural aspects of the data result in a high implicit dimensionality of the data representation. As a result, it becomes more difficult to cluster the data in a meaningful way. Many Clustering algorithms typically require the computation of similarity between different objects as subroutine. The problem of computing similarity between XML documents has itself been known to be a difficult problem.

This text was extracted from a PDF file.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 43% of the total text.

Page 1 of 4

A Skip List based approach to evaluate structural similarity in XML documents

3.1. The Skip Tree structure

We adopt Skip List to build the tree structure because it can preserve the structural relationship between nodes, that means we can skip from or to parent nodes and child nodes quickly and conveniently when we need to find their relationship while computing the tree similarity

Same as Skip List, Skip Tree is a layered structure. But rather than determined the nodes in each level by some fixed probability in Skip List, whether or not a node should appear in level i (i>0) depends on its depth in the XML tree. The top level of the Skip Tree contains only the root node , and in level h-1, the child nodes of the root is inserted to the list. Skip Tree will continue this process, until reached the bottom level. Figure 1 shows the Skip Trees of XML tree; we can see that if the levels in Skip Tree are equal to the depth of the XML tree.

.

Figure 1 A skip tree example

From Skip Tree, we can conclude the structure information of nodes and edges in the XML document tree.The node n's depth in the XML tree d = h - l , where h is the levels of the Skip Tree, l is the top level that node n exists in the Skip Tree. For example, in figure 2, the levels of Skip Tree is 3, the top level node B exists is level 2, and node B's depth in the XML tree is:3-2=1. And we define edge's depth in the XML tree the same with the level of child node it connects. For example, in figure 5, the level of edge AB is 1.

Skip Tree can be also regarded as a probabilistic skip list in a way that parent nodes and ancestor nodes of depth lower than or equal to i has the probability 1 in level i, while nodes of key less than i (i>0) has the probability 0 in level i.

         0 of the Skip Tree is a preorder sequence of the rooted labeled tree. So the similarity algorithms proposed on the traditional rooted labeled tree are available here.

Note that level

1

[This page contains 1 picture or other non-text object]

Page 2 of 4

Figure 2. An example of the keys

Each node in the Skip Tree has a key, key is a sequence number represented in "0" and "1",the value of the key is generated in a way that all the children of the same parent have the prefix same with their parents. The bit sequence is determined by the children number. If a parent has three nodes, we should add 2 digits to represent different nodes, and if it has one or two node, only a digit is needed. Figure 2 gives an example of the keys.

The characteristics of Skip Tree help us to originate search of nodes .As shown in figure 2 if we originate a query to find the destination of 'H'' s parent or ancestor, the query message is started at the rightmost nodes in the list of level 0, find its key value "0110" here. Then another search from the top 0 to 0110 started, it first routed to the level 2 node "C", which has value "01", then goes to the fist level find "011",which points to the node "G", the search continued until it wen...