Browse Prior Art Database

A Skip List based approach to evaluate structural similarity in XML documents

IP.com Disclosure Number: IPCOM000198180D
Publication Date: 2010-Jul-29
Document File: 4 page(s) / 101K

Publishing Venue

The IP.com Prior Art Database

Abstract

XML has become a popular method of data representation both on the web and in databases in recent years. One of the reasons for the popularity of XML has been its ability to encode structural information about data records. However, this structural characteristic of data sets also makes it a challenging problem for a variety of data mining problems. One such problem is that of clustering, in which the structural aspects of the data result in a high implicit dimensionality of the data representation. As a result, it becomes more difficult to cluster the data in a meaningful way. Many Clustering algorithms typically require the computation of similarity between different objects as subroutine. The problem of computing similarity between XML documents has itself been known to be a difficult problem. [1] etc Modeling XML documents as a labeled tree, and using the ”edit distance” between tree structures as the similarity function between documents. Tree edit distance between two trees is defined by the minimum cost required to transform from one tree to the other. This cost is computed by summing up the cost of primitive operations (ie node insertion, node deletion, node renaming, subtree insertion and subtree deletion) involved in the transformation. The cost of computing tree distance on XML documents is very high. In [2] it is O (|A|.|B|), where |A| and |B| are their respective sizes. [3] has done a survey on many kinds of tree edit distance algorithms, which shows that the complexity is high among all these methods. Computation of the edit distance for each document pair is required by clustering algorithm, so the cost of this approach is too high for practical applications. [4] etc model XML document as a structured graph, and the distance between two documents is defined according to the number of common element-sub element relationships, which can capture better structural similarity relationships than the tree edit distance in some cases. Usually Jaccard Coefficient based methods are used for document clustering, such as in [4] etc. The similarity in the Jaccard Coefficient is based on the rate of common edges against the maximum edges of two trees: However, when it applies to the tree structure, this algorithm can not distinguish the structural similarities of XML trees in the cases below: First, different levels of differences in tree structures have different impact to the similarity, which is not considered by the Jaccard Coefficient like algorithm. Second, the edge based Jaccard coefficient algorithm does not take the tree relationship such as embedded trees in to consideration. We present an approach for detecting structural similarity between XML documents. Our proposal roughly consists of modeling the structure of each XML document, by representing it as a Skip Tree and, then, comparing such structures through the analysis their structural relationships. Our algorithm is based on the Jaccard coefficient method, but we taking into account XML issues such as embedded trees, and hierarchical weight of impact to the similarity。 [1] A. Nierman and H. V. Jagadish. "Evaluating structural similarity in XML documents". In Proceedings of the 5th International Workshop on the Web and Databases (WebDB 2002), Madison, Wisconsin, USA, June 2002. [2] C.C. Aggarwal, N. Ta, J. Wang, J. Feng, M.J. Zaki. “XProj: A Framework for Projected Structural Clustering of XML Documents”. SIGKDD'07. [3] P. Bille. “A Survey on Tree Edit Distance and Related Problems”. Theoretical Computer Science, 337 (1-3):217–239, 2003. [4] W. Lian, D.W. Cheung, N. Mamoulis, S. Yiu. “An Efficient and Scalable Algorithm for Clustering XML Documents by Structure”. IEEE Transactions on Knowledge and Data Engineering, Vol 16, No. 1, 2004.

This text was extracted from a PDF file.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 43% of the total text.

Page 1 of 4

A Skip List based approach to evaluate structural similarity in XML documents

3.1. The Skip Tree structure

We adopt Skip List to build the tree structure because it can preserve the structural relationship between nodes, that means we can skip from or to parent nodes and child nodes quickly and conveniently when we need to find their relationship while computing the tree similarity

Same as Skip List, Skip Tree is a layered structure. But rather than determined the nodes in each level by some fixed probability in Skip List, whether or not a node should appear in level i (i>0) depends on its depth in the XML tree. The top level of the Skip Tree contains only the root node , and in level h-1, the child nodes of the root is inserted to the list. Skip Tree will continue this process, until reached the bottom level. Figure 1 shows the Skip Trees of XML tree; we can see that if the levels in Skip Tree are equal to the depth of the XML tree.

.

Figure 1 A skip tree example

From Skip Tree, we can conclude the structure information of nodes and edges in the XML document tree.The node n's depth in the XML tree d = h - l , where h is the levels of the Skip Tree, l is the top level that node n exists in the Skip Tree. For example, in figure 2, the levels of Skip Tree is 3, the top level node B exists is level 2, and node B's depth in the XML tree is:3-2=1. And we define edge's depth in the XML tree the same with the level of child node it connects. For example, in figure 5, the level of edge AB is 1.

Skip Tree can be also regarded as a probabilistic skip list in a way that parent nodes and ancestor nodes of depth lower than or equal to i has the probability 1 in level i, while nodes of key less than i (i>0) has the probability 0 in level i.

         0 of the Skip Tree is a preorder sequence of the rooted labeled tree. So the similarity algorithms proposed on the traditional rooted labeled tree are available here.

Note that level

1

[This page contains 1 picture or other non-text object]

Page 2 of 4

Figure 2. An example of the keys

Each node in the Skip Tree has a key, key is a sequence number represented in "0" and "1",the value of the key is generated in a way that all the children of the same parent have the prefix same with their parents. The bit sequence is determined by the children number. If a parent has three nodes, we should add 2 digits to represent different nodes, and if it has one or two node, only a digit is needed. Figure 2 gives an example of the keys.

The characteristics of Skip Tree help us to originate search of nodes .As shown in figure 2 if we originate a query to find the destination of 'H'' s parent or ancestor, the query message is started at the rightmost nodes in the list of level 0, find its key value "0110" here. Then another search from the top 0 to 0110 started, it first routed to the level 2 node "C", which has value "01", then goes to the fist level find "011",which points to the node "G", the search continued until it wen...