Browse Prior Art Database

Method To Collect And Use Distribution Statistics Over XML Wildcard Paths Disclosure Number: IPCOM000201620D
Publication Date: 2010-Nov-16
Document File: 5 page(s) / 35K

Publishing Venue

The Prior Art Database


Disclosed is a method for developing and implementing a Wildcard Histogram, which is a proportional representation of the wildcard IP, in order to solve the problem of lack of distribution statistics over XML wildcard IPs.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 27% of the total text.

Page 01 of 5

Method To Collect And Use Distribution Statistics Over XML Wildcard Paths

Extensible Markup Language (XML) documents can contain data that is arbitrarily complex; therefore, analysts often consider statistical information about their contents useful.

Among other purposes, analysts use this statistical information to efficiently

summarize the contents of an arbitrarily large set of XML documents, or to optimize the best method for retrieving information from these documents.

One useful representation of such statistical information that analysts can gather from the data is a histogram: a frequency distribution of a set of data, often represented in graphical forms. These graphical forms include rectangles (representing a class of objects) on the X-

Axis having varying heights corresponding to the number of times that

a class of objects occurs in the data.

Histograms can be collected primarily via two methods: by scanning data, or by scanning an index that has already been created over the data. These indexes are a common way to faster access the data. Many technologies, including searching, databases, etc. use indexes to speed up certain operations. These indexes always contain properties by which they can be sorted in order to facilitate easy search and retrieval.

XML indexes may be based on absolute paths or wildcards. Examples of an absolute path are '/Customer/id' and '/a/b/c'; examples of wildcard paths are '//id', '/Customer/*', and '/a/b/*'.

Absolute-path XML indexes, by definition, are sorted by the

                                              value they are indexing. Distribution statistics, such as histograms, can easily be collected over

Absolute-path XML indexes by scanning the XML index sequentially and constructing

the histograms.

XML wildcard indexes extend regular XML indexes in that they can contain multiple paths that match a particular pattern. For instance, the path pattern '/a/b/*' can match '/a/b/c', '/a/b/x', and '/a/b/z'. In addition, indexes that contain path patterns such as '//c' (

which have a self-or-descendant axes '//') are also considered XML wildcard indexes.

XML wildcard indexes are typically sorted first by path and then by value. This means that for the pattern '/a/b/*', all the '/a/b/c' entries appear together, then the '/a/b/d' entries, etc.


• XML (IP):

other paths in the collection

• Wildcard Collection: The set of data that is represented by an XML wildcard Important Path

• PathID:

A numeric representation of a path. For example, representing the path

A path in a collection of XML documents that occurs more heavily than


Page 02 of 5

'/a/b/c' by the pathID 201

Analysts tend to use XML Important Paths

(IP) more in search terms or queries,


can have significant consequences if the quantity of data being accessed or retrieved is misestimated (in terms of cardinality/fanout).

A reasonable conclusion, therefore, is that

most IPs have XML indexes on them, as indexes are generally used to improve performance, and crea...