Browse Prior Art Database

Method for generating statistical summary of document structure to facilitate data mart model generation

IP.com Disclosure Number: IPCOM000199141D
Publication Date: 2010-Aug-26
Document File: 3 page(s) / 36K

Publishing Venue

The IP.com Prior Art Database

Abstract

Extensible Markup Language (XML) is a popular representation for semi-structured data because of its great flexibility. However, processing and exchange of documents is greatly facilitated by enforcing a common structure or schema, and schemas for many domains have been defined using the XML Schema Language (XSL). It is sometimes the case, however, that the schema for a domain is very complex and allows for a great deal of variation in document structure to accommodate cases that are rare or occur only in particular contexts. In these situations, the schema as described by an XSL specification may be overly general for a specific use case, and not reflect the documents actually received. A user attempting to understand the information contained in a collection of documents can become lost among the alternatives permitted by the schema, whereas the structure of the available documents is much more tightly constrained. Furthermore, the documents themselves often contain cues that signal additional constraints on their structure. The values of certain elements, which we call discriminators, imply that the structure and/or content of the portion of the document that contains them will be constrained in particular ways. In effect, an element that contains a discriminator with a specific value corresponds to a constrained subtype of the containing element's type as defined by the schema. Discriminators also typically supply semantic information about the nature or intent of a document portion that is more specific than the information supplied by the name of the corresponding element or type in the XML schema. Given a set of discriminators, the invention describes how to build a statistical summary of document structures called a Semantic Data Guide that captures the structure and variation of documents in the repository and helps the user to understand and identify the information relevant to their purpose.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 46% of the total text.

Page 1 of 3

Method for generating statistical summary of document structure to facilitate data mart model generation

This invention addresses the problems of overly complex and general XML schemas in two ways. Firstly, we provide a statistical summary of the structure of documents in a collection. We call this summary a Semantic Data Guide (SDG). The SDG is simpler than the XML schema associated with the documents, because it only includes alternatives that actually occur, rather than all those that are hypothetically possible, and the SDG includes statistical information specific to the collection, including the prevalence of constructs in the collection as a whole and in particular contexts. Secondly, our invention makes use of discriminators to better describe the documents in a collection. A discriminator is a rule which describes how the structure and/or content of the document is constrained depending on other values in the document. A discriminator also typically supplies semantic information about the nature or intent of a document portion that is more specific than the information supplied by the name of the corresponding element or type in the XML schema. In the SDG, a document element containing a discriminator is split into multiple elements, each of which is bound to a particular discriminator value and represents a different semantic purpose and its correspondingly constrained document structure.

The most closely-related prior work is a paper by Goldman and Widom [1] that describes the Data Guide, which provides a summary description of a LORE database. The LORE data model is similar to XML but more general, making their algorithm unnecessarily complex. The statistics they collect are limited and oriented toward query optimization, not clarification of document structure, and they do not make use of discriminators to refine the document structure as defined by the schema.

Here is the high-level description on how the algorithm works.
Parse input XML document
Generate the document-level semantic data guide
Merge the document-level semantic data guide into the global semantic data guide

Along with the path context, and label information, each node in the hierarchy of semantic data guide also contains the following statistics, collected from the documents in a collection:

Total occurrences: percentage of documents in the collection that contain the path


1.

2.

3.

4.

of this child in documents in the collection that match the immediate prefix of this

path.

corresponding to this node. For example, if the path /a/b/c occurs in 30 out of 100 documents in the warehouse, total occurrences(/a/b/c) = 30/100 = 30% Contextual occurrences: percentage of documents in the collection that contain the

immediate prefix of the path corresponding to this node that also contain the path corresponding to the node itself. For example, if 60 of the 100 documents above contain the path /a/b, then contextual occurrences(/a/b/c) = 30/60 = 50% Maximum and minim...