Browse Prior Art Database

Extracting Hierarchical Metadata from Data

IP.com Disclosure Number: IPCOM000209502D
Publication Date: 2011-Aug-08
Document File: 2 page(s) / 19K

Publishing Venue

The IP.com Prior Art Database

Abstract

Disclosed is a method for scanning scientific data in order to extract metadata that can act as a summary of the data values for each field. Employing this method allows scientists to more readily locate the information they need.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 51% of the total text.

Page 01 of 2

Extracting Hierarchical Metadata from Data

Many environmental sensors provide a continuous stream of observations. Each individual observation consists of one or more sensor measurements, a geographic location, and a time. Environmental simulation models add to the vast stores of data. With billions of data points stored in diverse databases and in thousands of datasets, scientists have difficulty finding data relevant to their research interests.

The accepted solution is to describe the data stored in each dataset by providing descriptive metadata. Currently, most metadata created for scientific data involves the creating scientist manually providing information representing the contents of the datasets; automatically extracted metadata tends to consist of file system information such as the date of creation and the size of the dataset. Once created, the metadata can later be browsed or searched by other scientists. The metadata is generally provided at a single level; whether a metadata record describes a small or large amount of data, or whether the data covers a small or large range of values, is left entirely to the individual scientist. In data access or visualization approaches to searching for relevant data, tools are provided to allow a scientist to scan each dataset individually looking for relevant content. These approaches do not meet the scientists' needs as the number of datasets to review increases into the hundreds and thousands.

Environmental observations collected from sensors or the results of environmental simulation models generally have a record for each individual observation or data point. Each record consists of multiple dimensions or fields, with each field representing a specific data item. Such data is stored in a variety of formats and data storage methods, including databases, NetCDF and formatted files.

In the disclosed method, the scientific data is scanned to extract metadata that can act as a summary of the data values for each field; examples of such a summary are the minimum and maximum values or a histogram of values found in each field. (If desired, only a subset of available fields may be processed.) During this processing the values may be transformed into another, more compact representation (e.g., a series of times may be transformed into an interval, or a series of points representing locations of a mobile platform may be approximated by a polyline representing the platform's path). This combination...