Browse Prior Art Database

Representation of Life Sciences publication data in relational form

IP.com Disclosure Number: IPCOM000030587D
Original Publication Date: 2004-Aug-18
Included in the Prior Art Database: 2004-Aug-18
Document File: 2 page(s) / 76K

Publishing Venue

IBM

Abstract

Life sciences data exist in a number of different forms. One such form, for publications, is PubMed. This invention describes a representation of PubMed data in relational form, so that it may be queried using relational tools such as DB2.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 52% of the total text.

Page 1 of 2

Representation of Life Sciences publication data in relational form

This invention consists of two pices: A relational schema and a mapping from the hierarchical PubMed schema to the relational one. Specifically not covered by this invention are: The hierarchical PubMed schema itself and the mechanism that uses the mapping to perform the conversion. Aspects of these two items are relevant to the operation of the invention and are referenced below.

For the purposes of this discussion, the representations of the two schemas and the mapping are somewhat arbitrary. For simplicity, the representations chosen are closely related to the way that these items are implemented. The hierarchical schema is described by a set of Extensible Markup Language (XML) Data Type Descriptions (DTDs). The relational schem is represented by a set of SQL CREATE TABLE statements. Finally, the mappings are represented by a table relating a target table and column to an XML XPath expression; this expression describes the source of the relational data as it appears in the hierarchical (XML) schema.

The mapping operation takes place in two distinct parts. First, a transformation is applied to the XML data; this converts certain elements into canonical forms and simplifies the actual mapping. The second part is the mapping itself.

Four transformations may take place. These are for general item lists, author names (and lists of author names), dates and pagination.

Unless otherwise noted, lists of items that are de-normalized into a single column will have individual items separated by a semicolon and a single space. For instance, if for a particular entry contained the keywords "dnaA gene", "dnaN gene" and "orf187", the corresponding Keywords column would contain the value 'dnaA gene; dnaN gene; orf187'.

Names in the PubMed schema consist of a requi...