Browse Prior Art Database

Method for distributing and validating data files in a clustered computing environment

IP.com Disclosure Number: IPCOM000015918D
Original Publication Date: 2002-May-27
Included in the Prior Art Database: 2003-Jun-21
Document File: 3 page(s) / 71K

Publishing Venue

IBM

Abstract

Data distribution in clustered computing environments has been an issue in high performance computing for many years. However, two recent developments have introduced new issues in this problem domain: • The recent rise in “commodity clusters”, primarily built with Linux on x86 hardware • Data growth in the Life Sciences industry, particularly as it relates to genomic and proteomic data Linux clustering commonly known as “Beowulf”clusters – are increasingly being sought as cost effective solutions to problems requiring high performance computing systems. The nature of a Linux cluster is to interconnect a group of servers (typically x86-based servers connected via 10/100MB Ethernet), and to run parallel or “embarrassingly parallel” applications across the cluster. A considerable challenge in this environment is moving data across the cluster in an efficient, reliable, and dynamic manner. Further, the problem increases linearly as the size of the cluster grows.

This text was extracted from a PDF file.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 45% of the total text.

Page 1 of 3

  Method for distributing and validating data files in a clustered computing environment

  Data distribution in clustered computing environments has been an issue in high performance computing for many years. However, two recent developments have introduced new issues in this problem domain:

* The recent rise in "commodity clusters", primarily built with Linux on x86 hardware
* Data growth in the Life Sciences industry, particularly as it relates to genomic and proteomic data

Linux clustering -- commonly known as "Beowulf"clusters - are increasingly being sought as cost effective solutions to problems requiring high performance computing systems. The nature of a Linux cluster is to interconnect a group of servers (typically x86-based servers connected via 10/100MB Ethernet), and to run parallel or "embarrassingly parallel" applications across the cluster. A considerable challenge in this environment is moving data across the cluster in an efficient, reliable, and dynamic manner. Further, the problem increases linearly as the size of the cluster grows.

The Life Sciences industry has embraced Linux clusters for many purposes, but one of the common uses is running distributed "similarity searches" of genomic and proteomic data across the cluster. The main goal of these applications is to search through large databases in an effort to match a specific sequence. The data is typically retrieved from online databases, available via HTTP or FTP, to a "staging" node, and then processed and distributed to the computational nodes in the cluster (or "compute" nodes) which run the searching program. The databases in this scenario are typically text-based and non-relational. This data is growing at an increasing rate as genomic and proteomic research continues to increase in this rapidly evolving field (e.g., GenBank growth statistics: http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html).

One of the key challenges is the efficient integration of large database files across a clustered environment. Additionally, an efficient method for keeping this data synchronized and updated across the clustered system is needed. Because of the size and rate of growth of the database files encountered in these types of clusters, comparing the data files has significant performance implications, which in turn has significant business implications (i.e., increased time-to-market in drug discovery, etc.). A more efficient method is to generate a statistical profile of these data files and then to compare these digital "fingerprints."

This method provides a framework for solving an increasing problem in clustered computing systems by providing data file retrieval, distribution, and validation functionality. This framework differentiates from other approaches by the applied methods, which emphasize performance and scalability by using light-weight statistical "fingerprints" for comparative analysis. Additionally, this method provides a distribution and validation sol...