Browse Prior Art Database

A method to codify Amino Acid Representations

IP.com Disclosure Number: IPCOM000019239D
Original Publication Date: 2003-Sep-08
Included in the Prior Art Database: 2003-Sep-08
Document File: 1 page(s) / 41K

Publishing Venue

IBM

Abstract

A Method to Codify Genomic Sequencing Representations

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 55% of the total text.

Page 1 of 1

A method to codify Amino Acid Representations

  Nucleotide chains are used to represent the sequence of DNA and RNA chains. For DNA, we have four letters respectively, A, G, C, T. Each of these letters represents a different nucleotide respectively, Adenin, Guanin, Citozin and Timin. Likewise, in RNA chains, we have four letters to represent four nucleotide: Adenin, Guanin, Citozin and Urasil. The basic motivation to represent nucleotides as letters is to save time and space. All bioinformatics applications assume the input in the form of one letter strings. Therefore, each nucleotide consumes 8-bit in memory. When bioinformatics applications reside in a remote processor farm and XML needs to be invoked to send input strings over the wire, very long chains of DNA and RNA sequences cause the XML document to become extremely large. For example, the rice genome contains 45,000-56,000 genes, each of which has 4,500 base pairs long, whereas, the human genome has 30,000 to 40,000 genes, each of which has 72,000 base pairs.

    Similarly, amino acids are represented by three letter words, such as Ala for Alanine and Arg for Arginine. Scientists use these abbreviations to represent long chains of amino acid sequences, which is, in turn, used to define proteins and genes. All bioinformatics applications, such as blast, clustalw and genescan, use these abbreviations as the inputs and outputs. Therefore, such programs treat the homology search tools as simple string match algorit...