Browse Prior Art Database

A novel method for pathway classification based on gene expression data and known gene networks

IP.com Disclosure Number: IPCOM000029900D
Original Publication Date: 2004-Jul-16
Included in the Prior Art Database: 2004-Jul-16
Document File: 3 page(s) / 27K

Publishing Venue

IBM

Abstract

Disclosed is a description of a novel method for pathway classification, which combines information extracted from gene expression (microarray) data and knowledge from well understood gene regulatory pathways (networks). In this method a compendium of pathways is created based on perturbed cellular states, the compendium is used in conjunction with an augmented graph-matching algorithm for classification of uncharacterized pathways.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 48% of the total text.

Page 1 of 3

A novel method for pathway classification based on gene expression data and known gene networks

A novel method for pathway classification based on gene expression data and known gene networks
Abstract
Disclosed is a description of a novel method for pathway classification, which combines information extracted from gene expression (microarray) data and knowledge from well understood gene regulatory pathways (networks). In this method a compendium of pathways is created based on perturbed cellular states, the compendium is used in conjunction with an augmented graph-matching algorithm for classification of uncharacterized pathways.

1. Description

This section provides a concise description of the methodology.

Inputs: The primary inputs used are gene expression data and gene regulatory pathways. While gene expression data provides a landscape of mRNA (messenger Ribonucleic acid) expression of thousands of genes at the same time. The gene regulatory networks provide clue about regulation and expression of genes in a cellular state.

Pathway construction and scoring: The gene regulatory pathways are collected for a particular organism from pathway databases and validated by current scientific literature. For each directed relationship of type AB between two entities (genes, transcription factors, protein, protein complex) in the pathway, a score is calculated from gene expression values of the two entities. The scoring function uses measures based on correlation and mutual information. While correlation detects linear relationships among gene expression patterns, mutual information serves the purpose to uncover non-linear relationships between genes. These measures are combined linearly to yield a composite scoring function that is used to score all the relationships in a pathway.

Data structure for compendium: The scored pathways are called model graphs (GM). These model graphs collectively form the compendium. The data structure used for storing these model graphs is attribute relational graphs (ARGs). An ARG is a directed graph in which nodes and edges are assigned labels.

Classification strategy: The problem being dealt here is that given a compendium of cellular pathways (GM) how do we classify a putative pathway (GI). The abstract problem was modeled as a graph-matching problem. Graph is an ideal data structure to represent a cellular network. The input pathway (GI) and the model graphs (GM) are stored as ARGs. A modified version of "inexact graph matching algorithm"[1] is used for finding the best possible match between two graphs. The input pathway is assigned the class label of best matching model graph (pathway). In order to compare the input graph to the compendium of model graphs and decide which of the model is most similar to the input, it is necessary to define a distance measure for graphs.

Given two ARG's GM and GI, the goal is to find the best matching between their nodes that lead

1

Page 2 of 3

to the smallest matching...