# Optimal Splitting Algorithm for Arbitrary Random Predictors in the Two-Class Problem

Original Publication Date: 1989-Nov-01

Included in the Prior Art Database: 2005-Jan-29

## Publishing Venue

IBM

## Related People

Nadas, A: AUTHOR [+1]

## Abstract

Disclosed is an algorithm for finding the optimal split in the sample space of a continuous random variable. If (X,Y) has a distribution such that Y has two possible values and X is continuous (e.g., a multivariate Gaussian mixture), then for predicting Y the best question about X cannot be found either by exhaustive search or by sorting the conditional probabilities P(Y = 1 X = x) as in the case of X with finitely many values. An algorithm is given to define the optimal split (question) in the continuous case.

**This text was extracted from a PDF file.**

**At least one non-text object (such as an image or picture) has been suppressed.**

**This is the abbreviated version, containing approximately 52% of the total text.**

__Page 1 of 3__**Optimal Splitting Algorithm for Arbitrary Random Predictors in the Two-
Class Problem **

Disclosed is an algorithm for finding the optimal split in the sample space of a continuous random variable. If (X,Y) has a distribution such that Y has two possible values and X is continuous (e.g., a multivariate Gaussian mixture), then for predicting Y the best question about X cannot be found either by exhaustive search or by sorting the conditional probabilities P(Y = 1 X = x) as in the case of X with finitely many values. An algorithm is given to define the optimal split (question) in the continuous case.

If (X,Y) has a distribution such that Y (class index) has two possible values and X is continuous (e.g., multivariate or scalar Gaussian or mixtures), then for predicting Y the best question about X cannot be found either by exhaustive search or by sorting the conditional probabilities P(Y = 1 X = x) as in the case of X with finitely many values. An algorithm is given to define the optimal split (question) in the continuous case based on the conditional probability density functions p(x Y = 1), p(x Y = 2) given the classes. When incorporated into our iterative splitting algorithms, this allows the approximation of the most informative question about an arbitrary random variable in the N - class problem as well and hence allows construction of decision trees for the general classification problem with arbitrary random variables as predictors.

Let (X,Y) have a bivariate distribution with Y having marginal probabilities pi = Prob(Y = 1) for i = 1, ..., N and X having conditional probability densities pi(x) for i = 1, ...N and x e S, where S is an arbitrary sample space such as k dimensional Euclidean space. A problem of great interest to growers of decision trees is this: what question of the form "Is X e B?" is most informative about the random class index Y? In other words, how is one to find a subset B of the sample space S which maximizes the Shannon mutual information (or similar measures based on entropy-like convex functions) I(w(X e B); Y) where w(A) is one or zero accordingly as A occurs or not. There is no published algorithm available for this except for the special case wherein S is a finite set of size m and N = 2 (see Theorem 4.6 of [*]). In this case the probability densities are with respect to counting measure so they are in fact probabilities. The algorithm is to sort the p1(x1), i = 1, ..., m and compute the information in each question defined by an initial segment of the sorted list; the theorem guarantees the best question among these m-1 questions.

THE ALGORITHM: We consider the case of arbitrary S and N = 2 STEP 0: INITIALIZE:

Let X1, ..., Xm be a small sample from the marginal distribution of X (e.g., a random subsample of a larger sample). Let define the conditional class probabilities given...