__ Page 1 of 3 __
**Optimal Splitting Algorithm for Arbitrary Random Predictors in the Two-
Class Problem **

Disclosed is an algorithm for finding the optimal split in the sample space of
a continuous random variable. If (X,Y) has a distribution such that Y has two
possible values and X is continuous (e.g., a multivariate Gaussian mixture), then
for predicting Y the best question about X cannot be found either by exhaustive
search or by sorting the conditional probabilities P(Y = 1 X = x) as in the case of
X with finitely many values. An algorithm is given to define the optimal split
(question) in the continuous case.

If (X,Y) has a distribution such that Y (class index) has two possible values
and X is continuous (e.g., multivariate or scalar Gaussian or mixtures), then for
predicting Y the best question about X cannot be found either by exhaustive
search or by sorting the conditional probabilities P(Y = 1 X = x) as in the case of
X with finitely many values. An algorithm is given to define the optimal split
(question) in the continuous case based on the conditional probability density
functions p(x Y = 1), p(x Y = 2) given the classes. When incorporated into our
iterative splitting algorithms, this allows the approximation of the most informative
question about an arbitrary random variable in the N - class problem as well and
hence allows construction of decision trees for the general classification problem
with arbitrary random variables as predictors.

Let (X,Y) have a bivariate distribution with Y having marginal probabilities pi =
Prob(Y = 1) for i = 1, ..., N and X having conditional probability densities pi(x) for i
= 1, ...N and x e S, where S is an arbitrary sample space such as k dimensional
Euclidean space. A problem of great interest to growers of decision trees is this:
what question of the form "Is X e B?" is most informative about the random class
index Y? In other words, how is one to find a subset B of the sample space S
which maximizes the Shannon mutual information (or similar measures based on
entropy-like convex functions) I(w(X e B); Y) where w(A) is one or zero
accordingly as A occurs or not. There is no published algorithm available for this
except for the special case wherein S is a finite set of size m and N = 2 (see
Theorem 4.6 of [*]). In this case the probability densities are with respect to
counting measure so they are in fact probabilities. The algorithm is to sort the
p1(x1), i = 1, ..., m and compute the information in each question defined by an
initial segment of the sorted list; the theorem guarantees the best question
among these m-1 questions.

THE ALGORITHM: We consider the case of arbitrary S and N = 2 STEP 0:
INITIALIZE:

Let X1, ..., Xm be a small sample from the marginal distribution of X (e.g., a
random subsample of a larger sample). Let define the conditional class
probabilities given...