Browse Prior Art Database

# Maximum Likelihood Training of Probabilistic Linear Functions (Probabilistic Perceptrons)

IP.com Disclosure Number: IPCOM000121130D
Original Publication Date: 1991-Jul-01
Included in the Prior Art Database: 2005-Apr-03
Document File: 7 page(s) / 241K

IBM

## Related People

Bakis, R: AUTHOR [+3]

## Abstract

We devise an EM algorithm for maximum likelihood (or Bayes) training of perceptrons (linear threshold functions) and single layer neural nets having vector inputs and bitstring outputs.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 38% of the total text.

Maximum Likelihood Training of Probabilistic Linear Functions (Probabilistic
Perceptrons)

We devise an EM algorithm for maximum likelihood (or
Bayes) training of perceptrons (linear threshold functions) and
single layer neural nets having vector inputs and bitstring outputs.

A linear threshold function, or "perceptron" is a parametric
binary function from d-dimensional Euclidean space to a two point set
(1)
given by
(2)
where   denotes transposition and   is the unit step
(3)

By a single layer neural net we mean a parametric function from
d-dimensional Euclidean space to the space of all strings of length k
on a two-point set:
(4)
defined by requiring that each coordinate of the output is computed
by a perceptron.  Thus, if z = fk(x), then
(5)

Our problem is to learn the pair (b,c) (or the pairs (bj,cj))
in a supervised way from the training data
(6)

It suffices to discuss the problem in the case of k = 1 (the
perceptron); we henceforth restrict ourselves to this case.

In the unlikely event that the two sets in
(7)
happen to be linearly separable, a linear function that achieves this
separation (i.e., zero error rate on the training data) can be
obtained by the celebrated perceptron training algorithm.  In the
more likely situation when these sets are not linearly separable,
then one can define some criterion or other and choose b and c by
optimization. Assuming that the two sets
(8)
are samples from two Gaussian distributions with a common nonsingular
covariance matrix S and mean vectors mi, R. A. Fisher's linear
discriminant function where
(9)
and
(10)
is an early solution to this problem.  In practice the required two
moments are replaced ("plug-in" rule) by the centroids of the two
sets and by a poled estimate of the common covariance matrix.  The
current approach seems to be to avoid the problem as posed and solve
instead the simpler problem which can be solved by an optimization of
a differentiable objective function.  This is done by replacing the
unit step function U by some differentiable approximation S
("sigmoid").

We are interested in solving the original perceptron training
problem without resorting to the "Sigmoid Solution."  Viewed from
this point of view, the Sigmoid Solution has certain unattractive
features.  These features include
\$    The use of minimum mean square error criterion where the error
is the difference between a bit z (= the actual data) and a real
number f(x) (= the output corresponding to the latest estimate of
(b,c)).  Real numbers are not bits and this Sigmoid Solution in a
sense pretends that they are.
\$    The use of mean square error criterion in a setup where the data
cannot possibly have a Gaussian distribution (except in the zero
variance degenerate case when no error is possible) or anything even
remotely resembling a Gaussian distribution.  Thus, the Sigmoid
Solution in a sense pretends that the distribut...