Browse Prior Art Database

FOOLPROOF CONVERGENCE IN MULTICHAIN POLICY ITERATION

IP.com Disclosure Number: IPCOM000148677D
Original Publication Date: 1976-Mar-09
Included in the Prior Art Database: 2007-Mar-30

Publishing Venue

Software Patent Institute

Related People

Schweitzer, P.J.: AUTHOR [+3]

Abstract

FOOLPROOF CONVERGENCE IN MULTICHAIN POLICY ITERATION P. J. SchweitzerIBM Thomas J. Watson Research Center Yosktown Heights, New York 10598and A. FedergxuenMathematical CenterAmsterdam, Holland ABSTRACT: An example for undiscowted multichain Markov Renewal Programing shows that policies may exist suchthat the Policy Iteration Algorithm (PIA) can converge to these policies for some (but not all) choices of the additive constants in the relative values, and as a consequence that the PIA may cycle if the relative values are improperly determined. A class of rules for choosing the additive constants is given sufficient to guarantee the convergence of the PIA, as well as necessaq and sufficient conditions for a policy to have the property that the PIA can converge to it for any relative value vector. Finally we give some properties of the policies that exhibit this foolproof convergence.

This text was extracted from a PDF file.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 18% of the total text.

Page 1 of 16

FOOLPROOF CONVERGENCE IN MULTICHAIN POLICY ITERATION
P. J. Schweitzer
IBM Thomas J. Watson Research Center
Yosktown Heights, New York 10598
and
A. Federgxuen
Mathematical Center
Amsterdam, Holland
ABSTRACT: An example for undiscowted multichain Markov
Renewal Programing shows that policies may exist such
that the Policy Iteration Algorithm (PIA) can converge to
these policies for some (but not all) choices of the additive
constants in the relative values, and as a consequence that
the PIA may cycle if the relative values are improperly
determined.

    A class of rules for choosing the additive constants is
given sufficient to guarantee the convergence of the PIA, as
well as necessaq and sufficient conditions for a policy to
have the property that the PIA can converge to it for any
relative value vector. Finally we give some properties of the
policies that exhibit this foolproof convergence.

RC 5894

( f 2 5 4 7 9 ) 3 / 9 / 7 6 Mathematics
13 pages

[This page contains 1 picture or other non-text object]

Page 2 of 16

LIAIITED DISTRIBUTION KOTICE


This report has been submitted for publication elsewhere and has been issued as a Research Report for early dissemination
of its contents. As a courtesy to the intended publisher, it should not be widely distributed until after the date of outside publication.

Copies may be requested from:
IBR.1 Thomas J. Watson Research Center Post Office Box 21 8
Yorktown Heights, New York 10598

[This page contains 1 picture or other non-text object]

Page 3 of 16

                        Introduction
We consider a Markov Renewal Program, with 52 = (1,. . . ,N) as state
space and K(i) as the finite set of alternatives in state i (1 -

< i IN).

Let qk denote the one-step expected reward and P& 0 the transition
i

probability to state j, when alternative keK(i) is used in state i.
k

(L . Pi = 1) . Finally, rk > 0 denotes the expected holding time, when

  3 ij
alternative k is used in state i, given j is the next state of the system.
A (stationary) randomized policy f is characterized by a tableau [fikl
satisfying f ik and 'kE~(i) fik = 1 for all ie0, where fik denotes the
probability that the kth alternative is chosen in state i. We let SR
indicate the set of all randomized policies and S the subset of all pure

P -

(non-randomized) policies, i.e., for feS each f = 0 or 1. For fcS

                               P ik P
f(i)€K(i) denotes the single alternative used in state i. Associated
with each feS are N-component reward vector q(f), holding time vector

R

T(f), and two matrices P(f) and H(f):

P(f)ij = zkeK(i) fik Pij;
k H(fIij = zkcX(i) fik Pij
k T i
j (lzi,jz~)

Denote by n(f) the number of subchains (c:Losed, irreducible sets of states)
of P (f) . In the system of 2N equations:

the vector g = g(f) is uniquely determined, whereas the vector v is
determined up to n(f) additive constants, i-e., one constant per subchain
of P (f) (cf. Lemma 2-11.

[This page contains 1 picture or other non-te...