Full Text
UDC 519.217
CYBERNETICS AND CONTROL THEORY
E. S. USACHEV
ASYMPTOTIC PROPERTIES AND APPROXIMATION OF STOCHASTIC MODELS OF LEARNING
(Presented by Academician A. A. Dorodnitsyn on 26 I 1968)
Many of the simplest procedures of learning (skill formation) consist of a series of consecutive experiments \(E_1, \ldots, E_t, \ldots\). Each experiment begins with the appearance of a conditional stimulus \(C_\alpha \in C\) (\(C\) is the set of conditional stimuli) with probability \(p_t(C_\alpha)\). If the subject (or learner) responds to the stimulus by a reaction \(R_\beta \in R\) (\(R\) is the set of reactions), then with probability \(p_t(S_\gamma / C_\alpha R_\beta)\) an unconditioned stimulus \(S_\gamma \in S\) appears (\(S\) is the set of unconditioned stimuli). Everything that the subject can learn from experiment \(E_t\) is contained in the event \(\omega_i = \{C_\alpha R_\beta S_\gamma\} \in C \times R \times S = \Omega\). The pair of finite sets \((C, S)\) and conditional probabilities \(p_t(C_\alpha)\), \(p_t(S_\gamma / C_\alpha R_\beta)\) will be called the environment in which the experiments \(E_t\) take place.
We make the following three assumptions:
a) In experiment \(E_t\) the subject (learner) performs reaction \(R_\beta\) with probability \(p_t(R_\beta / C_\alpha)\). The matrix of conditional probabilities \(\|p_t(R_\beta/C_\alpha)\|\) will henceforth be called the subject’s behavior and denoted by \(\bar p_t(\beta/\alpha)\).
b) All the experience accumulated by the subject (learner) is completely contained in its behavior \(\bar p_t(\beta/\alpha)\). Formally this is expressed as follows: if in experiment \(E_t\) the event \(\omega_{i_t} \in \Omega\) occurred, then the new behavior will be
\[
\bar p_{t+1}(\beta/\alpha)=\varphi(\bar p_t(\beta/\alpha),\omega_{i_t}),
\]
where \(\bar p_t(\beta/\alpha)\) is the behavior of the subject in \(E_t\).
c) The rule for changing behaviors \(\varphi(\bar p(\beta/\alpha),\omega)\), defined on the set of all behaviors \(\hat P=\{\|p(R_\beta/C_\alpha)\|\}\), is a contracting homeomorphic mapping with contraction coefficient \(a_\omega\) and fixed point \(\bar p_\omega(\beta/\alpha)\). This fixed point is naturally called the behavior taught by the event \(\omega\); \(a_\omega\) is the learning rate.
Learning according to scheme a)—c) will be called a behavioristic model, or briefly BM (for more on BM see \((^8)\)). A BM for which the transformation
\[
\varphi(\bar p(\beta/\alpha),\omega)=a_\omega\bar p(\beta/\alpha)+(1-a_\omega)\bar p_\omega(\beta/\alpha),
\]
i.e., is linear, was proposed and tested in many experiments in \((^{1-3})\). This model will henceforth be denoted MBM (the Bush–Mosteller model). A BM can be represented by an automaton with an uncountable set of states (see the addenda in \((^3)\)).
We shall call an approximation of a BM of order \(k\) (briefly ABM \(k\)) a learning model possessing properties a), b) and the new property c′):
c′) The rule for changing behaviors
\[
\tilde\varphi(\bar p_{\omega_{i_1}\ldots\omega_{i_k}}(\beta/\alpha),\omega_j)
=
\bar p_{\omega_{i_2}\ldots\omega_{i_k}\omega_j}(\beta/\alpha)
\]
is defined on the finite set
\[
\bar P_k=\{\|\bar p_{\omega_{i_1}\ldots\omega_{i_k}}(\beta/\alpha)\|\}
\]
of all possible behaviors that are fixed points of \(k\)-fold superpositions of the mappings
\[
\varphi(\varphi(\varphi\ldots(\varphi(\bar p(\beta/\alpha),\omega_{i_1}),\omega_{i_2}),\ldots),\omega_{i_k}),
\quad \omega_{i_j}\in\Omega.
\]
An ABM \(k\) can be represented by a combination of a random-signal generator and a finite automaton. In particular, an ABM \(k\) is realizable in the form of a homogeneous network of formal neurons and neurons with spontaneous activity.
Beginning with the second experiment \(E_2\), the behavior of BA and ABM \(k\) forms a random sequence \(\hat p_t(\beta/\alpha)\)—a Markov chain. The study of asympt—
... of the asymptotic properties of the BM is reduced to the investigation of the asymptotic properties of the chain \(p_t(\beta/\alpha)\).
Consider a Markov chain \(\xi(t)\), whose set of states is the complete metric space \(X\) with metric \(\rho(x,y)\) and
\[
\max_{x,y\in X}\rho(x,y)=d<\infty.
\]
If at time \(t\), \(\xi(t)=x\in X\), then with probability \(p_t(\omega_i/\xi(t)=x)\) a contracting homeomorphic mapping \(\psi(x,\omega_i)\) is applied to the point \(x\), having contraction coefficient \(a_{\omega_i}\leq \alpha<1\) and fixed point \(x_{\omega_i}\). The mapping \(\psi(x,\omega_i)\) takes \(\xi(t)=x\) from \(x\) to
\[
\xi(t+1)=\psi(\xi(t),\omega_i)=\psi(x,\omega_i).
\]
In the case of a BM, \(X=\hat P\),
\[
p(\omega_i=(C_\alpha R_\beta S_\gamma)/\xi(t))
=\bar p_t(\beta/\alpha)=p_t(C_\alpha)p_t(R_\beta/C_\alpha)p_t(S_\gamma/C_\alpha R_\beta).
\]
Denote by: \(\Omega^t\) the set of all sequences \((\omega_{i_1}\ldots \omega_{i_t})\) of length \(t\); \(I^t\) a proper subset of \(\Omega^t\); \(\Omega^{t-l}I^l\) the set of all possible sequences of events \((\omega_{i_1}\ldots \omega_{i_{t-l}},\omega_{i_{t-l+1}}\ldots \omega_{i_t})\) such that \((\omega_{i_{t-l+1}}\ldots \omega_{i_t})\in I^l\), \(l\leq t\); \(p(I^l/\xi(t)=x)\) the probability that, from time \(t+1\) to time \(t+l\), the sequence \((\omega_{i_1}\ldots \omega_{i_l})\in I^l\) will appear, under the condition that \(\xi(t)=x\); \(p(\xi(t')/\xi(t'')=x)\) the probability distribution of the states of the chain \(\xi(t)\) at time \(t'\geq t''\), under the condition \(\xi(t'')=x\).
Theorem 1. In the space of conditional measures on \(X\), the weak distance*
\[
L\bigl(p(\xi(t)/\xi(0)=x),\; p(\xi(t)/\xi(0)=y)\bigr)
\leq \max(\alpha^l d,\;\tilde\mu_{t-l}),
\]
where \(l\leq t\),
\[
\tilde\mu_{t-l}=\sup_{\Omega^{t-l}I^l}
\left|p(\Omega^{t-l}I^l/\xi(0)=x)-p(\Omega^{t-l}I^l/\xi(0)=y)\right|.
\]
Theorem 2. If
\[
\sup_{\substack{\omega\in\Omega\\ t'\in[t,t+\tau]}}
\left|p(\omega/\xi(t')=x)-p(\omega/\xi(t')=y)\right|
\leq C_0\rho(x,y),
\]
then
\[
\left|p(I^\tau/\xi(t)=x)-p(I^\tau/\xi(t)=y)\right|
\leq C_0\rho(x,y)/(1-a).
\]
Let \(I_{x_\omega}^1\) be the set of all those \(\omega_i\) for which the mappings \(\psi(x,\omega_i)\) have one and the same fixed point \(x_\omega\).
Theorem 3. Suppose:
1) For each time \(t\) there exists (generally speaking, for different \(t\), different) set \(I_{x_\omega}^1\) such that
\[
p(I_{x_\omega}^1/\xi(t)=x)\geq \delta>0;
\]
2) the conditions of Theorem 2 are satisfied.
Then for every \(l\leq t\),
\[
\left|p(\Omega^{t-l}I^l/\xi(0)=x)-p(\Omega^{t-l}I^l/\xi(0)=y)\right|
\leq \frac{a^m C_0 d}{1-a}+q_{m,t},
\]
where \(q_{m,t}\) is the probability of the absence of a run of \(m\) successes in a sequence of \(t\) Bernoulli trials with probability of success \(\delta^2\).**
From Theorem 3, in particular, follow conditions for the existence and uniqueness of a distribution \(\mu\), to which \(p(\xi(t)/\xi(0)=x)\) converges weakly for all \(x\). For the case when the BM has only two reactions, these conditions were established in \((^4,^5)\). The proof of Theorem 3 proceeds analogously to the corresponding proof in \((^5)\).
\[ \text{* The definition of weak distance see in }(^6),\ \text{p. 147.} \]
\[
\text{** It can be shown (see }(^7),\ \text{pp. 326--327), that}
\]
\[
q_{m,t}\sim
\frac{1-\delta^2\lambda}{(m+1-m\lambda)(1-\delta^2)}\,
\frac{1}{\lambda^{m+1}},
\quad
\text{where }
\lambda\approx 1+(1-\delta^2)\delta^{2m}+(m+1)(\delta^{2m}(1-\delta^2))^2.
\]
To determine the degree of imitation of the BM by the approximating ABM\(k\) model, let us associate with the chain \(\xi(t)\) the chain \(\xi_k(t)\). The states of \(\xi_k(t)\) are the fixed points of the mappings \(\psi(\psi(\ldots \psi(x,\omega_{i_1})\omega_{i_2})\ldots \omega_{i_k})\). The chain \(\xi_k(t)\) has a finite set of states
\[ X_k=\{x_{\omega_{i_1}\ldots \omega_{i_k}}\} \]
and passes from the state \(x_{\omega_{i_1}\ldots \omega_{i_k}}\) at time \(t\) to the state \(x_{\omega_{i_2}\ldots \omega_{i_k}\omega_j}\) with probability
\[ p_t(\omega_j/\xi(t)=x_{\omega_{i_1}\ldots \omega_{i_k}}). \]
Theorem 4. Let \(\xi(t)\) be the chain generated by the random behaviors of an MBM in a stationary environment, and let \(\xi_k(t)\) be the corresponding chain generated by the random behaviors of an ABM\(k\) in the same stationary environment, with \(k \ge 2\). Then:
1) If at least one of the chains \(\xi_k(t)\), \(\xi(t)\) has no absorbing sets, then \(\xi_k(t)\) are ergodic chains, and \(p(\xi(i)/\xi(0)=x)\) converges weakly to an invariant distribution \(\mu\) independent of \(x\), and the limiting distributions \(\mu_k\) of the chains \(\xi_k(t)\) converge weakly to \(\mu\) as \(k\to\infty\).
2) The chains \(\xi_k(t)\) and \(\xi(t)\) either both have, or both do not have, absorbing sets. In the former case the number of absorbing sets of \(\xi_k(t)\) and \(\xi(t)\) is the same, and the absorbing sets of \(\xi_k(t)\) are proper subsets of the absorbing sets of \(\xi(t)\).
3) The chains \(\xi(t)\) and \(\xi_k(t)\) either both have, or both do not have, absorbing states. These absorbing states are conditional fixed points \(\|p_\omega(\beta/\alpha)\|\) (i.e., simple matrices \(\|p_\omega(\beta/\alpha)\|\)).
Some cases in which a direct analytic computation of the limiting distributions of an MBM is possible are considered in (8).
The author expresses deep gratitude to V. G. Sragovich for assistance in the work and to B. G. Sushkov for a number of useful comments.
Computing Center
Academy of Sciences of the USSR
Received
26 I 1968
REFERENCES
- R. R. Buch, C. F. Mosteller, Psych. Rev., 58, 313 (1951).
- R. R. Buch, C. F. Mosteller, Ann. Math. Stat., 24, 559 (1953).
- R. Bush, F. Mosteller, Stochastic Models for Learning, Moscow, 1962.
- R. Bellman, T. Harris, H. N. Shapiro, Studies on Functional Equations Occurring in Decision Processes, 1952.
- S. Karlin, Pacific J. Math., 3, 725 (1953).
- Yu. V. Prokhorov, Yu. A. Rozanov, Probability Theory, “Nauka,” 1967.
- V. Feller, An Introduction to Probability Theory and Its Applications, 2nd ed., Vol. I, Moscow, 1964.
- E. S. Usachev, A Stochastic Model of Learning and Its Properties, in: Studies in the Theory of Self-Adjusting Systems, Moscow, Proceedings of the Computing Center of the Academy of Sciences of the USSR, 1967.