UDC 621.398+621.52[016.3]
CYBERNETICS
Submitted 1967-01-01 | RussiaRxiv: ru-196701.55551 | Translated from Russian

Full Text

UDC 621.398+621.52[016.3]

CYBERNETICS
AND CONTROL THEORY

Corresponding Member of the USSR Academy of Sciences V. S. PUGACHEV

OPTIMAL LEARNING SYSTEMS

In (1–3) a statistical theory of optimal learning algorithms for automatic systems was developed under the assumption that the probability distribution of the input signal \(Z\) and of the required output signal \(W\) is specified to within a finite number of parameters, which remain constant during learning and the subsequent operation of the system. Here a generalization of the theory of optimal learning systems is given for the case when the unknown parameters of the distribution of the input and required output signals vary randomly during learning and the subsequent operation of the system.

Suppose that all possible realizations \(z\) of the input signal \(Z\) are elements of some set \(A\), and all possible realizations \(w\) of the required output signal \(W\) are elements of some set \(B\). Let \(\eta(\Delta z,\Delta w;\theta\mid\lambda)\), \(\Delta z \subset A\), \(\Delta w \subset B\), be a family of probability measures defined on the Cartesian product \(A \times B\), depending on a parameter \(\lambda\), which takes values in the set \(L\), and on a numerical parameter \(\theta\)*. Let \(\Lambda(\theta)\) be a random function of the parameter \(\theta\), taking values in the set \(L\). We shall assume that the input and required output signals \(Z, W\) in each cycle of operation of the system are random variables distributed in \(A \times B\) in accordance with the probability measure \(\eta(\Delta z,\Delta w;\theta\mid\Lambda)\) for some value \(\theta\) and for a value \(\Lambda\) of the parameter \(\lambda\) equal to the value of the random function \(\Lambda(\theta)\) at the corresponding value of \(\theta\). In different cycles of operation of the system the parameter \(\theta\) has different values (in particular, \(\theta\) may be the number of the cycle or the moment of its beginning). Consequently, the parameter \(\Lambda\) has different random values in different cycles of operation of the system.

The learning of the system consists in the fact that, during \(N\) cycles corresponding to the values \(\theta_1,\ldots,\theta_N\) of the parameter \(\theta\), certain signals statistically related to \(\Lambda\)—the learning signals—are introduced into it. The system is required, by statistical processing of the learning signals, to produce an optimal estimate \(W^*\) of the required output signal \(W\) in the first cycle of operation after learning, ensuring fulfillment of the condition

\[ M[l(W,W^*\mid\Lambda)] = \min, \tag{1} \]

where \(l(W,W^*\mid\Lambda)\) is a loss function, which may depend on the unknown parameter \(\Lambda\), and the probabilistic averaging is performed over all possible realizations of the learning signals, of the random function \(\Lambda(\theta)\), and of the input, required output, and actual output signals \(Z, W, W^*\) in the first cycle after learning.

Let us denote by \(\widetilde W\) the actual output signal of any learning system (not necessarily optimal), and by \(\Xi\) the totality of all learning signals (regarded as random variables) and of the input signal \(Z\) in the first cycle after learning. Then we shall have

\[ \text{* Speaking of probability measures defined on a certain set, we mean, of course, measures defined on the corresponding \(\sigma\)-algebra of subsets of this set.} \]

\[ R(\delta)=M\,[l(W,\hat W\mid \Lambda)]= \]
\[ =\int_L d\lambda\int_{\bar L} d_{\bar\lambda}A(\lambda,\bar\lambda)\int_X d\sigma(\xi\mid \lambda,\bar\lambda)\int_B d\delta(\hat w\mid \xi)\int_B l(w,\hat w\mid \lambda)\,d\varkappa(w;\theta\mid z,\lambda), \tag{2} \]

where \(A(\Delta\lambda,\Delta\bar\lambda)\) is the joint probability measure of the random parameter \(\Lambda=\Lambda(\theta)\) in the first cycle of operation of the system after training and of the aggregate \(\bar\Lambda\) of random parameters \(\Delta_1=\Lambda(\theta_1),\ldots,\Lambda_N=\Lambda(\theta_N)\), \(\Delta\lambda\subset L\), \(\Delta\bar\lambda\subset \bar L=L^N\); \(\sigma(\Delta\xi\mid \lambda,\bar\lambda)\) is the conditional probability measure of the random variable \(\Xi\) for given values \(\lambda,\bar\lambda\) of the random variables \(\Lambda,\bar\Lambda\), \(\Delta\xi\subset X\); \(\delta(\Delta\hat w\mid \xi)\) is the decision function of the system in the first cycle after training, representing the conditional probability measure of its output signal \(\hat W\) for a given value \(\xi\) of the variable \(\Xi\), \(\Delta w\subset B\), \(\varkappa(\Delta w;\theta\mid z,\lambda)\) is the conditional probability measure of the required output signal \(W\) in the first cycle after training for given values \(z,\lambda\) of the input signal \(Z\) and the parameter \(\Lambda\):

\[ \varkappa(\Delta w;\theta\mid z,\lambda)= \frac{d\eta(z,\Delta w;\theta\mid\lambda)}{d\gamma(z;\theta\mid\lambda)} = \frac{d\eta(z,\Delta w;\theta\mid\lambda)}{d\eta(z,B;\theta\mid\lambda)}. \tag{3} \]

In the case of self-learning, \(\Xi\) is the aggregate of input signals \(Z_1,\ldots,Z_N,Z\) obtained by the system during the training period and in the first cycle after training, and \(X=A^{N+1}\). In this case, if \(Z_1,\ldots,Z_N,Z\) are conditionally independent for any values \(\lambda,\bar\lambda\) of the variables \(\Lambda,\bar\Lambda\), then the conditional probability measure of the variable \(\Xi\) is determined by the formula

\[ \sigma(\Delta\xi\mid\lambda,\bar\lambda) = \gamma(\Delta z;\theta\mid\lambda) \prod_{i=1}^{N}\gamma(\Delta z_i;\theta_i\mid\lambda_i), \tag{4} \]

where \(\gamma(\Delta z;\theta\mid\lambda)=\eta(\Delta z,B;\theta\mid\lambda)\) is the probability measure of the input signal \(Z\).

In the case of training the system by a teacher by means of demonstration, \((1)\) \(\Xi\) is the aggregate of input signals \(Z_1,\ldots,Z_N,Z\) and the teacher’s output signals \(\widetilde W_1,\ldots,\widetilde W_N\), introduced into the system in the course of training, and \(X=A^{N+1}\times \widetilde B^N\), where \(\widetilde B\) is the space of all possible realizations of the teacher’s output signal \(\widetilde W\) (it may, in particular, coincide with \(B\)). In this case, if the triples of signals \((Z_1,W_1,\widetilde W_1),\ldots,(Z_N,W_N,\widetilde W_N)\) and \(Z\) are conditionally independent for any values \(\lambda,\bar\lambda\) of the variables \(\Lambda,\bar\Lambda\) (\(W_1,\ldots,W_N\) are the required output signals corresponding to the training input signals \(Z_1,\ldots,Z_N\)), then the conditional probability measure of the random variable \(\Xi\) is determined by the formulas

\[ \sigma(\Delta\xi\mid\lambda,\bar\lambda) = \gamma(\Delta z;\theta\mid\lambda) \prod_{i=1}^{N} \int_{\Delta z_i} \pi(\Delta\widetilde w_i;\theta_i\mid z_i,\lambda_i)\, d\gamma(z_i;\theta_i\mid\lambda_i), \tag{5} \]

\[ \pi(\Delta\widetilde w;\theta\mid z,\lambda) = \int_B \delta_y(\Delta\widetilde w;\theta\mid z,w,\lambda)\, d\varkappa(w;\theta\mid z,\lambda), \tag{6} \]

where \(\delta_y(\Delta\widetilde w;\theta\mid z,w,\lambda)\) is the teacher’s decision function, i.e., the conditional probability measure of his output signal \(\widetilde W\) for given values \(z,w,\lambda\) of the variables \(Z,W,\Lambda=\Lambda(\theta)\), \(\Delta\widetilde w\subset \widetilde B\).

In the case of training the system by a teacher by means of evaluating its actions, \(\Xi\) is the aggregate of training input signals \(Z_1,\ldots,Z_N\), the corresponding output signals of the system \(\hat W_1,\ldots,\hat W_N\), the teacher’s evaluations (his output signals) \(\widetilde W_1,\ldots,\widetilde W_N\), and the input signal \(Z\) in the first cycle after training, and \(X=A^{N+1}\times B^N\times \widetilde B^N\). In this case, if the quadruples of signals \((Z_1,W_1,\hat W_1,\widetilde W_1),\ldots,(Z_N,W_N,\hat W_N,\widetilde W_N)\) and \(Z\) are conditionally independent for any values \(\lambda,\bar\lambda\) of the variables \(\Lambda,\bar\Lambda\), then the conditional probability measure of the variable \(\Xi\) is determined by the formulas

\[ \sigma(\Delta \xi \mid \lambda, \bar{\lambda}) = \gamma(\Delta z; \theta \mid \lambda) \prod_{i=1}^{N} \int_{\Delta z_i} \delta^{i}(\Delta \hat{w}_i \mid z_i) \times \]
\[ {}\times \pi(\Delta \tilde{w}_i; \theta_i \mid z_i,\hat{w}_i,\lambda_i)\, d\gamma(z_i;\theta_i \mid \lambda_i), \tag{7} \]

\[ \pi(\Delta \tilde{w};\theta \mid z,w,\hat{w},\lambda) = \int_{B} \delta_y(\Delta \tilde{w};\theta \mid z,w,\hat{w},\lambda)\, d\chi(w;\theta \mid z,\lambda), \tag{8} \]

where \(\delta^{1}(\Delta \hat{w}\mid z),\ldots,\delta^{N}(\Delta \hat{w}\mid z)\) are the decision functions of the system in the learning cycles, independent of the values of the parameters \(\lambda_1,\ldots,\lambda_N,\lambda\), \(\Lambda(\theta_1),\ldots,\Lambda(\theta_N),\Lambda(\theta)\).

Representing (2) in the form

\[ R(\delta)= \int_{X} d\beta(\xi) \int_{B} \rho(\xi,\hat{w})\,d\delta(\hat{w}\mid \xi), \tag{9} \]

\[ \rho(\xi,\hat{w})= \int_{L} d\Omega(\lambda\mid \xi) \int_{B} l(w,\hat{w}\mid \lambda)\, d\chi(w;\theta \mid z,\lambda), \tag{10} \]

where

\[ \beta(\Delta \xi)= \int_{L} d\lambda \int_{\bar{L}} \sigma(\Delta \xi \mid \lambda,\bar{\lambda})\, d_{\bar{\lambda}}A(\lambda,\bar{\lambda}) \tag{11} \]

is the unconditional probability measure of the random variable \(\Xi\), and

\[ \Omega(\Delta\lambda\mid \xi)= \int_{\Delta\lambda} d\lambda \int_{\bar{L}} \frac{d\sigma(\xi\mid \lambda,\bar{\lambda})}{d\beta(\xi)} \,d_{\bar{\lambda}}A(\lambda,\bar{\lambda}) \tag{12} \]

is the conditional (posterior) probability measure of the parameter \(\Lambda=\Lambda(\theta)\) for the given realization \(\xi\) of the random variable \(\Xi\), and assuming that for every realization \(\xi\) of the random variable \(\Xi\) there exists a set \(C_\xi \subset B\) of values \(w^*\) of the random variable \(\hat{w}\) satisfying the condition

\[ \rho(\xi,w^*) \leq \rho(\xi,\hat{w}) \quad \text{for all } \hat{w}\in B, \tag{13} \]

we arrive at the conclusion that the optimal decision function of the learning system \(\delta_{\mathrm{opt}}(\Delta\hat{w}\mid \xi)\) will be an arbitrary conditional probability measure completely concentrated on the set \(C_\xi\). In the particular case when, for every \(\xi\), there exists a unique value \(w^*=B\xi\) (\(B\) is a deterministic operator) satisfying (13), the algorithm of the optimal learning system proves to be deterministic.

A learning system that produces its output signal \(W^*\) (the optimal estimate of the required output signal \(W\)) in the first cycle after learning by minimizing expression (10) is a Bayes optimal learning system. No learning system that has received the same collection of realizations of the learning signals and the same realization of the input signal in the first cycle after learning can be better than the Bayes optimal learning system, i.e., can give a better estimate of the required output signal in the first cycle after learning. Consequently, the quality of the Bayes optimal learning system, determined by the value \(R(\delta_{\mathrm{opt}})\), can serve as a characteristic of the limiting possibilities of learning—the potential learnability of automatic systems.

The basic property of the Bayes optimal learning system, as is clear from (4), (5), (7), (11), and (12), is that, in order to determine the posterior distribution of the parameter \(\Lambda\), it uses not only the realizations of the learning signals but also the realization of the input signal obtained after learning. In other words, the Bayes optimal learning system continues to improve after learning by means of self-

self-training in the course of operation. In the particular case when there is no training period, such a system continuously computes, during the operating cycle, the a posteriori distribution of the parameter \(\Lambda\) from the segment of the realization of the input signal obtained up to the given moment \(t\), and uses this distribution in producing the value of its output signal at the moment \(t\), i.e., it carries out self-training. In this particular case the Bayesian optimal learning system is a Bayesian optimal adaptive (self-adjusting) system.

To evaluate the effectiveness of the optimal training process it is necessary to compare the quantity \(R(\delta_{\mathrm{opt}})\), corresponding to the Bayesian optimal learning system, with the quantity \(R(\delta_\lambda)\), attainable with the aid of a Bayesian optimal system with complete information about the parameter \(\Lambda\), producing its output signal \(W_\lambda^*\) by minimizing the expression

\[ \rho_\lambda(\xi,\hat{w})=\int_B l(w,\hat{w}\mid \lambda)\,d\varkappa(w;\theta\mid z,\lambda) \tag{14} \]

for the exactly known value \(\lambda\) of the parameter \(\Lambda=\Lambda(\theta)\). Since always \(R(\delta_{\mathrm{opt}})\ge R(\delta_\lambda)\), when \(R(\delta_\lambda)>0\) the quantity

\[ \varepsilon(\delta_{\mathrm{opt}})= \frac{R(\delta_{\mathrm{opt}})-R(\delta_\lambda)}{R(\delta_\lambda)} \tag{15} \]

may serve as a measure of the limiting attainable (potential) closeness of the learning system to the optimal system with complete information about the distribution of the input and required output signals \(Z, W\) (3).

From the formulas obtained, which determine the a posteriori measure \(\Omega(\Delta_\lambda\mid \xi)\), it is seen that in the considerably more general learning model studied here than in (1–3), all the basic properties of optimal training processes derived in (1–3) are preserved, except for the recurrence of the process of computing \(\Omega(\Delta_\lambda\mid \xi)\). The latter property is preserved only if the random function \(\Lambda(\theta)\) is a Markov random process, i.e., if the values \(\Lambda(\theta_1),\ldots,\Lambda(\theta_N),\Lambda(\theta)\) of the parameter \(\Lambda\) in the various cycles form a simple Markov chain.

The theory of optimal training processes for automatic systems, developed in (1–3), is obtained as a particular case of the general theory set forth here, when \(L\) is a finite-dimensional Euclidean space, the signals \(Z, W, \hat{W}\), and \(\overline{W}\) are finite-dimensional random vectors or vector random functions, and the random parameter \(\Lambda\) retains a constant value during the training process and after training,

\[ \Lambda=\Lambda(\theta_1)=\ldots=\Lambda(\theta_N)=\Lambda(\theta). \]

Received
17 V 1966

CITED LITERATURE

  1. V. S. Pugachev, DAN, 172, No. 5 (1967).
  2. V. S. Pugachev, Report at the Third All-Union Conference on Automatic Control (Technical Cybernetics), Odessa, 20–25 IX 1965.
  3. V. S. Pugachev, Report at the Third IFAC International Congress on Automatic Control, London, 20–25 VII 1966.

Submission history

UDC 621.398+621.52[016.3]