Full Text
UDC 519.95
MATHEMATICS
V. A. SLEPIAN
ON THE NUMBER OF DEAD-END TESTS AND ON MEASURES OF THE INFORMATIVENESS OF A COLUMN FOR ALMOST ALL BINARY TABLES
(Presented by Academician S. L. Sobolev on 2 VII 1969)
- Let \(T_{ln}\) be a binary table having \(l\) rows and \(n\) columns, and let \(T_{lk}\) be a table composed of \(k\) \((k \le n)\) columns of the table \(T_{ln}\).
Definition 1. We shall call the table \(T_{lk}\) a test of the table \(T_{ln}\) if it consists of distinct rows.
Definition 2. A test will be called dead-end if, after the deletion of any column from it, it ceases to be a test. The notion of a test and of a dead-end test is given in paper \((^1)\).
With the table \(T_{ln}\) we associate the quantities \(N(T_{ln})\) and \(N^i(T_{ln})\)—the number of all dead-end tests and the number of dead-end tests into which the \(i\)-th column enters, respectively.
In test algorithms for pattern recognition \((^2)\), the information weight of a column is taken as the measure of informativeness of the column,
\[ I = N^i(T_{ln}) / N(T_{ln}). \tag{1} \]
For the so-called “voting” algorithms, Yu. I. Zhuravlev proposed, as a measure of informativeness, the quantity
\[ J = \sum_{i=1}^{l-1} \sum_{j=i+1}^{l} \left(C_{n-\rho_{ij}}^{k} - C_{n-1-\tilde{\rho}_{ij}}^{k}\right) \Bigg/ \sum_{i=1}^{l-1} \sum_{j=i+1}^{l} C_{n-\rho_{ij}}^{k}, \tag{2} \]
where \(\rho_{ij}\) is the Hamming distance between rows \(i\) and \(j\) in the table \(T_{ln}\), and \(\tilde{\rho}_{ij}\) is the same distance after the deletion of one of the columns.
- We shall now regard \(T_{ln}\) as a random table, namely a table each element of which takes the values 0 and 1 with probability \(1/2\). Then \(N(T_{ln})\), \(N^i(T_{ln})\), \(I\), and \(J\) are random variables.
In the present note, for the class of tables satisfying the relation
\[ n \ll 2^{l^\alpha} \quad (\alpha < 1/2) \tag{3} \]
the asymptotics of the mean (as \(n, l \to \infty\)) of the random variable \(N(T_{ln})\) is found, and for the class of tables satisfying the condition
\[ \log l / \alpha \le n \le l^\beta \tag{4} \]
\[ (\beta < 2,\ \alpha = 1/2 - \lambda_0 / \log l,\ \lambda_0 \to \infty,\ \lambda_0 / \log l \to 0), \]
it is proved that
\[ N/\overline{N} \xrightarrow{p} 1, \qquad \ln / 2 \log l \xrightarrow{p} 1 \qquad (l, n \to \infty). \]
For tables \(T_{ln}\) of arbitrary sizes, provided only that one of the relations
\[ k \ll \sqrt[3]{n}, \qquad l^2 2^{-k} \to \infty, \]
is fulfilled, the formula
\[ Jn/k \to 1 \qquad (l, n \to \infty) \]
is valid.
Here the symbol \(\xrightarrow{p}\) denotes, as usual, convergence in probability.
3. Let us estimate the probabilities of a dead-end test
Consider the table \(T_{lk}\). We give several definitions.
Definition 3. Two unequal rows of the table \(T_{lk}\) form a connection on the \(i\)-th column if, after deletion of the \(i\)-th column, they become equal.
We shall say that the table \(T_{lk}\) has a connection on the \(i\)-th column if at least one pair of rows of \(T_{lk}\) forms a connection on the \(i\)-th column; the table \(T_{lk}\) has a connection on \(m\) columns if it has a connection on each of the \(m\) columns.
Introduce the notation: \(T_{lk}^*, T_{lk}'\) are, respectively, a test and a dead-end test of the table \(T_{lk}\); \(R_l^0, R_l^1\) are, respectively, the set of tests and the set of tests having a connection on the first column; \(r_l^i\) is the set of tests having \(i\) and only \(i\) connections on the first column; \(\mu(A)\) is the cardinality of the set \(A\).
Let \(Q_0\) and \(Q\) be some sets whose elements are, respectively, the tables \(T_{lk}, T_{(l-1)k}\), and let \(\alpha_0\) and \(\alpha_1\) be integers. By the notation \(\alpha_0 Q_0 \to \alpha_1 Q_1\) we shall agree to understand the following: from the \(\alpha_0\) sets \(Q_0\), by deleting one row from each table, one can construct \(\alpha_1\) sets \(Q_1\). For the sets \(r_l^i, r_{l-1}^i, r_{l-1}^{i-1}\) it is not difficult to prove the validity of the relations
\[ 2 i r_l^i \to (l - 2i + 1) r_{l-1}^{i-1}, \tag{5} \]
\[ (l - 2i) r_l^i \to 2(2^{k-1} - l + i + 1) r_{l-1}^i . \tag{6} \]
The introduced notation allows us to write the equalities
\[ R_{l-1}^0 = \sum_{i=0}^{[(l-1)/2]} r_{l-1}^i, \qquad R_l^1 = \sum_{i=0}^{[l/2]} r_l^i . \tag{7} \]
On the basis of (5) and (6) one can always find an integer \(A_l > 0\) such that the representation
\[ A_l(\alpha_i + \beta_i) r_l^i \to m_i r_{l-1}^{i-1} \cup n_i r_{l-1}^i \qquad (i = 0, 1, \ldots, [l/2]), \]
holds, where \(\alpha_i, \beta_i \ge 0\); \(\alpha_i + \beta_i = 1\); \(m_i, n_i\) are positive integers. It is proved, moreover, that \(\alpha_i, \beta_i\) satisfy the equation
\[ \alpha_i \mu(r_l^i) + \beta_{i-1}\mu(r_l^{i-1}) = \frac{\mu(R_l^1)}{\mu(R_{l-1}^0)} \, \mu(r_{l-1}^{i-1}). \]
Hence and from formulas (7) it follows:
Lemma 1. There exist positive integers \(A_l, B_l\) such that
\[ A_l R_l^1 \to B_l R_{l-1}^0 . \]
A consequence of Lemma 1 is
Lemma 2. The probability of a dead-end test satisfies the inequality
\[ p(T_{lk}') \ge p(T_{lk}^*) \prod_{i=1}^{k} p_i, \qquad p_i = 1 - \frac{2^{l-i+1} C_{2^{k-1}}^{\,l-i+1}}{C_{2^k}^{\,l-i+1}} . \]
Let the table \(T_{lk}\) form a test. Fix \(m\) columns.
Definition 4. We shall call a garland a group of \(i+1\) rows that form connections among themselves on \(i\) columns belonging to the \(m\) specified columns.
Definition 5. We shall call a maximal garland on the \(j\)-th column a garland that forms connections on the maximal number of columns among the \(m\) fixed columns, including the \(j\)-th.
Denote by \(X_i^m\) a test \(T_{lk}^*\) having a connection on \(m\) fixed columns, among which the first is included, and by \(Y_{i\max}\) a test \(T_{lk}^*\) containing a maximal garland of \(i+1\) rows on the first column.
Then
\[ p\left(X_l^m/R_l^0\right)=\sum_{i=1}^{m}p\left(Y_i\max X_l^{m-i}/R_l^0\right)\quad (X_l^0\equiv 1). \tag{8} \]
From the assertion of Lemma 1 and the fact that the probability of a join on \(m\) columns does not increase when the number of rows of the table is decreased, the probabilities entering the right-hand side of (8) are estimated from above. As a result we have
\[ p\left(X_l^m/R_l^0\right)\leq p_1p\left(X_l^{m-1}/R_l^0\right)+ \sum_{i=2}^{m}2^{k-1}C_{m-1}^{i-1}i!(i-1)!(l2^{-k})^{i+1}p\left(X_l^{m-i}/R_l^0\right). \]
Further, by the method of mathematical induction, one proves
Lemma 3. The probability of a dead-end test satisfies the inequality
\[ p(T_{lk}')\leq p(T_{lk}^*)p_1^k(1+\delta)^k, \]
where
\[ \delta=\frac{l^3k}{2^{2k}p_1^2}\, \frac{1-(lk'/2^kp_1)^k}{1-lk'/2^kp_1}. \]
A consequence of Lemmas 2 and 3 is
Theorem 1. If \(k^2l^{-1}\to 0\), then
\[ p(T_{lk}')\sim p(T_{lk}^*)\left(1-\exp(-l^22^{-k-1})\right). \]
For those \(k\) for which the condition of Theorem 1 is not satisfied, we give the following upper estimate for the probability of a dead-end test:
\[ p(T'_{l(k+1)})\leq p(\overline{T}_{lk}^*)\,p(T'_{lk})/p(T_{lk}^*), \tag{9} \]
4. The mean number of dead-end tests can be represented by the sum
\[ \overline{N}= \sum_{k=[\log l]}^{\min(n,l-1)}\overline{N}_k, \qquad \overline{N}_k=C_n^k p(T'_{lk}). \tag{10} \]
where \(\overline{N}_k\) is the mean number of dead-end tests of length \(k\) (the length of a test is the number of columns in it). Using Theorem 1 and formulas (9) and (10), one proves
Theorem 2. 1) If \(n\leq l^2(\log l)^\gamma\) \((\gamma<3)\), then
\[ \overline{N}\sim \overline{N}_{k_0}+\overline{N}_{k_1}. \]
2) If \(n\leq 2^{l^\alpha}\) \((\alpha<1/2)\), then
\[ \overline{N}\sim \overline{N}_{k_0} \left( 1+ \sum_{i=1}^{[l^{\alpha_0}]-k_0}(\eta_{k_0}^{-1}a^{(i-1)/2})^i + \sum_{i=1}^{k_0-[\log l]}(\eta_{k_0-1}a^{(i-1)/2})^i \right), \]
where
\[ \overline{N}_k\sim \widetilde{N}_k = C_n^k p(T_{lk}^*)(1-x)^k; \quad x=\exp(-l^22^{-k-1}); \quad \eta_k=\widetilde{N}_k/\widetilde{N}_{k+1}; \]
\(k_0\) is the root of the equation \(\eta_k=1\); \(k_1=k_0-1\) or \(k_0+1\); \(\eta_{k_0}\geq 1\), \(\eta_{k_0-1}\leq 1\); \(0<a<a_0<1/2\); \(0\leq a\leq 1/4\).
In particular: 1) for \(2\log l\leq n\leq l^\beta\) \((\beta<2)\)
\[ k_0=[2\log l-\log(-\ln(n^{1/2}\log l-1))-2]; \]
2) for \(l^\varphi\leq n\leq 2^{l^\alpha}\) \((\varphi>2)\), \(a=1/4\),
\[ k_0=[\,1/2\log(n/l^2)-1/2\log\log(n/l^2)-1/2\,]. \]
Consider the set of tables \(Q_n=\{T_{ln}\}\), where \(2\log l\leq n\leq l^\beta\), \(l\in(l_1,(1+a)l_1)\), \(a>0\).
Theorem 3. In almost all tables of the set \(Q_n\)
\[ \overline{N}\sim \overline{N}_{k_0}\qquad (l\to\infty). \]
5. In the table \(T_{ln}\) we number the columns \((1,2,\ldots,n)\). Consider two tables composed of \(k\) columns of the table \(T_{ln}\). Let each of them contain \(m\) and only \(m\) columns with identical numbers. Denote by \(p(m)\), \(p'(m)\) the probability that both tables simultaneously are tests, dead-end tests respectively. For them the following is valid
Lemma 4. 1) \(p(m)\) is a nondecreasing function of \(m\).
2) \(p'(m) \leq p(m) \leq \exp\{-l2^{2-k}+\beta_m\}\), where
\[ \beta_m= \frac{l^2}{2^{2k-m+1}}+ \frac{l}{2^{m+1}}+ \frac{l^3}{2^{2m+1}} \left( \frac{1}{1-l2^{-m}}+ \frac{2\exp(l\cdot 2^{-m+1})}{1-q} \right) \quad (q<1). \]
3) \(p'(m) \leq p^* \sim p(T'_{lk})(1-x)^k(1-x^{2^m})^{k-m}\) \((k^2l^{-1}\to 0,\ l\to\infty)\).
Denote by \(DN_k\) the variance of the number of dead-end tests of length \(k\). Then, by \((3)\), the formula holds
\[ V_k=\frac{DN_k}{\overline{N}_k^{\,2}} = \sum_{m=\max(0,\,2k-n)}^{k} \frac{C_k^m C_{n-k}^{k-m}}{C_n^k} \frac{p'(m)}{p^2(T'_{lk})} -1. \tag{14} \]
Using Lemma 4, one can show that
Lemma 5. For the class of tables \((4)\), \(V_{k_0}\to 0\) \((l,n\to\infty)\).
In what follows we shall consider only those tables of class \((4)\) for which \(\overline{N}\sim \overline{N}_{k_0}\) \((l\to\infty)\). From Chebyshev’s inequality \((4)\) and Lemma 5 it follows that
Theorem 4. \(N/\overline{N}\xrightarrow{p}1\) \((l,n\to\infty)\).
For the random variable \(N^i(T_{ln})\) one can prove a theorem analogous to Theorem 4, and show that \(\overline{N}^{\,i}\sim 2\log l\,\overline{N}/n\) \((n,l\to\infty)\). Hence, and from Theorem 4, it follows that
Theorem 5. \(I_n/2\log l\xrightarrow{p}1\) \((n,l\to\infty)\).
6. Consider the table \(T_{ln}\), whose dimensions are arbitrary. For the random variable \(J\) the following holds:
Theorem 6. If \(k\ll \sqrt[3]{n}\) or \(l2^{2-k}\to\infty\), then \(J_n/k\xrightarrow{p}1\) \((l,n\to\infty)\).
Institute of Mathematics
Siberian Branch of the Academy of Sciences of the USSR
Novosibirsk
Received
2 VII 1969
References
- I. A. Chegis, S. V. Yablonskii, Trudy Mat. Inst. im. V. A. Steklova AN SSSR, 51, 270 (1958).
- A. N. Dmitriev, Yu. I. Zhuravlev, F. P. Krendelev, Diskretnyi analiz, no. 7, 3 (1966).
- V. A. Slepian, Diskretnyi analiz, no. 12, 50 (1968).
- B. V. Gnedenko, A Course in Probability Theory, Moscow, 1961.