Hello Sarge and Joachim. Thanks for your messages. I will refer simultaneously to both comments. Following Joachim timid initiative I will try to supplement my arguments with symbols. This is the compiled Latex version. The question of the data being noisy is very important. First of all, the data must be assumed to be noisy. A neural network can deal with noisy data if it recognizes the data with margins. See the precise definition below. Let the training [or learning] data be a pair of non-empty, finite, disjoint subsets A, B of vectors in Euclidean space, A ⊂ Rn , B ⊂ Rn , and assume that the neural network is of the following kind: Feedforward, classical, discontinuous, n-input, binary valued, multilayer (with k-layers, k > 0) perceptron neural network [PNN]. These PNNs in general transform a data n-vector, x ∈ Rn , into a binary value, P (x) ∈ {0, 1}, and therefore correspond to functions P : Rn → {0, 1} Any such function P splits the data space Rn into two disjoint complementary sets, X1 , X0 , one being the inverse image of 1, the other being the inverse image of 0 X1 ∩ X0 = ∅

X1 = P −1 (1)

X1 ∪ X0 = Rn

X0 = P −1 (0)

For any PNN the sets X1 and X0 are polyhedra [a polyhedron is a finite union of convex cells, with each convex cell being a finite intersection of linear half spaces]. By definition a PNN P cognizes A if A is contained in the first partition member, equivalently if P (a) = 1 for all a ∈ A; and P uncognizes B if B is contained in the second one, or P (b) = 0 for all b ∈ B A ⊂ X1 B ⊂ X0 When both conditions are met will say that P (perfectly) recognizes the pair A, B (of data sets). The recognition of A and B by P is stable if A is contained in the (topological) interor of X1 and B is contained in the interior of X0 . But metric conditions are more revealing. Let rA , rB be numbers, each to be called a margin. By definition P (perfectly) recognizes A and B with margins rA and rB if any a ∈ A is at distance at least rA from the boundary of X1 and any b ∈ B is at distance at least rB from the boundary of X0 . Note that if this happens then the sum of the margins is a lower bound for the distance from A to B rA + rB < d(a, b) for all a ∈ A and all b ∈ B Recall that, by traditional definition, the distance D from set A to set B (these being non-empty and finite) is the smallest of the numbers d(a, b). The inequality then says that the distance from A to B is larger than the sum of the margins. Instantaneous learning can now be stated as the following

1

Theorem: For data sets A, B in Rn , at distance D > 0 from each other, and for any pair rA , rB of margins such that rA + rB < D, a 3-layer PNN P : Rn → {0, 1} can be calculated that perfectly recognizes A, B with margins rA , rB . The theorem implies that all data vectors x ∈ Rn at distance less than rA from some a ∈ A are cognized by P if d(x, a) < rA for some a ∈ A, then P (x) = 1 while all data vectors x ∈ Rn at distance less than rB from B are uncognized by P : if d(x, b) < rB for some b ∈ B, then P (x) = 0 In particular noisy data [data + small perturbations] are tolerable as usual, that is, if noise is not too large. In the context of the theorem it is not clear how to make sense of zero tolerance. That is, for A, B, rA , rB and P related to each other as in the theorem, what meaning does ”zero tolerance”has. Perhaps this clears Sarge’s objection. The nearest neighbor algorithm [NNA] mentioned by Joachim can be realized as a PNN. This could be expressed as NNA ⊂ PNN Beware, however, that when NNAs are translated into PNNs they turn out to be very special cases. A general PNN that realizes the NNA for a data set A with p data vectors, |A| = p, and a data set B with q data vectors, |B| = q, seems to require a first layer with pq processing units [PUs], and a second layer with p PUs. The third layer will have only one unit. The numerical values of the weights will depend, as can be expected, on the numerical values of the data vectors. In my opinion pq + p processing units is, for large data sets, too much. Just think of n, p and q as large integers, say n = 60, p = 100 and q = 100, similar to the Mines vs Rocks benchmark problem. Furthermore margins are, in the case of the NNA, always equal to half the distance from A to B, or D/2, an inconvenient restriction. In actual practice it is advantageous to be able to select any pair of margins with sum less than D; the reason is that —depending on the empirical problem at hand— the number of false positives to be encountered down the road, or of false negatives, can then be minimized. Similarly, the use of real time updating o the instantaneous PNN could result in a large number of data vectors being employed. Then the already mentioned number pq of first layer processing units would —for the NNA— grow excessively. About the input-output comment, also by Joachim, note that by hypothesis A and B are disjoint data sets. This condition means that the data is consistent. Otherwise you will have data vectors that are simultaneously assigned both values 1 and 0. Or, as a concrete case, an X-ray image would have to be simultaneously recognized as ¸cancer.and ”non-cancer”, a most annoying situation. 2

Instant learning makes useless the separation of data into ’learning’ and ’test’. Instead, the PNN can be updated in real time. The analogy of instantaneous learning is the whizkid that immediately learns everything. If in some examination he makes a mistake then, as soon as you tell him, the new knowledge will be assimilated and corrections will take place. If you download and install the zipped file linked in my original question you will see instantaneous learning —or, equivalently, extremely fast training— in action. Thanks again for your interest. Most cordially, Daniel Crespin

3

X1 = P −1 (1)

X1 ∪ X0 = Rn

X0 = P −1 (0)

For any PNN the sets X1 and X0 are polyhedra [a polyhedron is a finite union of convex cells, with each convex cell being a finite intersection of linear half spaces]. By definition a PNN P cognizes A if A is contained in the first partition member, equivalently if P (a) = 1 for all a ∈ A; and P uncognizes B if B is contained in the second one, or P (b) = 0 for all b ∈ B A ⊂ X1 B ⊂ X0 When both conditions are met will say that P (perfectly) recognizes the pair A, B (of data sets). The recognition of A and B by P is stable if A is contained in the (topological) interor of X1 and B is contained in the interior of X0 . But metric conditions are more revealing. Let rA , rB be numbers, each to be called a margin. By definition P (perfectly) recognizes A and B with margins rA and rB if any a ∈ A is at distance at least rA from the boundary of X1 and any b ∈ B is at distance at least rB from the boundary of X0 . Note that if this happens then the sum of the margins is a lower bound for the distance from A to B rA + rB < d(a, b) for all a ∈ A and all b ∈ B Recall that, by traditional definition, the distance D from set A to set B (these being non-empty and finite) is the smallest of the numbers d(a, b). The inequality then says that the distance from A to B is larger than the sum of the margins. Instantaneous learning can now be stated as the following

1

Theorem: For data sets A, B in Rn , at distance D > 0 from each other, and for any pair rA , rB of margins such that rA + rB < D, a 3-layer PNN P : Rn → {0, 1} can be calculated that perfectly recognizes A, B with margins rA , rB . The theorem implies that all data vectors x ∈ Rn at distance less than rA from some a ∈ A are cognized by P if d(x, a) < rA for some a ∈ A, then P (x) = 1 while all data vectors x ∈ Rn at distance less than rB from B are uncognized by P : if d(x, b) < rB for some b ∈ B, then P (x) = 0 In particular noisy data [data + small perturbations] are tolerable as usual, that is, if noise is not too large. In the context of the theorem it is not clear how to make sense of zero tolerance. That is, for A, B, rA , rB and P related to each other as in the theorem, what meaning does ”zero tolerance”has. Perhaps this clears Sarge’s objection. The nearest neighbor algorithm [NNA] mentioned by Joachim can be realized as a PNN. This could be expressed as NNA ⊂ PNN Beware, however, that when NNAs are translated into PNNs they turn out to be very special cases. A general PNN that realizes the NNA for a data set A with p data vectors, |A| = p, and a data set B with q data vectors, |B| = q, seems to require a first layer with pq processing units [PUs], and a second layer with p PUs. The third layer will have only one unit. The numerical values of the weights will depend, as can be expected, on the numerical values of the data vectors. In my opinion pq + p processing units is, for large data sets, too much. Just think of n, p and q as large integers, say n = 60, p = 100 and q = 100, similar to the Mines vs Rocks benchmark problem. Furthermore margins are, in the case of the NNA, always equal to half the distance from A to B, or D/2, an inconvenient restriction. In actual practice it is advantageous to be able to select any pair of margins with sum less than D; the reason is that —depending on the empirical problem at hand— the number of false positives to be encountered down the road, or of false negatives, can then be minimized. Similarly, the use of real time updating o the instantaneous PNN could result in a large number of data vectors being employed. Then the already mentioned number pq of first layer processing units would —for the NNA— grow excessively. About the input-output comment, also by Joachim, note that by hypothesis A and B are disjoint data sets. This condition means that the data is consistent. Otherwise you will have data vectors that are simultaneously assigned both values 1 and 0. Or, as a concrete case, an X-ray image would have to be simultaneously recognized as ¸cancer.and ”non-cancer”, a most annoying situation. 2

Instant learning makes useless the separation of data into ’learning’ and ’test’. Instead, the PNN can be updated in real time. The analogy of instantaneous learning is the whizkid that immediately learns everything. If in some examination he makes a mistake then, as soon as you tell him, the new knowledge will be assimilated and corrections will take place. If you download and install the zipped file linked in my original question you will see instantaneous learning —or, equivalently, extremely fast training— in action. Thanks again for your interest. Most cordially, Daniel Crespin

3