Connectionist Knowledge Representation By ... - Semantic Scholar

2 downloads 0 Views 60KB Size Report
Under classification criterion (2), at one end of the spectrum we have those rule extraction techniques that view the underlying ANN at the maximum level of ...
Connectionist Knowledge Representation By Generic Rules Extraction from Trained Feedforward Neural Networks Richi Nayak, Ross Hayward and Joachim Diederich Neurocomputing Research Centre Queensland University of Technology 2 George St, Brisbane 4001, Qld, Australia Email: {nayak, hayward, [email protected]} Rule-extraction from trained neural networks has previously been used to generate propositional rule-sets. The extraction of "generic" rules or objects from trained feedforward networks is clearly desirable and sufficient for many applications. We present several approaches to generate a knowledge base that includes rules, facts and a is-a hierarchy that enables the greater explanatory capabilities by allowing the user interaction. The approaches are (1) construct two feedforward neural networks by cascade correlation algorithm [Fahlman & Lebiere, 1991] and tower algorithm [Gallant, 1990]), extracts rules at the level of individual hidden and output units of both the networks by use of the decompositional rule-extraction method “LAP”; (2) cascade correlation and tower algorithm to train two different feedforward neural network, extracts rules that map inputs directly into outputs to generate the examples for each learning algorithm by the use of the pedagogical rule-extraction method “RuleVI” (3) constrained error back propagation [Andrews and Geva, 1994] to train a feedforward neural network, extract rules at the level of individual hidden and output units by use of the decompositional rule-extraction method “RULEX”; and use of the extracted symbolic rules to generate a connectionist knowledge base. Then the performance is demonstrated by a number of real-world applications.

1. Introduction A recognised shortcoming of artificial neural networks is the absence of a capability to explain in a comprehensible form the process by which a trained ANN arrives at a specific decision or result. Currently, one of the most promising approaches to overcome this problem is to extract the knowledge embedded in the trained ANN as a set of symbolic rules. Andrews, Diederich & Tickle [1995] developed an overall taxonomy for categorising techniques for extracting rules from ANNs. Here, a total of five primary criteria is proposed viz. (1) the expressive power of the extracted rules; (2) the translucency of the view taken within the rule extraction technique of the underlying ANN units; (3) the extent to which the underlying ANN incorporates specialised training regimes; (4) the quality of the extracted rules; and (5) the algorithmic complexity of the rule extraction/rule refinement technique. Under classification criterion (2), at one end of the spectrum we have those rule extraction techniques that view the underlying ANN at the maximum level of granularity i.e. as a set of discrete hidden and output units. Craven & Shavlik [1994] categorised such techniques as "decompositional." The basic motif of decompositional rule extraction techniques is to extract rules at the level of each individual hidden and output unit. In contrast to the

decompositional approaches, the motif in the pedagogical approaches is to view the trained ANN as a "black box". The focus is then on finding rules that map the inputs (i.e. the attribute/value pairs from the problem domain) directly into outputs (e.g. membershipof or exclusion-from some target class). In addition to these two main categories, Andrews et al. [1995] also propose a third category labelled as “eclectic” to accommodate the techniques that utilise a decompositional and pedagogical approaches. In this study, a connectionist knowledge representation system is not used for learning but is the target system for the rule-extraction process. This new approach uses feedforward networks with several learning algorithms and several rule-extraction methods such as (1) construct two feedforward neural networks using cascade correlation algorithm [Fahlman & Lebiere, 1991] and tower algorithm [Gallant, 1990]), extract rules at the level of individual hidden and output units of both the trained networks by use of the decompositional rule-extraction method “LAP” (Cascade-“LAP” and Tower-"LAP"); (2) cascade correlation and tower algorithm to train two different feedforward neural networks, extract rules that map inputs directly into outputs to generate the examples for both the learning algorithms by the use of the pedagogical rule-extraction method “RuleVI” (Cascade-”RuleVI” and Tower-”RuleVI”); (3) constrained error back propagation [Andrews & Geva, 1994] to train a feedforward neural network, extract rules at the level of individual hidden and output units by use of the decompositional ruleextraction method “RULEX”. Then use the extracted symbolic rules to generate a knowledge base for the connectionist knowledge representation systems. This knowledge base includes a is-a hierarchy, a collection of facts and rules using generic predicates, as well as an implementation of type -restrictions. The extracted rules-base can be used for forward and backward reasoning and enables the greater explanatory capabilities by allowing user interaction. The general objective is to form a symbolic generic representation of a predicate given a set of examples. Each example consists of a number of attributes that may be required to identify the predicate plus a value for each attribute. The extracted predicates are instantiated and form rule sets including type restrictions. The paper is organised as follows: section 1 gives an introduction to the neural networks used for learning (cascade correlation, rapid backpropagation, tower algorithm), section 2 describes the rule-extraction techniques (the decompositional algorithms “LAP” and “RULEX”, the pedagogical algorithm “RuleVI”). Section 3 introduces the rule-translation process to generate the connectionist knowledge base that consists of is-a hierarchy, facts and rules using generic predicates. The next section provides experimental results that help to present the entire rule-extraction and generation of knowledge base process in detail.

2. Network Architecture The presented approach is in general independent of the underlying network architecture. However we utilise different neural networks to support the different rule-extraction methods.

2.1 The BpTower Networks The tower algorithm [Gallant, 1990] employs single-cell learning to build a tower of cells, where each cell sees the original inputs and the single cell immediately below. The initial network starts with no hidden units, that are only p input units and a single (output) cell with index p+1, train the network using backpropagation of error algorithm and freeze these weights. Tower algorithm adds a new cell that takes p inputs plus the activation from the cell recently trained (immediately below). Train the p+2 weights for this cell. If the network with this added cell gives improved performance, then freeze its coefficient and continue adding and training the newly constructed network; otherwise remove this last added cell and output the network. 2.2 The Cascade Correlation Networks The cascade type networks [Fahlman & Lebiere, 1991] have particular architectural differences in comparison to multilayer perceptrons. The initial network starts with no hidden units, and only the weights to the outputs are trained. Cascade correlation constructs a network by initially training the output unit to approximate the target function and when training stagnates, a pool of candidate units (set of new units) is trained with connections from all input and previously inserted candidate units to predict the network error. When training the candidate stagnates the one that minimises the error most is inserted into the network by adding a connection to the output unit. The weights into the inserted units are then frozen and training the output unit repeated. This process is continued until an acceptable overall network is achieved. 2.3 The Constrained Error Backpropagation Networks The constrained networks [Andrews & Geva, 1994] make use of some form of local response units in their hidden layers. The rapid backpropagation (RBP) network is consists of an input layer, a hidden layer of local basis functions units, and an output layer. The hidden units are sigmoid based locally responsive units that have the effect of partitioning the training data into a set of disjoint regions, each region being represented by a single hidden layer unit. Each local unit is composed of a set of ridges, one ridge for each dimension of the input. A ridge will produce appreciable output (thresholded sum of the activations of the ridges) only if the value presented as input lies within the active range of ridge. In the ith dimension, the sigmoid of a local basis function unit are paramaterised according to centre, breadth and edge steepness of each ridge. The local response region is created by subtracting the value of one sigmoid from the other. An incremental constructive training algorithm is used with training, involving adjusting the parameters of the sigmoids - centre, breadth and edge steepness - that define the local response units by the gradient descent. During training the output weight is held constant at a value such that the hidden units are prevented from overlapping, e.g., no more than one unit contributes appreciably to the network output.

3. Rule Extraction Algorithm The rule-extraction methods such as “LAP”, “RuleVI” and “RULEX” are used to extract the propositional rule set after training the relevant neural networks such as the cascade correlation networks, BpTower networks and RBP networks. 3.1 The “LAP” rule-extraction algorithm If Heaviside activation functions are used to approximate the sigmoidal function employed by cascade correlation during training, a decompositional rule extraction algorithm can be used to isolate the necessary dependencies between inputs to each unit in the network and to form a symbolic representation for each unit [Hayward, Tickle & Diederich, 1996]. The “LAP” algorithm assumes that the data presented to the network has been sparsely coded and uses this information as a heuristic to reduce the search space when identifying rules. Function Enumerate(S1, S2, …, SN) : BOOLEAN is /* function enumerates recursively all minimal vectors. */ /* return true if max(S1) + max(S2) + … + max(SN) < unit threshold */ BOOLEAN belowThreshold; belowThreshold := (max(S1) + max(S2) + … + max(SN)) < unit threshold; if not belowThreshold then BOOLEAN minimal := true; for i := 1 to N if |Si| > 1 then Si := Si - max(Si ); minimal := Enumerate(S1, S2, …, SN) AND minimal; end end if minimal then Save(S1, S2, …, SN); end end return belowThreshold end Enumerate. Table 1 "LAP" Algorithm

The algorithm produces a DNF expression for each perceptron in the network that consists of boolean variables whose value is the result of a test of the set membership of an attribute value for a given instance. The terms of the expression are of the form: IF xi values for attribute1 ∈ {subset of attribute 1 values} AND xi values for attribute2 ∈ {subset of attribute 2 values} ...

AND xi values for attributeN ∈ {subset of attribute N values} THEN perceptron will fire END Where xi is some instance. If any subset of values for an attribute contains all the possible values then the test for membership can be dropped as the variable is trivially true based on the assumption that the data is sparsely coded. Each rule would have any 1...M, number of values per attribute, where M is the total number of values per attribute and any 1...N, number of attributes, where N is the total number of attributes. 3.2 The “RuleVI” Algorithm The core idea behind a pedagogical rule-extraction algorithm is to view rule extraction as a learning task where the target concept is the function computed by the network and the input features are simply the networks input features. “RuleVI” focuses solely on the task of extracting conjunctive rules. /* Initialise rules for each class */ for each class c Rc := NULL end repeat e := Examples( ); c := Classify( ); if e not classified by Rc then r := empty rule let t be formed from e for each input ti u := t with ti a range if SUBSET(u) = false then r := r OR ti; else t := u end Rc := Rc OR r end until training instances are exhausted Table 2 The "RuleVI" Algorithm

The genesis of the “RuleVI” technique is the observation that every systolic rule in propositional calculus can be expressed as a disjunction of conjunctions [Hayward et. al., 1997]. A conjunctive rule holds only when all the antecedents in the rule are true and hence by changing the truth value of one of the antecedents, the consequent of the rule

changes. Each rule would have only one value per attribute and any 1...N, number of attributes, where N is the total number of attributes. IF xi value for attribute1 ∈ {subset of attribute 1 values} AND xi value for attribute2 ∈ {subset of attribute 2 values} ... AND xi value for attributeN ∈ {subset of attribute N values} THEN perceptron will fire END Where xi is some instance. It is designed to select only one conjunctive rule per input pattern, but will still be able to extract all the rules learned from the patterns 3.3 The “RULEX” Algorithm Unlike other decompositional methods, “RULEX” [Andrews & Geva, 1994] is not a search and test method. The technique is designed to exploit the manner of construction and consequent behaviour of a particular type of multilayer perceptron, the constrained error back-propagation. Since “RULEX” uses the local function networks with a single hidden layer of basis function units, perform function approximation and classification by mapping a local region of input space, directly to an output. Individual hidden units or local response units (LRU) of the trained RBP network can be decompiled into rules of the form: IF ∀ 1 ≤ i ≤ n : xi ∈ [xi lower , xi upper ] THEN Pattern Belongs to the Target Class END Where xi lower represents the lower limit of activation of the ith ridge in LRU xi upper represents the upper limit of activation of the ith ridge in LRU The “RULEX” algorithm performs the rule extraction by the direct interpretation of weight parameters as rules. The number of antecedents per rule is in the range 1...N, where N is the dimensionality of the problem.

4. Rule Translation Process We are left with a problem corresponding to the quantification of prior determination knowledge in order to convert the propositional symbolic rules to quantified rules in form of generic predicates. To demonstrate the rule translation process we use the first monk

example [Thrun et. al 1991] and decompositional rule extraction method “LAP”. Arguably the Monk problems (table 3) are relatively simplistic but will suffice for an example of the methodology. The robots are classified as monks depending on the values for each of the attributes. The first monk problem defines the monks as those robots with the same head shape as body shape or if they are wearing a red jacket. A subset of instances is used to train the networks defined above with the resulting network consisting of a single hidden and output unit. Attribute Values Head_Shape round, square, octagon Body_Shape round, square, octagon Is_Smiling true, false Is_Holding sword, balloon, flag Jacket_Color red, yellow, green, blue Has_Tie true, false Table 3 Monks Problem

The DNF expression gained by “LAP”, when the hidden unit (which is referred to simply as unit) will have an activation of one, were: • •

Head_Shape ∈ {round, square} Λ Body_Shape ∈ {octagon } Λ Jacket_Color ∈ {yellow, green, blue} Head_Shape ∈ {square} Λ Body_Shape ∈ { round, octagon } Λ Jacket_Color ∈ {yellow, green, blue}

From the terms of the DNF expression we form ancillary concepts that imply the output of the hidden unit is greater than, in this case, 0.5. Let X denote the set of Head_Shapes, Y the set of Body_Shapes and Z the set of Jacket_Colors the ancillary predicates inferring that the hidden unit will fire can be written as {∀ X,Y,Z unit_predicate_a(X,Y,Z)}, {∀ unit_predicate_b(X,Y,Z)} (based on the two DNF expressions defiends above) with their associated respective facts {unit_predicate_a(round or square, octagonal, yellow or green or blue)} and {unit_predicate_b(square, round or octagon, yellow or green or blue)}. Each instantiated predicate (fact) contains only one value per attribute and as many numbers of attributes as DNF expressions involve. The DNF expression gained by “LAP” for the hidden unit having a low output were: • • • •

Jacket_Color Body_Shape Head_Shape Head_Shape

∈ ∈ ∈ ∈

{red} {square} {round, octagon} Λ Body_Shape ∈ {round, square} {octagon}

In a similar fashion we form ancillary concepts from the DNF terms inferring that the hidden units output will be below 0.5. These are represented as ∀X,Y,Z unit_predicate_c(Z), unit_predicate_d(Y), unit_predicate_e(X,Y), unit_predicate_f(X) with the respective facts unit_predicate_c(red), unit_predicate_d(square),

unit_predicate_e(round or octagon, round or square), unit_predicate_f(octagon). And collecting dependencies we form the concept definition for the hidden unit with the rules: ∀ X,Y,Z unit_predicate_a(X,Y,Z) ∀ X,Y,Z unit_predicate_b(X,Y,Z) ∀ X,Y,Z unit_predicate_c(Z) ∀ X,Y,Z unit_predicate_d(Y) ∀ X,Y,Z unit_predicate_e(X,Y) ∀ X,Y,Z unit_predicate_f(Z)

⇒ ⇒ ⇒ ⇒ ⇒ ⇒

unit(X,Y,Z) unit(X,Y,Z) not unit(X,Y,Z) not unit(X,Y,Z) not unit(X,Y,Z) not unit(X,Y,Z)

So far we have only considered the symbolic representation for a unit that has connections from the input space that is not true of all units in cascade networks. Proceeding with the example, we outline how the dependencies between network units are treated. The output unit, robot is not solely dependent on the input space but also on the output of unit. When the decompositional rule extraction process is applied to the output node, the resulting DNF expression contains an attribute corresponding to unit with possible values in {0,1} ({false, true} respectively). The hidden unit is named as “unit” and the output unit is named as “monk”. The DNF expression for the monk unit is (we omit the expressions for the case where a robot is not a monk for brevity): • • • • • •

Jacket_Color ∈ {red} Λ unit ∈ {false} Body_Shape ∈ {octagon} Λ unit ∈ {false} Head_Shape ∈ {round, square} Λ Body_Shape ∈ {round, octagon } Λ unit ∈ {false} Head_Shape ∈ {round, square} Λ Body_Shape ∈ {octagon } Λ Jacket_Color ∈ {red} Head_Shape ∈ {square} Λ unit ∈ {false} Head_Shape ∈ {square} Λ Body_Shape ∈ {round, octagon } Λ Jacket_Color ∈ {red}

A general predicate for the goal concept of a monk can be expressed as ∀X,Y,Z monk(X,Y,Z) with the facts from terms four and six above being: monk(round or square, octagon, red) and monk(square, round, red). We can further identify the ancillary concepts ∀Z monk_a(Z) from term one, ∀Ymonk_b(Y) from term two, ∀ X,Y monk_c(X,Y) from term three and ∀X monk_d(X) from term five. These have the respective facts monk_a(red), monk_b(octagon), monk_c(round or square, round or octagon) and monk_d(square). To complete the definition of a monk we need to introduce rules utilising the definition for unit and the ancillary monk concepts resulting in the rules: ∀ X,Y,Z monk_a(Z) Λ not unit(X,Y,Z) ∀ X,Y,Z monk_b(Y) Λ not unit(X,Y,Z) ∀ X,Y,Z monk_c(X,Y) Λ not unit(X,Y,Z) ∀ X,Y,Z monk_d(X) Λ not unit(X,Y,Z)

⇒ ⇒ ⇒ ⇒

monk(X,Y,Z) monk(X,Y,Z) monk(X,Y,Z) monk(X,Y,Z)

Having such a knowledge base now allows queries such as monk(round,square,red) to be posed and hence provides a basis for an explanation of why this classification arose. For our example query an explanation will be: unit_predicate_c(red) ∨ unit_predicate_e(round,square) ⇒ not unit(round,square,red) not unit(round,square,red) ∧ monk_a(red) ⇒ monk(round,square,red). The basic idea behind the rule-translation process is to convert the DNF expressions into the generic predicate form and facts. So the rule translation process is very similar for the rules extracted from “RuleVI” and “RULEX” to “LAP”, the difference is the DNF expressions. For example the “RuleVI” proceeds by initialising a DNF expression for each output class to empty, after satisfying some conditions the rule is added to the DNF expression for the class. If we interpret the above explained rule by “LAP” in “RuleVI” then there would not be any intermediate or hidden unit rules, the mapping would directly be from the attributes to the output unit (classifier). The equivalent rule would be: unit_predicate_c(red) ∨ unit_predicate_e(round,square) ⇒ monk(round,square,red). “RULEX” performs the rule extraction by the direct interpretation of weight parameters as rules. Since “RULEX” uses the local function networks with a single hidden layer of basis function units, perform function approximation and classification by mapping a local region of input space, directly to an output. Rules are extracted from the each hidden unit and can be interpreted as a set of attributes and output classifier. The rule would be: unit_predicate_c(red) ∨ unit_predicate_e(round,square) ⇒ monk(round,square,red).

5. Experimental Results and Discussion The process starts with the training of a supervised ANN. The inputs to the neural network are sparsely coded in case of tower and cascade correlation algorithm and represent prepositional variables. After training, the network has been finalised, the ruleextraction method is used to extract a propositional rule-set. The extracted rules are transformed in a connectionist knowledge base representation that includes a is-a hierarchy, type-token distinction and type-restrictions. The process is explained as follows: Data -> ANN -> Rule-Extraction -> Rule-Translation -> connectionist knowledge base The above explained approach has been applied to several real world data sets - the remote sensing data set [Hammadi and Korczak, 1995] to recognise the water and natural forest area from radiometric data contained in a satellite image (each has 3 features, 8k examples), congressional voting data set to classify the democratic and republican votes (16 attribute, 435 examples) and monk problems to categorise the robot as monk (6 features, 432 examples). The data sets other than remote sensing are available from ftp://ftp.ics.uci.edu/pub/machine-learning-databases. The remote sensing data (table 4) has been pre-processed (quantization, elimination of duplicate and conflicting data) to reduce the noise for the training of networks. Cascade networks and BpTower networks accept the sparse-coded inputs. The networks

performance in terms of the final architecture, training and testing error in classification has been summarised in table 5. Attribute Red Green Blue

Values very_low, low, medium, high, very_high very_low, low, medium, high, very_high very_low, low, medium, high, very_high Table 4 Remote Sensing Problem

Data set

cascade network

BpTower Network

Constrained Error BPN

network architec ture

training error in classifica tion (%)

testing error in classificat ion (%)

network architect ure

training error in classific ation

testing error in classifica tion (%)

net-work architect ure

training error in classifica tion (%)

testing error in classificat ion (%)

Monk1

17:1:1

0

0

17:1:1

0

0

6:6:1

0

0.23

Remote Sensing (water) Remote Sensing (forest) Congressional Voting

15:0:1

0

0

15:0:1

0

0

3:2:1

0

0

15:2:1

0

3.086

15:1:1

2.35

2.47

3:5:1

0

1.85

32:0:1

0.3

1.6

32:0:1

0.6

1.84

16:11:1

0

2.3

Table 5 Performance of Networks

All the algorithms, that we used for the training of the networks, incrementally build an ANN with localised hidden unit representations and allow for lateral connections between hidden units. These lateral connections can be part of an inference path in the resulting knowledge base. Such constructive learning algorithms are expected to build a near to minimal network for given problems, because every hidden unit serves as a unique feature detector, and no a priori (and possibly inappropriate) guesses about the size or number of hidden layers have to be made. Once the network is trained, the rule extraction technique has been applied. The results of the accuracy and fidelity of the different techniques are outlined in Table 6. We have considered both the rule extraction techniques such as “LAP”-decompositional and “RuleVI”-pedagogical to extract the rules from cascade type networks. “LAP” forms concept representations with a high degree of algorithmic complexity. This was done to gain rules with a high fidelity with the underlying network units to check the consistency of the overall method. Although the human comprehensibility of the rules is not good in some cases, but it is achieved by converting them into connectionist knowledge base and makes them assessable as queries. “RuleVI” based on a learning framework does not provide a rule set with such high fidelity to the underlying network but covers the set of instances that was used to gain the rule set, sometimes it is useful when only the observed facts are wanted in the knowledge base. The rule set generated using cascadeLAP(cascade-RuleVI) and BpTower-LAP (BpTower-RuleVI) is quite different because of

Data set

cascadeLAP Accu racy

Fidelity

cascadeRuleVI Accu racy

Fidelity

BpTowerLAP

BpTowerRuleVI

Accu racy

Accu racy

Fidelity

Fidelity

RULEX Accu racy

Fidelity

Monk1

100

100

98

98

100

100

85

85

100

99.7

Remote Sensing (water) Remote Sensing (forest) Congressional Voting

100

100

100

100

100

100

100

100

100

100

76

80

97

100

97.53

100

96.3

98.8

98.8

100

95

96.5

92.4

94

95.4

97.3

93.1

95

97.7

98.4

Table 6 Performance of Rule-Extraction Techniques

the fundamentally difference between the training. The rule comprehensibility of the rule set generated by RULEX is better than others. The reason is that RULEX provides mechanisms for removing redundant antecedent clauses (e.g. input dimensions that are not used in classification) from the extracted rules and for removing redundant rules (replacing two rules with a single general rule). The knowledge bases, generated using different techniques explained above, are quite different with each other for the same data-set because of the different learning algorithms and rule extraction techniques used to generate them. This approach gives a choice to user to construct a knowledge base according to the need such as accuracy, fidelity and comprehensibility.

Conclusion A framework for the integration of rule-extraction from trained neural networks and ruleprocessing to generate a knowledge base has been proposed. Rule-extraction from trained neural networks is used to form propositional rule-sets. Extracted Rules can be generalised and generic rules can then be processed by a connectionist reasoning system. Furthermore, the method is able to represent the explicit negation of predicates in describing the goal concept. The rules also support the type constraints on the arguments. In comparison to rule extraction techniques where a logical expression is derived describing the overall behaviour of a network, the presented methodology provides a much more detailed explanation of why a particular instance is classified as a member of a goal concept. Providing an explanation in terms of rules and facts involving generic predicates is a step forward in the symbolic representation of networks and offers several natural extensions. This system can in turn be used for deduction and analytical learning. This system uses different learning algorithms and rule extraction techniques to construct a knowledge base for the same data-set, in result this approach gives a choice to user to construct a knowledge base according to the need such as accuracy, fidelity and comprehensibility.

References Andrews, R.; Diederich, J.; Tickle A. B.: A survey and critique of techniques for extracting rules from trained artificial neural networks. Knowledge-Based Systems 8 (1995) 6, 373-389. Andrews, R.; Geva, S.: Rule extraction from a constrained error back propagation MLP. Proc 5th Australian Conference on Neural Networks Brisbane Queensland Australia (1994 9-12. Craven, M. W.; Shavlik, J. W.: Using sampling and queries to extract rules from trained neural networks. Machine Learning: Proceedings of the Eleventh International Conference, San Francisco, CA, USA (1994). Fahlman, S.E.; Lebiere, C.: The Cascade-Correlation Learning Architecture. In: Touretzky, D.S. (Ed.): Advances in neural information processing systems. San Mateo, Ca.: Morgan Kaufmann Pub., 2, 1990. Gallant, S.: Perceptron-based learning algorithms. IEEE Transactions on Neural Networks 1 (2), 1990, June, 179-191. Hammadi, F. M.; Korczak, J. J.: An Unsupervised Neural Network Classifier and its application in Remote Sensing. University Louis Pasteur, Stransbourg, France, 1995. Hayward, R.; Hostuart, C.; Diederich, J.: Neural Networks as Oracles for Rule Extraction. QUT NRC , June, 1997. Hayward, R.; Tickle, A..; Diederich, J.: Extracting rules for grammar recognition from cascade-2 networks. In: Wermter, S.; Riloff, A.; Scheller, G. (Eds.): Symbolic, connectionist and statistical approaches to learning for natural language processing. Berlin: Springer Verlag 1996, 48-60.

Thrun, S.; Bala, J.; Bloedorn, E.; Bratko, I.; Cestnik, B.; Cheng, J.; De Jong, K.; Dzeroski, S.; Fahlman, S, E.; Fisher, D.; Hamann, R.; Kaufman, K.; Keller, S.; Kononenko, I.; Kreuziger, J.; Michalski, R. S.; Mitchell, T.; Pachowicz, P.; Reich, Y.; Vafaie, H.; Van de Welde, K.; Wenzel, W.; Wnek, J.; Zhang, J.: The MONK's problems: a performance comparison of different learning algorithms" Carnegie Mellon University CMU-CS-91-197 (December 1991).