FLARE: Induction with Prior Knowledge

5 downloads 0 Views 75KB Size Report
and reasoning and the subsequent integration of induction, prior knowledge and deduction into unified ..... tic-tac-toe. 81.5 - 1.0. 88.5 - 0.72 hepatitis. 80.0 - 0.94. 81.2 - 0.68 zoo. 97.4 - 0.36 ... pruning of parts of the input space during learning.
FLARE: Induction with Prior Knowledge

Christophe Giraud-Carrier Department of Computer Science, University of Bristol, Bristol, BS8 1UB, England Tel: +44-117-954-5145, Fax: +44-117-954-5208 E-mail: [email protected]

ABSTRACT This paper discusses a general framework called FLARE, that integrates inductive learning using prior knowledge together with reasoning in a non-recursive, propositional setting. FLARE learns incrementally by continually revising its knowledge base in the light of new evidence. Prior knowledge is generally given by a teacher and takes the form of pre-encoded rules. Simple defaults, combined with similarity-based reasoning and learning capabilities, enable FLARE to exhibit reasoning that is normally considered non-monotonic. The framework is particularly useful in the context of knowledge acquisition and discovery, as theory and experience are combined. Results of several experiments are reported to demonstrate FLARE's applicability. 1. INTRODUCTION One of the greatest challenges in the construction of intelligent or expert systems is knowledge acquisition. Traditionally, knowledge acquisition consists of extracting domain knowledge from human experts (via interviews, etc.) and carefully engineering the result into computer-readable rules. Knowledge acquisition is a tedious task that presents many difficulties both practically and theoretically. Recent successes in machine learning make it possible, however, to increase the effectiveness and efficiency of knowledge acquisition through the use of inductive learning techniques. The main premise is that a system's knowledge base can be, and indeed ought to be, constructed from both rules encoded a priori and rules generated inductively from examples. In this extended context, prior knowledge, induction and, indeed, deduction complement each other naturally: • • • •

The rules engineered from human expertise are essentially prior knowledge. Provided examples exist, additional rules can be derived by induction. The ability to reason or perform deduction is necessary to make use of prior knowledge. Experience can be combined with theory to yield systems capable of confirming, refuting, and modifying their "taught" knowledge, based on empirical evidence.

• •

Prior knowledge is useful (even necessary) when examples are scarce or atypical. The ability to reason may guide the acquisition of new knowledge or learning.

The strong knowledge principle (34) and early work on bias (19) suggest the need for prior knowledge in inductive systems, while the complexity of knowledge acquisition and the inherently changing nature of expertise suggest the need for inductive mechanisms in expert systems. It is this author's contention that the study of the interdependencies between learning and reasoning and the subsequent integration of induction, prior knowledge and deduction into unified frameworks may lead to the development of more powerful models. This paper overviews one such recently proposed model, called FLARE (Framework for Learning And REasoning) (14). FLARE combines inductive learning using prior knowledge together with reasoning within the confines of non-recursive, propositional logic. Learning in FLARE is effected incrementally as the system continually adapts to new information. Prior knowledge is generally given by a teacher and takes the form of pre-encoded rules. Simple defaults, combined with similarity-based reasoning and learning capabilities, enable FLARE to exhibit reasoning that is normally considered non-monotonic. The focus, here, is on the combination of induction and prior knowledge. The paper is organised as follows. Section 2 discusses related work. Section 3 gives an overview of FLARE. Section 4 reports experimental results on classical datasets and several other applications, including two simple expert systems. Section 5 concludes the paper by summarising the results and discussing further research. 2. RELATED WORK FLARE follows in the tradition of PDL2 (12) and ILA (13), as it combines inductive learning using prior knowledge together with reasoning in a propositional setting. Unlike PDL2 and ILA whose reasoning power is limited to classification (i.e. 1-step forward inferences only), FLARE supports forward chaining to any arbitrary depth. Whereas PDL2's actual operation tends to decouple learning and reasoning (i.e., the system uses distinct mechanisms to perform either one), ILA and FLARE combine them into 2-phase algorithms that always reason first and then adapt accordingly. FLARE further extends ILA by providing a more accurate characterisation of conflicting defaults and supporting the induction of multiple concepts simultaneously. FLARE bears some similarity with NGE (28). However, because generalisation in FLARE is effected only by dropping conditions, the produced generalisations or generalised exemplars, are hyperplanes, rather than hyperrectangles, in the input space. Hence, FLARE implements a form of nearest-hyperplane learning algorithm. FLARE uses static and dynamic priorities to break ties between equidistant generalisations and, unlike NGE, overlapping hyperplanes are allowed for purposes of dealing with conflicting defaults. In cases where no generalisations exist (induced or a priori), FLARE degenerates into a restricted form of MBR (30). Learning in FLARE contrasts with algorithms such as CN2 (4), where all training examples must be available a priori. Rather, FLARE follows an incremental approach similar to that advocated by Elman (7), except that it is the knowledge itself that evolves, rather than the system's structure. In addition, learning in FLARE can be effected continually as in anytime algorithms (6). Any time new information is presented and the target output is known, FLARE can adapt. FLARE's ability to evolve its knowledge base over time is also

similar to that found in theory-refinement systems such as RTLS (9), EITHER (22) and KBANN (32). The integration of prior knowledge into the inductive process is a topic of wide interest in machine learning (e.g., see (2, 5, 8, 19, 23, 27)). The form of prior knowledge used by FLARE consists of domain-specific inference rules. Systems that explicitly combine inductive learning with this kind of prior knowledge include PDLA (10), ScNets (15), ASOCS (17) and ILP (20). ScNets are hybrid symbolic, connectionist models that aim at providing an alternative to knowledge acquisition from experts. Known rules may be preencoded and new rules are learned inductively from examples. ASOCS and PDLA are dynamic, self-organizing networks that learn, incrementally, from both examples and explicitly encoded, domain-specific rules. FLARE's approach is more flexible. Because the system can reason, domain-specific rules can be deduced automatically from more general rules. ILP models offer the same flexibility. However, ILP takes advantage of the full expressiveness of first-order predicate logic to learn first-order theories from background theories and examples. FLARE's representation language is only as expressive as nonrecursive, propositional clauses. However, FLARE handles both nominal and linear (including continuous and numerical) data and, in its simpler, attribute-based setting, FLARE supports evidential reasoning and the prioritisation of rules. FLARE's reasoning is a combination of rule-based and similarity-based reasonings, similar to CONSYDERR's (31). However, CONSYDERR is strictly concerned with a connectionist approach to concept representation and commonsense reasoning. It does not address the issue of learning. CONSYDERR is also restricted to Boolean features and a concept's representation is a single conjunction of features. FLARE's concepts generally consist of several (disjunctive) conjunctions of features, each encoding partial and complementary definitions of the concept. FLARE's limited handling of non-monotonicity differs from the approach taken in logic. Non-monotonic logics typically extend first-order predicate logic, while preserving consistency. FLARE's approach consists of tolerating inconsistencies in the knowledge base but providing reasoning mechanisms that ensure that no inconsistent conclusions are ever reached. It essentially consists of using normal defaults for inheritance and an external criterion for cancellation (33). The current criterion relies mostly on a simple counting argument. Though this approach has proven sufficient for simple propositional examples, it is likely to need extending on more sophisticated domains. 3. FLARE This section describes FLARE's representation language and gives an overview of FLARE's learning and reasoning mechanisms. 3.1. Representation Language and Prior Knowledge FLARE's representation language is attribute-based. Attributes may range over nominal and bounded linear domains, including closed intervals of continuous numeric values. The basic elements of knowledge are vectors defined over the cross-product of the domains of the attributes. Each vector contains exactly one target-attribute. The target-attribute represents the concept to be learned or, alternatively, the conclusion of the implicit if-then rule encoded by the vector. All domains are extended to contain the special symbols * and ?, which stand for

don't-care and don't-know, respectively. The semantics associated with * and ? are different. An attribute whose value is * is one that is known (or assumed) to be irrelevant in the current context, while an attribute whose value is ? may be relevant but its actual value is currently unknown. Hence, attributes whose value is * may be viewed as adding information to the system's knowledge, while attributes whose value is ? convey a lack of information (i.e., their effect on the conclusion is dependent upon the context). The * symbol is used for the encoding of rules, while the ? symbol accounts for missing attribute values in real-world observations. An example is a vector in which all attributes are set to either ? or one of their possible values. A rule is a vector in which some of the attributes have become * as a result of generalisation during inductive learning. A precept (see (10)) is similar to a rule but, unlike a rule, it is not induced from examples. Precepts are either given by a teacher or deduced from general knowledge relevant to the domain under study. In the context of a given rule or precept, the * attributes have no effect on the value of the conclusion. Precepts and rules thus represent several examples. Formally, if v is a vector in which attribute A has value *, then v represents all vectors obtained from v by replacing * by any of the possible values of A. Logically, a rule or precept subsumes all of the examples it represents. FLARE relies on the simple notion of precept to capture some explicit prior knowledge in the form of pre-encoded rules. The prior knowledge captured by precepts is essentially knowledge about the relevance of certain attributes values to the prediction of a given target attribute. That is, for a given target attribute's value, a precept encodes which attributes are critical together with their associated values, and which attributes are irrelevant. Alternatively, from a logical standpoint, a precept encodes the minimum set of sufficient conditions (i.e., premise) for a particular conclusion to be derived. Within the context of a particular inductive task, precepts also serve as useful learning biases. By using prior knowledge in the form of precepts together with raw examples, FLARE effectively combines the intensional approach (based on features, expressed here by precepts) and the extensional approach (based on instances, expressed by examples) to learning and reasoning. It is clear that this combination increases flexibility. On the one hand, extensionality accounts for the system's ability to adapt to its environment, i.e., to be more autonomous and self-adaptive. On the other hand, intensionality provides a mechanism through which the system can be taught and thus does not have to unnecessarily suffer from poor or atypical learning environments. 3.2. Algorithmic Overview FLARE is a self-adaptive, incremental system. It essentially follows the scientific approach to theory formation/revision: available prior knowledge in the form of precepts and experience in the form of examples produce a "theory" that is updated or refined continually by new evidence. Hence, FLARE's knowledge base is interpreted as a "best so far" set of rules for coping with the current application. An overview of FLARE is given in Figure 1. 1. Automatic precept generation 2. Main loop: For each vector presented to the system (a) Perform Reasoning (b) If there is a target value for the target-attribute, perform Adapting

Fig 1

FLARE - Algorithmic Overview

The purpose of step (1) is to generate, automatically and a priori, domain-specific precepts by reasoning from potentially available general rules. For succintness, details of this function are omitted here. Step (2), which is the heart of the algorithm, is, at least conceptually, an infinite loop. It is executed every time new information, in the form of either explicitly taught precepts or examples, is presented to the system. In step (2)(a), FLARE reasons based on the "facts" provided by the input vector and the contents of the current knowledge base. In step (2)(b), FLARE adapts its current knowledge base. Because learning is supervised, FLARE only adapts when a target value for the target-attribute is given explicitly. The combined execution of steps (2)(a) and (2)(b) is referred to as learning. Notice that FLARE's behaviour is akin to what humans often do when they make a tentative decision by reasoning from available information and subsequently adapt or get corrected if necessary. 3.3. Reasoning FLARE implements a simple form of rule-based reasoning combined with similarity-based reasoning. Deduction in FLARE is applied forward. Facts are coded into a vector in which attributes whose values are known are accordingly set, while all other attributes are ? (i.e., don't-know). One attribute is designated as the target-attribute and, if known, its value is also provided. The Reasoning function is shown in Figure 2. 1. Completion: For each asserted attribute a of v other than the target-attribute, if a is the target-attribute of a definition d and their values are equal, then copy all asserted attributes of d that are ? in v, into v. 2. Forward chaining: If v's target-attribute has not been asserted (a) Repeat until no new attribute of v has been asserted i. Let w = v. ii. For each non-asserted attribute a of v other than the target-attribute, if a rule can be applied to v to assert a, then apply it by asserting a in w. iii. Let v = w. (b) If a rule can be applied to assert v's target-attribute, then apply it. Otherwise, perform similarity-based assertion.

Fig 2

Function Reasoning

Step (1) applies completion first. In keeping with the classical assumption that what is not known by a learning system is false by default, inductively generated rules lend themselves naturally to the completion principle proposed in (3). Inductively learned rules are inherently definitional as they essentially encode a concept's description in terms of a set of features. Completion allows the system to reach goals that are not otherwise achievable by existing rules and, even if the top goal is not achieved directly by completion, further reasoning to achieve it is enhanced due to the information gained. When the target-attribute has not been asserted by completion, step (2) pursues the reasoning process using forward chaining. Each execution of step (2)(a)(ii) corresponds to the achievement of all possible subgoals at a given depth in the inference process. Each iteration uses knowledge acquired in the previous iteration to attempt to derive more new conclusions using existing rules. Step (2)(b) concludes the reasoning phase by asserting the target-attribute. The target-attribute is always asserted, either by rule application or similarity-based assertion. Hence, FLARE always reaches a conclusion. In the worst case, when there is no

information about the target-attribute in the current knowledge base, the value derived for the conclusion must clearly be ?. In all other cases, the validity and accuracy of the derived conclusion depend upon available information. The two complementary mechanisms used in asserting the target-attribute (i.e., rule application and similarity-based assertion) apply sequentially and are handled uniformly by the following non-symmetric distance function D, adapted from (11, 13). D is defined over n-dimensional vectors. If vector x is stored in the knowledge base and vector y is presented to the system to reason about, then the distance from x to y is: n

∑d(x , y ) i

D( x , y ) =

where, if

xi+ , yi+

i

i =1

num_ asserted( x )

denote values of attribute i other than * and ?, then 0 = 0. 5 = 0. 5 = 0. 5 = + = ( xi ≠ yi+ ) if attribute i is nominal xi+ − yi+ + + if attribute i is linear d ( xi , yi ) = range(i ) d (*, yi ) d (?, yi ) d ( xi+ ,?) d ( xi+ ,*) d ( xi+ , yi+ )

such that range(i) is the range of values of attribute i and num_asserted(x) is the number of attributes that are not * in x. The equations for d are consistent with the semantics of * and ? defined in Section 3.1. D(x,y) is meaningful only if x and y have the same target-attribute and the target-attribute is left out of the computation. D applies to both nominal and linear domains, and relies on the corresponding notion of distance between values. In particular, D handles continuous values directly, without need for discretisation. The notion of equality is slightly extended however, as suggested in (13). Two linear values x1 and x2 are equal if and only if |x1−x2|≤δ, for some δ>0. In the current implementation, δ is some fraction of the range of possible values of each attribute. FLARE currently has no mechanisms for individual weighting of features, which may cause performance degradation and increased memory requirements in the presence of a large number of irrelevant features. In the state of knowledge represented by a vector v, a rule may be applied if it covers v. Informally, w covers v if v satisfies all of the premises of w. The rule w is applied to v by simply copying the value of w's target attribute into v. Similarity-based assertion, as applied to vector v, consists of asserting the target-attribute of v to the value of that attribute in v's closest match given by D. Note that, since D is not symmetric, w covers v if and only if D(w,v)=0. Hence, since 0 is the minimum of the distance function, D can be used to apply both reasoning mechanisms with the correct order (i.e., rules first, similarity next), by computing the distance from all the rules in the current knowledge base to v and simply selecting the rule that minimizes D. As it is possible that more than one rule minimizes D, a priority scheme is devised to choose a winner. In order to recognise exceptions and handle cancellation of inheritance, priority is given first to the most specific vector. If all competing vectors have the same specificity, then the system may be faced with conflicting defaults. These are resolved either arbitrarily when a teacher has provided explicit priorities or epistemologically when evidence gathered so far is in stronger support of one of the defaults than the others. Details are in (14).

Notice that, in forward chaining, the assertion of attributes that are subgoals results from rule application only. As a result, the accuracy of the final goal is increased but the ability to perform approximate reasoning is reduced. It is possible to relax this restriction thus potentially achieving more subgoals but reducing the confidence in the final result. To this end, the condition in step (2)(a)(ii) can be relaxed to allow not only rules (which are perfect matches, i.e., D=0) but also matches deemed to be "close enough." The measure of closeness can be implemented via a threshold value TD, placed on D. The value of TD offers a simple mechanism to increase the level of approximate reasoning. 3.4. Adaptation The construction of FLARE's knowledge base is effected through incremental, supervised learning. FLARE learns by continually adapting to the information it receives. Training vectors are assumed to become available one at a time, over time. Prior knowledge, in the form of precepts, may be used at any time to augment inductive learning. The set of all examples, rules and precepts that share the same target-attribute can be viewed as a partial function mapping instances into the goal space. In this context, an example maps a single instance to a value in the goal space, while precepts and rules are hyperplanes that map all of their points or corresponding instances to the same value in the goal space. As mentioned above, learning consists of first applying the reasoning scheme and then making adjustments to the current knowledge base to reflect the newly acquired information. The prior application of reasoning allows the system to predict the value of the targetattribute based on information in the current knowledge base. Also, if there are missing attributes in the input vector and the knowledge base contains rules that can be applied to assert these attributes, the rules are applied so that as many of the missing attributes as possible are asserted before the final goal is predicted. Hence, the accuracy of the prediction is increased and generalisation is potentially enhanced, thus enabling FLARE to more effectively adapt its knowledge base. The system starts with an empty knowledge base. It then adapts to each new vector v, where v is either a precept or an example. If v is the first vector or there are no vectors in the current knowledge base with the same target-attribute as v, then there can not be any closest match and v is automatically stored in the current knowledge base (notice how this allows FLARE to naturally support the learning of multiple concepts). Otherwise, reasoning takes place, v is altered into v+ as subgoals are met and a closest match, say m, is found. The knowledge base is then updated based on the relationship between v+ and m. The possible relationships are summarised below, along with a brief and informal account of the ensuing adaptation. 1. 2. 3. 4. 5.

v+ is equal to m (i.e., noise or duplicates). v+ is subsumed by m (i.e., v+ is a special case of m). v+ subsumes m (i.e., v+ is a general case of m). v+ and m can produce a generalisation. all other cases (e.g., v+ is an exception to m, v+ and m are too far apart, etc.).

The first case accounts for the situation in which the state of knowledge encoded by v+ has previously been encountered and stored in the knowledge base. If m and v+ have the same target value then they are duplicate and no changes would be necessary. However, if m and v+ have different target values then there is an inconsistency, possibly due to noise. To handle this form of noise, FLARE stores, with each vector of its knowledge base, an array of

counters containing one entry for each possible value of the target-attribute. All the counters are initialized to 0, except the one corresponding to the vector's target-attribute value which is initialized to 1. Then, every time a training vector is found to be equal to the stored vector, the counter value corresponding to the training vector's target-attribute value is incremented. Using the above notation, the value of m's counter corresponding to v+'s target-attribute value is incremented by 1. The counter value that is highest represents the statistically "most probable" target-attribute value. In effect, the target-attribute value of a vector is always the one with highest count. Notice that this value may change over time, as new information becomes available. Also, the extension of the notion of equality discussed in Section 3.3 is used here to produce some generalisation for linear attributes. In effect, the vector m (retained in the knowledge base) acts as a "prototype" and its target-attribute's value is the one most probable among its δ-close neighbours. The second and third cases are reciprocal of each other. The second case corresponds to the situation in which the target-attribute value of v+ is predicted correctly by deductive inference using the stored rule m. Since the knowledge base already contains sufficient information, there is no need to store v+. However, the evidential strength of m is reinforced. In the third case, it is the stored special case m of v+ that allows the prediction to be made correctly by a kind of inductive inference. To reflect this inductive leap and account for the additional, more general knowledge provided by v+, m is removed from the knowledge base and replaced by v (rather than v+). The initial evidential strength of v reflects the fact that it subsumes m. The fourth case is critical in the learning process. It corresponds to the situations in which new generalisations are explicitly induced from, and refined by, examples and precepts. In FLARE, generalisations are constructed simply by dropping conditions (18) (i.e., replacing some attribute's value by *). Vectors m and v+ can produce a generalisation if they have the same target-attribute value, they differ in the value of exactly one attribute, the attribute on which they differ is nominal, the number of their attributes not equal to * differ by at most 1 and at least one of them has more than one non * attribute. Generalisation then consists of setting to * the attribute on which the two vectors differ, in the vector that is most general, as long as that vector has more than one non * attribute (see (14) for details). Only one of v+ or m is generalised and stored. Notice that this generalisation rule applies only to nominal attributes as it makes little sense for linear (especially real-valued) domains. For such linear attributes, some generalisation is achieved by the existence of prototypes. Finally, the fifth case covers all other situations. In these situations, either the current knowledge base does produce the correct target value for v+ or it does not. Whether the prediction is correct or not, v+ is added to the knowledge base. The motivation is as follows. If the prediction is incorrect, then the current knowledge base simply does not account for v+ and hence, one way to fix the deficiency is to add v+ to it. In particular, exceptions to existing rules may thus be discovered and stored. If the prediction is correct, then since none of the four previous cases applies, FLARE deems the prediction not reliable enough to properly account for v+ and adds v+ to the knowledge base. This avoids potential losses of information due to chance and the incremental nature of learning in FLARE. For example, in the early stages of learning, there may be only one vector in the knowledge base that shares the targetattribute of v+. Hence, that vector would be the best match in relative terms, even if it were not a very good match in absolute terms. There would thus be reasons to be suspicious about the prediction.

4. EXPERIMENTAL RESULTS To assess FLARE's performance as an inductive learning system, the standard training set/test set approach was used. Several datasets from the UCI repository (21) were selected. They represent a variety of applications, involving nominal-only attributes, linear-only attributes and mixtures of nominal and linear attributes. FLARE's results were gathered for each application, using 10-way cross-validation. Because FLARE's outcome is dependent upon the ordering of data during learning, each run was repeated 10 times with a new random ordering of the training set. The predictive accuracy for a given run is the average of the 10 corresponding trials and the predictive accuracy for the dataset is the average of the 10 runs. Results are shown in Table 1. In the second and third columns, the first number represents predictive accuracy on the test set and the second number represents the ratio of the size (in number of rules) of the final knowledge base to the number of examples used in learning. This value serves as another measure of the generalisation power of FLARE, as well as an indication of FLARE's memory requirements. Table 1 Application lenses voting-84 tic-tac-toe hepatitis zoo iris soybean (small) segmentation glass breast-cancer sonar Averages

Induction Results No PK 79.0 - 0.43 92.9 - 0.63 81.5 - 1.0 80.0 - 0.94 97.4 - 0.36 94.0 - 0.13 100 - 0.98 94.0 - 0.99 71.8 - 0.22 96.6 - 0.47 83.8 - 0.77 88.3 - 0.63

PK 80.5 - 0.33 94.5 - 0.25 88.5 - 0.72 81.2 - 0.68 97.4 - 0.32 N/A N/A N/A N/A N/A N/A

For the set of selected applications, FLARE's performance with no prior knowledge compares favorably with that of ID3 (24), CN2 (4) and Backpropagation (26), as well as with that of other inductive learning algorithms (e.g., see (1, 35, 36)). In addition, the knowledge base maintained by FLARE is generally significantly smaller than the set of all training vectors. The first five applications were used to illustrate the effect of prior knowledge on predictive accuracy and knowledge base size. In each case, the set of training examples is augmented by precepts given a priori. Here, the precepts are obtained from domain knowledge provided with the application (voting-84) or generated from the author's common sense (zoo, lenses, hepatitis, tic-tac-toe). For example, in the lenses application, which involves predicting whether a patient should be fitted lenses, the precept used states that lenses should not be fitted if the patient's tear production rate is low (i.e., the eyes are dry). The results with precepts show an average increase of 2.6% in predictive accuracy and a decrease of 31.3% of the size of the final knowledge base. The decrease in size demonstrates that prior knowledge allows pruning of parts of the input space during learning. Indeed, starting with the same number of training vectors, FLARE ends up with a knowledge base containing about one-third less vectors than when precepts are not used. Hence, precepts not only increase generalisation performance, they also reduce memory requirements.

Three experiments with the well known Nixon Diamond (25) were also conducted to demonstrate FLARE's ability to handle conflicting defaults both intensionally and extensionally: 1. Intensional-only: encode both defaults as precepts along with a priori relative priorities based on some externally provided information (e.g., religious convictions supersede political affiliations). 2. Extensional-only: neither defaults are given. Rather, examples of Republicans, Quakers and Republican-Quakers are shown and the system automatically comes up with both the defaults (through induction) and their relative priorities. 3. Intensional and extensional: encode both defaults as precepts without relative priorities. This corresponds to a possibly more natural situation where the system really is in a don't-know state when it comes to deciding on Nixon's dispositions. FLARE adopts an epistemological approach, wherein the conflict is resolved by observing instances of Republican-Quaker. The relative number of pacifists and non-pacifists serves as evidence to lean towards one decision or the other. In other words, it is the system's observation of what seems most common in its environment that creates its belief. This is not unlike the way humans deal with many similar situations. Finally, two expert system knowledge bases were experimented with. One is called mediadv (16) and is intended to help designers or committees choose the most appropriate media to deliver a training program. It consists of 20 rules with chains of inference of length 2 at most. The other is called health (29) and is intended to predict the longevity of patients based on a variety of factors (e.g., weight, personality type, etc.). In the experiments, 72 of the original 77 rules are used. The chains of inference have length greater than 2. Experiments with the health knowledge base were aimed only at demonstrating FLARE's ability to perform deduction. The experiments conducted involved chains of inference in which FLARE successively inferred new conclusions until it reached a value for the top goal, longevity. For example, starting with data about an older, unhealthy male (e.g., smoker, alcoholic, aggressive, etc.), the system first deduced that the man's blood pressure and risk of heart disease were higher than average and that the base longevity should be set relatively low (60 years). It then further deduced that the outlook was bleak and that the base longevity should be further reduced. It finally concluded (as expected) that the man's longevity was only 48 years. The mediadv knowledge base, though less interesting in terms of deduction, was used to show how learning supplements knowledge acquisition. Of particular interest was the case of conflicts that arise because two or more rules may apply to a given situation, while implying different goal values. In mediadv, such a conflict exists between the following rules, where X is some fixed conjunction of conditions not shown here: • •

rule 13: if (X) and (training-budget=small or training-budget=medium) then media-toconsider=lecture rule 14: if (X) and (training-budget=medium) then media-to-consider=lecture-with-slides

It is almost impossible to avoid occurrences of similar conflicts in large knowledge bases elicited from experts. Moreover, even when identified, such conflicts are often difficult to resolve intensionally. Revised rules may be hard to formulate and restoring consistency in one section of the knowledge base may create inconsistencies in another. Through inductive learning, FLARE offers a viable alternative. In the case of mediadv, FLARE may look at various (historical) situations where training-budget was medium and check which media was used then. This information can, in turn, be used to give precedence to one rule over the other.

Moreover, this precedence need not be fixed after so many examples have been considered. Indeed, it may evolve over time and even change radically depending on circumstances. In one particular experiment, several additional instances of [(X) and (training-budget=medium)] together with a target value for media-to-consider, were used that effectively gave rule 14 (evidential) precedence over rule 13. 5. CONCLUSION This paper motivates and overviews a system, called FLARE, that combines inductive learning using prior knowledge together with reasoning within the confines of non-recursive, propositional logic. Reasoning incorporates rules and similarity. Learning is incremental and prior knowledge takes the form of pre-encoded rules. Several important positive conclusions may be drawn from the results of this research. In particular, • • • •

Performance is improved in terms of both memory requirement and predictive accuracy when prior knowledge is used. Prior knowledge may be used to reduce the negative effects of poor or atypical learning environments. Induction from examples can be used to effectively resolve conflicting defaults extensionally. Induction offers a valuable complement to classical knowledge acquisition techniques from experts.

Experiments with FLARE on a variety of applications demonstrate promise. However, much work still remains to be done to achieve a more complete and meaningful integration of learning and reasoning. Areas of future work include the following: • • • • •

Designing mechanisms to use reasoning to guide learning. Fuzzyfying the distance function to better handle uncertainty. Improving the induction of explicit generalisations to minimise the size of the knowledge base and increase comprehensibility. Further experimenting with larger applications. Extending the language to first-order.

5. REFERENCES 1. 2. 3. 4. 5. 6.

AHA, D.W., KIBLER, D. and ALBERT, M.K. Instance-based learning algorithms, Machine Learning, 1991, Vol. 6, pp. 37-66. BUNTINE, W.L. A Theory of Learning Classification Rules, PhD Thesis, 1990, University of Technology, School of Computing Science, Sidney, Australia. CLARK, K.L. Negation as failure, Logic and Databases, 1978, H. Gallaire and J. Minker (Eds.), Plenum Press, pp. 293-322. CLARK, P. and NIBLETT, T. The CN2 induction algorithm, Machine Learning, 1989, Vol. 3, pp. 261-283. COHEN, W. Compiling prior knowledge into an explicit bias, Proceedings of the Ninth International Conference on Machine Learning, 1992, pp. 102-110. DEAN, T.L. and BODDY, M. An analysis of time-dependent planning, Proceedings of the Sixth National Conference on Artificial Intelligence, 1988, pp. 49-54.

7.

8. 9. 10.

11.

12.

13.

14. 15.

16. 17. 18. 19. 20. 21. 22. 23.

24. 25. 26.

ELMAN, J.L. Incremental learning, or the importance of starting small, Technical Report CRL 9101, 1991, University of California, San Diego, Center for Research in Language, La Jolla, CA. FLANN, N.S. and DIETTERICH, T.G. A study of explanation-based learning methods for inductive learning, Machine Learning, 1989, Vol. 4, No. 2, pp. 187-226. GINSBERG, A. Theory reduction, theory revision, and retranslation, Proceedings of the Eighth National Conference on Artificial Intelligence, 1990, pp. 777-782. GIRAUD-CARRIER, C. and MARTINEZ, T.R. Using precepts to augment training set learning, Proceedings of the First New Zealand International Two-Stream Conference on Artificial Neural Networks and Expert Systems, 1993, pp. 46-51. GIRAUD-CARRIER, C. and MARTINEZ, T.R. An efficient metric for heterogeneous inductive learning applications in the attribute-value language, Proceedings of the Third Golden West International Conference on Intelligent Systems, 1994, Vol. 1, pp. 341350 (Kluwer Academic Publishers). GIRAUD-CARRIER, C. and MARTINEZ, T.R. An incremental learning model for commonsense reasoning, Proceedings of the Seventh International Symposium on Artificial Intelligence, 1994, pp. 134-141. GIRAUD-CARRIER, C. and MARTINEZ, T.R. ILA: combining inductive learning with prior knowledge and reasoning, Technical report CSTR-95-03, 1995, University of Bristol, Department of Computer Science, Bristol, UK. GIRAUD-CARRIER, C. and MARTINEZ, T.R. An integrated framework for learning and reasoning, Journal of Artificial Intelligence Research, 1995, Vol. 3, pp. 147-185. HALL, L.O. and ROMANIUK, S.G. A hybrid connectionist, symbolic learning system, Proceedings of the Eighth National Conference on Artificial Intelligence, 1990, pp. 783-788. HARMON, P. and KING, D. Expert Systems, 1985, John Wiley & Sons, Inc. MARTINEZ, T.R. Adaptive Self-Organizing Networks, PhD Thesis, 1986, University of California, Los Angeles, Tech. rep. CSD 860093. MICHALSKI, R.S. A theory and methodology of inductive learning, Artificial Intelligence, 1983, Vol. 20, pp. 111-161. MITCHELL, T.M.. The need for biases in learning generalisations, Technical report CNM-TR-5-110, 1980, Rutgers University, New Brunswick, NJ. MUGGLETON, S. and DE RAEDT, L. Inductive logic programming: theory and methods, Journal of Logic Programming, 1994, Vol. 19-20, pp. 629-676. MURPHY, P.M. and AHA, D.W. UCI repository of machine learning databases, 1992, University of California, Irvine, Department of Information and Computer Science. OURSTON, D. and MOONEY, R.J. Theory refinement combining analytical and empirical methods, Artificial Intelligence, 1994, Vol. 66, No. 2, pp. 273-309. PAZZANI, M. Creating high level knowledge structures from simple events, Knowledge Representation and Organization in Machine Learning, 1989, K. Morik (Ed.), pp. 258287, Springer-Verlag. QUINLAN, J.R. Inductive learning of decision trees, Machine Learning, 1986, Vol. 1, pp. 81-106. REITER, R. and GRISCUOLO, G.. On interacting defaults, Proceedings of the Seventh International Joint Conference on Artificial Intelligence, 1981, pp. 270-276. RUMELHART, D.E. and MCCLELLAND, J.L. Parallel and Distributed Processing: Explorations in the Microstructure of Cognition, 1986, Vol. 1, MIT Press.

27.

28. 29. 30. 31. 32. 33.

34. 35.

36.

L. SAITTA, L., BOTTA, M., RAVOTTA, S. and SPEROTTO, S. Improving learning using deep models, Proceedings of the First International Workshop on Multistrategy Learning, 1991, pp. 131-143 (George Mason University Press). SALZBERG, S. A nearest hyperrectangle learning method, Machine Learning, 1991, Vol. 6, pp. 251-276. SAWYER, B. and FOSTER, D.L. Programming Expert Systems in Pascal, 1986, John Wiley & Sons, Inc. STANFILL, C. and WALTZ, D. Toward Memory-Based Reasoning, Communications of the ACM, 1986, Vol. 29, No. 12, pp. 1213-1228. SUN, R. A connectionist model for commonsense reasoning incorporating rules and similarities, Knowledge Acquisition, 1992, Vol. 4, pp. 293-321. TOWELL, G.G. and SHAVLIK, J.W. Knowledge-based artificial neural networks, Artificial Intelligence, 1994, Vol. 70, No. 1-2, pp. 119-165. VILAIN, M., KOTON, P. and CHASE, M.P. On Analytical and Similarity-Based Classification, Proceedings of the Eighth National Conference on Artificial Intelligence, 1990, pp. 867-874. WATERMAN, D.A. A Guide to Expert Systems, 1986, Addison Wesley. WETTSCHERECK, D. and DIETTERICH, T.G. An experimental comparison of the nearest-neighbor and nearest-hyperrectangle algorithms, Machine Learning, 1994, Vol. 19, pp. 5-28. ZARNDT, F. A Comprehensive Case Study: An Examination of Connectionist and Machine Learning Algorithms, Masters Thesis, 1995, Brigham Young University, Department of Computer Science.