Parallel logic programming systems - Association for Computing

0 downloads 0 Views 3MB Size Report
logic programming languages such as Prolog, whale the family of concurrent ..... selection of a unifying clause in the program when per- forming a resolution.
Parallel

Logic Programming

JACQUES

CHASSIN

IMAG/LGI,

46al~enue

PHILIPPE INRIA,

Systems

DE KERGOMMEAUX Ftilm

Vlallct,

F-38031

Grenoble

Cedex

7, France

CODOGNET

Domawze

de Voluceuu,

Parallelizing because

aims

languages

intrinsic

communication This

article

implementations.

efficient

parallel

art sequential

and

logic Prolog

such

parallel

D.3.2

and

Formal

Key

constraint

of constraint

B.3.2

programmmg;

of parallelism

in order

to obtain

and

most

speedups

also addresses

is, the

the

over state-of-the-

current

and

and the efficiency

parallehsm logic

of existing

and the concurrent

programming

with

logic

parallelism,

[Memory

Design

Multiple

Stream

Techniques]:

Logic

Language

and parallel interpreters;

Mathematical

Data

Concurrent

Languages]:

distributed, Languages]:

Structures]:

Architectures]:

languages;

D.3,4

preprocessors; Lo~c—logzc

F.4.

[Programming

1 [Mathematical

programming

Languages Words

memory

and

Phrases:

AND-parallelism,

constraints,

management,

nondeterminism,

OR-parallelism,

Warren

Machine

Abstract

develops their

implemented,

D. 1.6 [Programming

[Programming

programming,

parallelism,

languages

exploitation

effective

Techniques]:

Processors—compzlers;

Terms:

Additional

the transparent

research

programming

concurrency—that

to be solved

the applicability

C. 1.2 [Processor

Classifications-concurrent, Languages]:

reach

logic

processes—within

been

The article

Descriptors:

raltel

Programming:

General

systems

D. 1.3 [Programming

Programming-pa

have

community,

One

architectures.

memory,

Architectures;

parallel

logic

the

on transparent

solutions

combination

and Subject

Styles—shared

merging

programs.

in existing

to the problems

in extending

in the research

of logic

of concurrent

mainly

solutions

implementations. issues

interest

to express

between

These

as models

use of highly

Logic

family

concentrates

approaches,

Categories

the

programming

research

much

Cedex

of parallelism

programmers

mature

efficient

languages

whale

allowing

the most

prospective

has attracted

and AND-parallelisms

and synchronization

and surveys

Le Ghesna.v

exploitation

as Prolog,

constructs

algorithms.

OR-

at transparent

such

language

systems,

F-78153

logic programming

of the

stream

Rocquencourt,

guard,

hash

multisequential Prolog,

1. INTRODUCTION Logic programs can be computed sequentially or in parallel without changing their declarative semantics. For this reason, they are often considered well suited to programming multiprocessors. At the same time, parallel architectures repre-

bmdmg windows,

arrays,

load

implementation

scheduling

parallel

concurrent

balancing,

massive

techniques,

tasks,

static

analysis,

sent the most promising way to increase the computing power of computers in the future. Since multiprocessors remain difficult to use efficiently, an implicitly parallel programming language offers a very attractive mean of exercising the parallelism of multiprocessors of today and tomorrow.

Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. 01994 ACM 0360-0300/94/0600-0295 $01.50

ACM

Computing

Surveys,

Vol. 26, No, 3, September

1994

296

-

J. Chassin

de Kergommeaux

and P. Codognet

CONTENTS

1.INTRODUCTION 2,

PARALLELISMS

IN

LOGIC

PROGRAMMING

Programs ‘2 2 Sources of Parallehsm m Logic Programs 23 OR-Parallehsm m Prolog 2.4 AND-Parallehsm m Prolog ISSUES 3. LANGUAGE 3.1 Parallehz]ng Prolog 3.2 Delays, Synchronlzatlon, and Concurrency 3.3 Concurrent Logic Languages 34 Umfymg Prolog and Concurrent Languages 2.1

4

5

Logic

IMPLEMENTATION 41

From

42

Efficient

ISSUES

Early

Models

to Multisequential

Sequential

43

OR-Parallelism

4.4

AND-Parallehsm

45

Scheduhng

SYSTEMS

Systems

Implementation

of Parallel

Techniques

Tasks

EXPLOITING

ONE

TYPE

Exploiting

OR-Parallehsm

OF

PARALLELISM 51

Systems

~ 2 A System

Exploiting

Parallehsm:

Independent

AND-

&-Prolog

53

Systems

54

Performance

Exploiting

Dependent

of Systems

AND-Parallelism

Exploiting

One

Type

of Parallelism 6

SYSTEMS

COMBINING

SEVERAL

TYPES

OF

PARALLELISM 61

Systems

Combming

Independent

AND-

With

OR-Parallelism 62

Systems

Combmmg

Dependent

AND-

With

OR-Parallelism 7

RESEARCH

C~TRRENT

7 1 Statw

?)

Analysis

AND

of Logic

72

Parallelism

and

73

Concurrent

Constraint

74

Use

of Massively

PERSPECTIVES

Programs

Constraint

Logic

Programming

Languages

Parallel

Multiprocessors

CONCLUSION

ACKNOWLEDGMENTS REFERENCES

Logic programming languages are very high-level languages enabling programs to be developed more rapidly and concisely than with imperative languages. However, in spite of important progress in compilation techniques for these languages, they remain less efficient than imperative languages, and their use is mainly constrained to prototyping. Increasing the efficiency of logic programming to the level of imperative languages would certainly enlarge their domain of use and raise the productivity of production programmers.

ACM

Computing

Surveys,

Vol.

26,

No

3, September

1994

These reasons persuaded the ICOT to choose logic programming as the basic programming language of the Fifth Generation Computer Systems project (FGCS) [Furukawa 1992; Shapiro and Warren 1993]. One aim of this project was to produce multiprocessors delivering more than one gigs-lips, one “lip” being one logical inference per second, and one inference being similar to a procedure call of an imperative language. The gigs-lips level of performance seemed out of scope when the FGCS project started in 1982, since the most efficient Prolog systems of that period were limited to several kilo-lips of performance. This is not the case any more, mainly because of the spectacular progress made by the hardware technology. The most efficient Prolog implementations exceed one mega-lip on today’s most powerful RISC microprocessors, and massively parallel multiprocessors can include more than 1000 such processors, However, getting gigs-lip-level performance out of massively parallel multiprocessors will only be possible if parallelizing techniques capture enough of the parallelism in the logic programs while keeping the overhead of parallel execution limited. Additionally, it is not clear whether a large number of logic programs can benefit from massive parallelism. There exist two main schools of thought in the logic programming community: either parallelism should be made explicit, or it should be kept implicit. Explicit parallelism is the approach developed in concurrent logic languages which have already been successfully used in operating systems and parallel simulators [Foster 1990; Shapiro 1989]. The aim of implicit parallelism is instead to speed up existing or future logic programs without troubling programmers with parallelism. Both approaches will be presented in this article, although more emphasis will be placed on systems exploiting parallelism implicitly. Research in parallel logic programming has resulted in the definition of a large number of parallel computational models [Delgado-Rannauro 1992a; 1992b;

Parallel 1992c]. All these models cannot be presented in this article, which will concentrate on the most mature ones, already used in efficient parallel implementations. The organization of the article is the following. The first sections are introductory, defining the different types of parallelism that can be exploited in logic programming, introducing the language issues involved in implicit versus explicit parallelism, and raising implementation issues involved in implementing efficiency logic programming in parallel. The two following sections survey representative systems exploiting one type of parallelism or combining several types of them. Several important research topics and perspectives are then sketched before the conclusion of the article. 2. PARALLELISMS IN LOGIC PROGRAMMING Due to their declarative and logical essence, logic programs naturally exhibit different kinds of parallelism. Indeed, the parallelization of logic programs can be seen as a direct consequence of the key feature advocated by Kowalski [1979] in his well-known motto “programs

= logic

+ control.”

A problem should be decomposed into a logical part (i.e., a specification written as set of logical formulas) which encodes the declarative meaning of the program and a control part (implementation dependent) which provides a way of executing this specification. The most popular instance of the latter, proposed for example by the Prolog language, is sequential depth-first search, but different kinds of parallel execution mechanisms are also possible. Each type of parallelism leads to a specific model of execution that forms the foundations for one of the systems that will be presented later. 2.1 Logic Programs This brief the basic

and intuitive presentation notions of logic programs

of is

Logic

Programming

Systems



297

intended to make the paper self-contained. A more complete introduction to logic programming can be found in Llovd .

[1987i. -

-

A logic program Horn clauses, that form:

A:–

B1,...,

is a set of definite is, formulas of the

Bn.

(1)

A.

(2)

and one query :-

Q1,...,

QP.

(3)

The A,, l?,, and Q, are atomic formuwhere p is a las such as p(tl,,,,,tin), predicate symbol and tl, . . . . t~ are compound terms, constants, or simple variables. In (l), A is the head of the clause while the conjunction of the Bi is called its body. Each of the B, is called a goal. Logic programs enjoy both a declarative, or logical, semantics and an operational, or procedural, semantics. Therefore (1) can be logically read as “if B1 and Bz . . . . and B. are true, then A is true,” and (2) as “A is a fact that is always true.” The declarative interpretation of a logic program P amounts to proving that the query (3) is a logical consequence of the conjunction of the clauses of P. On the other hand, the procedural interpretation of Horn clauses forms the basis of the operational semantics of logic programming. The set of all clauses with the same head predicate symbol p can be considered as the definition of the procedure p. Each goal of the body of a clause can be considered as a procedure call. Parameter passing between the caller and the callee is achieved by unification of the arguments. Unifying two terms consists in finding an assignment of the free variables of both terms which makes the two terms equal. For instance, unifying the two termsl f(X, a, Y) and f(b, Z, T) yields the substitution {X ~ b, Z + a, Y + T}. Hence

1Written with Prolog’s identifiers convention: variables begin with a capital letter, while function symbols and constants begin with a lower-case letter.

ACM

Computing

Surveys,

Vol. 26, No. 3, September

1994

298



J. Chassin

de Kergommeaux

and

parameter passing is multidirectional. At a given step of the execution of a logic program, the set of goals that remain to be executed, i.e., the continuation of the computation, is called the current resoluent. The execution of a logic program starts by taking the query as the initial resolvent and proceeds by transforming the current resolvent into a new one, as described by the following high-level pseudocode: ●

Set the query

s while

current

resolvent

to the

initial

The query is equivalent to the request: “Find a person whose grandfather is Mark.” Let us detail the execution of this program by examining the resolvents produced at each step. During program execution each use of a program clause requires renaming the variables of the clause with fresh variables distinct from those in the resolvent. This will be achieved in the following example by adding subscripts to variables’ names.

(1) The first

the current

—Select

resolvent

any goal

G of the

then

Resolution

*

Select whose

resolvent

*

(2)

a clause of the program head unifies with G of this

* if there solvent tives then if resolvent

(3)

step:

exists a previous rewith untried alterna-

restore

else exit

is empty

Let us illustrate simple example.

this

with

this

resolvent

(2’)

success

machinery

by

a

2.1. Consider the following “genealogy” program:

grandfather :- father father(john,

(X, Z) (X, Y), father

(Y, Z).

(1)

philip).

(2)

father(peter,

andy).

(3)

father(andy,

mark).

(4)

and query :- grandfather(X,

ACM

Computmg

Surveys,

Vol

mark),

26, No

3, September

(5)

1994

mark).

Assume that we select the leftmost goal for continuing the resolution process. It unifies with (head of) clause (2) with substitution {X + john, y, ~ philip], giving the resolvent: mark).

Since no clause head in the program unifies with the goal of the resolvent, backtracking occurs, and the last choice (i.e., second resolvent) is restored: :- father(X,

Example traditional

mark).

Y,), father(Y1,

:- father (philip,

failure

then

query

The (unique) goal of the query (5) unifies with the head of clause (1), that is, grandfather (Xl, Z,), with substitution {Zl + mark, X1 ~ X}, giving the resolvent: :- father(X,

apply the substitution (i. e., variables bindings) resulting from unification to the resolvent

else backtracking

is the initial

:- grandfather(X,

proG

step:

* replace G by the body clause in the resolvent

resolvent

(5):

is nonempty

—if there exists a clause of the gram whose head unifies with



P. Codognet

Y,), father

(Yl, mark).

an alternative uni(3’1 We now consider fying clause to resolve with the leftmost goal, such as clause (3), giving the substitution {X + peter, Y, + and y}, and the resolvent: :- father(andy, (4’)

mark).

This goal unifies with (head of) clause (4), with an empty substitution, giving the empty clause as final resolvent: ❑

(empty

clause).

Parallel

Logic

Program

ming

Systems

w

299

grandfather(X,mark)

x,=x Z1=mark father(X,Yl)

, father(Yl,mark)

,,~,,,,:

,,:,/ father(philip,mark)

father(mark,mark)

father(andy,mark)

fail

success

fail

Figure 1.

Search

tree for tutorial

The execution of this logic program is successful, and the result is the binding of the variable of the query, that is, X = peter, which is called an answer substitution. This computation can be graphically summarized by depicting the associated search tree, shown in Figure 1. The nodes of the tree are the resolvents occurring in the computation, and the arcs are labeled by the set of bindings of the corresponding resolution step. We underline, when necessary, the selected goal in a resolvent. A sequential computation according to Prolog’s strategy simply consists in a depth-first left-to-right search of the tree. Logic programming languages such as Prolog also include several extensions to this “pure” logical setting, as for instance side-effect primitives (e.g., read and write predicates), and the (in)famous cut operator which is used to make determinate parts of the computation. Cut appears syntactically as a special goal (written “ !”) in the body of a clause and will affect, when executed, the part of the computation performed after the entry of the clause. A cut will remove all choicepoints (i.e., backtracking points created when several unifying clauses for a goal exist), that have been created since. No backtracking in this part of the computation is then possible.

example.

Example is replaced cut :

2.2. by

Suppose that clause clause (3) containing

father(peter,

andy) :- !.

(3) a

(3’)

This cut is executed just before the success node of the search tree, and will not affect this success, but will remove the choice-point created for the goal father (X, Y,), which still has a remaining alternative (clause (4)). The cut will remove this alternative, that is, delete on Figure 1 the rightmost branch of the search tree. Here, this has not much effect on the overall computation, since the deleted branch ended with failure, but, had it ended in success, this solution would have been deleted as well. Therefore, using cuts could obviously destroy the logical meaning of a program, because it may delete some success appearing on a branch pruned by a cut. 2.2 Sources

of Parallelism

in Logic Programs

There are two intrinsic sources of parallelism in the above execution model, which correspond to the two choices that have to be made to perform an inherently parallel algorithm sequentially. The first choice is the selection of a goal in the resolvent in order to perform a resolution step. One may envisage selecting several goals of the resolvent and

ACM

Computing

Surveys,

Vol. 26, No. 3, September

1994

300



J. Chassin

de Kergommeaux

and

performing all resolution steps simultaneously. This is called AND-parallelism, since it consists of pursuing in parallel goals that must all be satisfied. The second one is the selection of a unifying clause in the program when performing a resolution step and proceeding to a new resolvent. If there exist several such unifying clauses, it is possible to perform several alternative resolution steps in parallel, creating therefore several new resolvents upon which the computation proceeds. This is called ORparallelism, since it consists of pursuing in parallel alternative goals, any one of which may be satisfied.

2.3

OR-Parallelism

in Prolog

OR-parallelism consists of the simultaneous development of several resolvents that would be computed successively by backtracking in a sequential execution. Examining the search tree (Figure 1’), an OR-parallel search consists of exploring in parallel each branch of the tree, these branches being indeed OR-branches representing alternative computations, In our example, the computation may split into three OR-branches when computing the second resolvent, using the three clauses of the father procedure. Of course, only the second branch succeeds at the next computation step. It is worth noticing that in our definition of OR-parallelism, simultaneous independent execution is not limited to a single resolution step, and thus reduced to a mere database parallelism while searching for a matching clause, but applies to the entire computation of com r.so[.~ents. Thus, if the OR-split plete occurs early enough in the computation, alternative resolvents the remaining might involve large computations, yielding therefore coarse-grain parallelism. The main problem in OR-parallelism is to manage different independent resolvents in parallel, in each of the ORbranches. This is not a problem in the example above where the resolvents are very small, but in general, care must be taken to avoid inefficient copying of large

ACM

Computmg

Surveys,

Vol. 26, No

3, September

1994

P. Codognet

data structures, such as the current set of variables bindings. Sophisticated algorithms and data structures have been designed to overcome this problem, which we present in Section 5. Another issue to be addressed in this “eager evaluation” of alternative choices is tuning the amount of parallelism in order to stop the system from collapsing under too many processes. Efficient solutions mix backtracking and on-demand OR-parallelism (by idle processors) in so-called “multisequential” models.

2.4

AND-ParaHelism

in Prolog

AND-parallelism consists of the simultaneous computation of several goals of a resolvent. Nevertheless, if each subcomputation is followed independently, one needs to ensure the compatibility of the binding sets produced by the parallel branches. Consider again the computation in Example 2.1, and more precisely the second subgoal - father(X, Y,), father(Y1, mark), which has potential AND-parallelism. The execution of the two parallel AND branches results in the following candidate solution sets: Branch

1: goal:

Variable

{X=

:- father(X,

bindings:

Y,)

{X = john, Y, = phlllp},

peter, ‘I’, = andy},

{x = andy,

‘f, =

mark} Branch

2: goal:

Variable

binding:

father(Y1,

mark)

{Yl = andy}

Obviously, only the second set of bindings of the first branch is compatible with the unique binding produced by the second

branch.

One

must

“join”

binding

sets

coming from different AND-branches in order to form a valid solution for the entire resolvent. The main difficulty with AND-parallelism is to obtain coherent bindings for the variables shared by several goals executed in parallel. Run-time checks can be very expensive. This has led to design execution models where goals that may possibly bind shared variables to conflicting values are either serialized or syn-

Parallel chronized. The first solution leads to independent AND-parallelism: only independent goals—those that do not share any variables—are developed in parallel. The second solution leads to dependent AND-parallelism: concurrent execution of goals sharing variables is possible. Such goals are synchronized by producer/consumer relationships on the shared variables. This ammoach has been used in concurrent lo~i; programming languages such as Parfog,- I@l, and Concurrent Prolog. 3. LANGUAGE

Logic more ating

Systems



easily modeled by multiple concurrent processes.

3.1 Parallelizing

301 cooper-

Prolog

Prolog was originally designed with a sequential target machine in mind. However, parallelizing the language without change is an appealing approach for the following reasons: 0

Parallelism can be transparently achieved with “minimal” changes to the initial model, at least at the abstract level, as described in the previous section. All the implementation technology developed for sequential Prolog systems, such as the WAM abstract machine [Warren 1983], can thus be reused. The most successful offsprings of this approach are the multisequential models for both OR and AND parallelism that will be presented in Section 4.

0

Parallel execution does not add any complexity to the programming language; it helps the user to concentrate on declarative statements without bothering with control issues. This view keeps up with the separation of logic and control advocated by Kowalski [1979] in his well-known formula “program = logic + control.”



The corpus of Prolog programs already develop~d can be ex~&te~ without an; modification to the programs’ source code by parallel systems.

ISSUES

The first attempts in the earlv 80s to parallelize Prolog and to define ~ew models of execution for logic programs raised a number of issues, including some key points in language design. The main debate can be summarized by the following question: “Do we need a new language with specific constructs to express concurrency and parallelism, or should we stick to Prolog and exploit parallelism transparently?” Prolog has been in use for nearly two decades and has established itself as a useful tool for a wide range of problems. Declarative languages become more attractive as computing power increases, since it becomes more feasible to displace complexity of programming from the user to the internal computation mechanism. Efficient parallel implementations of existing languages are thus desirable to speed applications that are already developed, and they can moreover lead logic programming a step further in covering effective “real-world” applications. On the other hand, one may sometimes prefer a more flexible control of the search mechanism to the simple but somehow rigid search control of Prolog. This brings up the problem of communication and synchronization in logic programs that calls for new features and changes in language design. Thus was born the new domain of concurrent logic languages, sometimes called committedchoice languages. It has paved the way to a wide range of new applications that are

Programming

Some problems, however, arise when considering the Prolog constructs that are order sensitive. Most side-effect primitives, as for instance read or write, require what could be called ANDsequentiality, and are thus necessary sequentialization points for AND-parallel models. OR-parallel models, however, which keep the sequential order of execution of goals, do not suffer from ANDsequentiality. Nevertheless, side-effect primitives can also cause problems in these models, since programmers usually expect the OR-sequentiality of Prolog, which requires that alternative clauses

ACM

Computing

Surveys,

Vol.

26,

No.

3, September

1994

302



J. Chassin

de Kergommeaux

and

be executed in the textual order of the program. Problems of OR-sequentiality are even stronger when considering “impure” or “nonlogical” features such as the cut operator or the assert/retract primitives which dynamically manipulate program clauses. These constructs will cause problems in OR-parallel execution, because they are sensitive to the order in which OR-branches are explored. Let us consider the cut operator. All the computation corresponding to the development of the alternative clauses that may be pruned by a cut is called speculative work. This work will be pruned if the cut is executed, but will be developed if the branch containing the cut fails before executing it. Scheduling speculative work is therefore a difficult task in OR-parallel models [Hausman 1990], since anticipation may amount to useless computation while conservatism may prevent parallel execution. To avoid the inherent sequentiality of the cut operator, OR-parallel Prolog systems usually introduce a new “symmetric” cut operator: whereas the cut prunes choice-points in the previous alternative clauses of a predicate (with respect to the textual order of the proa symmetric cut will prune gram), choice-points in all the alternative clauses of a predicate (both before and after the clause containing the symmetric cut in the textual order of the program). This new operator will obviously be easier to implement in an OR-parallel system: as soon as one parallel branch reaches a symmetric cut operator, all other branches are killed, without bothering about the order in which they appear in the text of the program. This is of concurclose to the “commit” operator rent logic languages, which will be detailed in the next section.

P. Codognet

alects such as IC-Prolog [Clark and McCabe 1981], Prolog-II [Colmerauer 1986], MU-Prolog [Naish 1984], or SICStus Prolog [Carlsson and Widen 1988] provide primitives to explicitly delay the execution of some goals, and therefore introduce some coroutining mechanism between active goals of the resolvent. Roughly speaking, a predicate can be given a wait declaration specifying that it should not be executed before some of its arguments are instantiated, i.e., bound to nonvariable terms. This corresponds to declaring this goal as a consumer-only goal on those arguments. Such delayed goals in a resolvent can be seen as coroutines that will be woken up by some instantiation of their variables. A more direct and precise control mechanism has been introduced by the family of concurrent logic languages. All these languages have some syntactic way to declare that a process (goal) is either a producer or consumer of some of its arguments. Such declarations are thus used to produce a synchronization mechanism between active processes, i.e., goals of the resolvent. When unification of a goal instantiates a consumed variable, this goal is suspended until that variable is sufficiently instantiated (by some other goal). Let us illustrate this machinery by a simple example. Example 3.1. Consider a very simple program consisting of the three clauses below:

p(a).

(1)

p(b)

(2)

q(b) .

(3)

and a query 3.2

:-p(x),

Delays, Synchronization, and Concurrency

In Prolog, control of forward execution is given by the order of the goals inside a clause and that of the clauses in the program. However, a more flexible control strategy is sometimes needed. Di-

ACM

Computing

Surveys,

Vol

26, No. 3, September

1994

q(x).

(4)

Also suppose that we have stated (we will detail the syntax later on) that p is consumer of its argument (X) while q is producer. Consider an execution that selects goal p in the query: the unification

Parallel with any of clause (1) or (2) amounts to instantiating X, and therefore goal p is suspended. On the other hand, q may be selected, and the unification with clause (3), which amounts to binding X to b, is possible because q is producer of X; p can then be woken up, and unification with clause (2) succeeds without any instantiation, since X is now bound to b. Observe how producer/consumer synchronization is achieved between active goals of the resolvent and results in this example being a purely determinate computation, while a usual Prolog computation would have been nondeterminate: p would have been selected; unification with clause (2) would have produced the binding of X to a; then selection of q would have led to a failure and backtracking, bringing up clause (3) for p; q would then succeed.

3.3

Concurrent

Logic

[Clark and Gregory 1981], the ancestor of (concurrent languages such as Parlog [Clark and Gregory 1986], GHC and KL1 [Ueda and Chikayama 1990], Concurrent Prolog [Shapiro 1983] and FCP [Shapiro 1989], or CP [Saraswat 1987]. The reader should consult Shapiro [ 1989] for a detailed genealogy, history, and complete presentation of this programming paradigm.

Systems



303

The main programming concepts behind those languages can be summarized as follows: ●

The process interpretation of logic programs [Shapiro 1983] replaces the traditional procedural interpretation. Each goal is seen as a distinct process; AND-connected goals are assumed to run concurrently. This form of ANDparallelism is called dependent ANDparallelism, since goals that share variables (hence “linked” or dependent) can run in parallel.



Communication is achieved through logical variables: each variable shared by several goals/processes acts as a communication channel, leading to a very powerful and elegant communication mechanism.



Synchronization is achieved by using a p~oducer/consumer relationship between processes. A process may be blocked during a resolution step until the variables that it consumes are sufficiently instantiated (by some other processes).

Logic Languages

The usual logic programming framework is limited to transformational systems, i.e., systems that, given an original input, transform it to yield some output at the end, and as the end. This framework is not well suited to reactive systems, i.e., “open” systems able to react to a continuous stream of inputs from the “real world,” which are more easily modeled by interacting agents. To address these problems, communication and synchronization techniques derived from the concurrency theory [Dijkstra 1975] have been introduced in logic programming, giving rise to concurrent logic programming. The basic notions for the integration of concurrency into logic programs can be traced back to Relational Language

Programming

Several new major language features are therefore introduced (see example below): (1) Some way of stating a producer/consumer mode for each variable has to be introduced. Each language has its own way to declare the access mode (Read or Write) of a variable in the predicate declaration, which we will not detail here. These modes are used dynamically to induce a producer/ consumer relationship between active processes (goal of the resolvent) on each shared variable, i.e., on each communication channel. This technique amounts to replacing the unification procedure involved at each resolution step by a matching procedure for consumer goals, i.e., allowing binding of the variables in the callee only (head of clause) and not in the caller (goal of the resolvent). This is what happened in Example 3.1: when P(X) was first selected, both clauses (1) and (2) unify, but neither matches,

ACM

Computing

Surveys,

Vol. 26, No 3, September

1994

304



J. Chassin

de Kergommeaux

and

since unification amounts to binding variable X of the goal. This notion in fact can be rephrased and generalized in the Ask-and-Tell mechanism of concurrent constraint languages, which will be presented in Section 7.3. introduced by (2) The concept of guard Dijkstra [1975] and used for imperative languages such as Ada and Occam is adapted to logic programming. A guard is a switch construct of condition/action pairs, which delays execution until one of the conditions is true, and executes the corresponding action as soon as a single condition is verified. This translates to logic programs as follows: each clause is given an additional guard part that appears between the head and the body. A guard is simply a sequence of goals that must be executed successfully before the body of the clause is entered (all body goals are spawned in parallel). Implementation considerations have led the last generation of languages such as KL1 or FCP (Flat Concurrent Prolog) to accept only fZat guards, i.e., guards composed only with built-in predicates. This ensures that no hierarchy of guard systems will ever be created (hence the name) and that guard checking will be reasonably fast. care” re nondetemn[n isrn (3) “Don’t places the traditional “Don’t know” nondeterminism of logic languages. This means that at each resolution step, only one alternative is pursued (and we do not care which) as opposed to pursuing all of them (since we do not know which one to choose). The latter is achieved in a language like Prolog by creating a choice point whenever more than one clause unifies for a resolution step, and by backtracking upon failure of one computation branch. Alternative terminologies are “indeterminism” for “Don’t care” nondeterminism and “angelic nondeterminism” for “Don’t know” nondeterminism. Under “don’t-care” nondeterminism no

ACM

Computing

Surveys,

Vol.

26,

No.

3, September

1994

P. Codognet choice-point is ever created, and no backtracking ever takes place, resulting in a deterministic language. The only nondeterminism lies in the guard check, since all guards are evaluated in parallel. Among all satisfiable guards, one of them is chosen, and to the the computation is committed corresponding clause, neglecting alternative ones. This mechanism is syntactically expressed by the presence of a commit operator between the guard and the body of each clause. The commit indeed corresponds to a generalized and symmetric “cut” operator.

Concurrent logic languages hence depart from traditional logic languages in several ways, most notably by the replacement of unification by matching and by the giving up of “Don’t know” (“angelic”) nondeterminism. The former is the key mechanism for the synchronization mechanism and is the price to pay for a new programming style. The latter is, however, only supported for implementation reasons, since a simple backtracking scheme (such as the chronological backtracking of Prolog) was not easy in a concurrent environment. To give a flavor of what a concurrent logic program is, let us consider the following merge program, written following the syntax of KLl or FCP( l). This program merges the two intmt streams (lists) ;n its first two arguments to produce an output stream in its third argument. L

Example

,3.2

I Out merge ([ Xllnll, ln2, Out) “- true ln2, Out’). [XIOut’], merge(lnl, merge (lnl, [Xlln21, Out) :- true I out [XIOut’1, merge(lnl, in2, Out’), merge([ 1, ln2, Out) :-true I Out = ln2. merge(lnl, [ ], Out) :-truel Old = Inl.

= =

This program is very simple: all guards being empty (true predicates indicate empty guards), synchronization will be enforced by the compound terms present in the head of the clauses. Since unification is replaced by matching, the presence of a compound term in the head of a

Parallel

clause for a given argument place enmode for this arguforces a consumer ment, meaning that this term cannot be bound to a variable of the caller goal but should be checked versus an incoming nonvariable term only. Thus a merge process will wait until some data arrives on one of its input streams (i.e., a “cons” is produced in one input list). When an input stream is reduced to the empty list, merge simply produces what it is fed through the remaining input stream. This program preserves the order of the elements in both input streams, but guarantees nothing about the rate at which each stream will be served. More complex promerger which grams, such as a fair guarantees that every data produced by one input stream will eventually be written on the output stream, are presented in Shapiro [ 1989]. 3.4 Unifying Prolog and Concurrent Logic Languages As described above, parallel concurrent logic languages distinct application areas: Prolog is higher-level terministic, solving.

Prolog address

and two

more suited for declarative programs, such as nondesearch-intensive problem

Concurrent logic lan~ages are more . — suited when explicit control matters and fine-grain coordination are essential, when the problem is more easily modeled by communicating agents. It was thus natural to try to encompass both paradigms in a single language, for the best of both worlds. model was proposed by The Andorra Warren [1988] to combine OR-parallelism and dependent AND-parallelism (cf. Section 2.4), and it has now bred a variety of idioms and extensions developed by different research groups. The essential idea is to execute determinate goals first and in parallel, delaying the execution of nondeterminate goals until no determinate goal can proceed any more. This was inspired by the design of the concurrent language P-Prolog [Yang and

Logic

Programming

Systems



305

Aiso 1986], where synchronization between goals was based on the concept of determinacy of guard systems. However, the roots of such a concept can be traced back further to the early developments of Prolog, as for instance in the sidetrackof Pereira and Porto ing search procedure [1979] which favors the development of goals with fewer alternatives (and hence determinate goals as a special case). This is indeed but another instance of the well-known “first fail” heuristic often used in problem solving, which consists of exploring first the branches that are most likely to fail. An interesting aspect of the Andorra model is its capability to reduce the size of the computation when compared to standard Prolog, since early execution of determinate goals can amount to an a priori pruning of the search space (see Example 3.3 and Section 6.2). The ability of delaying nondeterminate goals as long as determinate ones can proceed amounts to a synchronization mechanism managed by determinacy. The programming style of concurrent logic Iangaages, such as cooperation between a producer process and a consumer process, can thus be somewhat mimicked by making the consumer nondeterminate (and hence blocked) as long as the producer has not produced a value, which would then wake up the consumer by making him determinate; see Costa et al. [ 1991b] for an example. Andorrabased models, such as Andorra-I [Costa et al. 1991a], Andorra Kernel Language (AKL) [Haridi and Janson 1990], and Pandora [Baghat 1993], thus support both Prolog and concurrent logic languages styles of programming. Moreover, the Andorra Kernel Language is an attempt to fully encompass both Prolog and concurrent logic languages. The main new language feature is the introduction of the guard construct borrowed from the concurrent logic paradigm and its associated guard operator. Both ‘

record:



●ther*

prOces*_f

Logic

(Y)

.-------------Y

philip Clloioa

=

point:

father Uaxt

(-tar,

2

andy)

Local Top

o

untxi9d:

fathar

Stack

Trail

stack

trail Aativaticm

Activation

x.cord:

10ng_cOmputatiOn

--------

(P)

P

D

(Y)

---

Y

---------------

;..

record:

process_fathars ”--

-.

unbound

*

point:

Choia* father liaxt

untried:

fatha

Local

Stack

Activation

ToP

-. . ..-.

3

(Y)

. . . . . . . .I

Y

Trail

point:

c2wic0 fathar

,z-

Next untri9d: father (mdy, T-

IWWrd:

10ng_Oouput

-----

I

at ion

--------●

P

mark]

trail“.’’’ .....

activation

andy) ......... ..

o

.IH I

record:

prOcess_fathars

(peter, trail

(P)

stack

*

......

.

i

Figure 4. Backtracking in Prolog implementations. This figure illustrates how locations in the local stack are used to store several bindings (the global stack is not used in the example above, but the situation would be similar If it were). Here, the same location is used to store two of the successive bindings of variable Y, initially bound to philip (1), then unbound when backtracking occurs (2), and then bound to andy (3). The program example is taken from Figure 3.

One problem occurs with stack copying: the worker grabbing work needs to restore its stacks to their state when the OR-node was created. In the example of Figure 3 and Figure 4, the binding of Y a conditional binding, has to to philip, be reset in the copy of the stacks of WI done in the workspace of Wz. Two solutions have been proposed to solve this

problem: the first one, called Kabu-wake2 model, associates to each binding a logical date, such a date being managed by each worker [Masuzawa et al, 1986] (see Figure 5); the second solution uses the

2 Kabu-Wake names a transplanting to grow bonzai trees.

ACM

Computing

Surveys,

Vol.

26,

No

technique

3. September

used

1994

310



J.

Chassin

de Kergommeaux

Stacks Worker

and

Stacks Worker

W, WY

----Local date:

Stack

P. Codognet

W2

- -> Local

WI

Stack

Wz

date: O

O

~ q -------,->

1 ---------

, date:

philip

Y

I -.

-

1

Y

I I

Choice point:

t

process_fathers(Y)

. ..-

. . . . . . ------

-

unbound

I

I

father

I I

Next untried:

I I

father(andy,

I

mark)

I I

date:

O

I I I

,.. ,

. . .. .. .. .. . .. . .

Top trail

I 1 I I !

date:l

Activation

I

record:

long_computation(P) ----------------

I I

-

*

P

II

l==

I I I I I I I

Trail

Stack

—-

*

JVI

I I L———.

. . . . .——

Figure 5. Kabu-Wake model: Use of loglcal dates to ldentlfy conditional variables. The logical date of a worker M the number of choice-points in the stack of this worker. The binding of Y to phzIzp was performed after the creation of the choice-point j!ather and is therefore not valid for Wz. Copying of the stacks of WI to Wz does not interfere with W1’s actiwties. The program example is taken from Figure 3.

trail stack to restore the state of the stack of the worker grabbing work [Ali and Karlsson 19901. In the latter soluWI and Wz synchronize tion, workers during the copy operation so that Wz gets coherent copies of the stacks of W1. Then W2 uses its copy of the trail stack to unbind, in its copies of the stacks, all the conditional bindings performed by WI which are not valid for Wz, reproducing part of the operations done during a backtracking operation (see Figure 4). The first solution results in some time

ACM

Comput,ng

Surveys,

Vol

26,

No.

3. September

1994

and space overhead to associate counter values with each binding while, in the second solution, an idle worker copying the stacks from an active worker slows down the latter since both workers need to synchronize during the copying of the trail stack at least. The overhead of stack copying can be reduced by observing that often an idle worker shares a part of the program search tree with any active worker. In W2 gets general, when an idle worker work from an active one WI, there is no

Parallel

Logic

Programming

*

Systems

311

ROOT

? 1

1 1

: I

? CCP

#,/

,4

-:-----------i---------------” \

/’

W2

,

stack segments

‘\ .

\\

/\,

.

\\

:

Segments

[

copy

‘+

t

-.-i

WCP

to

‘-----

---------------Wz

/

w,



Figure 6. Incremental copying of the stacks. W2 pops its top stack segments until common choice-point and then co~ies on the top of its stacks the stack segments of CCP and WCP.’

need to copy the complete stacks of the active worker WI, since WI and Wz already share a part of the program search tree [Ali and Karlsson 1990]. When Wz takes work from the node (choice-point) to the last WCP of Wl, W2 backtracks common choice-point (CCP) before copying the segments of the stacks of WI that than CCP and older than are younger the choice-point WCP (see Figure 6). Additionally, bindings performed by WI in the common parts of the stacks, between the creations of CCP and WCP, must be installed in the stacks of Wz. This can also be done by using the trail stack of WI or a special data structure recording value trail or such bindings (also named binding

4.3.2

list).

Sharing of Stacks

In the stack-sharing scheme, parts of the local and global stacks are shared; parts are private: workers share the parts of the stacks representing the portions of the proof tree that they have in common. Also, private data structures are used to

it reaches created

W,

the last between

store conflicting bindings to variable locations of the shared parts of the stacks. Two types of data structures have been proposed: binding arrays [Warren 1987] [Borgwardt 19841. and hash windows to A binding array is an array, private each worker, which contains such bindings. Variable locations of the shared part, which could otherwise be bound simultaneously to conflicting values by several workers, do not store values (bindings) as in Prolog systems but integers. These integers index the bindings of the variables within the binding array of each worker sharing these variable locations (see Figare 7). In the alternative solution, bindings performed to the same variable locations are stored in data structures called hash windows and attached to each branch of the search tree. An entry in a hash window is computed by hashing the variable location. All the bindings performed along the path leading from the root of the search tree to a branch of the computation are valid for the worker computing this branch. Therefore, hash windows are

ACM

Computing

Surveys,

Vol.

26,

No.

3, September

1994

312

.

J.

Chassin

de Kergommeaux

and

P. Codognet

Shared Local Stack WORKER

Actwation

record:

proces_fathers(Y)

m . Activation

Y

Binding

‘array

Clwce point:

y -,% ,. ,,: ,: ::

i father

,,’ (:

Next untried: father(andy, ,--

~

(

mark)

1:

,! ,,: ,,

Top trail

,,

,EWORKER

W2

Local Stack

;:.

t!’,

:

,L.:. ;, :, :, :,

record:

10ng_cOmputation(P) -------------------

P



I

Trad slack

1 l+= ‘H Local Stack

!, ,, ,,, ,:

Activation record: 10ng_cOmputat[ On(P)

-----

-----

--------------



‘P

----,

,’

. . ..

,:

=:

Binding

~1

array

W*

I

Trail stack

andy

----

*

Binding

array

t!;

phd)p

‘H

Figure 7. Use of binding arrays. This simple representation of the computational state of the example of Figure 3 exemplifies the use of binding arrays Two workers WI and W2 share the portion of the local stack containing the variable location Y, bound to phL@ by WI and to andy by W2. The global stack is not used in this example.

linked, and variable lookup may imply searching a chain of hash windows of unbounded length (see Figure 8). The concept of “hash window” has led to several implementations [Butler et al. 1986; Chassin de Kergommeaux and Robert 1990; Westphal et al. 1987] (see Section 5.1). An extension of the basic scheme restricts the creation of hash windows to cases where parallelism is actually used [Westphal et al. 1987]. The binding-array scheme adds constant overhead to some variable lookup (dereferencing) operations. Another overhead of this scheme is the updating of the binding array of a worker, when it gets work from another worker. On the contrary, starting a new branch is cheap

ACM

Computmg

Surveys,

Vol.

26.

No.

3, September

1994

the hash window scheme since it is only necessary to create a new hash window and link it to the previous one. However, the cost of a variable lookup is unbound since it may imply the searching of a chain of hash windows of unbounded length.

in

4.3.3 Recomputation

of Stacks

To avoid the overheads associated with the stacks copying and the stacks sharing solutions, it is possible to have workers compute independently complete paths of the search tree [Clocksin and A1shawi 1988]. The parts of the search tree corresponding to portions of the stacks, which would be either copied or

Parallel

Logic

Programming

Systems

313



T~[--------jF3Eq Shared

Local

Activation

Stack

~

record: ,, ,, .,.,.. .1..,... : . . . .

father

I:i

father(andy,

II, (, ,,.

mark)

Previous hash-window

--

: P :

:

------------., q---l

1

i

3’

m--;

Wz

I

record:

10ng_cOmputatiOn(P)

, l;. ,

--1

stack

Actwation

!,:: Next untried:

t

Local

,,:: Choice pint:

andy

1

procw.fathers(y) ------------------Shared

Y

hash(Y)

I

I

‘ash(y)&d Local

stack

;1’,

Activation record: 10ng_cOmputatiOn(P) ------------------*----,.

P

-

n

Figure 8. new

branch

Hash windows. In the original is entered: a new alternative

previous

one

locations

in

the

computation.

in

the

the

path

shared

part

The

program

of the of

the

hash-window taken from

search

tree

stacks

are

example

is

taken

leading done

in

from

shared in one of the previous schemes, are recomputed by each of the active workers. Each worker is given a predetermined path of the search tree described by an “oracle” (path in the search tree) allocated by a specialized worker called “controller,” Programs are rewritten to obtain an arity-2 search tree so that oracles can be efficiently encoded as bit strings (see Figure 9). 4.4

scheme, a new hash a choice-point. This

AND-Parallelism

In systems exploiting AND-parallelism, several workers may bind the same logical variables to conflicting values. To avoid joining dynamically partial ANDsolutions, some systems compute only independent goals in parallel. The other

to the the Figure

All

root.

hash

window

window is created each time a hash window is linked to the

conditional attached

bindings to

the

to

current

variable branch

of

3.

AND-parallel AND-parallelism

systems exploit instead.

4.4.1 Independent

dependent

AA!D-Parallelism

The main problem is the detection of independence between goals. Independence detection can be done during the execution or in advance at compile-time or partly at compile-time and partly during the execution. Run-Time

Independence

Detection

In the AND-OR process model [Conery 1983], dependencies among the goals of each clause are represented by producer/consumer graphs of the variables appearing in the clause, these graphs being updated at run-time by costly op-

ACM

Computmg

Surveys,

Vol.

26,

No.

3, September

1994

314



J.

Chassin

de Kergommeuux

process-fathers(Y)

:- fatherl

10ng-cOmputatiOn(P)

(X, Y),

and

P. Codognet

long.computation(

Y).

:- . ..

fatherl(john,

philip).

fatherl(X,Y)

:- father2(X,Y).

father2(peter,andy). (father2(X,Y)

:- father3(X,Y).

I

father3(andy,mark)

I fatherl(.Y, father-l (john,

Y) :- father2(X,

Bitstring Bitstring

encoding

Y).

phdzp). encoding

1

O

,~73

father2(X,

Y) :- father3(X,

Y).

2

Figure 9. Recomputation of stacks. In order for two workers to compute independently the branches 2 and 3 of this search tree, they will each have to compute the path leading from the root to node II The program example from Figure 3 was rewritten to obtain an arity-2 search tree. The computation path leading to branch 3 can thus be encoded as bit string 11. erations. If several goals have ground (constant) arguments, they can be executed in AND-parallel since their execution is guaranteed not to bind these arguments to conflicting values. In the APEX system [Lin and Kumar 1988], clauses are compiled into a producer/consumer data flow graph. One token is associated with each variable V of a clause. The leftmost goal of the clause with the variable V is a generator of V. A goal becomes executable when it has received the tokens for all its shared variables. This model was implemented with tokens represented by bit vectors, the ith bit representing the ith goal. Another bit vector is associated to each goal so that it is possible to test whether all the generators (other goals in the same clause) of the variables in the clause have completed by a simple bit vector operation. Restricted

AND-Parallelism

To limit the run-time [1984] has proposed

ACM

Computmg

Surveys,

overhead, performing

Vol

26.

No

DeGroot half of

3, September

1994

the independence detection work at compile-time. Programs are compiled into graphs called Conditional Graph Expressimple run-time sions (CGES ), including independence tests such as testing for the groundless of a variable. For example, the clause: f(x)

:- p(x),

will be compiled expression:

q(x),

into

s(x).

the following

graph

f(X) = (ground(X) p (x), (ground(X)

q(X), s(X)).

If the argument of ground (X) is ground, the two following goals are computed in parallel. Otherwise they are executed sequentially. Depending on the instantiation state of X at run-time, three possible execution graphs may occur at run-time (see Figure 10).

Parallel

Logic

X

X ground

Programming

Systems X

not ground

p(x) p(x)

q(x)

/

notground

X not ground

X ground

‘!/

315

P(x)

s(x)

\



“x\

‘(x) s(x)

Figure 10.

Possible

execution

graphs

resulting

from

Conditional

Graph

Expressions.

f(x) is called, all three goals can be computed in parallel. If -X is grounded by the instead, the two last goals q(X) and S(X) can be executed in parallel. Otherwise, executed sequentially,

Conditional Graph Expressions may fail to capture the potential independence between goals sharing the same variables, thus the name “Restricted AND-parallelism.” This occurs most notably in programs where two goals share a variable that is instantiated by only one of them, such as in the quicksort program:

quicksort(

[XIL], Sorted List, Ace)

:- partition(L,

X, Little, Bigs),

quicksort(Littles,

SortedList,

quicksort(Bigs, quicksort(

SortedBigs,

Ace).

[ ], Sorted List, Sorted List).

In this example, no parallelism can be detected automatically between the two in the first recursive calls to quicksort clause, whereas it can be exploited if the example is rewritten as:

quicksort(

[XILI, SortedList,

Ace)

:- partition (L, X, Littles, Bigs), quicksort(Littles, quicksort(Bigs,

Sorted List, [X ITemp] ), SortedBigs,

Ace),

Temp = SortedBIgs. quicksort(

[ ], Sorted List, Sorted List).

Several extensions have been proposed to increase the number of goals eligible

after

of P(X) goals are

for parallel execution [Hermenegildo and Rossi 1990; Winsborough and Waern 1988]. In practice, even “simple” run-time checks can be very expensive, such as testing for groundless of a complex nested Prolog term. To avoid run-time tests, current research is being concentrated on purely static detection of the independence between goals [Muthukumar and Hermenegildo 1990]. Early results indicate that these methods capture almost as much parallelism as the compilation into CGES. 4.4.2

[X ISortedBigs] ),

If X is ground computation all three

Dependent

AND-Parallelism

In the following we will assume that dependent AND-parallelism is only used between “determinate” goals, for which at most one clause matches. This is the case with flat concurrent logic languages and with the Andorra model of computation (see Sections 3.3 and 3.4). Most systems implementing dependent AND-parallelism use the goal process model. A goal process is responsible for trying each candidate clause until one is found that successfully commits and then creating the goal processes for the body goals of the clause. The goal being determinate, at most one candidate clause head will unify with the goal. Tentative unification of a goal process with a clause head will either fail, succeed, or suspend on a variable. Dependent AND-parallel systems need therefore to create, sus-

ACM

Computmg

Surveys,

Vol.

26,

No.

3, September

1994

316



J.

Chassin

de Kergommeaux

and

penal, restart, and kill a potentially high number of fine-grained processes. These basic operations on processes may be the source of considerable run-time overhead and need to be limited to obtain an efficient implementation [Crammond 1988]. Process-switching overhead can be limited by reducing the use of the general scheduling procedure to assign a process to a worker: after completing the last child of a process, a worker continues with the parent; similarly, after spawning processes, a worker continues with the execution of its last child. Another very important problem arising in concurrent logic programming systems is memory consumption. Since these systems do not support the Prolog “don’t nondeterminism, the popping of know” the stacks performed at backtracking in Prolog systems (see Section 4.2 and Figure 4) cannot be done in these systems, resulting in considerable memory consumption and poor data locality. General garbage collection algorithms that compact the stack and the heap require that all workers synchronize on entering and exiting garbage collection [ Crammond 1988]. In the PIM project, an incomplete but faster incremental garbage collection is used instead to increase the locality of references and obtain a better cache-hit ratio. In the Multiple Reference Bit (MRB) algorithm, used inside a processing element or a group of processing elements sharing the same memory (cluster), each pointer has one bit of information to indicate whether it is the only pointer to the referenced data [Chikayama and Kimura 1987]. MRB is supported efficiently in time and space by the hardware of the PIM machine [Taki 1992] while reclaiming more than 60% of cells of stream parallel prothe garbage grams.

4.5

Scheduling

of Parallel

Tasks

The aim of the scheduling of parallel tasks in parallel logic systems is to optimize the global behavior of the parallel system. Therefore, schedulers aim to bal-

ACM

Computmg

Surveys,

Vol.

26,

No.

3, September

1994

P. Codognet

ance the load among workers while limiting as much as possible the overhead arising from task switching. This overhead occurs when workers have exhausted their work and need to move in the computation tree to grab a new task. In order to keep this overhead low, schedulers attempt to limit the number of task switching by maximizing the granularity of parallel tasks and to minimize each task-switching overhead. Additionally, to avoid unnecessary computations, schedulers need to manage speculative work carefully. Maximizing the granularity of parallel tasks would require the ability to estimate either at compile-time or at runtime the computation cost of goals (AND-parallelism) or alternative clauses (OR-parallelism). Current research (see Section 7.1) mainly concentrates on compile-time analysis of granularity [Debray et al. 1990; Tick 1988]. Because of the limited results obtained so far in task granularity analysis, most schedulers use heuristics to select work. One common heuristic used by OR-parallel systems [Baron et al. 1988a; Lusk et al. 1988] is that idle workers are given the highest uncomputed alternative in the computation tree, therefore mixing breadth-first exploration strategy among workers to the depth-first Prolog computation rule within each worker. To limit the number of task switches between workers, several schedulers share the computation of several nodes of the tree among workers [Ali and Karlsson 1991; Beaumont et al. 1991]. These schedulers then attempt to maximize the amount of shared work. In contrast to the granularity of an unexplored alternative (OR-parallelism) or of a goal waiting to be computed, the amount of sharable work in a branch can be easily estimated by the number of unexplored alternatives. Once an active worker has performed sharing of all its available work with an idle one, both workers resume the normal depth-first search of Prolog computation. Workers of independent AND-parallel systems distribute in priority the closest

Parallel

work from their current position in the computation tree: goals waiting to be solved are put on a stack accessible from other workers [Hermenegildo 1986]. In most OR-parallel logic systems, an idle worker switching to a new task needs to install its binding array or to copy parts of the stacks of the active worker providing work. In these systems, the cost of task switching depends on the relative positions of both workers providing and receiving work. Schedulers designed for these systems attempt to minimize the cost of task switching by selecting the closest possible work in the computation tree [Ali and Karlsson 1991; Butler et al. 1988; Calderwood and Szeredi 1989]. This is not necessary for systems designed to provide task-switching costs independent of the relative positions of both workers involved in this operation [Baron et al. 1988a]. Speculative work may be pruned by pending cuts. To avoid performing unnecessary work, a scheduler should give preference to nonspeculative work, and if all available work is speculative, to the least speculative work available [Hausman 1989]. Speculative work is better handled by schedulers performing bottommost dispatching based on sharing [Ali and Karlsson 1991; Beaumont et al. 1991] than topmost dispatching of work [Calderwood and Szeredi 1989], since the control of the former is closer to the depth-first leftmost computation of sequential Prolog systems. 5. SYSTEMS EXPLOITING OF PARALLELISM

ONE TYPE

A large number of models have already been proposed to parallelize Prolog. In this section, we will concentrate on the systems exploiting one type of parallelism that have been efficiently implemented on multiprocessor systems, based on the multisequential approach.

Logic

Programming

Exploiting

OR-Parallelism

The Kabu-Wake system (see Section 4.3.1 and Figure 5) was the first to copy stacks



317

when switching tasks and to use logical dates to discard invalid bindings, apparently without the incremental copying optimization. The Kabu-Wake implementation, on special-purpose hardware, provides nearly linear speedups for large problems computed in parallel. However, it is based on a rather slow interpreter. A more efficient implementation of this model was done on a transputer-based architecture [Briat et al. 1992]. The ANL-WAM system from the Argonne National Laboratories [Butler et al. 1986; Disz et al. 1987] was the first parallel Prolog system based on compiling techniques and implemented on a shared-memory multiprocessor. All trailed logical variables are stored in hash windows (see Section 4.3.2 and Figure 8), even when no parallel computation takes place. Being the first efficient parallel Prolog implementation, the ANL-WAM system has been widely used for experimental purposes. The performance of the ANL-WAM system is limited by the rather low efficiency of the sequential WAM engine used and by the overhead arising from its hash window model. In PEPSys [Baron et al. 1988a; Westphal et al. 1987] hash windows are only used if needed: when distinct logical variables having the same WAM identification are bound simultaneously by several workers. Most accesses to the values of variables are as efficient as in a sequential implementation, but some accesses may involve searching of hash window chains of unbounded length. The cost of task installation is independent of the respective positions of the work and of the idle worker in the search tree. In spite of the sharing of the stacks, PEPSys does not require a global address space and has been simulated on a scalable architecture combining shared memory in clusters and message passing between clusters [Baron et al. 1988b]. PEPSys

5.1 Systems

Systems

mented

has

also

on

a

been

efficiently

shared-memory

implemulti-

a single processor, the WAM-based PEPSYS implementation runs 30 to 4070 slower than SICStus Proprocessor.

ACM

Computing

On

Surveys,

Vol.

26,

No.

3, September

1994

318



J.

Chassin

de Kergommeaux

Table 1. Program[”

Execution

times]

and

Times

1 PE

P. Codognet

and Relatlve

2 PEs

4 PEs

Speedup

8 PEs

of Aurora

10 PES

SICStus 0.6

parsel

*2O

1.09

0.70

0.67

0.71

1.57

1

1.80

280

2.93

276

125

4.87

2.47

1.30

0.70

0.63

382

1

1.97

3.74

692

7.72

1.27

194

1.04

0.59

0.51

2.73

1.93

3.58

629

734

1.37

4.26

2.14

107

0.88

677

199

395

788

9.60

1,25

9.54

4,87

2,53

2.06

13.78

207

406

7.81

959

143

1.97

parse5

db5 *1O

3.74 1 846

8-queensl

1 19,79

tina

1

The benchmark m-ograms were executed by Aurora using the Bristol scheduler and SICSt& implementations on Se&ent Symme;ry. parsel, parse5, and db5 are parsing and database queries of the Chut80 natural language database front-end program. 8-queensl computes all the solutions to the 8-queens problem, and tzrzcI is a tourmm information program. Each entry of the table contains the run-time of the program m seconds on the upper line and tbe relative speedup on the lower line

log, both systems being interpreters of WAM instructions implemented in C. In parallel, it provides almost linear speedups for programs having a large enough search space [Baron et al. 1988a]. Other experimental results [Chassin de Kergommeaux 1989] indicate that long hash window chains are rare and do not compromise the efficiency of the implementation. The Aurora system is based on the SRI model [Warren 1987] where bindings to logical variables sharing the same identification are performed in binding arrays (see Section 4.3.2 and Figure 7). An Aurora prototype, based on SICStus Prolog, was implemented on several commercial shared-memory multiprocessors [Carlsson 1990]. Four different schedulers have been implemented: three of them use various techniques for the idle worker to

ACM

Computmg

Surveys,

Vol.

26,

No.

3, September

1994

find the closest topmost node of the search tree with available work [Brand 1988; Butler et al. 1988; Calderwood and Szeredi 1989]. Since experimental results [Szeredi 1989] indicate that the work installation overhead remains fairly limited, the Bristol scheduler [Beaumont et al. 199 1] performs sharing of work similar to the Muse system, the objective being to maximize the granularity of parallel tasks. All schedulers support the full Prolog languages including side effects. The system has successfully run a large number of Prolog programs, and the performances of some of them have been reported in Beaumont et al. [1991] and Szeredi [1989] (see Table 1). On a single processor, Aurora is slightly more efficient than PEPSys, and on average it also provides slightly better speedups in parallel.

Parallel

In the Muse model [Ali and Karlsson 1990], active workers share, with otherwise idle workers, several choice-points containing unprocessed alternatives. Sharing is performed by an active worker, which creates an image of a portion of its choice-points stack, in a workspace shared with a previously idle worker. The stacks of the active worker are then copied from its local memory to the local memory of the idle worker, most of the copying being performed in parallel by the two workers, using incremental copying of the portions of the stack which are not identical. Unwinding and installation of bindings in the shared section use the trail stack instead of logical dates in Kabu-Wake. Muse was implemented on several commercial multiprocessors with uniform and nonuniform memory addressing (see Section 7.4) and on the BCMachine prototype constructed at SICS, which provides both shared and private memory [Ali and Karlsson 1991]. The Muse implementation is based on SICStus Prolog. It combines a very low sequential overhead over SICStus (5%) with parallel speedups similar to Aurora. K-LEAF [Bosco et al. 1990] is a parallel Prolog system implemented on a transputer-based multiprocessor. In contrast to other multisequential systems, it creates all possible OR-parallel tasks. Combinatorial explosion of the number of tasks is avoided by language constructs that ensure sufficient grain size of parallel tasks. K-LEAF was implemented on an experimental transputer-based multiprocessor providing a virtual global address space. This implementation is based on the WAM, using the binding-array technique. The WAM code is either emulated or expanded into C and then compiled using a commercial C compiler, the latter solution being five times more efficient than the former.

5.2

A System Exploiting Independent AND-parallelism: &-Prolog

The &-Prolog system is the most mature system exploiting independent ANDparallelism, combining a compiler per-

Logic

Programming

Systems



319

forming independent detection with an efficient implementation of AND-parallelism on shared-memory multiprocessors [Hermenegildo and Greene 1990]. The &-Prolog system supports both automatic and user-expressed parallelisms. Parallelism can be expressed explicitly with the &-Prolog language. &-Prolog is very similar to Prolog with the addition of the parallel conjunction operator “&” and of several built-in primitives to perform the independence run-time tests of conditional graph expressions (see Section 4.4. 1). For example, the Prolog program p(x):-q(x),r(x). can be written

in &-Prolog

p(X) :- (ground(X)

-

as

q(X) & r(X)).

parallelism, provided by orAutomatic dinary Prolog programs, can also be exploited since the &-Prolog compiler performs a number of static analyses to transform Prolog programs into &-Prolog programs. The static analyses aim at detecting the independence of literals, even in presence of side effects [Muthukumar and Hermenegildo 1989a; 1989bl. &-Prolog programs are compiled into an extension of the WAM called PWAM [Hermenegildo 1986] whose main difference from the WAM is the addition of a goal stack where idle processors pick work. The PWAM assumes a sharedmemory multiprocessor. To adjust the computing resources to the amount of parallel work available, the &-Prolog scheduler organizes PWAMS in a ring. A worker searches for an idle PWAM in the ring. If no idle PWAM is found, and enough memory is available, the worker creates a new PWAM, links it to the ring, and looks for work in the goal stacks of the other PWAMS. The &-Prolog run-time system is based on SICStus and was implemented on shared-memory multiprocessors. The overhead of the &-Prolog system running sequentially over SICStus arises mainly from the use of the goal stack and remains very limited (less than 5%) for the majority of the benchmarks using uncon-

ACM

Computing

Surveys,

Vol.

26,

No.

3, September

1994

320



J.

Chassin

de Kergommeaux

Table 2.

Execution

Times

Program

and

(Seconds)

1 PE

4 Pcs

P. Codognet

and Relatwe

8 PEs

Speedup

10 PEs

of &-Prolog

Quintus Sun3-110

matrlx(50)-]nt

238

5.98

300

240

3.98

7.93

992

305

1.2

0.66

064

10

2.54

4.62

476

279

2.79

2.79

2.79

10

1.0

10

1.0 qs(1000)-app

qs(1000)-dl-si

1.0 qs(1000)-dl-nsl

annOtatOr(100)

288

095

062

061

10

303

4.65

4.72

091

028

0.17

013

.32.5

535

70

1.0

798

17

1.61

161

062

The benchmark programs were executed by &-Prolog on Sequent Symmetry and Qumtus 2.2 on SUN3-11O. The first benchmark is a matrix multlphcation. The three next benchmarks are different versions of the quzchsort of a list of 1000 elements using the predicate append or difference-llsts exploiting strict independent AND-parallehsm (s1) and nonstrict Independent ANDparallehsm (nsi). The last benchmark annotator is a “reahstlc” program of 1800 lines, the annotator being one of the static analysers used m the &-Prolog compder.

ditional parallelism, that is, without run-time tests from conditional graph expressions. These tests have been shown to be fairly expensive, which explains why an important research activity is dedicated to purely static detection of independence among goals. Parallel speedups depend on the benchmark program, but even when no parallelism is available (see Table 2), the &-Prolog system remains almost as efficient as the SICStus sequential Prolog implementation. 5.3

Systems Exploiting AND-Parallelism

Dependent

The important research activity dedicated to implementing efficiently concurrent logic languages on sequential and parallel computers has already been reported in a survey paper [Shapiro 1989].

ACM

Computmg

Surveys,

Vol.

26,

No

3, September

1994

We only parallel guages. 5.3.1

mention in the implementations

following some of these lan-

Parlog

The JAM abstract machine based on the WAM was designed to support concurrent logic languages and the full Parlog language in particular. It was implemented on a shared-memory multiprocessor [Crammond 1988], reaching half of the efficiency of SICStus Prolog. In parallel, the benchmarks containing important dependent AND-parallelism show speedups of between 12 and 15 on 20 processors while benchmarks containing no dependent AND-parallelism, such as the queens program, run almost at the same speed in parallel as sequentially (see Table 3).

Parallel

Table 3. Speedup

I

Program

qsort

nrev

lqueens

tak

queens

5.3.2 Flat Concurrent

Logic

Programming

Systems



321

Execution Times (Seconds) and Relative of the JAM on Sequent Balance B21.

no. of

1 PE

5 PEE

10 PEs

19 PEs

3.7

0.9

0.6

0.5

1.0

4.1

6.2

7.4

30.0

6.5

3.6

2.4

1.0

4.6

8.3

12.5

378

7.6

3.9

2.4

10

4.97

calls

4708

80601

23531

63609

62082

969

15.75

71.4

15,0

7.9

1.0

4.76

9.03

85.9

87.6

88.7

90.1

1.0

0.98

0.97

0.95

Prolog

FCP was implemented on distributedmemory multiprocessors Intel iPSC 1 [Taylor 1989; Taylor et al. 1987]. The implementation provides limited speedup for fine-grained systolic computations and good speedup for large-grained programs (see Table 4). 5.3.3 GHC and KL 1 A large number of GHC and KLl implementations were done in the framework of the Japanese Fifth Generation Computer Systems Project led by the ICOT. It is only possible to mention a few of them in this article. KLl was implemented on a commercial shared-memory multiprocessor [Sato and Goto 1988], this implementation being 2 to 9 times slower than Aurora, for programs executing the same algorithms [Tick 1989]. The most significant prototypes involved both hardware and software development with the implementation of several highly parallel machines dedicated to the execution of

4.8 1487

the concurrent logic language KL1. These machines, known as Multi~PSI and PIM are presented in Section 7.4.2.

5.3.4

Commercial Concurrent Language: Strand

Logic

Strand (STReam AND-parallelism) is a commercial product that was derived from Parlog and FCP [Foster and Taylor 1989]. In the Strand language, there is no more unification but matching of the head of a clause against a goal. The rule guards reduce to simple tests. An assignment operation can be used in the bodies of the rules, but as in other logical languages, Strand variables are singleassignment variables. With processor allocation pragmas, programmers can control the degree of parallelism of their programs. A foreign interface language enables the calling of Fortran or C modules from a Strand program. Strand can thus be used to coordinate the execution of existing sequential programs on multiprocessors, therefore avoiding the need

ACM

Computing

Surveys,

Vol.

26,

No.

3, September

1994

322



J.

Chassin

Table

de Kergommeaux Execuhon

4.

Times

and Relatwe

Program

Merge

sort

(60

speedup

(10,000)

155.21

sort

16 PEs

(d-4

Hypercube)

61.35

( 10,000)

191.83

(sers)

input

5638

1

multlply

37229

problem)

51.78

(sees)

59.82

(large

problem)

1122

(see.s)

622

5.54 (reins)

(reins)

9.4

1 FCPic

(WCS)

3.40

1

(small

(sees)

2.53

1

X 60)

FCPic

of FCP on Intel IPSC1

(sew.)

input

reversed

Matrix

P. Codognet

Uniprocessor

ordered

Merge

and

(reins)

9.35 (reins) 1

Triangle

6575

12 48.9 (reins)

(reins)

13.4

1

OR-parallel

Prolog

560

58 (reins)

(reins)

97

1

— The

merge

and

matrw

multzply

programs

are systolic

algorithms,

the

other

benchmarks being large-gram computations, FCPZC is a 600-line polygonbased graphic modehng system; trzangle is a puzzle program used to evaluate LMp systems; OR-purallel Prolog IS a parallel interpreter runmng a Prolog program computing all permutations of a hst of slx elements,

to rewrite these programs to parallelize them. The implementation of Strand is based on the Strand Abstract Machine or SAM, designed to limit the target-dependent parts of the implementation. SAM is divided into a reduction component, similar to a simplified concurrent logic language implementation, and a communication component performing matching (read) and assignments (write). All machine-specific features of the implementation are localized into the communication component. Strand was implemented on a variety of sharedand distributed-memory multiprocessors. The Strand uniprocessor implementation

ACM

C!omputmg

Surveys,

VOI

26,

No.

3, September

1994

runs approximately two times faster Parlog and three times faster than on SUN workstations.

5.4

Performance of Systems Type of Parallelism

Exploiting

than FCP

One

The systems described above have demonstrated that it is possible to efficiently exploit ORand independent AND-parallelism on shared-memory multiprocessors containing a limited number (up to several tens) of processors. “Efficiently” means that in programs that exhibit the suitable type of parallelism, almost linear speedup can be obtained, whereas in the worst case, when no par-

Parallel

allelism can be exploited, their speeds remain similar to an efficient sequential Prolog implementation such as SICStus Prolog. Dependent AND-paraIlel systems are slower when they execute the same algorithms, but they enable exploitation of finer-grained parallelism rather than OR- and independent AND-parallel systems. The majority of the parallel systems presented in this section compile logic programs into abstract machine instructions that are usually executed by an interpreter written in C, as is the case in SICStus Prolog. Some commercially available Prolog systems execute abstract machine instructions more efficiently by using an abstract machine interpreter written in assembly language (Quintus Prolog) or by generating target machine code from the abstract machine instructions (BIM Prolog). Parallelizing techniques developed in the systems presented above remain valid to parallelize these efficient commercial systems, although improving the efficiency of the sequential engine will probably decrease the parallel speedup [Maes 1992]. Whether it is cost-effective to use nonsalable shared-memory multiprocessors to solve symbolic problems programmed in logic languages is still questionable. This situation may change when the multiprocessor technology becomes more mature and multiprocessor workstations become as common as uniprocessor ones. Then a parallel logic programming system, providing significant speedup in the best cases and no decrease in speed otherwise, will be a good way of exploiting the computing power delivered by the multiprocessor platform. The use of more scalable highly and massively parallel multiprocessors, to solve larger problems currently intractable sequentially, is discussed in Section 7.4. 6. SYSTEMS

COMBINING OF PARALLELISM

SEVERAL

TYPES

Important benefits can be expected from the combination of several types of parallelism into a single parallel logic

Logic

Programming

Systems



323

programming system implementation: mainly an increase in the number of programs that can benefit from parallelism and a reduction of the search space of some of them. Although systems exploiting one type of parallelism have demonstrated their ability to reach effective speedups over state-of-the-art sequential Prolog systems, a large number of parallel programs cannot benefit from speed improvements: this is the case of deterministic programs in OR-parallel systems and of large search programs in AND-parallel systems. Additionally, while OR- and independent AND-parallel systems mainly exploit coarse-grained parallelism, dependent AND-parallel systems exploit fine-grained parallelism, present in a large number of applications. Combining OR- with AND-parallelism may also result in significant reductions in the program search space when several recomputations of the same independent AND-branch, due to backtracking into a previous AND-branch, can be avoided by “reusing” the solutions already produced in each AND-branch (see Figure 11). However, the combination of ANDwith OR-parallelism raises difficult problems. Control of the execution must guarantee that all solutions to the ANDbranches are cross-produced without introducing too much overhead to store partial solutions or to synchronize workers. Merging of partial solutions during the cross-product of AND-branches is a complex operation whatever binding scheme is being used: stack copying or stack sharing using binding arrays or hash windows. The solutions proposed to solve these problems [Conery 1992; Costa et al. 1991a; Delgado-Rannauro 1992b; Gupta and Jayaraman 1990; Gupta and Hermenegildo 1991a; Ka16 and Ramkumar 1990; 1992; Westphal et al. 1987] are too complex to be presented in this article. Indeed, the idea of “sharing” solutions between OR-branches is not unique to parallel execution models, and also appears in a few sequential Prolog sys-

ACM

Computing

Surveys,

Vol

26,

No

3, September

1994

324

-

J.

Chassin

de Kergommeaux

and_par(Paramet

ers)

:-

left_

and

P. Codognet

hand(Left_Parameters,

Left-Results),

right_hend(Right_Parameters,

Right_Results).

left_hand(X,

Y)

:-

. . .

/*

First

left_hand(X,

Y)

:-

. . .

/*

Second

left_hand(X,

Y)

:-

. . .

/*

Third

right_hand(Z,

T)

:-

long_cornputatiOn(.

branch

*/

branch

branch

*/

*/

..).

Figure Il. Reduction of thesearch space byuseof cross-product. Inasequential system, therlght-hand predicate is computed three times, once for each solution to the left-hand one. In a system combining AND with OR-parallelism, the solution to the right-hand predicate will be computed once and “cross-produced” with the three solutions to the left-hand predicate, thus saving two (long) computations of the right-hand predicate.

terns that attempt to ’’memoize’’ and “reuse” partial solutions in order to avoid recomputations, such as Villemonte dela Clergerie [1992]. Such systems are not yet mature and competitive enough with traditional Prolog systems to serve as a basis for parallel models. Moreover, simulations of a variety of programs [Shen and Hermenegildo 1991] indicate that only a few of them would benefit from reuse of partial solutions of independent AND-branches. 6.1 Systems Combining with OR-Parallelism

independent

AND-

The ROLOG system implements the Reduce OR Process Model (ROPM) [Kale 1987] which combines independent ANDwith OR-parallelism. Programs are compiled into an abstract machine inspired by the WAM [Ramkumar and Ka16 1989]. The implementation uses a machine-independent run-time kernel called the CHARE kernel, which made possible the porting of ROLOG on a variety of sharedand distributed-memory parallel machines. Because of the complexity of the execution model, sequential efficiency of ROLOG is several times lower than efficient parallel logic systems exploiting one type of parallelism. In parallel, it provides linear speedups for benchmarks

ACM

Computmg

Surveys,

Vol

26,

No.

3, September

1994

where programmer annotations ensure sufficient granularity of tasks. Experimental results also indicate that significant reductions of the search space can also be obtained in several programs, by avoiding the recomputation of ANDparallel branches due to backtracking [Ramkumar 1991]. The ACE model [Gupta and Hermenegildo 1991b] aims to combine the independent AND-parallel approach of the &-Prolog system with the OR-parallel approach of the Muse system. No effort is made to reuse results of independent AND-subcomputations in different ORbranches, as described above, but the strength of this model lies in its (re)use of simple techniques whose efficiency has already been proved. 6.2

Systems Combining With OR-Parallelisms

Dependent

AND-

As presented in Section 3.4, execution models based on the Andorra principle have been proposed to combine OR- and dependent AND-parallelism. They intend to subsume both the Prolog and the concurrent programming paradigms. There exist two main streams of research in this area, depending on the language being supported: the first one, based on the Basic Andorra Model, supports the Pro-

Parallel log language, while another one, based on the Andorra Kernel Language, integrates concurrent languages constructs. 6.2.1 Basic Andorra Model The Andorra-I system is an implementation of the Basic Andorra Model on shared-memory multiprocessors [Yang et al. 1993]. The Andorra-I system consists of a compiler, an engine, and a scheduler. The compiler performs a program analysis, based on abstract interpretation techniques, in order to determine, for each procedure, the mode of its arguments, i.e., the possible instantiation types. This information is used, together with the analysis of pruning operators and side effects, to generate sequencing instructions that should be taken into account by the execution model. Specialized code is also generated to test efficiently the determinacy of procedures. The compiler produces Andorra-I Abstract Machine code, that is, an extension of Crammonds JAM (see Section 5,3.1). The engine supports OR-parallelism by borrowing techniques from the Aurora system (binding arrays) and exploits dependent AND-parallelism by using techniques derived from the JAM implementation of PARLOG. Andorra-I programs are executed by teams of workers (abstract processing agents). Each team is associated with a separate OR-branch in the computation tree, and undertakes the parallel execution of determinate goals in that branch, if any. Otherwise a nondeterminate goal is chosen, usually the leftmost goal to follow Prolog’s strategy, and a choicepoint is created. The team will then explore one of the OR-branches. If several teams are available, OR-branches are explored in parallel using the SRI model as in Aurora, whose scheduler is also used to distribute work. Within a team, all workers share the same execution stacks, but manage their own run queue of goals waiting for execution. When a local queue becomes empty, the worker will try to

Logic

Programming

Systems



325

find work within the team, as in the implementation of Parlog. Performance evaluation of Andorra-I on Sequent Symmetry shows that the relative OR-parallel speedup of AndorraI is comparable with Aurora and that the relative AND-parallel speedup of Andorra-I is comparable with the JAM. Sequentially, Andorra-I runs as fast as the JAM and 2.5 times slower than SICStus Prolog in average, that is, approximately two times slower than Aurora. The ability of Andorra to reduce the amount of computation was estimated by measuring the total number of inferences executed for several problems. The reduction can attain one order of magnitude. When both ANDand OR-parallelism are exploited, the overall speedup is greater than that achievable with one kind of parallelism alone. The basic Andorra model was also implemented as an extension of the Parallel NU-Prolog system [Naish 1988; Palmer and Naish 1991], which implements dependent AND-parallelism. A simple variant of the compilation techniques developed for concurrent logic languages [Klinger and Shapiro 1988] is used to construct a decision tree of the conditions under which a goal is determinate. First, parallel experiments were limited to 4 processors, giving speedups ranging from 2 to 3.4 for simple test programs, 6,2.2 Extensions to the Basic Andorra Model Warren [1990] recently proposed a new execution model that extends the Basic Andorra Model by allowing parallel execution between nondeterminate goals as long as they perform “local” computations, i.e., since they do not instantiate nonlocal variables. In this way nondeterminate goals are synchronized (i.e., blocked) only when they try to “guess” the value of some external variable. Observe that such extra parallelism contains independent AND-parallelism as a subcase. The Extended Andorra Model combines thus all three kinds of parallelism: OR-, dependent AND-, and independent AND-parallelism.

ACM

Computmg

Surveys,

Vol.

26,

No,

3, September

1994

326



J. Chassin

de Kergommeaux

and

The IDIOM model displays another combination of the three kinds of parallelism [Gupta et al. 1991c]. It uses Conditional Graph Expressions (CGE) to express independent AND-parallelism as in &-Prolog, and execution proceeds as follows. First, in the dependent ANDparallel phase, all determinate goals are evaluated in parallel. When no more determinate goals exist, the leftmost goal is selected for expansion. If it is a CGE, then an independent AND-parallel phase is entered; otherwise a choice-point is created as in the basic Andorra model. Solution sharing is handled by cross-producing partial solutions of independent subcomputations. The two extended models presented above have not yet been implemented. 6.2.3 Andorra Kernel Language The Andorra Kernel Language (AKL) [Janson and Haridi 1991] extends the Basic Andorra Model by borrowing some concurrent languages constructs (see Section 3.4), The semantics of AKL [Haridi and Janson 1990] is given by a set of rewrite rules on AND/OR trees, which has led to the design of a simple abstract machine and sequential implementation. Parallel implementations are currently under development, based on the integration of a MUSE-like mechanism for handling OR-parallelism. Independently, a new abstract machine was developed to accommodate the new constructs of AKL [Palmer and Naish 1991]. It was inspired by the JAM as the Andorra-I implementation. Another execution model [van Hecker et al. 199 1] exploits mostly OR-parallelism and uses hash windows similar to those of PEPSys, which seem more suited to the OR-parallel execution of AKL than binding arrays. 7. CURRENT RESEARCH AND PERSPECTIVES An important research activity aimed at extending existing models and systems has already been mentioned in previous sections. In this section we summarize

ACM

Computing

Surveys,

Vol

26,

No.

3, September

1994

P. Codognet

the development of static analysis techniques for logic programs. Additionally, this section reviews a research domain we deemed very promising: the combination of constraints and parallelism and the development of concurrent constraint languages. We also consider the challenge of efficiently using the massively parallel computers that have become commercially available. 7.1 Static Analysis

of Logic Programs

An important research area in logic programming is that of static (compile-time) program analysis, which can be used to predict run-time behavior of programs as an aid to optimization. This analysis is usually based on abstract interpretation techniques, introduced by the seminal work of Cousot and Cousot [ 1977] for flowchart languages and developed in the domain of logic programming since the mid-80s [Bruynooghe 1991; Jones and Sondergaard 1987; Mellish 1986]. Abstract interpretation is a general scheme for static data flow analysis of programs. Such analysis consists primarily of executing a program on some special domain called “abstract” because it abstracts only some desired property of the concrete domain of computation. The use of abstract interpretation for parallel execution falls roughly into three major kinds of analysis:

* dependency

analysis, which approximates data dependencies between literals due to shared variables,



granularity analysis, which mates the size of computations,



determinacv deterministic

approxi-

analysis, which identifies par~s o~ programs.

Various abstract domains have been designed that approximate values of logical variables into only some groundless and sharing information. Let us take a simple example to illustrate this. Consider the substitution: {X+

X1, Y+

Z+g(U),

f’(U, X1+

b), a}.

Parallel It can be approximated by a groundless component (which variables are bound to terms that do not contain any free variable ?) {X

is ground,

Xl

is ground},

and a sharing component (which variables share a common variable as subterm?) {Y and Z share}. Obviously we have lost some information with respect to the original substitution (e.g., the structure of terms), but specific information that can be used for optimization has been extracted. This is currently used in independent AND-parallel models such as by Hermenegildo and Greene [1990] in order to find out at compile-time which predicates will run in parallel [Muthukumar and Hermenegildo 1990], thus avoiding costly run-time tests (see Section 4.4. 1). Experimental results indicate that a large part of the potential parallelism of programs is captured, although some potential parallelism may be lost, compared to a method performing precise but costly run-time analysis. In dependent ANDparallel models, the same kind of information can be used to determine an advantageous scheduling order among active processes and to avoid useless process creation and suspension. Another useful application is static detection of deadlocks in concurrent programs [Codognet et al. 1990]. Task granularity analysis, by discriminating large tasks from small tasks not worth parallelization, strives to improve the scheduling policy of ANDand OR-parallel models. Research has only just begun on this important topic [Debray et al. 1990]. Determinacy analysis is required by Andorra-based models, in order to simplify run-time tests. As execution models become more and more complex with the combination of several kinds of parallelism, the need for compile-time analysis increases in order to compile programs more simply and more efficiently.

Logic

Programming

Systems



327

7.2 Parallelism and Constraint Logic Programming Constraint Logic Programming (CLP) languages provide an attractive paradigm that combines the advantages of logic programming (declarative semantics, nondeterminism, logical variables, partial-answer solution) with the efficiency of special-purpose constraint-solving algorithms over specific domains such as reals, rationals, finite domains, or booleans. The key point of this paradigm is to extend Prolog with the power of dealing with domains other than (uninterpreted) terms and to replace the concept of unification with the general concept of constraint solving [Jaffar and Lassez 1987]. Several languages, such as CLP(R) [Jaffar and Lassez 1987], CHIP [Dincbas et al. 1990; Van Hentenryck 1989b], Prolog-III [Colmerauer 1990] show the usefulness of this approach for real industrial applications in various domains: combinatorial problems, scheduling, cutting stock, circuit simulation, diagnosis, finance, etc. The use of constraints indeed leads to a more natural representation of complex problems. A most promising perspective for CLP languages is the parallel execution of such languages, where both OR- and AND-parallelism can be applied. The simultaneous exploration of alternative paths in the search tree provided by ORparallelism can be particularly useful in problems such as optimization problems. Sequential CLP systems use a branchand-bound method that involves searching for one solution and then starting over with the added constraint that the next solution should be better than the first one, ensuring thus an a priori pruning of nonoptimal branches. With ORparallelism, one can move away from a pure depth-first search and mimic a limited breadth-first search,3 since several alternative branches are explored simultaneously by parallel processes. Such a scheme allows for “superlinear” relative

3Limited

ACM

by the

Computing

computing

Surveys,

Vol.

sources

26,

No,

available.

3, September

1994

328



J. Chassin

de Kergommeaux

and P. Codognet

speedups, since the simultaneous search is more likely to find a better solution quickly, therefore improving the pruning of the search space and the overall performance. Experiments were done by combining the finite domains solver of CHIP and the Pepsys OR-parallel system [Van Hentenryck 1989a], and pursued in the ElipSys system [Veron et al. 1993]. They show that impressive speedups with respect to an efficient sequential implementation can be achieved. The ElipSys system shows that “superlinear” relative speedups can be achieved on big graphcoloring problems (up to a factor 100 for 12 processors), and that nearly linear speedups can be achieved for a branchand-bound algorithm on disjunctive scheduling problems (up to a factor 9 for 10 processors). AND-parallelism can also be used for better performance. Independent ANDparallelism corresponds to partitioning the global constraint system into several smaller (independent) subsystems that can be solved more easily. Since problems tackled by constraint systems are usually very large and NP-hard, breaking a global problem into several subproblems that can be solved independently is an important issue. Andorra-based models, by favoring deterministic computations, can also greatly improve CLP. They provide a good heuristic to guide the constraint-solving process. This is the case, for instance, when solving disjunctive constraints. In CLP languages, disjunctive constraints are treated by using the nondeterminism of the underlying Prolog language. A choice-point is created for each disjunctive constraint, and different alternatives are then explored by backtracking. The (static) order in which the disjunctive constraints are stated will be the order in which the choice-points will be created and therefore will direct the overall search mechanism. A poor order could lead to bad performance due to useless backtracking. The Andorra principle, by delaying choice-point creation, will make the problem as deterministic as possible before handling the disjunctive con-

ACM

Computing

Surveys,

Vol

26,

No.

3, September

1994

straints. In this way, computations are factorized between alternatives, and some disjunctions may even be rendered deterministic, thus reducing the complexity of the problem. 7.3 Concurrent

Constraint

Languages

The current CLP framework is based on Prolog-like languages and hence is limited to transformation systems (see Section 3). New research has begun to extend this to reactive systems, by investigating concurrent constraint languages. These languages were recently introduced by Saraswat [1993] and Saraswat and Rinard [1990]. They are based on the ask-and-tell mechanism [Maher 1987] and generalize in an obvious manner both concurrent logic languages and constraint logic languages. The key idea is to use constraints to extend the synchronization and control mechanisms of concurrent logic languages. Briefly, multiple agents (processes) run concurrently and interact by means of a shared store, i.e., a global set of constraints. Each agent can either add a new constraint to the store (Tell operation) or check whether the store entails a given constraint (Ask operation). Synchronization is achieved through a blocking ask: an asking agent may block and suspend if one can state neither that the asked constraint is entailed nor that its negation is ennothing prevents the contailed—i.e., straint from being entailed, but the store does not yet contain enough information. The agent is suspended until other concurrently running agents add (Tell ) new constraints to the store to make it strong enough to decide. Obviously, the instantiation of this framework to unification constraints over terms (i.e., equations of the form tl = tz ) nicely rephrases the usual concurrent logic languages of Section 3. Telling such a constraint corresponds to performing unification, while asking a constraint corresponds to performing matching: Ask( X = a) with respect to a store S will suspend as long as S is not strong enough to entail X = a, i.e., does not contain

Parallel such an instantiation (which would be added—told—by another concurrent agent). Note however that such unification constraints are just one of many possible domains over which the con-current constraint framework can be instantiated: possible alternatives are finite domains, linear arithmetics, booleans, etc. Also, restricting the languages to only Tell operations over constraints reproduces the Prolog-based CLP languages, which perform constraint satisfaction (Tell ) but no synchronization through constraints (Ask ). The efficient implementation of these languages is still under investigation, but current research shows that the implementation technologies and know-how developed for both concurrent languages and constraint languages can be merged and nicely integrated, as for instance in the CC(FD) language of Van Hentenryck et al. [1993] which uses the finite domains constraints introduced by the CHIP CLP language [Van Hentenryck 1989 b], with improved performances.

Logic

Use of Massively Parallel Multiprocessors

Most existing parallel logic programming systems described in the previous sections were implemented on nonsalable shared-memory UMA (Uniform Memory Access) architectures. An ever increasing number of massively parallel multiprocessors (that is, scalable to a maximum number of processing elements on the order of 103, are becoming available. These multiprocessors may provide logically shared memory based on physically distributed memory, access time to the shared memory being nonuniform (NUMA: Non Uniform Memory Access, or COMA: Cache-Only Memory Access). Examples of such multiprocessors include the BBN Butterfly TC2000 (NUMA), composed of up to 128 processing elements, Motorola M881OO, and the Kendall Square Research KSR-1 (COMA), composed of up to 1088 proprietary processors. Most massively parallel multiprocessors have the

Systems

w

329

NORMA (NO Remote Memory Access) architecture, whose interprocessor communication is performed by message passing. The commercial massively parallel multiprocessors appearing on the market include the CS-2 produced by a European consortium of three manufacturers (Meiko, ParSys, and Telmat) with up to 1024 Spare processors, the SP-1 from IBM with up to 128 RS-6000 processors, the CM-5 from Thinking Machines with up to 1024 Spare processors, and the Intel Paragon with up to 4096 processors [Bell 1992; Ramanathan and Oren 1993]. Although intended primarily to solve number-crunching applications, these multiprocessors could quickly help solve large symbolic applications, currently untractable on sequential computers. In order for parallel logic programming systems to efficiently use a large number of processing elements, it is necessary that the techniques used in these systems scale well with the number of processing elements.

7.4.1 7.4

Programming

Processor-Memory

Connection

Issue

Systems performing sharing of computation stacks and designed for UMA shared-memory multiprocessors may not be appropriate for NUMA multiprocessors such as the BBN Butterfly. On this machine, accessing local memory on a node is one order of magnitude faster than accessing shared memory on a remote node. Since cache coherency is not provided, shared memory cannot be cached, and local accesses to shared memory are slower than local accesses to private memory. This is the case for all accesses to the execution stacks, which cannot be cached since they are shared among workers. This problem was exhibited by the Aurora prototype implementation on the Butterfly [Mudambi 1991], which ran sequentially 80% slower than SICStus, in contrast to its 20% slower sequential performance on the UMA shared-memory Sequent Symmetry. In spite of these problems, experiments performed demonstrated the possibility of obtaining near-linear speedup with 80

ACM

Computmg

Surveys,

Vol.

26,

No.

3, September

1994

330



J. Chassin Table 5.

de Kergornrneaux Execution

Times

and

of the MUSE

P. Codognet Prototype

on Butterfly

TC2000

No. of PEs

~ =~:

~

‘o’”:

Execution times design of a circuit

and speedups board.

processing elements computing large problems. Systems performing stack copying are better suited for NUMA multiprocessors. Experimental results of the MUSE prototype running on Butterfly TC2000 [Ali and Karlsson 1992] indicate that the overhead of MUSE running sequentially over SICStus Prolog remains limited to 57. while near-linear speedups are provided up to 40 processors (see Table 5). 7.4.2

Design

of Spec/al/zed

Architectures

Since NUMA architectures are not suitable for models sharing the computation stacks, one track of research is to design a scalable UMA architecture. In the Data Diffusion Machine [Warren and Haridi 1988], the location of a datum is decoupled from its virtual address, and data migrates automatically where it is needed: memory is actually organized like a very large cache (COMA). The hardware organization consists of a hierarchy of buses and data controllers linking proeach having a set-associative cessors, memory. The machine is scalable since of levels in the hierarchy is the number not fixed, and it should be possible to build a multiprocessor including a large number of processing elements with a limited number of levels. The KSR-1 commercial multiprocessor whose architecture is similar to the DDM is now commercially available from Kendall Square Research. No result of parallel Prolog implementation on KSR-1 was available at the time this article was written.

ACM

Computmg

Surveys,

Vol

26,

No

3, September

1994

:

::

of a knowledge-based

system

checking

the

The most important research activity in this area was performed by the ICOT, in the Fifth Generation Computer Systems project. This project delivered the PIM multiprocessor, including up to 512 specialized processors [Kumon et al. 1992; Nakashima and Nakajima 1992; Taki 1992]. PIM is composed of up to 64 clusters, each cluster being similar to a (physically) shared-memory multiprocessor including 8 processing elements. The implementation techniques of KLl on the PIM machine have been used on the Multi-PSI/V2 distributed memory multiprocessor [Nakajima et al. 1989]. The intracluster issues of the PIM machine were originally investigated inside a Multi-PSI/V2 processor while intercluster issues of PIM were investigated between Multi-PSI/V2 processing elements. This was the case, for example, with intracluster garbage collection [Chikayama and Kimura 1987] and intercluster garbage collection [Ichiyoshi et al. 1990] techniques. 7.4.3

Scheduhng

Issues

Several research projects address the scalability issues raised by scheduling and load balancing over massively multiprocessors. In order to avoid bottlenecks that could arise in centralized or distributed systems, multilevel hierarchical load-balancing schemes have been proposed [Briat et al. 1991; Furuichi et al. 1990]. Bottlenecks are expected to arise in centralized systems with a large number of processing elements accessing a central scheduler. Fully distributed sys-

Parallel terns require some shared data and result in bottlenecks to access these shared data together with a large amount of interprocessor communication. Experimental results over 64 PEs Multi-PSI NORMA architectures, reported by Furuichi et al. [1990], indicate that multilevel load balancing increases the speedups when using more than 32 PEs: a speedup of 50, using 64 PEs, was obtained for a puzzle-solving program.

8.

CONCLUSION

Considerable research activity is being dedicated to parallelizing logic programming, mainly because of the “intrinsic” parallelism of logic programming languages and of the computational needs of symbolic problems. Parallel logic programming systems exploit essentially OR-parallelism—simultaneous computation of several resolvents—and ANDparallelism—simultaneous computation of several goals of a clause. AND-parallel execution can be restricted to goals that do not share any variable, in which case it is called independent. Otherwise, it is called dependent AND-parallelism. Two main approaches can be distinguished: the first one exploits parallelism transparently by executing Prolog programs in parallel, while the second one develops language constructs allowing programmers to express explicitly the parallelism of their applications. The second approach is mainly represented by the family of concurrent logic languages, which is intended for supporting the programming of applications not addressed by Prolog, such as reactive systems and operating system programming. A large number of parallel computational models have been proposed to parallelize logic programming. Among them, multisequential computing models can use most of the efficient implementation techniques developed for sequential Prolog implementation. In spite of the inherent parallelism of logic languages, implementing them efficiently in parallel raises difficult problems. Optimizations made possible by the Prolog backtracking

Logic

Programming

Systems



331

cannot be applied to OR-parallel systems. AND-parallel systems need to check that goals executed in parallel assign their variables to coherent bindings. Independent AND-parallel goals cannot bind the same variables. Independence can be tested at compile-time, at runtime, or partly at compile-time and partly at run-time. Most run-time independence tests are time consuming, and the most efficient results are obtained when the goal independence can be computed at compile-time. Dependent AND-paralleI systems only compute determinate goals in parallel, determinacy being enforced by the language in concurrent logic systems or checked partly at compile-time and partly at run-time in the Andorra model of computation. Some of these computational models have been implemented efficiently on multiprocessors. The most mature systems exploit one type of parallelism. Systems exploiting OR- and independent AND-parallelism obtain linear speedups for programs containing enough parallelism while programs containing no exploitable parallelism run almost as efficiently as if executed by an efficient Prolog system. Systems exploiting dependent AND-parallelism are usually several times less efficient when they are used to program the same applications, but they can exploit more parallelism of finer grain in logic programs. Concurrent logic languages, which exploit this type of parallelism, have also proved their potential for developing different types of programs, such as the PIMOS operating system for both multi-PSI and PIM machines, composed of 100,000 lines of KL1 [Furukawa 1992]. Systems combining several types of parallelisms are not for the moment as efficient as systems exploiting a single type of parallelism, but they may result in a reduction of the search space of programs by avoiding useless computations. Also, these systems can speed up more logic programs than systems exploiting one type of parallelism. Current and future research in the domain of logic programming is active in

ACM

Computing

Surveys,

Vol.

26,

No,

3, September

1994

332



J. Chassin

de Kergommeaux

and

several domains. One track of research seeks to improve existing systems, mainly by developing compile-time analysis techniques for logic programs to detect determinacy of and independence among goals, and to anticipate the granularity of resolvents (OR-parallelism) or goals (AND-parallelism). Another fruitful direction is the combination of constraint logic programming with parallelism. The last direction of research in this article is the use of highly and massively parallel multiprocessors. The most important activity in this area was performed by the Japanese Fifth Generation Computer Systems project, where the PIM/p multiprocessor, demonstrated in 1992, delivers approximately one hundred millions of reductions per second peak performance (one reduction is equivalent to one lip). In spite of the excellent performance results achieved by parallel logic systems, the cost effectiveness of parallel logic programming is still questionable. Parallel multiprocessors generally do not use the most recent microprocessors already used in the most powerful uniprocessor workstations, In other words, because of the rapid progress in VLSI technology, a sequential Prolog system running on the most recent type of workstation performs almost as well as a parallel logic programming system running on a modestly parallel multiprocessor based on the previous generation of microprocessors (see figures in Section 5). This situation might change rapidly since several parallel workstations using stateof-the-art microprocessors are becoming commercially available. There are not yet enough applications likely to benefit from highly and massively parallel systems, since the traditional use of logic programming is constrained to making prototypes. This situation may also change with the advent of highly and massively parallel multiprocessors that can address problems too large to be solved by the most efficient sequential logic programming systems. As the performance of parallel systems becomes radically higher than that of uniprocessors, and as the processor tech-

ACM

Computmg

Surveys,

Vol

26,

No.

3, September

1994

P. Codognet

nology used in parallel systems approaches the state of the art, parallel logic programming systems will become increasingly attractive. Existing systems can adaptively exploit parallelism when it is present, with little sacrifice of sequential performance when it is not. Most importantly, logic programming provides logic programming systems with natural opportunities to exploit parallelism and programmers with convenient ways to express it. ACKNOWLEDGMENTS The for

authors the

tions

idea

and

Jacques survey

would

hke

of this

survey

corrections. Cohen

and

in detail

and

to thank

D. H

and

for

We would Paul

also

Hilfinger

proposing

D

helpful hke

for

Warren suggesto thank

reading

numerous

the

improve-

ments.

REFERENCES AIT-KAcI, H. Tutor~al bridge, ALI,

1991. Warren’s Reconstruetton.

Abstract Machzne: A MIT Press, Cam-

Mass.

K. A. M. AND KARLSSON, R. 1992. OR-Parallel speedups in a knowledge based system. On Muse and Aurora. In proceedings of the International Conference on Fifth Generation Computer Systems ICOT, Tokyo.

ALI, K A. M AND KARLSSON, R. 1991. Scheduhng Or-Parallehsm in Muse. In F70ceedmgs of the 8th International Conference on Logic Programming, MIT Press, Cambridge, Mass., 807-821, ALI,

K. A. M. AND KARLSSON, R. 1990. The Muse Or-Parallel Prolog model and Its performance. In proceedings of the 1990 North American Conference on Logic Programming MIT Press, Cambridge, Mass.

BAHGAT, R, 1993, Non-Determmlstlc Logzc Programming in Pandora, tific Publishing, London, U.K.

Concurrent World Scien-

BARON, LT. C., CHASSIN DE KERWMMEAUX, J., HAILPERIN, M., RATCLIFFE, M., ROBERT, P , SYRE, J. C., AND WESTPHAL, H. 1988a. The parallel ECRC Prolog system PEPSys: An overview and evaluation results. In Proceedings of the Internatlonal Conference on Fzfth Generation Cornputer Systems. ICOT, Tokyo. BARON, U. C., ING, B., RATCLIFFE, M., AND ROBERT, P. 1988b. A distributed architecture for the PEPSys parallel logic programming system. In Proceedings of ICPP’88. Pennsylvania State Univ., 410-413. BEAUMONT, A., MUTHURAMAN, WARREN, D. H. D. 1991.

S,, SZEREDI, P., AND Flexible scheduling

Parallel of OR-parallelism in Aurora: The Bristol Scheduler. In Parallel Architectures and Languages Europe, PARLE ’91. Lecture Notes in Computer Science, vol. 506, Springer-Verlag, New York, 403-420. BELL, G. 1992. fore its time.

Ultracomputers: Commun. ACM

A teraflop 35, 8, 26-47.

SOFI, G. gramming tures. In Conference Cambridge, BRAND, P. gigalips

CECCHI, C., MOISO, C., PORTA, M., AND 1990. Logic and functional proon distributed memory architecProceedings of the 7th International on Logic Programming. MIT Press, Mass., 325-339.

1988. Wavefront project, SICS.

scheduling.

Int.

Rep.,

BRIAT, J., FAVRE, M., GEYER, C., AND CHASSIN DE KERGOMMEAUX, J. 1992. OPERA OR-Parallel Prolog system on supernode. In Zmplemen tutions of Distributed Prolog. John Wiley and Sons, New York, 45-63. BRIAT, J., FAVRE, M., GEYER, C., AND CHASSIN DE KERGOMMEAUX, J. 1991. Scheduling of ORparallel Prolog on a scalable, reconfigurable, distributed memory multiprocessor. In Parallel Architectures and Languages Europe, PARLE ’91. Lecture Notes in Computer Science, vol. 506, Springer-Verlag, New York, 385-402. BRUYNOOGHE, M. 1991. A practical framework for the abstract interpretation of logic programs. J. Logic Program. 10, 2, 98-124. BUTLER, R., DISZ, T., LUSK, E., OLSON, R., OVERBIIEK, R. A., M-W STEVENS, R., 1988. Scheduling OR-parallelism: An Argonne perspective. In proceedings of the 5th International Conference and Symposium on Logic Programming. MIT Press, Cambridge, Mass., 1590-1605. BUTLER, R., LUSK, E. L., OLSON, R., AND OVERBEEK, R. A. 1986. ANLWAM: A parallel implementation of the Warren Abstract Machine. Tech. Rep., Argonne National Laboratories. CALDERWOOD, A. AND SZEREDI, P. 1989. Scheduling Or-parallelism in Aurora. In Proceedings of the 6th International Conference on Logic Programming. MIT Press, Cambridge, Mass. CARLSSON, M. ANrJ WIDEN, J. 1988. SICStus log User Manual. Res. Rep., SICS.

Pro-

CHIKAYAMA, T. ANII KIMURA, Y 1987. Multiple reference management in flat GHC. In Proceedings of the 4th International Conference on Logw Mass.,

Programming. 276–293.

MIT

Press,

Cambridge,

S. 1981. A programming.

Systems

CLARK, K. AND MCCABE, cilities Micro burgh

F.

1981.

333



Languages Press, New The

and York.

control

fa-

of IC-Prolog. In Expert Systems in the Electronic Age, D. Mitchie, Ed. EdinUniversity Press, Edinburgh, U. K.

CLOCKSIN, W. F. AND ALSHAWI, H. 1988. A method for efficiently executing Horn clause programs using multiple processors. New Gen. Comput. 5, 361-376. CODOGNET, C., CODOGNET, P., AND CORSINI, M.-M. 1990. Abstract interpretation for concurrent logic languages. In Proceedings of the 1990 North American Conference on Logic Programming. MIT Press, Cambridge, Mass., 215–232, COLMERAUER, A. 1990. An introduction log-111. Commzm. ACM, 33, 7.

to

Pro-

COLMERAUER, A. 1986. Theoretical model of Prolog II. In Logic Programmmg and Its Applicat~on, M. van Caneghem and D. Warren, Eds. Ablex, Norwood, N. J., 3–31. CONERY, J. S. 1992. The OPAL Machine. In plementations of Distributed Prolog. John ley and Sons, New York, 159-185.

ImWi-

CONERY, J. S. 1983. The AND/OR process model for parallel execution of logic programs. Ph.D. thesis, Tech. Rep. 204, Dept. of Computer and Information Science, Univ. of California, Irvine, Calif. COSTA, V. S., WARREN, D. H. D., AND YANG, R. 1991a. The Andorra-I engine: A parallel implementation of the basic Andorra model. In Proceedings of the 8th International Conference on Logic Programming. MIT Press, Cambridge, Mass., 825–839. COSTA, V. S., WARREN, D. H. D., AND YANG, R. 1991b. The Andorra-I preprocessor Supporting full Prolog on the basic Andorra model. In Proceedings of the 8th International Conference on Logic Programming. MIT Press, Cambridge, Mass., 443-456.

COUSOT, P. AND COUSOT, R.

1977. Abstract interpretation: A unified framework for static analysis of programs by approximation of fixpoint. In the 4th ACM Symposium on Principles of Programming Languages. ACM Press, New York.

CRAMMOND, J. 1988. Implementation of committed choice logic languages on shared memory multiprocessors. Ph.D. thesis, Dept. of Computer Science, Herroit-Watt Univ., Edinburgh, U. K. CHASSIN DE KERGOMMEAUX, J. the PEPSys implementation Tech. Rep. CA-44, ECRC.

1989. on

Measures of the MX500.

ParPro-

CHASSIN DE KERGOMMEADX, J. AND ROBERT, P. 1990. An abstract machine to implement efficiently OR-AND parallel Prolog. J. Logic Program. 8, 3.

relational In the

VILLEMONTE DE LA CLERGERIE, E. 1992. Dyalog: Complete evaluation of Horn clauses by dynamic programming. Ph.D. thesis, INRIA.

CLARK, K. ANI) GREGORY, S. 1986. PARLOG: allel programming in logic. ACM Trans. gram, Lang. Syst. 8, L CLARK, K. AND GREGORY, language for parallel

Programming

ACM Conference on Functional Computer Architecture. ACM

be-

BORGWARDT, P. 1984. Parallel prolog using stack segments on shared memory multiprocessors. In the 84 International Symposzum on Logic Programmmg. IEEE, New York, 2-11.

Bosco, P. G.,

Logic

ACM

Computing

Surveys,

Vol.

26,

No.

3, September

1994

334



J. Chassin

de Kergommeaux

and

DEBRAY, S. K., LIN, N.-W., AND HERMENEGILDO, M. 1990. Task granularity analysis in logic programs. In Proceedings of the ACM SIGPLAN’90 Conference on Programmmg Language Design and Implementation. ACM, New York. DEGROOT, D. 1984. Restricted And-parallelism. In Proceedings of the International Con&ence on Fifth Generation Computer Systems 1984. ICOT, Tokyo, 471-478 DELGADO-RANNAURO, S. A. 1992a. OR-parallel lo~c computational models. In Implementations of D~strlbuted Prolog. John Wdey and Sons, New York, 3-26. DELMDO.R.ANNAURO,

S.

A.

1992b

MOOLENAR, R., VAN HECKER, H., AND DEMOEN. 1991. A parallel implementation of AKL. the ILPS 91 Workshop on Implernentatton Parallel Logw Programming Systems. ILPS.

B. In of

HARIDI, S. AND JANSON, S. 1990. Kernel Andorra Prolog and its computation model. In Proceedings of the 7th International Conference on Logic Program mmg. MIT Press, Cambridge, Mass., 31-46.

HAUSMAN, B. 1989. Pruning and scheduling speculative work in Or-parallel systems. In Parallel Architectures and Languages Europe, PARLE’89. Lecture Notes in Computer Science, vol. 366. Springer-Verlag, Berlin.

DIH.GADO-RANN.AURO, S. A. 1992c. Stream ANDparallel logic computational models. In Implementations of Dlstrlbufed Prolog. John Wiley and Sons, New York, 239–257. DIJKSTRA, E. W. 1975. Guarded commands, nondeterminacy, and formal derivation of programs. Commun. ACM 18, 8. DINCBAS, M., SIMONIS, H., AND VAN HENTENRYCL P. 1990. Solving large combinatorial problems in logic programming ,1. Logic Program. 8, 1-2. DISZ, T , LUSK, E., AND OVERBEEK, R. 1987. Experiments with OR-parallel loglc programs. In the 4th International Conference on Logzc Programming. ACM Press, New York, 576-599. Progranmzzng Prentice-Hall

pendent And-, independent And-j and Or-parallelism. In Proceedings of the 1991 International Logtc Programming Symposmm. MIT Press, Cambridge, Mass., 152-166.

HAUSMAN, B. 1990. Pruning and speculative work in OR-Parallel Prolog. Ph.D. thesis, SICS Res. Rep D-90-9001, Royal Inst of Technology, Stockholm, Sweden.

Restricted

ANDand AND/OR-parallel logic computational models. In Implementations of Distrz b uted Prolog. John Wiley and Sons, New York, 121-141.

FOSTER, I. 1990, Systems allel LogLc Languages. tional, U. K.

P. Codognet

in ParInterna-

HERMENEC+ILDO, M. V. 1986. An abstract machine for restricted AND-parallel execution of logic programs In I%oceedzngs of the 3rd International Con ference on Logic Programming. Lecture Notes in Computer Science, SpringerVerlag, New York, 25-39. HERMENEGILDO, M. V. AND GREENE, K. J. 1990. &-Prolog and its performance: Exploiting independent And-parallelism. In Proceedings of the 7th International Conference on Logic Programming. MIT Press, Cambridge, Mass., 253–268. HERMENEGILDO,

M.

AND

ROSSI,

F.

1990.

Non-

1. AND TAYLOR, S. 1989 Strand New Concepts in Parallel Programming F’rent~ceHall, Englewood Cliffs, N J

strict independent And-parallelism. In Proceedings of the 7th International Conference on Logw Prwgrammmg. MIT Press, Cambridge, Mass., 237-252.

FURUICHI, M., TAKI, K., AND ICHIYOSHI, A. N. 1990. A multi-level load balancing scheme for Orparallel exhaustive search programs on the Multl-PSI. ACM SIGPLAN Not. 25, 3, 50-59.

ICHIYOSHI, N., ROKUSAWA, K., NAKAJIMA, K., AND INAMURA, Y. 1990. A new external reference management and distributed umfication for KL1. New Gen. Cornput. 7, 159-177.

FURUIMWA, K. 1992. Logic programming as the integrator of the Fifth Generation Computer Systems Project Commzm. ACM 35, 3, 83-92.

JAFFAR, J AND LASSEZ, J.-L. 1987. Constraint logic programming. In the 13th ACM Symposium on Principles of Programmmg Languages, POPL 87. ACM Press, New York.

FOSTER.

GUPT,A, G. AND HERMENEGILDO, M. 1991a. ACE And/Or-parallel copying-based execution of logic programs. In ICLP’91 Workshop on Parallel Executton of Logbc Program.?. SpringerVerlag, New York. GUPTA, G AND H~RMENE~lLDO, M. And/Or-parallel copying-based logic programs. Tech. Rep. Polit6cnica de Madrid, Spare GUPTA,

G. AND JAYARAMAN,

B.

1991b. ACE: execution of Universidad

1990.

Optimizing

And-Or parallel Implementations. In Proceedings of the 1990 North Amerzcan Conference on Logic Programming. MIT Press, Cambridge, Mass 605-623. GUPTA, G, COSTA, V. S., YANG, R., AND HERMENI? GILDO, M. V 1991 IDIOM Integrating de-

ACM

Computing

Surveys,

Vol

26,

No

3, September

1994

JANSON, S. AND HARIDI, S. 1991. Programming paradigms of the Andorra Kernel Language. In Proceedings of the 1991 International Logic MIT Press, CamProgramnung S’ymposutm. bridge, Mass., 167-186 ,JoN~s, N. AND SONDERGAARD, H. 1987. A semantic-based framework for the abstract interpretation of Prolog. In Abstract Interpretat~on of Declarative Languages, S. Abramsky and C. Hankin, Eds. Ellis Horwood, 123-142. KALfi, L. V. 1987. The REDUCE-OR process model for parallel evaluation of logic programs. In Proceedings of the 4th International Conference on Logic Programming. MIT Press, Cambridge, Mass., 616-632.

Parallel KALfi, L. V. duce-OR

AND RAMIfUMAR, B. 1992. process model for parallel

gramming on non-shared memory Implementations of Distr~buted Wiley

and

Sons,

New

York,

The Relogic pro-

187-212.

KLINGER, S. AND SHAPIRO, E. 1988. A decision tree compilation algorithm for FCP (—, :, ?). In pz’oceedmgs of the 5th International Conference and Sympostum on Logic Programming. MIT Press, Cambridge, Mass., 1315–1336. ing.

Elsevier

1979. Science,

Logic New

for

Problem

Solw

York.

KUMON, K., ASATO, A., ARAI, S., SHINOGI, T., AND HATTORI, A. 1992. Architecture and implementation of PIM/p. In Proceedings of the In ternatLonal Conference on Fifth Generation Computer Systems 1992. ICOT, Tokyo, 414-424, LfN, Y. J. AND KUMAR, V. 1988. AND-parallel execution of logic programs on a shared memory multiprocessor: A summary of results. In Proceedings of the 5th International Conference and Syrnposlum on Logic Programming. MIT Press, Cambridge Mass., 1123-1141. LLOYD, J, W. gramming. Lus~,

1987. Foundation.9 of Logic Springer-Verlag, Berlin.

Pro-

E.,

BUTLER, R., DISZ, T., OLSON, R., OVERR., STEVENS, R., WARREN, D. H. D., CALDERWODD, A., SZER~DI, P., HARID1, S., BRAND, P., CARLSON, M., CIEPIELEWSKI, A., AND HAUSMAN, B.> 1988. The Aurora Or-parallel pro~og system. In Proceedings of the International Conference on Fifth Generation Computer Systems 1988, ICOT, Tokyo. BEEK,

Programming

Systems



335

matic compile-time parallelization of logic programs for independent And-parallelism. In Proceedings of the 7th International Conference MIT Press, Cambridge, on Logic Programming. Mass., 221-236.

machines. In Prolog. John

KALfi, L. V. AND RAMKUMAR, B. 1990. Joining AND parallel solutions in AND/OR parallel systems. In Proceedings of the 1990 North American Conference on Logic Programming. MIT Press, Cambridge, Mass., 624-641.

KOWALSKZ, R. A.

Logic

MUTHUKUMAR, K. AND HERMENEGILDO, M. 1989a. Complete and efficient methods for supporting side-effects in independent/restricted ANDparallelism. In Proceedings of the 6th International Conference on Logic Programming. MIT Press, Cambridge, Mass., 80-97. MUTHUKUMAR, K. AND HERMENEGILDO, M. 1989b. Determination of variable dependence information through abstract interpretation. In Proceedings of the North American Conference on Logic Programming. MIT Press, Cambridge, Mass., 166-188. NAISH, L. 1988. Parallelizing NU-Prolog. In Proceedz ngs of the 5th International Conference and Symposium on Logic Program mmg. MIT Press, Cambridge, Mass., 1546-1564. NAISH, L. 1984. Mu-Prolog 3. ldb Reference ual. Melbourne University, Melbourne, tralia.

ManAus-

NAKAJIMA, K., INAMURA, Y., ICHIYOSHI, N., ROKUSAWA, K., AND CHIKAYAMA, T. 1989. Distributed implementation of KLl on the MultiPSI/V2. In the 6th International Conference on Logic Program ming. NAKASHIMA, H. AND NAKAJIMA, K. 1992. Architecture and implementation of PIM\m. In Proceedings of the In ternatlonal Conference on F~fth Generation Computer Systems 1992. ICOT, Tokyo, 425-435. PALMER, D. AND NAISH, L. 1991. NUA-Prolog: An extension to the WAM for parallel andorra. In Proceedings of the 8th International Conference on Logic Programming. MIT Press, Cambridge, Mass., 429-442.

MAES, L. 1992. Or-parallel speedups in a compiled PROLOG engine: Results of the integration of the Muse scheduler in the ProLog by BIM engine, Draftj BIM, Kwikcstraat 4, B-3078 Everberg, Belgium.

PERCE~OIS, C., SIGNES, N., AND SELLE, L. 1992. An actor-oriented computer for logic and its

MA~ER, M. J. 1987. Logic semantics for a class of committed-choice programs. In Proceedings of the 4th International Conference on Logic Programming, MIT Press, Cambridge, Mass., 858-876.

PER~IRA, L. M. AND PORTZ.), A, 1979. Intelligent backtracking and sidetracking in Horn clause programs. Tech. Rep., CINUL 2/79, Universitade Nuova de Lisboa, Portugal.

MASUZAWA, H. 1986, Kabu Wake parallel inference mechanism and its evaluation. In 1986 FJCC. IEEE, New York, 955–962. MELLISH, C. S. 1986. Abstract interpretation of Prolog Programs. In Proceedings of the 3rd International Conference on Logzc Programming, Springer-Verlag, New York. 463–474.

application. Prolog. 213-235.

In

Implementations of Distributed Wiley and Sons, New York,

John

RAMANATHAN, commercial 21, 3.

G. ANZI OREN, J. 1993. Survey of parallel machines. ACM Szggarch

RAMZiUMAR, B. 1991, Machine independent “AND” and “OR parallel execution of logic programs. Ph.D. thesis, Univ. of Illinois, Urbana-Champaign, Ill.

MUDAMBI, S. 1991. Performances of Aurora on NUMA machmes. In Proceedings of the 8th Intei-natzonal Conference on Logic Programming. MIT Press, Cambridge, Mass., 793–806.

RAMKUMAR, B. AND IQm4, L. V. 1989. Compiled execution of the Reduce-OR process model on multiprocessors. In Proceedings of the North American Conference on Logzc Programming. MIT Press, Cambridge, Mass., 313–331.

MUTHUKUMAR, K. AND HERMENEGILDO, M. V. The DCG, UDG, and MEL methods for

SARASWAT, V. programming

1990. auto-

ACM

Computing

A.

1991. Concurrent languages. In Doctoral

Surveys,

Vol

26,

No,

constraint Disserta-

3, September

1994

336

J. Chassin



tzon

MIT

Award Press,

de Kergornmeaux

and P. Codognet VAN HENTENRYCK, P., SARASWAT, V., AND DEVILLE, Y. 1993. Design, implementation and evaluation of the concurrent constraint language cc(fd). Tech. Rep., Brown Univ., Providence, R. I.

and Logic Programming Serzes. Cambridge. Mass. To be published.

SARASWAT, V. A. 1987. The concurrent logic programming language CP Definition and operational semantics. In the 13th ACM Symposium on Principles of Programming Languages, POPL

87. ACM

Press,

New

VAN ROY, P. AND DES~AIN, A. M. 1992. High-performance logic programming with the Aquarius Prolog compiler. (“ompater 25, 1

York.

SARASWAT, V. A. AND RINARD, M. 1990. Concurrent constraint programmmg. In the 16th ACM Symposlurn on Principles of Programmmg Languages, POPL 90. ACM Press, New York. SATO, M. AND GOTO, A. 1988. Evaluation of the KLl parallel system on shared memory multiprocessor. In Proceedings of the IFIP Working Conference on Parallel Processing. North-Holland, Amsterdam. SHAPIRO, E. 1989. The family of concurrent programming languages. ACM C’omput. 21, 3.

logic Sure.

SHAPIRO, E. 1983. A subset of Concurrent Prolog and its interpreter. Tech. Rep , Weizmann Institute, Rehovot, Israel. SHEN, K. AND HERM~NIS~ILDO, M. V. 1991. A simulation study of Or- and independent And- parallelism. In F’roceedzngs of the 1991 International Logic Prograrnrnmg Symposium. MIT Press, Cambridge, Mass., 135-151. SZEREI)I, P. 1989. Performance analysis of the Aurora OR-parallel Prolog system. In Proceedings of the North Amerzcan Conference on Logic Programrnmg. MIT Press, Cambridge, Mass. TAIL

K. 1992. Parallel Inference Machine PIM. In Proceedings of the International Conference on Fzfth Generat~on Computer Systems 1992. ICOT, Tokyo, 50-72.

TAYLO~ S. 1989. Parallel Techniques. Prentice-Hall

Logtc Programming International, U. K.

199x A TAYLOR, S., SAFRA, S., AND SHAPIRO. E parallel implementation of Flat Concurrent Prolog. J. Parall. Program. 15, 3, 245-275. TICK

E. 1989. Comparmg gramming architectures.

two parallel loglc proIEEE Softw. (July)

TICK,

E. 1988 Compile-time granularity analysis for parallel logic programming languages In Proceedings of the Iuternatzonal Conference orL Fl fth Generation Computer Systems 1988. ICOT, Tokyo, 994-1000.

UEDA, K. AND CHIKAYAMA, T. 1990. Kernel language for the Parallel chine Comput. J. 33, 6. VAN HENTENRYCK, tion in Logic bridge, Mass. VAN

Constraint

MIT

SatisfacPress, Cam-

HENTENRYCK, P. 1989b. Parallel constraint satisfaction in logic programming. Preliminary results of Chip within PEPSYS In Prwceedmgs of the 6th Internatz onal Conference on Logzc Programmmg. MIT Press, Cambridge, Mass., 165-180.

Recewed

ACM

P. 1989a. Programming.

Design of the Inference Ma-

May 1992;

Computmg

final

Surveys,

revmon

accepted

Vol

No.

26,

November

3. September

1994

VERON, A., SCHUERMAN, K., REEVE, M., AND LI, L. 1993. Why and how in the ElipSys OR-parallel CLP system. In the 5th Znternationa[ PARLE Conference. Lecture Notes in Computer vol 694. Springer-Verlag, Berlin, Science, 291–304. WARREN, D. H. D. 1990 The extended Andorra model with Implicit control. In the ICLP 90 Workshop on Parallel Logzc Programming. Presentation slides WARREN, D. H. D. 1988. The Int. Rep., Gigalips Group.

Andorra

principle.

WARREN, D. H. D. 1987. The SRI model for Orparallel execution of Prolog. Abstract design and Implementation issues. In proceedings of the 1987 S.vmposum on Logic Programmmg. IEEE Computer Society Press, Los Alamitos, Cahf., 46-53. WARREN, D. H. D. 1983. An abstract Prolog instruction set. Tech. Rep. tn309, SRI, Menlo Park, Calif. WARREN, D. H. D. AND HARIDI, S. 1988. Data Diffusion Machine—A scalable shared virtual memory multiprocessor. In Proceedings of the International Conference on Fifth Generation Computer Systems 1988. ICOT, Tokyo, 943952. WESTPHAL, H , ROBERT, P., CHASSIN, J., AND SYRE, J.-C. 1987 The PEPSys Model: Combimng backtracking, ANDand OR-parallelism. In Proceedings of the 1987 Symposium on Logzc Programming IEEE Computer Society Press, Los Alamitos, Cahf., 436-448 WINSBOROUGH, W AND WAERN, A. 1988. Transparent And-parallelism in the presence of shared free variables. In Proceedings of the 5th Internatmnal Conference and Symposium on Logzc Program mlng. MIT Press, Cambridge, Mass , 749-764. YANG, R. ANU Amo, H. 1986. P-Prolog: A parallel logic language based on exclusive relatlon. In Proceedings of the 3rd Znternatzonal Conference OTLLogic Program mmg Springer-Verlag, New York, 255–269 YANG, R., BEAUMONT, T., DUTRA, I., COSTA, V. S., AND WARREN, D. H. D. 1993. Performance of the compder-based Andorra-I system. In Proceedz ngs of the 10th International Conference on Logzc Programming. D. S. Warren, Ed., MIT Press, Cambridge, Mass., 150-166.

1993