A Configuration Approach to Parallel Programming ... - Semantic Scholar

3 downloads 77316 Views 62KB Size Report
been developed from the Conic system for distributed programming. The use of the toolset is illustrated through its application to the development of a parallel ...
A Configuration Approach to Parallel Programming Jeff Magee, Naranker Dulay Department of Computing, Imperial College of Science, Technology and Medicine, 180 Queen's Gate, London SW7 2BZ, UK.

Abstract This paper advocates a configuration approach to parallel programming for distributed memory multicomputers, in particular, arrays of transputers. The configuration approach prescribes the rigorous separation of the logical structure of a program from its component parts. In the context of parallel programs, components are processes which communicate by exchanging messages. The configuration defines the instances of these processes which exist in the program and the paths by which they are interconnected. The approach is demonstrated by a toolset (Tonic) which embodies the configuration paradigm. A separate configuration language is used to describe both the logical structure of the parallel program and the physical structure of the target multicomputer. Different logical to physical mappings can be obtained by applying different physical configurations to the same logical configuration. The toolset has been developed from the Conic system for distributed programming. The use of the toolset is illustrated through its application to the development of a parallel program to compute Mandelbrot sets.

1

1 Introduction The work described in this paper arose from our interest in applying the principles embodied in Conic [KRA85, MAG89] to a programming environment for multicomputers [ATH88]. The shortcomings we perceived in existing programming environments for multicomputers based on transputer arrays provided additional motivation. The modifications necessary to Conic to enable its efficient use in the transputer environment led to naming the variant Tonic, for obvious reasons. Conic is a toolkit for constructing distributed systems. It provides two languages: the first, a declarative configuration language used to describe the structure of a logical node in terms of its constituent process types, process instances and process interconnections and the second, a programming language used to program individual process types. The programming language is Pascal augmented with message passing primitives. Distributed systems are constructed in Conic by dynamically assigning instances of logical nodes to physical nodes and interconnecting these instances. Conic embodies the configuration approach in rigorously separating the logical structure of a distributed program from the components which implement its computational function. The differences between Tonic and Conic arise from the characteristic differences between parallel and distributed programs. We see these as being: Objective - Distributed programs can be considered as consisting of a number of logically distinct entities which intercommunicate to achieve some overall goal - typically access to geographically distributed resources. Parallel programs are logically one entity where the constituents co-operate to achieve some computational goal - the overall objective of performing the computation in parallel being speedup. Failure - Failure of one component of a distributed program generally requires continued operation albeit in degraded mode whereas failure of one component of a parallel computation can generally be allowed to cause termination of the overall computation. In distributed environments, the longevity of execution, together with the probability of communication and node failure means that software development toolkits must provide programming abstractions to deal with such failures. This is not the case in multicomputers where we can assume reliable communication and low probability of node failure during the execution of programs which have as their primary objective speedup rather than continuous execution. Evolution - A large class of critical distributed programs execute perpetually and for economic or safety reasons require the facility to be modified and updated on-line. The Conic toolkit supports this requirement through its ability to dynamically configure running 2

systems. On-line evolution is not a requirement for parallel programs which run for relatively short periods, completing when a result has been computed. Heterogeneity - Distributed programs are generally designed such that components of the program may run on computers with different processor types. Programming environments for distributed systems deal with this hardware heterogeneity by providing multiple code generators and message datatype conversion facilities. Distributed memory multicomputers, typified by transputer networks, provide hardware homogeneity, although they are usually hosted by a computer with a different processor type. However, programming environments for multicomputers should optimise for hardware homogeneity. In summary, Tonic is optimised to support the development of parallel programs for distributed memory multicomputers where the primary objective is speedup. Tonic does not support dynamic configuration and assumes reliable processors and reliable inter-processor communication. It inherits from Conic the configuration approach in providing a separate language to define logical structure and extends the Conic configuration facilities by also applying this language to describing the physical structure of the target multicomputer. This physical configuration description is used to drive the logical to physical mapping process. Currently, the most commonly used toolset for developing parallel programs for transputer based multicomputers is the Occam language [INM88a] and the Transputer Development System (TDS) [INM88b]. While these are efficient tools for developing embedded applications, they have drawbacks when used for developing application programs for the current generation of transputer based multicomputers such as the Meiko Computing Surface and the Supernode. The drawbacks are primarily concerned with the flexibility permitted in mapping an arbitrary network of communicating Occam processes to the hardware topology of interconnected transputers. The developer must take into account the limit of four links per transputer and laboriously place logical channels onto physical channels. Explicit multiplexor and demultiplexor processes must be provided where it is necessary to map more than one logical communication channel onto a physical link. Changing the number of processors on which an application runs requires recompilation. More recent toolsets such as CStools [MEI89] address the problem of making the logical structure independent from the underlying hardware structure through the use of configuration descriptions (termed “par files”), however, these descriptions do not permit runtime parameterisation of the number of processors or flexible logical to physical mapping unless the user resorts to the underlying library routines. The Helios operating system [DSL90] allows a user to describe the hardware configuration (Resource Map) separately from the logical configuration (CDL). These descriptions use different notations and are limited to 3

compile time parameterisation. Further, the application programmer has little control over the logical to physical mapping. Helios programmers are limited to Unix style I/O for inter-process communication. In the following section, the Tonic facilities for developing parallel programs are illustrated through the development of a program to compute the Mandelbrot Set. The facilities provided for mapping this program to different hardware topologies are described in section 3. Section 4 overviews the implementation of Tonic and provides some performance data. Finally, section 5 evaluates the approach.

2

Program Construction in Tonic (Logical Structure)

The following overviews the programming features offered by Tonic for parallel programming through the example of a program to generate Mandelbrot sets. The program generates a 512 by 512 pixel image where the colour of each pixel is represented by an 8 bit quantity. This quantity is computed as the number of iterations ( up to a maximum of 255) of the calculation z= z*z+c before |z|>2 where z is a complex variable and c a complex constant. If the maximum is reached c is assumed to be in the Mandelbrot Set, otherwise the number of iterations indicates how “close” c is to the set. The simplistic approach to parallelising this is to divide the image into the same number of chunks as there are processors and hand each chunk to a processor for computation. Since some image areas, far outside the set, require much less computation than others this approach leads to poor load balancing and thus poor performance. A more sophisticated approach employs a work allocator or supervisor process to hand out smaller chunks to worker or slave processes [MAG91]. A slave process computes a chunk and hands it back to the supervisor for display and then gets another chunk to compute until none are left. In the following, chunks are the size of one horizontal line of pixels. The logical structure of the program is shown in Figure 1.

4

mandgen

computed line (mandmsg)

supervisor

slave[i]

new line to compute (integer)

Figure 1 - Logical structure of the Mandelbrot Generator Program The types of message exchanged between components of the program together with program wide constants are defined in a definitions unit (c.f. Modula-2 modules) as shown below. 1 2 3 4 5 6 7 8

d e f i n e mandbrot:Xmax,Ymax,mandmsg,mandp; c o n s t Xmax=512; Ymax=512; t y p e mandp = ^mandmsg; mandmsg = r e c o r d lineno:integer; linebuf:p a c k e d a r r a y[1..Xmax] o f char; e n d; e n d.

The program unit shown below is the definition of the slave process type - in Tonic process types are task modules. Tasks1 communicate with the outside world by sending messages to exitports and receiving messages from entryports. A task has no direct knowledge of which other tasks it will be connected to. This configuration independence greatly facilitates reuse. In this case, the slave task sends a computed line of pixel colour values (type mandmsg) to its exitport result (line 4). The communication primitive used is a send-wait (line 23) which sends a request message to the exitport and then suspends the task waiting for a reply. The first message from a slave to the supervisor has a zero linenumber indicating that the message does not include a computed line - it is merely a request for the first line to compute. Subsequent messages overload a request for a new line with the results of computing the last line.

1

The terms t a s k and process are used interchangeably throughout the paper.

5

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

t a s k m o d u l e slave(x0,y0,d0:real); use mandbrot:Xmax,Ymax,mandmsg; exitport result:mandmsg r e p l y integer; var M:mandmsg; x1,y1,delta:real; i:integer; f u n c t i o n mandcalc(cx,cy:real):integer; v a r i:integer; zx,zy,xx,xy,yy,t:real; begin i:=0; zx:=cx; zy:=cy; repeat xy:=zx*zy; xx:=zx*zx; yy:=zy*zy; zy:=xy+xy+cy; zx:=xx-yy+cx; t:=xx+yy; i:=i+1; u n t i l (t>4.0) o r (i=256); mandcalc:=i; e n d; begin M.lineno:=0; delta:=d0/Xmax; loop s e n d M t o result w a i t M.lineno; x1:=x0; y1:=y0-(M.lineno-1)*delta; f o r i:=1 t o Xmax d o b e g i n M.linebuf[i]:=chr(mandcalc(x1,y1)); x1:=x1+delta; e n d; e n d; e n d.

The supervisor component shown in Figure 1 is implemented by two tasks as shown in Figure 2.

supervisor

master

display

Figure 2 - Supervisor Composite Component

6

The supervisor composite component is described in the Tonic configuration language by the group module below. Note that the interface to a group module is defined in an identical way to task module interfaces. Thus tasks can be replaced by groups and vice-versa at any point during program development without affecting the rest of the program. The configuration description consists of four parts: i) the use clause (line 2) specifies the message and component types used to construct the group, ii) the interface to the component in terms of entry and exitports (line 6), iii) the instance of component types from which the component is constructed (line 8) and iv) the interconnections between instances of these components (line 11). Since the master task is critical to the overall performance of the program, its associated pragma (line 10) indicates that this task instance should be run as a high priority transputer process with its workspace in on-chip memory if possible. 1 2 3 4 5 6 7 8 9 10 11 12 13 14

g r o u p m o d u l e supervisor; use mandbrot:mandmsg; display; master; entryport result:mandmsg r e p l y integer; create display; master ; link display.out t o master.out; result t o master.result; e n d.

The program for task module master is given below. This task allocates lines to be computed to slave tasks through replies to the entryport result (line 18) and stores computed lines in the array bufs. Since lines will be computed in different times by slave processes, computed lines will not be received by master in line order. The guard on the receive from the entryport out (line 25) ensures that the display task receives lines in the correct order. The semantics of the select statement are identical to the Conic select statement. It should be noted that the master task does not have or need information on the number of slave tasks that are connected to it. In the following, this will allow us to simply parameterise the overall program with the number of slave tasks.

7

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

t a s k m o d u l e master; use mandbrot:Xmax,Ymax,mandmsg,mandp; entryport result:mandmsg r e p l y integer; out:signaltype r e p l y mandp; var written,allocated:integer; current:mandp; bufs:a r r a y[0..Ymax] o f mandp; begin f o r written:=0 to Ymax d o bufs[written]:=nil; written:=0; allocated:=0; new(current); loop select r e c e i v e current^ f r o m result = > allocated:=allocated+1; i f allocated written:=written+1; e n d; e n d; e n d.

The remaining task display, listed below, exists to decouple I/O latencies to the display from the response time of master to requests for lines. Note that this is the only task in the program that terminates. The Tonic termination model simply states that when any one task terminates (correctly or erroneously) the entire program is terminated. In this Tonic differs considerably from its predecessor which allowed continued operation in the presence of failures. This decision is consistent with the different characteristics of parallel and distributed programs identified in the introduction.

8

1 2 3 4 5 6 7 8 9 10 11 12 13 14

t a s k m o d u l e display; use mandbrot:Xmax,Ymax,mandmsg,mandp; exitport out:signaltype r e p l y mandp; var current:mandp; i,j:integer; output:text; begin f o r i:=1 t o Ymax d o b e g i n s e n d signal t o out w a i t current; f o r j:=1 t o Xmax d o write(output,current^.linebuf[j]); e n d; e n d.

The final step in developing the Mandelbrot program is to describe the overall configuration structure of slave and supervisor components together with an abstract description of how we wish these to be executed on the multicomputer. At this stage, we merely indicate a mapping of components to an abstract machine which consists of maxprocessor identical processors. We are not concerned with the physical details of how these processors are interconnected. In fact, we assume that they are fully interconnected or, as termed in the following - globally interconnected. The configuration description for the program mandgen is given below. 1 2 3 4 5 6 7 8 9 10 11 12 13 14

{default parameter values} g r o u p m o d u l e mandgen(x:real=-2.0;y:real=2.0;d:real=4.0); use execpar; supervisor; slave; create execpar; create supervisor; c r e a t e f o r a l l k:[1..m a x p r o c e s s o r] a t (k) slave[k](x,y,d) ; l i n k f o r a l l k:[1..maxprocessor] slave[k].result to supervisor.result; e n d.

The replicator forall is used to declare vectors of components (line 10) or links (line 12). The at clause (line 10) specifies the processor at which the component instance is to be located. Any integer expression may follow the keyword at to denote the processor. Components with no at clause are by default allocated to the processor to which their parent group is allocated (in this case 1). More precisely, the rules governing allocation to processor numbers are:

9

(1) Processors are numbered from 1 to maxprocessor. (2) Any component instance can be allocated to a processor by at. (3) The default allocation is to the parent group (ie at not used) . (4) The top-level group is conceptually allocated to processor 1. These rules mean that an at clause can appear at any level of a configuration description. For example, the configuration description for the parallel executive component execpar (below), allocates an instance of the component exec800 to each processor. Execpar provides I/O to the host, error reporting and inter-processor communication It should be noted that maxprocessor is not a compile-time constant - it is initialised at run-time as described in the next section. 1 2 3 4 5 6

g r o u p m o d u l e execpar(buffers:integer=4); use exec800; c r e a t e f o r a l l k:[1..m a x p r o c e s s o r] a t (k) exec800[k](buffers); e n d.

This section has demonstrated how parallel programs are constructed in Tonic. The mandgen program includes examples of request-reply communication and one-to-one and many-to-one communication (many slave exitports to one supervisor entryport). Tonic also includes unidirectional synchronous communication primitives, a mechanism for responding to requests in a different order to which they were received and a forwarding facility. Description of these is beyond the scope of the present paper. The next section describes how the Mandelbrot program can be executed on many different hardware configurations.

3 Logical to Physical Mapping The previous section has described the logical structure of the Mandelbrot program mandgen. This logical structure is annotated with a mapping of components to an abstract machine consisting of maxprocessor identical globally interconnected processors. In this section, we outline how this abstract machine is realised on the actual hardware. In our case the target hardware is a Meiko Computing Surface consisting of a SPARC based host (running Unix) and 32 T800 transputers each with 4 Megabytes of memory (figure 3). Via a utility provided by Meiko (svcsd), a user can reserve a variable number of transputers and set up inter-transputer links before downloading an application program. Svcsd provides a bidirectional message passing interface from the host to link 0 of one of the reserved transputers . 10

. Transputers

SPARC

Figure 3 - Meiko Computing Surface The Tonic Configuration language is used to describe the desired physical topology of interconnected transputers. This physical configuration description is used to drive the logical to physical mapping process. Figure 4. depicts the configuration view of an individual T800 transputer. linkin[1]

linkout[0] linkin[0]

linkout[1]

linkout[2]

t800

linkin[3]

1 2 3 4 5 6 7 8

linkin[2]

task module t800; exitport linkout[0..3]:byte; entryport linkin[0..3]:byte; begin {never executed} end .

linkout[3]

Figure 4 - Configuration view of T800 transputer. The T800 transputer type is represented by a Tonic task module. The task has no code since it is never executed. It serves only to provide an interface specification to configuration descriptions. Using this definition, we can now describe physical topologies of transputers. The following group module describes a pipeline where link 1 of each transputer is connected to link 0 of its successor in the pipeline. The pragma (line 7) associates an integer processor identifier with each t800 instance. These processor identifiers are used during the mapping process. A component in the logical configuration will execute at the processor whose identity corresponds to that specified by the component’s at clause. In the interest of conciseness, it is 11

only necessary to define either linkout[m] to linkin[n] or linkout[n] to linkin[m] to specify a hardware connection between two transputers. 1 2 3 4 5 6 7 8 9 10 11 12

g r o u p m o d u l e pipeline(length:integer); entryport linkin:byte; use t800; c r e a t e f o r a l l k:[1..length] t800[k] ; l i n k f o r a l l k:[1..length-1] t800[k].linkout[1] t o t800[k+1].linkin[0]; link linkin t o t800[1].linkin[0]; e n d.

The following description of a ternary tree of transputers illustrates some of the more powerful features of the Tonic configuration language - namely guards and recursion. In this example we have omitted pragmas to explicitly associate identities to processors but relied on the default assignment of identities supplied by the underlying system. This definition of a ternary tree is of limited usefulness since it only generates configurations of 1 processor (depth=0), 4 processors (depth=1), 13 processors (depth=2), etc. In practice, we use a definition of ternarytree which generates balanced ternary trees for any number of processors. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

g r o u p m o d u l e ternarytree(depth:integer); entryport linkin:byte; use t800; create root:t800; link linkin t o root.linkin[0]; w h e n depth>0 c r e a t e f o r a l l k:[1..3] child[k]:ternarytree(depth-1); w h e n depth>0 l i n k f o r a l l k:[1..3] root.linkout[k] t o child[k].linkin; e n d.

To complete the hardware configuration description the connection between the host processor and target transputer system must be specified as shown below for both the pipeline and ternary tree topologies. Line 9 specifies the connection between the host, represented by the component gin, and the first transputer in the pipeline (or ternarytree). Gin is the engine which performs most of the work of providing the Globally INterconnected abstract machine required to execute the logical configuration. Its implementation is described in outline in the next 12

to execute the logical configuration. Its implementation is described in outline in the next section. The name was irresistible. 1 g r o u p m o d u l e pipe(length:integer); 2 use 3 gin; 4 pipeline; 5 create 6 gin; 7 pipeline(length); 8 link 9 gin.linkout t o pipeline.linkin; 10 e n d .

1 g r o u p m o d u l e ttree(depth:integer); 2 use 3 gin; 4 ternarytree; 5 create 6 gin; 7 ternarytree(depth); 8 link 9 gin.linkout t o ternarytree.linkin; 10 e n d .

The above group modules pipe and ttree compile into the host executable files pipe and ttree. The program mandgen described in the previous section compiles into the target executable file mandgen. To execute the Mandelbrot program on a pipeline of four processors the user types the following command on the host. The logical to physical mapping is depicted in figure 5. pipe 4 mandgen 2.0 2.0 4.0 | pixdisp2 Similarly, to execute the Mandelbrot program on a ternary tree of depth 2 (13 processors), the user types the command: ttree 2 mandgen 2.0 2.0 4.0 | pixdisp Note that ttree and pipe can be applied to any application program, they are not specific to the Mandelbrot example.

H O S T

0

1 supervisor slave[1]

1

0

2

1

slave[2]

0

3 slave[3]

1 0

4 slave[4]

Figure 5 - Mandgen mapped to a pipeline of 4 transputers 2

Pixdisp is a program executing on the Unix host which reads from its standard input and displays the bytes read as coloured pixels on an Xwindow.

13

4 Implementation & Performance In this section, we give an overview of how Tonic programs are executed on the Meiko Computing Surface. The latter part of the section discusses performance. Tonic Configuration Language Each group module in a configuration description compiles into a procedure to elaborate the structure of that group at run-time. The set of these elaboration procedures when executed at run-time generate a directed graph in which the nodes are task instances and the arcs are intertask links. Group modules are not represented in this graph, it is a flat representation of the hierarchical configuration structure [DUL90]. Consequently, no penalty is paid at run-time for using hierarchically structured configuration descriptions. Bootstrapping the transputer network. When a physical configuration description is executed on the host (e.g. ttree) the graph structure generated is passed to Gin. This graph represents the desired configuration of transputers required to execute the logical configuration. Gin performs the following sequence of actions: 1) The graph is checked to ensure that it represents a legal transputer configuration. That is, all transputer connections are one-to-one, processor identifiers are in a contiguous range, and the graph is fully connected (so that there is a path to boot every processor). 2) Gin then computes a minimum depth spanning tree for the graph. This identifies which transputer links will be used for bootstrapping. The complete graph is recorded in an adjacency matrix AJ[0..maxprocessor, 0..maxlink] where maxlink=3 and AJ[i,j] is the identity of the processor to which link j of processor i is connected.Processor 0 represents the host. The matrix is marked with those links which will be booted. 3) Using the svcsd utility, gin grabs the required number of processors and interconnects them to conform to the graph generated by the physical configuration description. 4) In the next stage, gin bootstraps the first transputer by sending the application program (e.g. mandgen.800) to its link 0. Once the program starts executing, gin sends it four further pieces of information: 14

a) its processor identifier (in the range 1..maxprocessor). b) the maximum number of processors maxprocessor. c) the adjacency matrix AJ. d) the command arguments represented as strings ( Unix argc & argv). For the example, these would be mandgen.800 2.0 2.0 4.0. 5) At this stage, gin is finished with the bootstrapping process and becomes a server which services I/O requests from the application program. It runs until either the program running on the network reports an error or terminates. The application program loaded into each transputer continues the bootstrap process. When started it receives the items a) to d) listed in 4) above. The program then examines the entry in the adjacency matrix corresponding to its processor identifier. If an entry AJ[self,j] is marked to be bootstrapped, the program sends its code to outgoing link j to bootstrap the processor to which link j is connected. It then sends the processor identifier AJ[self,j], maxprocessor, AJ and the command arguments to complete the bootstrap. Note that exactly the same code is loaded into each processor. The only value which a processor receives to distinguish it from others is its processor identifier. After the first level of the boot spanning tree has been bootstrapped, booting continues in parallel until the leaves of the tree have been bootstrapped. Initialisation After completing bootstrapping, each transputer executes the same initialisation code. This code first calculates a minimum distance routing table from the adjacency matrix. The routing table is used by execpar at execution time. Initialisation proceeds by invoking the group elaboration procedures. Each transputer node thus has a copy of the complete logical configuration graph. However, the kernel (which is part of execpar) only instantiates tasks which correspond to its processor identifier ie. an at clause in the logical configuration specified “this” processor. The kernel in addition to instantiating tasks creates datastructures to implement both local and remote inter-task communication. These communication datastructures contain initialised transputer channel words. The Tonic communication primitives are implemented using the transputer communication instructions , in , out , alt etc.

15

Consequently, the time to receiveּfrom an entryport is proportional to the number of exitports

Loading the entire code for the application at each processor has the disadvantage of wasting storage when task types are loaded but not instantiated. However, this scheme has the following major advantages: 1) It permits the bootstrap to proceed in parallel. This considerably reduces startup time for large numbers of processors. 2) Since each node has a complete copy of the logical configuration graph, the setup of inter-task communication associations at initialisation time does not require remote communication. The initialisation of each transputer proceeds in parallel. Again this reduces application startup time. Tonic applications typically take less than a second to startup. Performance Figure 6 shows the times required for a request-reply message exchange. The time is measured from the time the sending task initiates the exchange by a send-wait to the time the reply message completes the exchange. The receiving task executes a receive followed by a reply. 4 byte Request 100 byte Request 4 byte Reply 4 byte Reply | Intra-Processor | 17uS 19uS | Inter- Processor | 174uS 241uS | + time per additional | 144uS 210uS intermediate processor | Figure 6 - Request - Reply times. The times above are for a one to one request-reply communication i.e. one exitport connected to one entryport. Where the communication is n to 1 (n exitports connected to 1 entryport as in the Mandelbrot example) the time for an individual intra-processor request-reply is T + 5*n us (for n>1) where T is the time for a one to one communication. This is because many to one communication is implemented using the transputer alt instruction. The receiver task waits on a set of transputer channels representing the set of exitports connected to it. Consequently, the time to receive from an entryport is proportional to the number of exitports connected to it. This represents a considerable performance penalty for large fan-in configurations. For example, with 32 processors, the mandgen program has a 32 to 1 16

connection to the supervisor’s result entryport. Consequently, the performance penalty is 160uS per communication. We are currently re-implementing intra-processor communication using critical regions rather than transputer communication channels to make the receive time independent of the fan-in factor. ttree

Linear

pipe

35.00 30.00 25.00 20.00 Speedup 15.00 10.00 5.00 0.00 1

3

5

7

9 11 13 15 17 19 21 23 25 27 29 31 #processors

Figure 7 - Mandgen program speedup Figure 7 represents the speedup for the Mandelbrot program example plotted against the number of processors using a balanced ternary tree hardware configuration (ttree) and a pipeline hardware configuration (pipe). Speedup(n) is measured as the time for 1 processor (i.e. 1 slave & 1 supervisor) divided by the time for n processors (i.e. n slaves & 1 supervisor). Not surprisingly, the ternary tree mapping outperforms the pipeline mapping. The average processing rate for the 32 processor ternary tree mapping is 26Mflop/S or 0.8Mflop/S per transputer. The overall time with 32 processors to complete the computation with the parameters (2.0,2.0,4.0) was 3.3 seconds.

17

5 Discussion & Conclusions The paper has presented a configuration approach to the construction of parallel programs in which the functional behaviour of individual components is specified by a programming language and the overall parallel program structure is specified by a configuration language. The configuration specification declares the instances of component types and their interconnections. Component instances execute in parallel. This logical configuration is annotated with a mapping to an abstract machine which consists of maxprocessor identical, globally interconnected processors. The physical configuration of the real machine is specified using the same configuration language. This physical description drives the logical to physical mapping process. Both the logical and physical configuration specifications are considerably more flexible than those provided by existing systems [INM88b, DSL90, MEI89]. Running a Tonic program on different physical configurations with different numbers of processors and different inter-processor connections requires no re-compilation. This facilitates both portability and experimentation with different logical to physical mappings. The configuration approach is similar to that of the MUPPET [MUH88] system which uses a graphical notation to express configurations. However, MUPPET does not clearly separate logical from physical configurations and is limited in the mappings which can be expressed. Tonic also has a graphical notation for expressing configurations and an associated display tool [KRA89]. However, for describing regular structures, we find the power of the textual language to be more useful. In concurrent programming terms, Tonic falls into the class of languages which support the process/message passing paradigm [BAL89]. The best known of these languages are Occam [INM88b] and Ada[DOD83]. Unlike these languages, Tonic incorporates a separate language to describe the structure of concurrent programs in terms of task instances and intertask message paths (links). These languages also differ in the time at which the program’s process structure is fixed. Occam defines the process structure statically at compile time, Tonic at instantiation/initialisation time and Ada dynamically at run-time. In fairness, it should be noted that Ada semantics do not permit an efficient distributed memory implementation. We are currently experimenting with an approach which would allow changes to the process structure at run-time while retaining a strict separation between programming and configuration [MAG90]. This will permit a limited, but efficient, form of process migration to facilitate dynamic load balancing. We regard Tonic as a prototype implementation to validate the configuration approach to parallel programming. Its application is restricted by the reliance on one specific programming language - Pascal + message passing. Currently, we are engaged in the 18

[KRA89]ּ

development of a configuration language (Darwin) [MAG90] and associated tools which will permit the configuration approach to be applied to parallel programs composed of components written in commonly available languages such as C & Fortran. Despite the limitations expressed above, Tonic is a practical and efficient tool for developing parallel programs. It hides many of the irrelevant details about the underlying hardware which currently harass parallel programmers. Typically, application programmers select a physical configuration from a library rather than programming their own. The library currently includes pipeline, ring, mesh, torus, binary tree, ternary tree, cube connected cycles and WK-Recursive physical topologies. The toolset is used by both research and undergraduate students. Acknowledgements The authors would like to acknowledge discussions with our colleagues in the Parallel and Distributed Systems Group during the formulation of these ideas. We gratefully acknowledge the SERC under grants GE/E/62394 (ACME) & GR/G31079, and the CEC in the REX Project (2080) for their financial support. References [ATH88]

W.C. Athas, C.L. Seitz, “Multicomputers:Message-Passing Concurrent Computers”, IEEE Computer, Vol. 21, No.8, August 1988, pp 9-24.

[BAL89]

H.Bal, J. Steiner, A Tanenbaum, “Programming Languages for Distributed Computing Systems, ACM Computing Surveys,Vol. 21, No. 3, September 1989, pp 261-322.

[DOD83}

Department of Defense, U.S.A., “Reference Manual for the Ada Programming language”, ANSI/MIL-STD-1815A, DoD, Washington D.C. Jan 1983.

[DUL90]

N. Dulay, “A Configuration Language for Distributed Programming”, Ph.D. Thesis, Dept. of Computing, Imperial College, February 1990.

[DSL90]

Distributed Software Ltd, “ The Helios Parallel Programming Tutorial”, 670 Aztec West, Bristol, January 1990.

[INM88a]

Inmos Ltd, “OCCAM 2 reference manual”, Prentice Hall, 1988.

[INM88b]

Inmos Ltd, “Transputer Development System”, Prentice Hall, 1988.

[KRA85]

J.Kramer, J.Magee, "Dynamic Configuration for Distributed Systems", IEEE Transactions on Software Engineering, SE-11 (4), April 1985, pp. 424-436.

[KRA89]

J. Kramer, J. Magee, K. Ng, "Graphical Configuration Programming", IEEE Computer, Vol 22, No 10, Pages 53-65.

[MAG89]

J.Magee, J.Kramer, and M.Sloman, "Constructing Distributed Systems in 19

[MUH88]ּ

Conic" IEEE Transactions on Software Engineering, SE-15 (6), June 1989. [MAG90]

J.Magee, J.Kramer, M. Sloman, and N.Dulay, “ An Overview of the REX Software Architecture”, Proceedings of the 2nd IEEE Workshop on Future Trends of Distributed Computing Systems” Cairo, Egypt, Sept. 1990, pp 396402.

[MAG91]

J.N. Magee and S.C. Cheung, “ Parallel Algorithm Design for Workstation Clusters”, Software-Practice and Experience, Vol. 21. March 1991, pp 235250.

[MEI89]

Meiko Ltd, “CS Tools Documentation Guide”, 650 Aztec West, Bristol, 1989.

[MUH88]

H. Muhlenbein, Th. Scheider, and S. Streitz, "Network Programming with MUPPET", Journal of Parallel and Distributed Computing, Vol 5, 1988, Pages 641-653.

20