Optimizing the Software Architecture for ... - Semantic Scholar

7 downloads 531 Views 1MB Size Report
We optimize the task and message design with respect to this metric by ... the software or hardware architecture that requires the replace- .... The search space consists of the pri- ..... they will be set to 0 by the optimization engine, unless forced.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS

1

Optimizing the Software Architecture for Extensibility in Hard Real-Time Distributed Systems Qi Zhu, Yang Yang, Marco Di Natale, Eelco Scholte, and Alberto Sangiovanni-Vincentelli

Invited Paper

Abstract—We consider a set of control tasks that must be executed on distributed platforms so that end-to-end latencies are within deadlines. We investigate how to allocate tasks to nodes, pack signals to messages, allocate messages to buses, and assign priorities to tasks and messages, so that the design is extensible and robust with respect to changes in task requirements. We adopt a notion of extensibility metric that measures how much the execution times of tasks can be increased without violating end-to-end deadlines. We optimize the task and message design with respect to this metric by adopting a mathematical programming front-end followed by postprocessing heuristics. The proposed algorithm as applied to industrial strength test cases shows its effectiveness in optimizing extensibility and a marked improvement in running time with respect to an approach based on randomized optimization. Index Terms— Design space exploration, distributed system, extensibility, platform-based design, real-time.

I. INTRODUCTION

M

ANY complex distributed embedded systems with time and (possibly) reliability constraints are today the result of the integration of components or subsystems provided by suppliers in large quantities. Examples can be found in the automotive domain where production quantities for subsystems are in the range of hundreds of thousands, but are not limited to it: many avionics systems share similar characteristics. Also, home automation systems (HVAC controls, fire and security) and other control systems (for example elevators and industrial refrigeration systems) share a similar architecture framework and similar models of supply chain. All these systems are characterized by the need of a careful design and deployment of system-level functions, given the need to satisfy real-time constraints, and the need to cope with tight resource requirements because of cost constraints. Also, because of the large production quantities and the complexity of the supply chain, these Manuscript received November 17, 2009; revised April 01, 2010, May 31, 2010; accepted June 10, 2010. Paper no. TII-09-11-0335. Q. Zhu is with the Strategic CAD Laboratories, Intel Corporation, Hillsboro, OR 97124 USA (e-mail: [email protected]). Y. Yang and A. Sangiovanni-Vincentelli are with the EECS Department, University of California at Berkeley, Berkeley, CA 94720 USA (e-mail: [email protected]; [email protected]). M. D. Natale is with the Scuola Superiore S. Anna, Pisa 56124, Italy (e-mail: [email protected] ). E. Scholte is with the United Technologies Research Center, East Hartford, CT 06118-1127 USA (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TII.2010.2053938

systems are characterized by a lifetime of the architecture platform that easily extends over a five-year horizon. Such platforms must therefore accommodate function updates or additions for new features or error fixes over a multiyear product lifetime. In this case, being able to upgrade or adjust the software design incrementally, without undergoing a major redesign cycle is imperative for competitive advantage. Any major change in the software or hardware architecture that requires the replacement of one or more subsystems means huge losses because of the large quantities involved and the backlogs in the production of these units. We address the problem of defining the initial solution to the design problem so that it is as robust as possible with respect to modifying existing tasks. To do so, we adopt a robustness measure or extensibility metric, and then develop an efficient algorithm that optimizes it. In this paper, we focus on hard real-time distributed systems that collect data from a set of sensors, perform computations in a distributed fashion and based on the results, send commands to a set of actuators. Extensibility is defined as the amount by which the execution time of tasks can be increased without changing the system configuration while meeting the deadline constraints (as in [9]). With this definition, a design that is optimized for extensibility not only allows adding future functionality with minimum changes, but is more robust with respect to errors in the estimation of task execution times. We consider systems based on priority-based scheduling of periodic tasks and messages. Each input data (generated by a sensor, for instance) is available at one of the system’s computational nodes. A periodically activated task on this node reads the input data, computes intermediate results, and writes them to the output buffer from where they can be read by another task or used for assembling the data content of a message. Messages—also periodically activated—transmit the data from the output buffer on the current node over the bus to an input buffer on a remote node. Local clocks on different nodes are not synchronized. Tasks may have multiple fan-ins and messages can be multicast. Eventually, task outputs are sent to the system’s output devices or actuators. The extensibility optimization problem can be considered as part of the mapping stage in the Platform-Based Design (PBD) [20] design flow, where the functionality of the design (what the system is supposed to do) and its architecture (how the system does it) are captured separately, and then “joined” together, i.e., the functionality is “mapped” onto the architecture. In the application, function blocks communicate through signals, which represent the data dependencies. The architectural description is a topology of computational nodes connected by buses. In this

1551-3203/$26.00 © 2010 IEEE

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 2

paper, buses and nodes can have different transmission and computation speeds. Mapping allocates functional blocks to tasks and tasks to nodes. Correspondingly, signals can be mapped into local communication or packed into messages that are exchanged over the buses. Task and message priorities are assigned and the mapping is performed in such a way that the end-to-end latency constraints are satisfied in the worst-case. Task allocation, signal to message packing, message allocation, and priority assignment are the design variables considered in this paper that are chosen with the objective of optimizing task extensibility. The literature on extensibility is rich. Sensitivity analysis was studied for priority-based scheduled distributed systems [18], with respect to end-to-end deadlines. The evaluation of extensibility with respect to changes in task execution times, when the system is characterized by end-to-end deadlines, was studied in [21]. The notion of robustness under reduced system load was defined and analyzed in [15], for both preemptive and nonpreemptive systems. The paper highlights possible anomalies (increased response times for shorter task execution times) that would make evaluation of extensibility quite complex. These papers do not explicitly address system optimization. Task allocation, the definition of priorities, and message configuration, are assumed as given. Also, it is worth mentioning that time anomalies such as those in [15] and other described in several other papers on multiprocessor and distributed scheduling do not occur for our scheduling and information propagation model. This is because we assume local scheduling by preemption, the passing of information by periodic sampling and the periodic (not event-based) activation of each task and message. This decouples the scheduling of each task and message from predecessors and successors as well as from scheduling on other resources and avoids anomalies. The problem of optimal packing of periodic signals into CAN frames when signals have deadlines and the optimization metric is the minimization of bus utilization was proven to be NP-hard in [19], where a heuristics solution was provided. Because of its low cost and high reliability, the CAN bus is indeed a quite popular solution for automotive systems, and also used in aeronautics systems as an auxiliary sensor and actuator bus, as well as used in home automation, refrigeration systems and elevators. A similar problem for the FlexRay time-triggered bus is discussed in [13]. For distributed systems with end-to-end deadlines, the optimization problem was partially addressed in [18], where genetic algorithms were used for optimizing priority and period assignments with respect to a number of constraints, including end-to-end deadlines and jitter. In [6], an algorithm based on geometric programming was proposed for optimizing task and message periods in distributed systems, later extended in [22], to optimize jointly priority assignments, task and message allocations. Both works only considers single-bus systems. In [17], a design optimization heuristics-based algorithm for mixed timetriggered and event-triggered systems was proposed. The algorithm, however, assumed that nodes are synchronized. In [14], a SAT-based approach for task and message placement was proposed. The method provided optimal solutions to the placement and priority assignment. However, it did not consider signal packing. In [3], task allocation and priority assignment were

IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS

defined with the purpose of optimizing the extensibility with respect to changes in task computation times. The proposed solution was based on simulated annealing and the maximum amount of change that can be tolerated in the task execution times without missing end-to-end deadlines was computed by scaling all task times by a constant factor. Also, a model of event-based activation for task and messages was assumed. In [9], [11], and [10], a generalized definition of extensibility on multiple dimensions (including changes in the execution times of tasks, as in our paper, but also period speedups and possibly other metrics) was presented. Also, a randomized optimization procedure based on a genetic algorithm was proposed to solve the optimization problem. These papers focus on the multiparameter Pareto optimization, and how to discriminate the set of optimal solutions. The main limitation of this approach is complexity and expected running time of the genetic optimization algorithm. In addition, randomized optimization algorithms are difficult to control and give no guarantee on the quality of the obtained solution. Indeed, in the cited papers, the use of genetic optimization is only demonstrated for small sample cases. In [11], the experiments show the optimization of a sample system with nine tasks and six messages. The search space consists of the priority assignments on all processors and on the interconnecting bus. Hence, task allocation (possibly the most complex step) and signal to message packing are not subject to optimization. Yet, a complete robustness optimization takes approximately 900 and 3000 s for the two-dimensional and three-dimensional case, respectively. In general, the computation time required by randomized optimization approaches for large and complex problems may easily be an issue. In [9], a larger set of “20 tasks and message” is considered, but again, only priority assignment is subject to optimization. These results albeit important in their own right, exhibit a running time that is clearly out of the question for an effective design space exploration. This observation motivated us to develop a two-stage “deterministic” algorithm that has running times over an order of magnitude faster than the ones proposed so far in the literature. The first stage of the algorithm is based on mixed integer linear programming (MILP), where task allocation (the most important variable with respect to extensibility) is optimized within deadline and utilization constraints. The second stage features three heuristic steps, which iteratively pack signals to messages, assign priorities to tasks and messages, and explore task reallocation. This algorithm runs much faster than randomreduction with respect to ized optimization approaches (a simulated annealing in our case studies). Hence, it is applicable to industrial-size systems as shown by the experimental case studies, addressing the typical case of the deployment of additional functionalities in a commercial car. The first two case studies consist of a set of active-safety functions deployed on two candidate vehicle architectures, with 9 ECUs, 41 tasks, and 83 signals. In the first architecture option, all ECUs are connected to a single bus. In the second case, two buses are used, connected by a gateway ECU. In both cases, optimization takes less than 1800 seconds, compared to more than 12 h needed by the randomized optimization method, with results of comparable quality. The third test case is a safety-critical distributed control system deployed within a small truck. The key features of this system are the integration of slow and very fast (power

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. ZHU et al.: OPTIMIZING THE SOFTWARE ARCHITECTURE FOR EXTENSIBILITY IN HARD REAL-TIME DISTRIBUTED SYSTEMS

electronics) control loops using the same communication network. In this example, we are interested in redesigning an existing system to understand the effects of adding computational resources to the system. The shorter running time of the proposed algorithm allows using the method not only for the optimization of a given system configuration, but also for architecture exploration, where the number of system configurations to be evaluated and subject to optimization can be large. A further advantage of an MILP formulation (even if used only for the first stage) with respect to randomized optimization, is the possibility of leveraging mature technology in solvers, the capability of detecting the actual optimum (when found in reasonable time), or, when the running time is excessive, to compute at any time a lower bound on the cost of the optimum solution, which allows to evaluate the quality of the best solution obtained up to that point. II. REPRESENTATION The application is represented as a directed graph . is the set of tasks that is the set perform the computations. and of signals that are exchanged between task pairs. denote the source task and the set of destination tasks of signal , respectively (communication is of multicast type). The application is mapped onto an architecture that consists of connected a set of computational nodes . through a set of CAN buses is periodically activated with period , and executed with priority . The periods of communicating tasks are assumed to be harmonic, which is almost always true in practical designs. Tasks are scheduled with preemption according to their priorities, and a total order exists among the task priorities on each node. Computational nodes can be heterogeneous, and tasks can have different execution times on different nodes. We use to denote the execution time of task on node . In the following, the subscript is dropped whenever the formula refers to tasks on a given node, and is implicitly defined, or when the task allocation is (at least temporarily) defined, and the node to ) refers, is which the computation time (or its extensibility denotes the worst case response time. known. Finally, For a signal , the computational nodes to which the source and the destination task are allocated are called task source and destination nodes, respectively. If the source node is the same as all the destination nodes, the signal is local. Otherwise, it is global and must be packed into a message transmitted on the buses between the source node and all its destination nodes. Only signals with the same period, same source node and same communication bus can be packed into the same denotes its period, denotes message. For message denotes its worst case transmission time on its priority, and a bus with unit speed. The worst transmission time on bus is , where is the transmission speed of . is the worst case response time on a bus with unit speed. In addition, in complex systems the source and destination tasks may not reside on computation nodes that share the same bus. In this case, a signal exchanged among them will have to go through a gateway node and be forwarded by a gateway task.

3

We include the gateway concept in our model with a number of restrictive (but hopefully realistic) assumptions. • Any communication between two computation nodes is never going to need more than one gateway hop (every bus is connected to all the others through one gateway computation node). This assumption, realistic for small systems, could probably be removed at the price of additional complexity. • A single gateway node connects any two buses. • A single task is responsible for signal forwarding on each gateway node. This task is fully determined (there might be other tasks running on the gateway node.) on the application graph is an ordered A path interleaving sequence of tasks and signals, defined as . is the is its sink. Sources are path’s source and activated by external events, while sinks activate actuators. Multiple paths may exist between each source-sink pair. The worst case end-to-end latency incurred when traveling a path is denoted as . The path deadline for , denoted by , is an application requirement that may be imposed on selected paths. It may be argued that today, it is industrial practice (at least in the automotive domain) to allocate resources between suppliers before implementation parameters (such as worst-case execution times) are known. This is, however, only partly true. A major architecture and functional redesign is often characterized by a significant reuse (or carryover) of preexisting functionality (60% to 70% are typical figures), for which these estimates could be available. In addition, rapid prototyping techniques and the increased use of automatic code generation tools should ease the availability of these implementation related parameters (or at least estimates) even for newly designed functions. A. Design Space and Extensibility Metric The design problem can be defined as follows. Given a set of design constraints including: • end-to-end deadlines on selected paths; • utilization bounds on nodes and buses; • maximum message sizes; explore the design space that includes: • allocation of tasks to computational nodes; • packing of signals and allocation of messages to buses; • assignment of priorities to tasks and messages to maximize task extensibility. Different definitions can be provided for task extensibility. The main definition used in this paper is as the weighted sum of each task’s execution time slack over its period (1) where a task’s execution time slack is defined as the maxwithout vioimum possible increase of its execution time lating the design constraints, assuming the execution times of other tasks are not changed. is a preassigned weight that indicates how likely and how much the task’s execution time will be increased in future functionality extensions. In practice, however, because of functional dependencies, execution time increases in tasks belonging to a

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 4

IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS

set might need to be considered jointly. This can be done in several ways. One possible way is in the assignment of the weights as follows. 1) Identify a set of update scenarios . Each sceincludes a group of tasks to be extended, and nario is assigned a likelihood probability . and , assign a weight 2) For each update scenario to represent how much the task’s execution time will be increased in this scenario. of a task is computed as 3) The final weight . A more explicit way is to identify groups of tasks that are functionally related so that their execution times increases are related (in a way expressed by a simple linear formulation). We identify a set of task groups (each representing an update scenario). Execution times of tasks belonging to the same group are bound to increase to, we model gether in each update scenario. For each task the possible additional execution time as (2) is a constant. Equation (2) represents a simple exwhere tensibility dependency among tasks belonging to a functional group (more complex relationships can be represented at the price of higher complexity.) Based on (2), we define this alternative extensibility metric as follows: (3) Finally, another formulation is to use execution time slack over original execution time, i.e., , instead of using execution time slack over period in (1). The metric function (1) is used in the following discussion of the optimization algorithm. The required changes for the adoption of the metric in (3) are discussed further in Section III-F. Finally, the metric that uses slack times relative to the original execution time is discussed in Section VI-A5.

B. End-to-End Latency After tasks are allocated, some signals are local, and their transmission time is assumed to be zero. Others are global, and need to be transmitted on the buses through messages. The time needed to transmit a global signal is equal to the transmission denote the worst time of the corresponding message. Let case response time of a global signal , and assume its corre, then . sponding message is The worst case end-to-end latency can be computed for each path by adding the worst case response times of all the tasks and global signals on the path, as well as the periods of all the global signals and their destination tasks on the path (4)

where GS is the set of all global signals. Of course, in the case that gateways are used across buses, the signals to and from possible gateway tasks, as well as the response time of the gateway task itself and the associated sampling delays must be included in the analysis. We need to include periods of global signals and their destination tasks because of the asynchronous sampling of communication data. In the worst case, the input global signal arrives immediately after the completion of the first instance of task . The event data will be read by the task on its next instance and the result will be produced after its worst case response time, time units after the arrival of the input signal. that is, The same reasoning applies to the execution of all tasks that are the destinations of global signals, and applies to global signals themselves. However, for local signals, the destination task can be activated with a phase (offset) equal to the worst-case response time of the source task, under our assumption that their periods are harmonic. In this case, we only need to add the response time of the destination task. When a task has more than one local predecessor on a time-critical path, its activation phase (offset) will have to be set to the largest among the completion times of its predecessors. This case could be dealt with in the previous (4) and in the following formulations, by replacing the response time contribution of local sender tasks with the maximum among the response times of all the senders for a given task in the path. Similarly, it is sometimes possible to synchronize the queueing of a message for transmission with the execution of the source tasks of the signals present in that message. This would reduce the worst case sampling period for the message transmission and decrease the latency in (4). In this work, we do not consider these possible optimizations and leave them to future extensions. C. Response Time Analysis Computing end-to-end latencies requires the computation of task and message response times (signal response times are equal to the response times of the corresponding messages). The analysis in this section summarizes work from [7] and [12]. 1) Task Response Times: In a system with preemption and priority-based scheduling, the worst case response time for a task depends on its computation time , as well as on the interference from higher priority tasks on the same node. In the can be calculated using the following case of recurrence: (5) where refers to the set of higher priority tasks on the same node. 2) Message Response Times: Worst case message response times are calculated similarly to task response times. The main difference is that message transmissions on the CAN bus are may have to wait for not preemptable. Therefore, a message , which is the longest transmission time a blocking time of any frame in the system. Likewise, the message itself is not subject to preemption from higher priority messages during its . The response time can therefore be own transmission time

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. ZHU et al.: OPTIMIZING THE SOFTWARE ARCHITECTURE FOR EXTENSIBILITY IN HARD REAL-TIME DISTRIBUTED SYSTEMS

5

calculated with the following recurrence relation, in the case of :

(6)

D. Formulation Based on the formulas for computing end-to-end latencies and response times, we construct a mathematical formulation that contains all the design variables. Part of the formulation is similar to the one in [22]: both explore the same set of design variables—task allocation, signal packing and message allocation, as well as task and message priorities. In [22], the problem was formulated as mixed integer linear programming (MILP). To reduce the complexity, the problem was divided into subproblems and solved by a two-step approach. However, in [22], the objective is to minimize end-to-end latencies, while in this work, we optimize task extensibility. The formulation of task extensibility with respect to end-to-end deadline constraints is a quite challenging task. In general, inverting the function that computes response times as a function of the task execution times is of exponential complexity in the simple case of single-CPU scheduling [4]. When dealing with end-to-end constraints, the problem is definitely more complex. A possible approach consists of a very simple (but possibly time-expensive) bisection algorithm that finds the sensitivity of end-to-end response times with respect to increases in task execution times (this is the solution used for performing sensitivity analysis in [18]). denotes the increase of task ’s response Formally, if when task ’s computation time is increased by time , the end-to-end latency constraints and utilization constraints are expressed as follows: (7) (8)

where refers to the set of tasks with priority lower than and executed on the same node as denotes the set of the tasks on computational node , and denotes the maximum utilization allowed on . and can be derived from (5), The relation between as follows:

Fig. 1. Algorithm flow for task extensibility optimization.

For brevity, the above formulas do not model task allocation and priority assignment as variables. In the complete formulation, they were expanded to include those variables. Contrary to the problem in [22], in our case the formulation cannot be linearized because of the second term in (9). It could be solved by nonlinear solvers but the complexity is in general too high for industrial size applications. Therefore, we propose an algorithm that defines two stages to decompose the complexity: one in which mathematical programming (MILP) is used, and one refinement stage that consists of several steps based on heuristics. III. OPTIMIZATION ALGORITHM The flow of our algorithm is shown in Fig. 1. First, we decide the allocation of tasks, since the choices of other design variables are restricted by task allocation. In the initial allocation stage, the problem is formulated as MILP and solved by an MILP solver. Then a series of heuristics is used in the refinement stage: in the signal packing and message allocation step, a heuristic is used to decide signal-to-message packing and message-to-bus allocation. In the task and message priority assignment step, an iterative method is designed to assign the priorities of tasks and messages. After these steps are completed, if the design constraints cannot be satisfied or if we want to further improve extensibility, the tasks can be reallocated and the process repeated. Because of the complexity of the MILP formulation, we designed a heuristic for task reallocation, based on the extensibility and end-to-end latency values obtained in the previous steps. A. Initial Task Allocation

(9) (10)

In the initial task allocation stage, tasks are mapped onto nodes while meeting the utilization and end-to-end latency constraints. Utilization constraints are considered in place of the true extensibility metric to allow a linear formulation. In this stage, we also allocate signals to messages and buses assuming each message contains one signal only. The initial tasks and

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 6

IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS

message priority assignment is assumed as given. In case the procedure is used to optimize an existing configuration, priorities are already defined. In case of new designs, any suitable policy, such as Rate Monotonic, can be used. The MILP problem formulation includes the following variables and constraints: 1) Allocation Constraints: (11) (12) (13) (14) (15) (16) (17)

(18) (19)

(20) (21) (22) (23)

(24) (25) (26) (27) (28) (29) (30) Gatewaying of signals requires additional definitions and a modification of the signal set to accommodate replicated signals that are sent by the gateway tasks. Gateway tasks are preallocated, with known period and priority subject to optimization. For each signal in the task communication model, we use to represent the signal originating from the source task and directed to the destination or (if needed) to the gateway task with final destination . In addition, for each possible gateway , there is an additional possible signal, labeled repreto the destination senting the signal from the gateway task (allocated on a computation node that can be reached with gateway , Fig. 2). In case the source task and the destination task are on the and all the may be disregarded same node, the signal since they do not contribute to the latency and they will not need

Fig. 2. Signal forwarding by the gateways.

a message to be transmitted. In case the source and destination represents the signal task are connected by a single bus, should be disregarded (accombetween them and all the plished by treating them as local signals). with as many signals as the For each there is one set with cardinality equal to number of receivers and one set the product of the number of possible gateways by the number of receivers. All gateway signals have the same period and data length of the signals from which they originate. is the set of nodes that can be allocated to. represents the set of buses to which is connected. The Boolean variable indicates whether task is defines whether and are mapped onto node , and on the same node . Constraint (11) ensures that each task is mapped to one node and only one, and the set of constraints ((12), (13), (14)) ensures the consistency of the definitions of the and variables. is 1 if signal is mapped The Boolean variable onto bus and 0 otherwise, similarly for the definition of . To define these set of variables, we need to consider all computation node pairs for each signal from its source to all its destinations . In the following, for simplicity, we will label as the source task for signal . The set of constraints , destination defined by (15) for all possible sets (source , source node that is on bus , destination node that through ) forces to 1 for communicates with the bus from to , or from to the destination of the between and . The following set (16) gateway task to 1 (if necessary) for the bus from the gateway sets when gatewaying is needed (in this to the destination node set of constraints, is on while is not). The variables have a positive contribution to the cost function, hence they will be set to 0 by the optimization engine, unless forced at 1 by the constraints. To give an example of these constraints, in Fig. 2 the condition from to to be on bus , for the outgoing signal expressed as

needs to be defined for each computation node pair , and , or and where , or and . Similar sets of conditions will then need to be defined for and . As an example of gatewaying, the condition for the mapping on of the (possible) signal forwarded by gateway as part

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. ZHU et al.: OPTIMIZING THE SOFTWARE ARCHITECTURE FOR EXTENSIBILITY IN HARD REAL-TIME DISTRIBUTED SYSTEMS

of the communication from by the set

to

in the figure is expressed

defined for all the computation node pairs , where and . is 1 if is a global The value of the Boolean variable signal (i.e., being transferred on bus), and 0, otherwise. Simiis 1 if the signal ( forwarded by gateway ) larly, is global. The definition of and is provided by constraints (17)–(18)) and (19)–(20), respectively. Finally, must be 1 if at least one is 1, as in constraint (21). is 1 if signal needs to be transThe Boolean variable mitted on bus (needs a message on ) and 0, otherwise. constraints (22)–(24) encode these conditions. and define The Boolean variables and share the same bus with , rewhether spectively. Constraints (25)–(30) enforce consistency in the and with respect to definition of the the signal-to-bus allocation variables. 2) Utilization Constraints: The following constraints enforce the utilization bounds on all nodes and buses considering the load of the current tasks (summation on the left-hand side of (31) and the additional load caused by extensions of the , on the left-hand side of the equation). execution times ( and are the utilization bounds on computational node and bus , respectively. The additional load caused by the extension must be considered only if the task is allocated to the node for which the bound is computed. This is represented by using , and the typical “big M” formulation an additional variable in use in MILP programming for conditional constraints, where M is a large constant. In our formulation, tasks can have different execution times denotes the worst-case depending on their allocation, and execution time of task on node . Also, buses can have difdenotes the transmission time of the message ferent speeds. that carries signal on a bus with unit speed. At this stage, we assume each message will only contain one signal. The transis mission time of that message on a bus with speed (31) (32) (33) (34) (35) 3) End-to-End Latency Constraints: (36)

(37)

7

(38) (39) (40) (41) (42) (43)

(44) (45) (46) (47) (48) (49)

(50) (51) (52) (53) (54) (55)

Latency constraints are derived from (4), (5) and (6). Equation (4) shows the calculation of end-to-end latency for path . For each signal on path , we know its destination task and denote it as . If two tasks are on computation nodes connected to different buses, they will communicate through a gateway task and the corresponding additional latencies need to be considered. The calculation of end-to-end latency is shown in constraint (37). We assume the response time for gateway task is (i.e., a gateway task has the highest priority on its node). is the response time of task , defined by constraints (38)–(43), represents the number of possible interference is a parameter that denotes whether task from to and has higher priority than task . A large constant is used to , which defines linearize the relation the number of actual interferences by higher priority tasks, similarly as in the utilization constraints. Constraint (49) is used to enforce the task’s response time is no larger than its period, which is the assumption for our response time calculation in (5). and represent the response times of the mesand , respectively. Their defsages that carry signals initions are in constraints (44)–(45). If signals are local, the and corresponding response times will be 0.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 8

IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS

are similarly defined as in the definition of task response time. Constraints (49) and (55) enforce the assumption that message response times should not exceed their periods. 4) Objective Function: (56) We recall here the objective function in (1), which represents the task extensibility. An alternative objective function can also is the include the optimization of latency, as shown in (57). parameter used to explore the trade-off between task extensiis the original obbility and latencies. The special case jective function (56)

Fig. 3. Iterative Priority Assignment Algorithm.

(57) In Section VI, we will report the experimental results with various values of , to show the relationship between task extensibility and path latencies. B. Signal Packing and Message Allocation After the allocation of tasks is chosen, we use a simple heuristic to determine the signal packing and message allocation. The steps are shown below. 1) Group the signals with the same source node and period as packing candidates. 2) Within each group, order the signals based on their priorities, then pack them according to the message size constraints (priorities are assumed given from an existing configuration or some suitable policy, as in the initial task allocation). The priority of a message is set to the highest priority of the signals that are mapped into it. to each message based on its 3) Assign a weight priority, transmission time and period. In our algorithm, , where we set and are priority, transmission time on bus with unit and are constants, speed and period of the message. whose values are tuned in case studies (both set to 1 in our experiments). When multiple buses are available between the source and destination nodes, we allocate messages to buses according to their weights. Messages with larger weights are assigned first to faster buses. Other more sophisticated heuristics or mathematical programming solutions have been considered. For instance, signal packing can be formulated as MILP as in [22]. However, from preliminary experiments, there is no significant improvement that can outweigh the speed of this simple strategy. C. Priority Assignment In this step, we assign priorities to tasks and messages, given the task allocation, signal packing and message allocation obtained from previous steps. This priority assignment problem is proven to be NP-complete [5]. Finding an optimal solution is generally not feasible for industrial-sized problems. Therefore, we propose an iterative heuristic to solve the problem.

The flow of this heuristic is shown in Fig. 3. The basic idea is to define the local deadlines of tasks and messages over iteration steps, then assign priorities based on the deadlines. Intuitively, shorter deadlines require higher priorities and longer local deadlines can afford lower priorities. Initially, the deadlines of tasks and messages are the same as their periods. Then, deadlines are modified, and priorities are assigned using the deadline-monotonic (DM) approach [2]. Of course, there is no guarantee that the DM policy is optimal in this case as for any system with nonpreemptable resources (the CAN bus), but there is no optimal counterpart that can be used here, and DM is a sensible choice in the context of our heuristics. During the iterations, deadlines are changed based on task and message criticality, as shown in Algorithm 1 and explained below. Algorithm 1. Update Local Deadline 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17:

Initialize the criticality of every task and message to 0 for all task do

for all task do update for all path whose latency is changed do then if for all tasks and messages on do reset all for all task

to the values before the iteration do

for all message

do

The criticality of a task or message, reflects how much the response times along the paths to which it belongs are affected by extensions in the execution times of other tasks. Tasks and messages with higher criticality are assigned higher priorities. To define the criticality of a task or a message, we increase , the maximum the execution time of each task , by

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. ZHU et al.: OPTIMIZING THE SOFTWARE ARCHITECTURE FOR EXTENSIBILITY IN HARD REAL-TIME DISTRIBUTED SYSTEMS

amount allowed by utilization constraints and an upper bound of task execution time slack, as shown in lines 3 and 4 in Algorithm 1. Then, the response time of and of lower priority tasks on the same node as is recomputed (lines 5 and 6). The criticality (both generically denoted of the affected task or message as object ) is defined by adding up a term for each path whose end-to-end latency exceeds the deadline after the increase , where is the weight of (lines 7 to 10). After repeating this operation for every task, the crit. icality of all tasks and messages is computed, denoted by for each Criticality values are normalized, obtaining a value task and message and finally, local deadlines are computed as (lines 11 to 16). The procedure is shown is initially set to 1, then adin Algorithm 1. The parameter justed in the later iteration steps using a strategy that takes into account the number of iteration steps, the number of times the current best solution is found, and the number of times the priority assignment remains unchanged. As shown in Fig. 3, after local deadlines are updated, the stop condition for priority assignment is checked. If the number of iterations reaches its limit, or the upper bound of task extensibility is reached, the priority assignment will finish, otherwise, we keep iterating. The strategy of changing priorities based on local deadlines can also be found in [8]. Different from our algorithm, the goal is only to meet end-to-end latency constraints, therefore deadlines are updated based on the slack time of tasks or messages which indicate how much the local deadlines can be increased without violating latency constraints. D. Task Reallocation As shown in Fig. 1, after all the design variables are decided, we calculate the value of the objective function in Formula 1, and check the stop condition for our entire algorithm. If the results are not good enough and the iteration limit has not been exceeded, we reallocate the tasks and repeat the signal packing, message allocation and priority assignment. We could use the same MILP-based method for reallocating tasks, with additional constraints to exclude the allocations that have been considered. However, solving the MILP is time consuming. To speed up the algorithm, we designed a local optimization heuristic that leverages the results of previous iterations for the task reallocation step in Fig. 1. The details of this heuristic are shown in Algorithm 2. Algorithm 2. Task Reallocation Let for a mapping 1: if current solution does not satisfy latency constraints then 2: 3: 4: for all task and node that is not on do where is the original 5: is the new mapping after moving to mapping, 6: if then moves to 7:

9

8: 9: for all task that are not on the same node do similarly 10: defined as above 11: if then 12: = switch and 13: 14: Execute Two operators are considered for generating new configurations: moving one task to a different node, or switching two tasks on different nodes. For each possible application of the operators on each task or task pair, that satisfies the utilization constraints, we compute the corresponding change of the performance function of (57), which includes the consideration of task extensibility and end-to-end latencies. In the case of multiple buses, the changes might lead to signal forwarding through gateway tasks, and this is taken into account in the calculation. Finally, the change that provides the largest increase of the perin cost function formance function is selected. Parameter provides the tradeoff between task extensibility and end-to-end latencies. Initially, it is set to the same value as parameter in (57), which is used in the initial task allocation. If the current solution does not satisfy the end-to-end deadlines, we increase by a constant to emphasize the optimization of latencies was tuned to 0.05 in our experiments). ( E. Algorithm Complexity The algorithm shown in Fig. 1 is polynomial except for the MILP-based initial task allocation, which can be regarded as a preprocessing stage since we use heuristics for task reallocation in following iterations. Finding the optimal initial task allocation by MILP is a NP-hard problem. In practice, we set a timeout and use the best feasible solution. For the following steps, let denote denote the number of tasks, the number of signals, denote the number of computational nodes, denote the denote the number of paths. The number of buses, and complexity of the signal packing and message allocation stage . The complexity of the priority is assignment is assuming the number of iterations in Fig. 3 is within a constant (as stated in Section III-C, there is a preset limit of number of iterations when checking the end condition). And the complexity of heuristic task reallocation stage is . This is the dominant stage. and If we assume , which is usually the case in practice, we can simplify the complexity of the entire algorithm (excluding the MILPbased preprocessing stage) as , assuming the number of iterations in Fig. 1 is within a constant. F. Extensibility Metric for Multiple Tasks When using the extensibility metric for task groups defined in formula (3), the optimization algorithm introduced in the previous sections needs to be modified as follows.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 10

IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS

In the MILP formulation, the utilization constraints from formula (31) and (32) should be modified to (58) (59) (60) (61)

techniques, such as genetic algorithms or simulated annealing. We programmed a simulated annealing algorithm for our problem, in which new systems configurations are randomly produced and evaluated with respect to the extensibility metric. New solutions are accepted if their cost is lower (the cost is the extensibility negated) and conditionally, even if the cost is higher, with a probability that depends on the cost difference between the new and the current solution, and on a temperature parameter that is slowly lowered with time

Equation (2) needs to be added to the MILP formulation. Objective function (56) should be replaced with the new objective (3), and objective (57) should be replaced with (62) The allocation and the end-to-end latency constraints in the MILP formulation do not change since they are only related to the original execution times. In Algorithm 1, the criticality calculation from lines 2 to 11 needs to be adjusted as follows. 1: for all task groups 2:

do

3: for all task do 4: 5: for all task do 6: update 7: for all path whose latency is changed do then 8: if 9: for all tasks and messages on do 10: 11: reset all to the values before the iteration where is an upper bound of considering only the utilization constraints. For each , we compute (line 2), and increase the execution times of all the tasks in (lines 3 and 4). Then, the response times of all the tasks being affected are updated (lines 5 and 6), and the criticality of each task or message is updated (lines 7 to 10). needs to Also, in Algorithm 2, the definition of be changed to to reflect the change of the objective function. Finally, in the calculation of eventual objective value, both utilization and end-to-end latency constraints need to be consid. A bisection algorithm is used for approxered to calculate considering all tasks in (a bisection algorithm is imating also used for approximating in the original objective function). The experimental results for this metric based on task groups are shown in Section VI-A4. IV. A RANDOMIZED OPTIMIZATION APPROACH An alternative solution to the extensibility optimization problem consists in the application of randomized optimization

The simulated annealing algorithm applied to our extensibility problem has the general form of Algorithm 3. Algorithm 3. Simulated Annealing Algorithm 1: void anneal(double init temp, double final temp, double coolrate, int MAXNUMCHAINS, int MAXTRY, int MAXCHANGE) 2: ; 3: while MAXNUMCHAINS and temp final temp do 4: nsucc ; 5: for MAXTRY; do 6: ctype SelectChangeType(); 7: ApplyChange(ctype); 8: EvalSolution(); 9: deltavalue value-newvalue; 10: valid IsChange(deltavalue, temp); 11: ifvalid then 12: nsucc ; 13: value newvalue; 14: CommitChange(ctype); 15: else 16: UndoChange(ctype); 17: if nsucc MAXCHANGE then 18: break; 19: temp coolrate; 20: ; Its application to the specific problem at hand requires the definition of a transition function, for generating new systems configurations, defined as in Algorithm 3 and a solution evaluation function . The transition function is obtained by randomly applying one of three possible mutation operators. • The first changes the priority assignment of a task or a message, without changing its allocation. The new priority level is randomly chosen and the priority of all other tasks/ messages with priority lower than or equal to the new level and sharing the same resource is lowered to ensure that no two tasks or messages are assigned the same priority. • The second randomly selects a task and changes the computational node to which it is allocated. The task is randomly chosen and so is the destination node. All the sig-

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. ZHU et al.: OPTIMIZING THE SOFTWARE ARCHITECTURE FOR EXTENSIBILITY IN HARD REAL-TIME DISTRIBUTED SYSTEMS

nals that are input or output by the task are removed from their messages and can either become local signals in the new destination, or new messages can be purposely created for them, including messages through a gateway, if necessary, with a randomly assigned priority. Similarly, the task is randomly assigned a priority for execution in its new node. • The third randomly selects a signal and changes the message in which it is transmitted. The signal can be assigned a new message or randomly mapped in one of the messages that have the same source node, same period and enough spare space in the packet data. The transition functions are very simple. The allocation and signal packing modifications, do not attempt at optimizing the priority of the task at the destination node or of the newly created messages, leaving this task to the first operator. This is in agreement with the philosophy of randomized optimization, where speed (and the number of evaluated solutions) is typically preferred over the quality of the mutation operators, if that implies higher running times. All the theoretical results on the convergence of SA methods [1] indeed only assume the existence of a transition and cost evaluation functions and prescribe minimum lengths for the Markov chains at each step (number of tries and number of successful transitions) and a minimum value of the cooling factor (no assumptions are made on the quality of the transition function). The computation of the extensibility (cost) function is approximated by a simple bisection algorithm. The extensibility for each task is computed by approximating the maximum value of the computation time that preserves the feasibility of all paths, does not increase the load of the node beyond the chosen limit value (0.7 for our experiments) and does not increase the response times of the tasks beyond their period. is initially defined in the interval , where is the maximum computation time allowed by the utilization bound . Then, the algorithm iterates 20 times by tentatively assigning the task a computation time equal to the middle point of the interval and testing the feasibility conditions. If the value satisfies all the bounds, it becomes the new lower bound for the following iteration, otherwise, it is the new upper bound. After 20 iterations, the lower bound of the interval is selected as , with a maximum relative pessimism of , which is approximately , in the evaluation of the performance function. Finally, to allow the optimization to start with non feasible configurations (or to possibly accept one as an intermediate step in the optimization procedure), allocation and priority assignment that result in path latencies larger than the deadlines are not immediately rejected, but are given a high cost, equal to the maximum lateness (latency minus deadline) among all the paths. V. HEURISTIC FOR INITIAL TASK ALLOCATION An alternative to the MILP optimization described in Fig. 1 is to use heuristics for the initial task allocation step.

11

Algorithm 4. Initial Task Allocation Heuristic 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24:

for all computational node for all task

Sort for all

do

do

in the descendant order of (in sorted order) do

for all

do

if

then for all path if if for all

if

, signal and

on

do then

then do

then

In the greedy heuristic shown in Algorithm 4, tasks are assigned to nodes. Initially, no task is allocated (lines 1 and 2, the set of allocated tasks on each node is denoted as ). Then, to each task (lines 3 to 5) to define we associate a value its processing order, based on its weight, computation time and period. The intuition is to allocate first tasks that have larger weights and require a larger share of processor time. Since a task can have different computation times on different nodes, an average value is used (line 4). Once tasks are sorted (line 6), they will be assigned to computational nodes in order (lines 7 to 24). The destination node is selected based on a heuristic that attempts at clustering for an optimal value of the metric function and the satisfaction of deadline constraints. More specifically, while assigning task , for and the utievery node , we compute its current utilization (lines 10 and 11). If lization required by task , denoted as the addition of task does not violate the utilization constraint on node (line 12), we consider as an allocation candidate (lines 13 to 20) consisting and compute an objective value of two parts (line 20). The first part approximates the change of extensibility if task is allocated to . The approximation takes into account the utilization constraint, but not the end-to-end latency constraints. The second part of the objective value models the impact of mapping to on the path latencies. When is added to , the signals between and the tasks already mapped accounts for on become local, with no sampling delays. this by adding a value contribution equal to the amount of saved sampling delays (relative to the deadlines of the paths to which

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 12

IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS

they apply, to give emphasis to paths with shorter deadlines). Gatewaying is not considered for simplicity (and could be difficult to handle). Contributions from signal response times and task response times are not considered, since they are difficult to estimate without a complete allocation and are typically much is used to control smaller than sampling periods. Parameter the tradeoff between extensibility and latencies, similar to parameter in the MILP formulation in (57). In the case studies values are chosen for the experiments. in Section VI, a set of A straightforward implementation of Algorithm 4 has com. In our implementaplexity tion, we optimized it by computing the sampling delay terms and before the iterations of task assignment starts and reduced the complexity to . This heuristic for initial task allocation is integrated with the rest of the flow in Fig. 1, and the experimental results will be demonstrated in Section VI-A3. VI. CASE STUDIES The effectiveness of the methodology and algorithm is validated in this section with three industrial case studies. The first two cases focus on improving the extensibility for two automotive architecture options, whereas the third case study investigates the impact of additional resources on the optimality of the design of a truck control system. A. Active Safety Vehicle In these case studies, we apply our algorithm to an experimental vehicle that incorporates advanced active safety functions. This is the same example studied in [22]. We considered two architecture platform options with different number of buses. Both options consist of nine ECUs (computational nodes). In the first configuration, they are connected through a single CAN bus; in the second by two CAN buses, with one ECU functioning as gateway between the two buses. The transmission speed is 500 kb/s. The vehicle supports advanced distributed functions with end-to-end computations collecting data from 360 sensors to the actuators, consisting of the throttle, brake and steering subsystems and of advanced HMI (Human-Machine Interface) devices. The subsystem that we considered consists of a total of 41 tasks executed on the ECUs, and 83 signals exchanged between them. Worst-case execution time (WCET) estimates have been obtained for all tasks. In our formulation shown before, the WCET of a task on each ECU is distinguished (denoted by ). For the purpose of our algorithm evaluation, we assumed that all ECUs have the same computational power, so that the worst case execution time of tasks does not depend on their allocation. This simplification does not affect the complexity or the running time of the optimization algorithm and is only motivated by the lack of WCET data for the tasks on all possible ECUs. The bit length of the signals is between 1 (for binary information) and 64 (full CAN message). The utilization upper bound of each ECU and bus has been set to 70%. The task in (1) are set to 1 for all tasks. weights End-to-end deadlines are placed over 10 pairs of source-sink tasks in the system. Most of the intermediate stages on the paths

Fig. 4. Comparison of manual and optimized designs for the two architecture options.

are shared among the tasks. Therefore, despite the small number of source-sink pairs, there are 171 unique paths among them. The deadline is set at 300 ms for 8 source-sink pairs and 100 ms for the other two. 1) Optimization Algorithm: The experiments are run on a 1.7-GHz processor with 1 GB RAM. CPLEX [16] is used as the MILP solver for the initial task allocation. The timeout limit in the MILP formulation is set to 1000 s. The parameter is used to explore the tradeoff between task extensibility and end-to-end latencies during initial task allocation. We test our algorithm with several different values, and compare them with a system configuration produced manually. Results are shown in Fig. 4. A manual design is available for the single-bus configuration and consists of the configuration of the tasks and messages provided by its designers. This initial configuration is not optimized. The total latencies of all paths is 24528.1 ms and the task extensibility is 16.9113. For the single-bus case, in any of the four automatically optimized designs, all paths meet their deadlines. Different values provide the tradeoff between task extensibility and , we have the largest task end-to-end latencies. When extensibility at 23.8038, which is a 41% improvement over , we have the shortest total manual design. When end-to-end latency at 9075.46 ms, which is 63% less than manual design. If a balanced design between extensibility and end-to-end latency is needed, intermediate values may be used. , we obtain 37% improvement on task extensibility For and 31% improvement on end-to-end latencies. For the two-buses case, again all optimized designs satisfy the , the largest task end-to-end latency constraints. When extensibility obtained after optimization is 23.1347. When , we have the shortest total end-to-end latency at 16948.1 ms. If a balanced design is needed, intermediate values may be used, with the results shown in Fig. 4. Comparing single-bus and two-buses case, the results of twobuses case have longer latencies in general. This is because of the additional time taken on gateway tasks and signals. Also, the two-buses results span across a smaller range of extensibility and latency than the single-bus results. This is because the configurations for two-buses case are less flexible due to the constraints from allocation and gatewaying. For both configurations, after the initial task allocation, each outer iteration of the signal packing and message allocation, priority assignment and task reallocation takes less than 30 s, and the optimization convalues we tested. Fig. 5 verges within 30 iterations for the

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. ZHU et al.: OPTIMIZING THE SOFTWARE ARCHITECTURE FOR EXTENSIBILITY IN HARD REAL-TIME DISTRIBUTED SYSTEMS

13

TABLE I OPTIMIZATION RESULTS WITH DIFFERENT MILP TIMEOUT LIMITS (

K = 0)

TABLE II COMPARISON BETWEEN MILP AND HEURISTIC FOR INITIAL TASK ALLOCATION

Fig. 5. Task extensibility over iterations.

Fig. 6. Extensibility over iterations in the Simulated Annealing Algorithm.

shows the current best task extensibility over 30 iterations for for the two architecture options. Iteration 0 is the task extensibility after initial task allocation. The running time is 732 s for 30 iterations in the case of single bus, and 545 s for 30 iterations in the case of two buses. 2) A Simulated Annealing Algorithm: To evaluate the quality of the results, we compared our algorithm performance with a simulated annealing algorithm, executed with an initial temperature of 100, a cooling rate of 0.96, a final temperature of 0.005 and with a maximum number of iterations at each step (MAXTRY, MAXCHANGE) equal to 2000. Simulated Annealing after approximately 12 h of computation, computed a best value of 23.7539, with a total latency (not optimized by the algorithm) of 29718.9 ms for the single-bus architecture option and an extensibility performance of 23.206 with a total latency of 24678.5 ms for the two-buses platform. Fig. 6 shows the improvement of the performance of the solutions found by the simulated annealing algorithm with time. Each iteration (approximately 3 min of computation time) refers to a decrease in the temperature parameter. For the single-bus case the best value found until the given iteration index is shown (thick line in red). For the two-buses case, the best value found (thicker line) is shown together with the value currently accepted by the algorithm (thinner line), showing with its fluctuations the conditional acceptance of higher cost solutions. Negative values are not shown, but are considered (and conditionally accepted) by the algorithm. The maximum task extensibility values obtained from the op), and from the simulated antimization algorithm (when nealing algorithm are extremely close. This fact, together with

the way both algorithms converge to their final result, suggests that the obtained values and configurations are very close to the true optimum (although final proof cannot be obtained unless all combinations are evaluated, which is clearly impossible in a feasible time). 3) Impact of Initial Task Allocation and the Heuristic: To study the impact of the initial task allocation on the final optimization results, we conducted a set of experiments with different timeout limits for the MILP initial step. Also, we compared it with an alternative approach based on an allocation heuristic, as described in Section V. Table I shows the final optimization results with different MILP timeout limits in the initial task allocation. The architecture platform is the two-bus configuration and the parameter in (57) is set to 0. Results clearly degrade when shortening the timeout limits with respect to selected timeout of 1000 s. However, rewards are also decreasing fast when using more time in this first stage. Table II shows the comparison between the use of MILP and of our heuristic for the initial task allocation step (the rest of the algorithm is the same, as shown in Fig. 1). The architecture platform is the two-bus configuration. The timeout limit for the in (57) for MILP and MILP is set to 1000 s. The parameter in Algorithm 4 for the heuristic are the same. the parameter is set to 0 or 0.1, the heuristic could not find any When is feasible solution even after reallocation iterations. When set to 0.2 or 0.5, the heuristic puts more weight on minimizing latencies, therefore, it is able to find feasible solutions after a series of reallocation steps. However, the results are still worse with respect to the ones obtained using MILP. Of course, it is in principle possible to design a better heuristic, but this task is expected to be quite difficult considering the need to balance tradeoffs between feasibility and extensibility and the need to cope with gatewaying (for which the MILP formulation provides intuitive solutions). On the other hand, the heuristic is very efficient. In all experiments, it takes less than 1 s to complete the initial task allocation step. 4) Extensibility Metric for Multiple Tasks: In Sections II-A and III-F, we defined an extensibility metric for task groups containing multiple tasks, and explained the necessary changes to the original problem formulation and algorithm (shown in Fig. 1) for optimizing this new metric. We implemented these changes and conducted a set of experiments on the active safety

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 14

IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS

Fig. 8. Correlation of two extensibility metrics. Fig. 7. Optimization results for extensibility metric of multiple tasks. TABLE III DESCRIPTION OF DISTRIBUTED CONTROL SYSTEM

vehicle example with the two-bus architecture option. We defined two extensibility scenarios for experimental purpose. The corresponding task groups include three and five tasks, respectively. To explore the tradeoff between extensibility and latency, we selected as part of these groups, the tasks that appear most often in the paths with deadlines. The timeout limit for the MILP step is set to 1000 s. Parameter is used to tradeoff between extensibility and total latency, as defined in Formula (62). The results are shown in Fig. 7, showing the impact of values on the extensibility metric and total latency. 5) Correlation Between Extensibility Metrics: Our definition of the extensibility metric is a weighted sum of each task’s execution time slack over its period, as shown in (1). This metric will favor the tasks with shorter periods and potentially give them larger execution time slacks. The motivation of this is twofold. First, tasks with shorter periods are activated more often, therefore increasing their execution time slacks might have a larger cumulative effect than optimizing tasks with longer periods. Secondly, this metric acknowledges the impact of task periods on latencies. As shown in (5), tasks with shorter period have bigger impact on the lower priority tasks on the same node . Theresince the number of interferences is calculated as fore, it may be more beneficial to optimize them. Another option is to define the extensibility metric as a weighted sum of each task’s execution time slack over its original execution time. We correlated these two metrics in the experiments, with results shown in Fig. 8. For each optimization result obtained by , Metric using execution time slack over period metric ( 1 in the figure), we also computed its metric of execution time , Metric 2). The slack over original execution time ( (lower) blue line represents these optimization results, with each point corresponding to a different value between 0 and 0.5. We then optimized the design by using the metric (Metric 2) with different values, and computed Metric 1 for each optimization result. The (upper) red line represents these ( scales with respect to the paresults for rameters chosen in Metric 1 to reflect the value range of the new extensibility metric). From the results in Fig. 8, we have following observations: 1) the two metrics have a monotonic relationship, i.e., if an optimal result is better in one metric, it is generally also better in another

metric and 2) the difference between the two optimization lines is not very large. This is expected given the similarities between the two metrics. B. Distributed Control System In addition to the active safety vehicle application, another example is presented: a safety-critical distributed control system deployed within a small truck. This is a CAN-based system that implements a distributed closed-loop control. The key features of this system are the integration of slow and very fast (power electronics) control loops using the same communication network. In this example, we are interested in redesigning an existing system to understand the effects of adding communication and computational resources to the system. The system implements several control loops, such as the power electronics control, and diagnostic features. To protect sensitive confidential data obtained by a major automobile manufacturer, the system is abstracted as a set of tasks with aggregate information. Table III summarizes information about the test case. Task periods range from 10 to 1000 ms. The example system is evaluated for an initial system configuration consisting of seven nodes, and a derived system where one additional node is provided for additional flexibility. The optimization algorithm must define a new allocation of tasks to maximize extensibility on the new architecture. The average task utilization is 0.05. In the initial configuration with seven nodes, the average CPU utilization is 0.307, with a maximum of 0.45 and a minimum of 0.25. Results are shown in Fig. 9. Solid lines indicate the mapping of tasks (indicated by T#) to the seven nodes in the initial configuration, whereas dotted lines indicate the mapping computed by the algorithm for the extended configuration. In the optimized configuration with eight nodes, utilization is 0.269, with a maximum of 0.30 and a minimum of 0.20. The task extensibility values for the two systems are 12.15 and 13.97, respectively. The timeout limits of the MILP for initial task allocation in the two cases are both set to 1000 s. The running time of the rest of flow shown in Fig. 1, which includes 30 iterations of signal packing and message allocation,

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. ZHU et al.: OPTIMIZING THE SOFTWARE ARCHITECTURE FOR EXTENSIBILITY IN HARD REAL-TIME DISTRIBUTED SYSTEMS

Fig. 9. Reallocation of tasks for increased computational resources.

priority assignment and task-reallocation, is 298 s for the first case (seven nodes) and 346 s for the second case (eight nodes). VII. CONCLUSIONS AND FUTURE WORK A mathematical framework was presented for defining an extensibility metric and for solving the related optimization problem in distributed hard real-time systems, by exploring task allocation, signal packing and message allocation as well as task and message priorities. The problem is NP-hard. We formulated it as a standard optimization problem, then proposed an algorithm based on mixed integer linear programming for the initial task allocation and heuristics for the signal packing, message allocation, task and message priority assignment as well as task reallocation. To evaluate the performance of the algorithm, we tested it on two case studies. The results show that this framework can effectively optimize extensibility, while meeting the design constraints such as end-to-end latency constraints and utilization constraints. This algorithm also shows a significant improvement in running time, compared with a simulated annealing algorithm we implemented. In the future, we plan to extend our framework to include not only task extensibility but also message extensibility. Further, we would like to consider task and message scalability (i.e., how many new tasks and messages can be added to an existing system). REFERENCES [1] E. Aarts and J. Korst, Simulated Annealing and Boltzmann Machines. New York: Wiley, 1989. [2] N. C. Audsley, A. Burns, M. F. Richardson, and A. J. Wellings, “Hard real-time scheduling: The deadline monotonic approach,” in Proc. 8th IEEE Workshop on Real-Time Operating Syst. Softw., Atlanta, GA, 1991, pp. 133–137. [3] I. Bate and P. Emberson, “Incorporating scenarios and heuristics to improve flexibility in real-time embedded systems,” in Proc. 12th IEEE RTAS Conf., Apr. 2006, pp. 221–230. [4] E. Bini, M. Di Natale, and G. Buttazzo, “Sensitivity analysis for fixedpriority real-time systems,” in Proc. Euromicro Conf. Real-Time Syst., Dresden, Germany, Jun. 2006. [5] A. Burns, “Scheduling hard real-time systems: A review,” Softw. Eng. J., vol. 6, no. 3, pp. 116–128, 1991.

15

[6] A. Davare, Q. Zhu, M. Di Natale, C. Pinello, S. Kanajan, and A. Sangiovanni-Vincentelli, “Period optimization for hard real-time distributed automotive systems,” in Proc. 44th DAC Conf., 2007, pp. 278–283. [7] R. I. Davis, A. Burns, R. J. Bril, and J. J. Lukkien, “Controller area network (CAN) schedulability analysis: Refuted, revisited and revised,” Real-Time Syst., vol. 35, no. 3, pp. 239–272, 2007. [8] J. J. G. Garcia and M. G. Harbour, “Optimized priority assignment for tasks and messages in distributed hard real-time systems,” in Proc. 3rd Workshop on Parallel and Distrib. Real-Time Syst., 1995, pp. 124–132. [9] A. Hamann, R. Racu, and R. Ernst, “A formal approach to robustness maximization of complex heterogeneous embedded systems,” in Proc. CODES/ISSS Conf., Oct. 2006, pp. 40–45. [10] A. Hamann, R. Racu, and R. Ernst, “Methods for multi-dimensional robustness optimization in complex embedded systems,” in Proc. ACM EMSOFT Conf., Sep. 2007, pp. 104–113. [11] A. Hamann, R. Racu, and R. Ernst, “Multi-dimensional robustness optimization in heterogeneous distributed embedded systems,” in Proc. 13th IEEE RTAS Conf., Apr. 2007, pp. 269–280. [12] M. Gonzalez-Harbour, M. Klein, and J. Lehoczky, “Timing analysis for fixed-priority scheduling of hard real-time systems,” IEEE Trans. Softw. Eng., vol. 20, no. 1, pp. 13–28, Jan. 1994. [13] M. Lukasiewycz, M. Glass, P. Milbredt, and J. Teich, “Flexray schedule optimization of the static segment,” in Proc. CODES+ISSS Conf., Jun. 2009, pp. 363–372. [14] A. Metzner and C. Herde, “RTSAT—An optimal and efficient approach to the task allocation problem in distributed architectures,” in Proc. IEEE RTSS Conf., 2006, pp. 147–158. [15] A. K. Mok and W.-C. Poon, “Non-preemptive robustness under reduced system load,” in Proc. 26th IEEE Int. Real-Time Syst. Symp., RTSS’05:, Washington, DC, 2005, pp. 200–209. [16] ILOG CPLEX Optimizer. [Online]. Available: http://www.ilog.com/ products/cplex/ [17] T. Pop, P. Eles, and Z. Peng, “Design optimization of mixed time/ event-triggered distributed embedded systems,” in Proc. CODES+ISSS Conf., New York, 2003. [18] R. Racu, M. Jersak, and R. Ernst, “Applying sensitivity analysis in realtime distributed systems,” in Proc. RTAS Conf., San Francisco, CA, Mar. 2005. [19] K. Sandstrom, C. Norstom, and M. Ahlmark, “Frame packing in realtime communication,” in Proc. RTCSA Conf., 2000, p. 399. [20] A. Sangiovanni-Vincentelli and G. Martin, “Platform-based design and software design methodology for embedded systems,” IEEE Design and Test of Computers, vol. 18, no. 6, pp. 23–33, 2001. [21] R. Yerraballi and R. Mukkamalla, “Scalability in real-time systems with end-to-end requirements,” J. Syst. Architecture, vol. 42, pp. 409–429, 1996. [22] W. Zheng, Q. Zhu, M. D. Natale, and A. Sangiovanni-Vincentelli, “Definition of task allocation and priority assignment in hard real-time distributed systems,” in Proc. IEEE RTSS Conf., 2007, pp. 161–170.

Qi Zhu received the Ph.D. degree in electrical engineering and computer sciences from the University of California, Berkeley, in 2008. Since 2008, he has been a Research Scientist at the Strategic CAD Laboratories in Intel Corporation. His research interests are in the areas of design automation, electronic system-level design and embedded systems. Dr. Zhu received two Best Paper Awards at the Design Automation Conference (DAC), 2006 and 2007.

Yang Yang received the B.E. degree from Tsinghua University, Beijing, China, in 2003 and the M.S. degree from the University of California, Berkeley, in 2008. She is currently working towards the Ph.D. degree at the Department of Electrical Engineering and Computer Sciences, University of California, Berkeley. Her research interests include computer-aided design, real-time distributed systems, and embedded software.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 16

Marco Di Natale (M’03) received the Ph.D. degree from Scuola Superiore Sant’Anna, Pisa, Italy, in 1991. He is an Associate Professor at the Scuola Superiore Sant’Anna in which he held a position as Director of the Real-Time Systems (ReTiS) Lab from 2003 to 2006. He has been Visiting Researcher at the University of California, Berkeley, in 2006 and 2008/09. He has been selected in 2006 by the Italian Ministry of Research as the national representative in the mirror group of the ARTEMIS European Union Technology platform. He has been a Researcher in the area of real-time systems and embedded systems for more than 15 years, being author or coauthor of more than 100 scientific papers. Dr. Di Natale has been winner of three Best Paper Awards and the Archie T. Colwell Award. He has served as Program Committee member and has been organizer of tutorials and special sessions for the main conferences in the area, including the Real-Time Systems Symposium, the IEEE/ACM Design Automation Conference (DAC), the Design Automation and Test in Europe (DATE) and the Real-Time Application Symposium. He also served as Track Chair for the RTAS Conference, the Automotive track of the 2010 DATE Conference. He has been an Associate Editor of the IEEE TRANSACTIONS ON CAD and the IEEE EMBEDDED SYSTEMS LETTERS and is currently on the Editorial Board of the IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS.

Eelco Scholte received the M.S. degree from the University of Twente, Twente, The Netherlands, in 1999 and the Ph.D. degree from Cornell University, Ithaca, NY, in 2004. His Ph.D. research was funded through the DARPA Software Enabled Control program and focused on the use of embedded control for high-performance autonomous vehicles. He is a research staff member in the Systems Department at United Technologies Research Center (UTRC), and has been with UTRC since 2005. At UTRC, he has worked on methodologies for improving the design of embedded control systems, including verification and modeling of large distributed embedded systems. Currently, he is working as a Project Leader focusing on the use of model-based design and formal verification methods for aircraft power systems.

Alberto Sangiovanni-Vincentelli (F’81) received the Degree in electrical engineering and computer science (“Dottore in Ingegneria”) summa cum laude from the Politecnico di Milano, Milano, Italy in 1971. He holds the Edgar L. and Harold H. Buttner Chair of Electrical Engineering and Computer Sciences at the University of California at Berkeley. He has been on the Faculty since 1976. In 1980 to 1981, he spent a year as a Visiting Scientist at the Mathematical Sciences Department, IBM T. J. Watson Research Center. In 1987, he was a Visiting Professor at MIT. He has held a number of Visiting Professor positions at Italian Universities, including

IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS

Politecnico di Torino, Universita’ di Roma, La Sapienza, Universita’ di Roma, Tor Vergata, Universita’ di Pavia, Universita’ di Pisa, Scuola di Sant’Anna. He was a co-founder of Cadence and Synopsys, the two leading companies in the area of electronic design automation. He is the Chief Technology Adviser of Cadence. He is a member of the Board of Directors of Cadence and the Chair of its Technology Committee, UPEK, a company he helped spinning off from ST Microelectronics, Sonics, and Accent, an ST Microelectronics-Cadence joint venture he helped founding. He consulted for many companies including Bell Labs, IBM, Intel, United Technologies Corporation, COMAU, Magneti Marelli, Pirelli, BMW, Daimler-Chrysler, Fujitsu, Kawasaki Steel, Sony, ST, United Technologies Corporation and Hitachi. He was an advisor to the Singapore Government for microelectronics and new ventures. He consulted for Greylock Ventures and for Vertex Investment Venture Capital funds. He is a member of the Advisory Board of Walden International, Sofinnova and Innogest Venture Capital funds and a member of the Investment Committee of a novel VC fund, Atlante Ventures, by Banca Intesa/San Paolo. He was the founder and Scientific Director of the Project on Advanced Research on Architectures and Design of Electronic Systems (PARADES), a European Group of Economic Interest supported by Cadence, Magneti-Marelli and ST Microelectronics. Since 2010, he has been the Senior Advisor to the President and CEO of L’Elettronica. He is a member of the Advisory Board of the Lester Center for Innovation of the Haas School of Business and of the Center for Western European Studies and is a member of the Berkeley Roundtable of the International Economy (BRIE). He is a member of the High-Level Group, of the Steering Committee, of the Governing Board and of the Public Authorities Board of the EU Artemis Joint Technology Initiative. He is member of the Scientific Council of the Italian National Science Foundation (CNR). Since February 2010, he has been a member of the Executive Committee of the Italian Institute of Technology. He is an author of over 880 papers, 15 books, and 3 patents in the area of design tools and methodologies, large-scale systems, embedded systems, hybrid systems and innovation. Dr. Sangiovanni-Vincentelli is a Member of the National Academy of Engineering, the highest honor bestowed upon a US engineer, since 1998. He was a member of the HP Strategic Technology Advisory Board, and is a member of the Science and Technology Advisory Board of General Motors and of the Scientific Council of the Tronchetti Provera Foundation and of the Snaidero Foundation. In 1981, he received the Distinguished Teaching Award of the University of California. He received the Worldwide 1995 Graduate Teaching Award of the IEEE (a Technical Field Award for “Inspirational Teaching of Graduate Students”). In 2002, he was the recipient of the Aristotle Award of the Semiconductor Research Corporation. He has received numerous research awards including the Guillemin-Cauer Award (1982-1983), the Darlington Award (1987-1988) of the IEEE for the Best Paper Bridging Theory and Applications, and two awards for the Best Paper published in the IEEE TRANSACTIONS ON CAS and CAD, five best paper awards and one best presentation awards at the Design Automation Conference, other best paper awards at the Real-Time Systems Symposium and the VLSI Conference. In 2001, he was given the Kaufman Award from the Electronic Design Automation Council for “Pioneering Contributions to EDA.” In 2008, he was awarded the IEEE/RSE Wolfson James Clerk Maxwell Medal for groundbreaking contributions that have had an exceptional impact on the development of electronics and electrical engineering or related fields with the following citation: “For pioneering innovation and leadership in electronic design automation that have enabled the design of modern electronics systems and their industrial implementation.” In 2009, he received the first ACM/IEEE A. Richard Newton Technical Impact Award in Electronic Design Automation to honor persons for an outstanding technical contribution within the scope of electronic design automation. In 2009, he was awarded an Honorary Doctorate by the University of Aalborg, Denmark.