Workload-aware shaping of shared resource accesses ... - IEEE Xplore

1 downloads 0 Views 287KB Size Report
Oct 17, 2014 - For mixed-criticality systems, safety standards (e.g. ISO. 26262) require sufficient independence among different criti- cality levels, unless the ...
Workload-aware Shaping of Shared Resource Accesses in Mixed-criticality Systems Sebastian Tobuschat, Moritz Neukirchner, Leonardo Ecco, Rolf Ernst Institute of Computer and Network Engineering Technische Universität Braunschweig, Germany

{tobuschat | neukirchner | ecco | ernst}@ida.ing.tu-bs.de ABSTRACT For mixed-criticality systems, safety standards (e.g. ISO 26262) require sufficient independence among different criticality levels, unless the entire system is certified according to the highest applicable level. We present a resource arbitration scheme that provides sufficient independence among different criticality levels w.r.t. timing properties. We exploit throughput and latency slack of critical applications by prioritizing non-critical over critical accesses and only switching priorities when necessary. By using an accurate representation of resource access patterns and workloads, the proposed arbitration scheme achieves an improved resource utilization compared to classical approaches that use simple access counters. The approach allows to provide service guarantees for critical applications, while reducing the adverse effects through strict prioritization on non-critical applications.

Categories and Subject Descriptors C.3 [Special-purpose and Application-based Systems]: Real-time and embedded systems; J.7 [Computers in other Systems]: Real time

General Terms Design,Reliability

Keywords multicore, shared resource, mixed-criticality

1. INTRODUCTION Multicore systems become increasingly interesting also for safety-critical systems due to performance, power, and size benefits. The increased computing performance allows to consolidate multiple functions, which were previously distributed and isolated, onto a common multiprocessor system on chip (MPSoC). Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. ESWEEK’14, October 12 - 17 2014, New Delhi, India Copyright is held by the owner/author(s). Publication rights licensed to ACM. ACM 978-1-4503-3051-0/14/10 ...$15.00. http://dx.doi.org/10.1145/2656075.2656105.

In safety-critical systems, however, consolidation of multiple functions introduces additional challenges - particularly when functions of different safety criticalities are integrated [7]. The safety critical functions must be developed with high diligence according to well-defined processes [18], and thus their behavior (i.e. access patterns, timing) is well specified and tested. The confidence in the specification of non-critical applications is lower as no such process must be followed. Therefore, the possibility for the functions to exhibit errant behavior is expected to be higher. In multicore systems several resources, such as the communication interconnect and the memory, are shared between multiple functions. The concurrent use of resources couples the execution behavior between different functions. For mixed-criticality system, safety standards, as e.g. IEC 61508 [17] and ISO 26262 [18], require sufficient independence among different criticality levels, unless the entire system is certified according to the highest applicable level. Sufficient independence includes functional (access protection) and temporal (performance) isolation. For temporal independence the blocking time on shared resources must be known and bounded. To achieve this, multicore architectures must provide special means, such as predictable access schedulers, when shared resources are used. Commonly used approaches to provide temporal isolation are time-division multiplexing (TDM), strict-priority scheduling with "criticality as a priority", traffic shaping, or the use of servers. All these approaches, however, do not minimize the adverse performance impact of isolation on the non safety-critical functions. They typically assume a single size for all accesses to the resource. And based on this size the bounds or budgets for the isolation approach are derived. But the use of a single access size leads to an inaccurate estimation of the induced workload when a resource supports different access sizes. This inaccurate estimation leads to an imprecise bounding of the interference. This way these approaches can result in a lower system utilization and a performance decrease especially for the non-safety critical parts. Contribution: In this work we present a workload-aware shaping of shared resource accesses that avoids the drawbacks of an inaccurate workload estimation. We exploit throughput and latency slack of critical applications by prioritizing non-critical over critical accesses and only switching priorities when necessary. By using an accurate specification of resource access patterns and workloads, the proposed arbitration scheme achieves an improved resource utilization, especially for resources with varying access sizes. It enables

ഥ஻ாଵ ሺȟ‫ݐ‬ሻ Ƚ

Ƚ ഥ஻ா௡ ሺȟ‫ݐ‬ሻ

Ƚ ഥீ௅௟ଵ ሺȟ‫ݐ‬ሻ

Ƚீ்ଵ ሺȟ‫ݐ‬ሻ ഥ

Ƚீ்௡ ሺȟ‫ݐ‬ሻ ഥ

GLl GT



ഥீ௅௟௡ ሺȟ‫ݐ‬ሻ Ƚ

BE

ഥீ௅௦ ሺȟ‫ݐ‬ሻ Ƚ

ഥ஻ா ሺȟ‫ݐ‬ሻ Ƚ

Ƚீ௅௟ ሺȟ‫ݐ‬ሻ ഥ

&*/V

MCSP

ഥீ௅௦௡ ሺȟ‫ݐ‬ሻ Ƚ

GLs



In this section we present the proposed setup for an efficient workload-aware shaping. We first describe the traffic classes we distinguish. Then we give a brief overview on how the setup is used to provide isolation between certain classes. The general system setup is depicted in Figure 1. We assume multiple requesters to share a resource. At the resource the requesters are grouped according to their requirements in different classes. In the current approach we distinguish between non safety-critical best-effort (BE) and safety critical requesters. The latter are further divided in guaranteed latency with short deadlines (GLs), guaranteed latency with long deadlines (GLl), and guaranteed throughput (GT). GT requesters represent streaming applications that require a minimum guaranteed throughput, but can tolerate high latencies due to their predictable access patterns. However, feasible upper bounds on their latency must exist for a safe function. GLs and GLl requesters transfer small amounts of data e.g. in the size of cache-lines and have hard deadlines. Their accesses to the shared resources occur sporadically and the latency must not exceed their deadline. Therefore an upper bound on the latency between the request to access the resource to the grant of the access must be provided. We assume the deadline of GLl to be longer than the deadline of GLs. Hence GLl can suffer a higher blocking without violating its deadline. With this assumption GLl can be used to transfer non-urgent inter-process communications while GLs is suitable for urgent notifications. For BE the requested data size is not known and they are sensitive to latency [21]. Thus they achieve a higher performance when low latency is provided. Nevertheless, a guaranteed upper bound for their latency is not required. As BE requesters are not essential for the safe functioning of the system, they can be dropped without violating the dependability of the system. At the resource each class is equipped with dedicated queues to prevent blocking between classes. At the output of the queues the grouped workload each class induces to the resource is monitored. A monitor-controlled static-priority (MCSP) scheduler determines which access from the queues will be processed next. The group monitoring allows a single requester in a group to induce more workload than assumed, if other group members induce less workload. This can increase the utilization of the resource [22]. The load of a requester depends on the number of accesses issued to the resource and the blocking time C i each of these accesses i induces. Based on the requested workload the monitor controls the priorities used in the MCSP for each class. This

ഥீ௅௦ଵ ሺȟ‫ݐ‬ሻ Ƚ



2. SYSTEM SETUP

Resource …

to provide latency rate properties [4, 30] and thus quality of service to critical functions, while reducing the negative performance impacts on non-critical functions. This paper is structured as follows: After giving a short overview on the used setup in Section 2, we will discuss related work in Section 3. Section 4 introduces the system model used throughout this paper. We then give an overview of the benefits of an accurate workload representation (Section 5) and show in Section 6 how this is used to achieve QoS guarantees for safety-critical accesses while improving the performance for non critical accesses. The QoS guarantees are derived analytically in Section 7. After the analysis we show the performance benefit of the proposed mechanism with a case study in Section 8 and conclude it in Section 9.

Ƚ ഥீ் ሺȟ‫ݐ‬ሻ

&%( &*/O &*7

Monitor

Figure 1: Proposed system setup. Four requesters compete for the time on a resource. Access to the resource is handled by a monitor-controlled staticpriority (MCSP) scheduler.

enables the monitor to bound the interference a class can induce to a class of lower priority, by switching their priorities. While group monitoring reduces the overhead compared to observe all requesters individually, it can not bound the interference between requesters inside a class. In the basic design, the MCSP provides a priority level for each class. Hence a priority is shared between all requesters in a class. The MCSP distinguishes multiple operational states with different priority assignments: normal, critGT and critGL. GLs requesters have always the highest priority, while the priority assignment of the other classes is different between the states. In the normal state BE requesters have a higher priority than GT and GLl. We denote the priority of a requester r with P (r). Hence in the normal state we have: P (GLs) > P (BE) > P (GLl) > P (GT ). In the critGL state the priority of GLl is higher than the one of BE: P (GLs) > P (GLl) > P (BE) > P (GT ). And in the critGT state BE requesters have the lowest priority: P (GLs) > P (GLl) > P (GT ) > P (BE). As long as GLl and GT make sufficient progress during the normal state, the system retains in this state and delivers low latency to BE and GLs to increase the overall system performance. If the monitor detects insufficient progress the state is changed dynamically to guarantee sufficient progress of GLl and GT.

3. RELATED WORK Several approaches to provide quality-of-service and isolation exist. Closely related to our approach are schedulers that implement traffic shapers or dynamic priorities. Traffic shaping [11, 26, 29] controls the data rate by delaying events. At a static-priority scheduler typically the safety-critical requesters are assigned a high priority to achieve a guaranteed service. To prevent starvation of non-critical low-priority requesters, the safety-critical requesters are shaped. However, these focus on isolation of safety-critical real-time requesters and do not optimize the performance for non-critical traffic. In [35] the authors propose budget based memory throttling to bound the interference of certain requesters at the memory. The authors of [23, 24] use access counters to bound the blocking on a shared resource. The Constant Bandwidth Server (CBS) [2] provides isolation using an EDF scheduler and dynamic deadlines. Similar to this approach is the credit-

controller static-priority arbitration (CCSP) [5], which uses a rate regulator and a static-priority scheduler. Each requester has an unique priority and the rate regulator enforces a maximum bound on the provided service for each requester. The CCSP is based on the latency rate LR server [30] model. It presents a general model for the analysis of traffic shaping algorithms. However, the grouping of requesters and the use of varying access granularities at the resource, to increase the performance for non critical requests, were not investigated. In [13, 14, 16] static-priority scheduling with priority alteration based on certain conditions is used. However, the change of priorities is based on fixed time periods and a single granularity for the accesses. Similar to these approaches is Priority Division (PD) scheduling [28], which changes the priorities based on a static (time) schedule. A more accurate accounting of the workload is used in [6, 9]. In the dual priority scheme [9] a task can raise its priority, when it detects that no more interfering workload is allowed. In [6] the adaptive mixed criticality (AMC) approach monitors the execution time of all tasks. If a non-critical task exceeds its budget, all non-critical task are de-scheduled. The approach was further extended in [36] to utilize preemption threshold scheduling. However, grouping of tasks is not investigated, and thus non-critical tasks might be de-scheduled too often. The shown approaches have two major drawbacks. First, they do not exploit the throughput and latency slack of critical applications. Second, they use a single access granularity at the resource that can lead to an inaccurate estimation of the induced workload. Both can lead to a reduced resource utilization and thus reduced system performance. To encounter these drawbacks, we propose to use typed events [33] and prioritizing non-critical over critical accesses and only switching priorities when necessary. Typed events can account for different workloads requests induce to the resource. The workload will be monitored [10, 22] to determine dynamically if critical accesses must be prioritized to meet their requirements. Doing this, the proposed arbitration scheme achieves an improved resource utilization compared to the presented classical approaches.

4. SYSTEM MODEL In this section we briefly introduce the system model used throughout this paper. The model is based on the models from [22, 31, 33, 34]. In a system with a shared resource a set of requesters R = {r1 , .., rn } is competing for the time on the resource. The accesses to the resource from a certain requester are denoted as an access stream, where each access will use the resource for a certain time C. The requesters are divided in different classes based on their safety requirements: guaranteed throughput (GT), guaranteed latency (GLs and GLl) and best-effort (BE). Each of these classes Rclass ⊂ R is considered as one virtual requester, i.e. rBE for all BE requesters. The different requesters inside a class might require different times on the resource with each access. Therefore different worst-case blocking times Cri each access i from requester r can induce must be considered. This corresponds to the modeling of intra-task correlations in [33] where the resource blocking times can be modeled as execution times. In order to reason about the interference that can occur on the resource, a notion of the temporal behavior for the accesses and the worst-case workload they entail is needed. For this we define access traces which capture a specific se-

quence of accesses scheduled on the resource and the resulting worst-case workload. Definition 1. An access trace of size nmax is a function σ : [1, nmax ] → N+ × N+

(1)

t

where σr (n) = (σ , C) denotes that the n-th access of requester r is scheduled at time σ t and uses the resource for time C in the worst case. Throughout this paper we will use σrt (n) and σrC (n) to return the point in time the n-th access of requester r is scheduled and its time on the resource respectively. With this the accumulated workload Wr (n, m) that a requester r induces in the closed time interval [σrt (n), σrt (m)] from the occurrence of the n-th access until the occurrence of the m-th access can be specified as Wr (n, m) =

m Ø

σrC (i)

(2)

i=n

In many systems the behavior of the requesters is not deterministic as their demands for resource usage can depend on external events or state of the requester. To give guarantees for the system behavior, analyses, such as RTC [31] and CPA [15], use worst-case abstractions for the traces. These are referred to as event models. These event models are represented by upper bounds on the number of events that can occur in any time interval ∆t, which can be given as event arrival functions α(∆t). ¯ As we are grouping multiple requesters into a virtual requester, the events of this group may demand different time on the resource. To account for the variable times we adopt the workload arrival functions from [33]. A workload arrival function defines an upper bound on the workload that can be requested in any time interval of size ∆t. Definition 2. An upper workload arrival function (WAF) αr of a requester r is an increasing function αr : N0 → N0

(3)

where αr (∆t) represents an upper bound on the workload Wr requester r might require during any time interval of size ∆t. These workload arrival functions are a generalization of event arrival functions. For traces where all accesses demand for the same time C on the resource, both are related through α(∆t) = C ∗ α(∆t) ¯

(4)

For traces where the accesses occupy the resource for a different amount of time, the relation between both is more complex. An analytic derivation of these WAFs is presented in [33]. In this paper we will use the generalized WAFs to bound the workload a group of requesters can generate on a resource. For this we must check if a specific trace satisfies a WAF. To satisfy a WAF, a trace must not request more workload W (n, m) between any two requests n, m than allowed by the WAF α(σ t (m) − σ t (n)), or formally

Definition 3. A scheduled trace σ satisfies an upper workload arrival function α if: ∀n ≤ m ∈ [1, nmax ] : α(σ t (m) − σ t (n)) ≥ W (n, m) =

m Ø

σ C (i)

(5)

i=n

Along with the behavior of the requesters, the behavior of the resource must be specified. For this we define the resource capacity β and the remaining capacity β ′ after some workload has been scheduled according to [31]: Definition 4. The resource capacity β(∆t) denotes the amount of workload the resource can serve within any time interval of size ∆t. Definition 5. The remaining resource capacity β ′ (∆t) denotes the amount of workload the resource can serve for low priority requesters in each time interval of size ∆t while a high priority requester induces the worst-case workload α(∆t). For a work conserving resource it can be described as: β ′ (∆t) = max (β(u) − α(u)) 0≤u≤∆t

(6)

With the known remaining capacity, we can specify if a design is feasible for a requester. A design is feasible, if the remaining capacity is sufficient to serve the requested workload before a certain deadline d, after serving all requesters with a higher priority. Definition 6. The design is feasible for requester r if the remaining capacity βr′ (∆t) is sufficient to serve the requested workload αr (∆t) before the deadline d: αr (∆t − d) ≤ βr′ (∆t) ∀∆t with αr (x) = 0 f or x < 0

(7)

5. ACCURATE ACCOUNTING OF WORKLOAD In this section we outline that an accurate estimation of the induced workload can increase the performance for besteffort requesters and thus for the overall system. We do this by giving a brief overview about how our workload accounting together with the MCSP reduces the over-provisioning for critical requesters and hence increases the usable resource time for BE. For an accurate workload accounting the granularity for the workload itself and for the time in which this workload can be issued are important. Classical isolation approaches, as e.g. [14], typically use a periodic budget refill and assume a single fixed time Cf ixed for accesses on resources such as memory or interconnect. Based on this time upper bounds for the number of allowed accesses nbound during each period are derived. The interference of a requester during each period is than defined by Cf ixed · nbound . Using simple models for the budget refill and thus for the bounds does not reflect the complex activation patterns used in e.g. automotive systems [27]. This can result in a high over-reservation. The use of arrival functions, specified through minimum distance functions [15], results in a more fine-granular bounding. These fine-granular bounds enable to reduce the over-reservation and thus to increase the system performance. Besides of an accurate bound

Figure 2: Influence of varying access lengths on a resource. Using multiple granularities for the accesses enables a more accurate measurement and thus to issue more accesses before the upper bound is reached.

on the induced load, the capturing of the transient load induced by a requester is important. In order to derive a safe upper bound, classical approaches use the maximum time Cmax of all accesses that can occur in the system to estimate the interference and used budget: Cf ixed = Cmax . However, if the system contains resources that support different access granularities and thus times, Cmax is an pessimistic assumption. This leads to an over-estimation of the induced workload and hence the used budget. This over-estimation results in a lower average utilization of the resource Two examples for resources with different access granularities are the AMBA AHB Bus [1] and DDR2 400 memory [20]. The AMBA AHB Bus supports locked transfers with different burst lengths and thus supports transmissions that occupy the bus for a different time duration. The DDR2 memory offers different transaction sizes and hence also different durations. Based on the current state of the memory and the access granularity, which are specified via a burst length and burst count parameter [3, 19, 25], an access occupies the memory controller for a different amount of time. The use of different access types with a varying granularity can increase the data efficiency and utilization of memory accesses. Requesters that need small amounts of data benefit from a fine granularity as their requests are served faster. In case of requesters that need big data sets, a fine granularity increases the overhead as more accesses are necessary. For these big data sets coarse grained accesses are preferable, as the overhead can be reduced. However, these accesses take longer, and thus would unnecessarily increase the latency when just a small part of the fetched data is needed. Therefore the use of two different transaction granularities provides both requesters with the appropriate behavior. If the isolation mechanism does not account for the different access granularities, it can decrease the utilization of resources and thus overall system performance. Figure 2 outlines this problem for a resource with two different access sizes where a requester issues continuously three short and one long access with a duration of Cmin and Cmax respectively, where Cmax = 1.5 · Cmin . This is comparable to the different lengths of DDR2 400 memory transactions. Accessing a single bank without a burst takes approximately 11 cycles while accessing four banks interleaved with a burst takes 16 cycles [19, 20]. The figure presents the measured accumulated workload (ordinate) over the number of accesses (abcissa). The solid line shows the measurement for using only a single fixed duration Cmax to estimate the induced workload. The dotted line presents the results for using a

fine-granular estimation with two different durations. As can be seen in the figure, using the more accurate model with multiple durations enables to issue three additional requests before reaching the bound (dashed horizontal line). Thus the requester can issue more requests before it gets shaped and so the requester can achieve an increased performance. A fine-granular estimation of the induced workload of each requester enables to fully utilize the assigned budget. For this, the real workload can be measured with timers or approximated through the usage of typed events [33]. The event types represent the maximum time an access will block the resource in the worst-case. For the shown example, all arriving accesses will be typed with Cmin or Cmax when they arrive at the scheduler. All requests with a guaranteed time under Cmin will be typed with Cmin and all others with Cmax . The accuracy of this approximation depends on the number of different types the resource supports and the used number of types for the approximation. The use of typed-events has also the benefit of knowing the maximum time an access will block the resource before it is scheduled. This information can be used to fully utilize the remaining budget of a traffic class without over-reserving or delaying other classes. This is important for non-preemptive accesses. Without knowing the time before an access is scheduled, it might occupy the resource for too long and thus block subsequent requesters. In the proposed setup we distinguish GL, GT and BE. We assume GT traffic to use only large accesses as needed for efficient handling of streaming data. GL will only be used for small accesses, such as synchronization primitives or coherency traffic. Hence we assume a single fixed time for each and denote them CGT and CGL respectively. Based on the amount of data the accesses might require, we set CGT > CGL . Additionally we assume CGL to be the smallest and CGT the largest possible time an access can occupy the resource. As BE requesters are not well specified, they can contain accesses from different application types and issue accesses that require different times. Hence we assume i CGL ≤ CBE ≤ CGT for any request i from rBE throughout the paper. Using these times, the MCSP types all GL and GT requests with the corresponding times when they arrive at the queues. Arriving BE requests are divided depending on the actual needs and are typed with CGL or CGT , if it is a short or long transaction respectively, to estimate online the transient workload.

6. WORKLOAD-AWARE SHAPING In this section we show how the proposed setup and the accurate workload accounting are used to provide quality of service guarantees to GT and GL while at the same time minimizing the latency of BE accesses. We first show how to integrate GT and BE in one base system where BE has the highest priority in the normal state. In this base system we use the monitor and the monitor-controlled static-priority scheduler (MCSP) to provide QoS, i.e. bound the interference. This is done through switching the priorities of BE and GT when necessary to provide sufficient throughput. Based on the base system we show how GL requesters can be integrated while still ensuring the GT guarantees.

6.1

Guaranteed Throughput

In the base system we only consider a BE and a GT requester. The BE requester rBE has a higher priority than

1(t) group(t)

3 2 1

9

4

2(t)

8

12

16

20

24 t

8 7 6 5

6 5 4 3 2 1

4 3 2 1

4

8

12

16

20

24 t

(a) Individual Workload Arrival Functions (α1 ,α2 )

4

8

12

(b) Grouped Arrival

16

20

24

t

Workload

Figure 3: Grouping of two individual workload arrival functions (α1 ,α2 ) to one grouped WAF (αgroup ).

the GT requester rGT to reduce the latency for BE accesses. The access pattern and durations of rBE are unknown. But to provide service guarantees to a low priority requester, i.e. rGT in this case, the interference of the higher priority requester must be bounded. We achieve this bound through the usage of workload monitoring and priority switching in the MCSP. If the monitor observes a too high workload of rBE , i.e. a too slow progress of rGT , a priority switch between rBE and rGT occurs. The priority switch removes the BE requester from the set of interferers of GT, which consists of all higher or equal priority requesters. We denote this sets normal critGT for GT as IGT and IGT for the normal and critGT critGT normal \ {rBE } state respectively. Thus we have IGT = IGT The monitor checks if the current BE trace σBE satisfies its WAF based on Equation (5) to bound the maximum interference rBE can induce. For this the WAF and the observed trace must be stored in a buffer. The satisfaction condition requires to check the whole trace against the WAF, which leads to an overhead that increases with the trace length. In [22] the authors showed how monitoring can be achieved with a constant buffer size and in a constant time through the usage of limited workload arrival functions αl . For this only a limited trace buffer of the length l must be stored and checked when a new event is scheduled. For checking the satisfaction, the upper workload arrival GT function for BE αBE must be known. This is the workload which can be accommodated for BE streams without interfering with the service guarantees for GT. The WAF for BE GT αBE can be derived based on the specification of GT, which is known at design time. Based on the individual GT WAFs a grouped WAF for the whole GT group can be obtained as indicated in Figure 3. We assume no correlations between the individual requesters and thus the grouped WAF is given as the sum of the individual WAFs [22]. In the figure we have two periodic requesters with α1 and α2 as bounds on their required workload. Both requesters in the example will never request more than one workload unit in four time units (α1 (4) = α2 (4) = 1). Thus the grouped WAF αgroup for both will never require more than two workload units in four time units αgroup (4) = 2. With the known WAF αGT for GT and the resource capacGT ity β we can determine αBE under the constraint, that the ′ remaining capacity βBE after BE is scheduled is sufficient to serve GT:

GT and is added to all sets of interferes for GT. With this we can specify the sets of GT’s interferes for the different states as follows:

!

′ αGT (∆t − d) ≤ βBE (∆t) ′ βBE (∆t)

= max (β(u) − 0≤u≤∆t

with

GT αBE (u))

∀∆t

(8)

where d denotes a specified delay for the GT requester. To GT obtain αBE the inverse of the resource transformation from Equation (6) can be constructed analogous to [34]. This delivers a conservative upper bound for the workload that can be entailed by the BE requester: GT αBE (∆t) ≤ β(∆t + λ) − αGT (∆t + λ)

with λ = max{γ : αGT (∆t + γ) = αGT (∆t)}

(9)

GT With the known WAF αBE for the BE requester, the monitor ensures that the scheduled BE trace σBE always GT satisfies αBE when GT accesses are pending. Before a BE request is scheduled, the monitor checks, if any BE request GT can be scheduled without violating αBE . This guarantees that GT is never blocked by more workload than specified also for non-preemptive accesses. If there is no access type with a duration C that can satisfy the condition, GT must be scheduled and the MCSP goes in the critical state critGT . This results in GT having a higher priority than BE. Hence the MCSP will select GT accesses. The switch back to normal state is raised when no more GT accesses are pending in the critical state. After the switch back, BE has higher priority again and can benefit from a lower latency. If any access type can be scheduled without violating the satisfaction condition only accesses with a duration lower or equal to this type are scheduled. In case a BE access is scheduled the trace buffer is updated with the induced workload. This can be based on its approximated time (i.e. the duration associated with the access type) or its exact time (i.e. obtained through a timer). The checking for satisfaction based on the access type is done before an access is scheduled. This ensures that the priority switch does not occur while an access is occupying the resource and that the interference is conservatively bounded GT by αBE . The allowance of different durations for the accesses allows BE to optimally utilize the allowed workload, before its priority is degraded. By fully utilizing the allowed workload GT accesses are scheduled as late as possible and the performance for BE is increased while still bounding the interference BE induces to GT.

6.2

Including Guaranteed Latency

After obtaining QoS for GT requesters, we integrate the GL class into the system. For this we define which priority levels are assigned to the different GL requesters and how the MCSP is handling them. Additionally we show how GL can be integrated without violating the QoS guarantees for GT. The latency-sensitive safety-critical requesters are represented as the two GL requesters rGLs and rGLl . For these a guaranteed upper bound on latency must be provided. In the current design rGLs is set to the highest priority level in all MCSP states. This ensure that rGLs achieves the lowest latency. The rGLl requester is set to a priority lower than BE in the normal state, which is raised to a level higher than BE in the critGL state. In all states both GL requesters have a higher priority than GT. Thus GL can interfere with

normal IGT = {GLs, GLl, BE} critGL IGT = {GLs, GLl, BE} critGT IGT = {GLs, GLl}

(10)

As rGLl has a lower priority than BE in the normal state, a monitor observes the progress of rGLl , similar as described for GT. Based on the specification of rGLl the allowed workload GLl of BE αBE can be obtained. The monitor then checks if the current BE trace satisfies this WAF. If a rGLl access is pending and there is no access type of rBE that can GLl , a rGLl access must be scheduled without violating αBE scheduled and the state is changed to critGL. This ensures an upper bound on the possible interfering workload. Similar as for GT, accesses from rGLs are accounted to the BE workload when rGLl has pending accesses. Safety-critical requesters and so the GL requesters are developed with high diligence. Thus the behavior of these functions is well specified. The worst-case resource demand of GL is specified by αGL = αGLs + αGLl . As the system design should ensure a reliable and safe system, we assume that GT and GL can be served in the worst-case without any violation of the QoS requirements similar to Equation (8): !

αGT (∆t − d) ≤ max (β(u) − αGL (u)) 0≤u≤∆t

∀∆t

(11)

With this design constraint there is no need to actively (i.e. through priority changing) bound the interference of GL on GT in the system. GL accesses are scheduled directly without checking against the satisfaction condition. The workload induced by GL on the resource is accounted to the BE requester in the monitor if GT accesses are pending. If the accumulated workload of the current GL and BE GT traces violates αBE , the MCSP changes its state to critGT and BE obtains the lowest priority. GL remains at the highest priority levels. Thus GL is allowed to issue more accesses to the resource during the critGT state, which is still accounted to the BE workload in the monitor. This leads to an overcharge of the allowed workload. In the worst case, only BE accesses have been scheduled until the state changes. So GL might induce the full workload specified through αGL in the critGT mode. To account for this additional load, the MCSP remains in the critGT state until no more GT accesses are pending. Thus if the initial blocking through BE and GL leads to a backlog of multiple GT accesses, it can be caught up.

7. QUALITY OF SERVICE GUARANTEES In this section we derive bounds on the behavior of the GL and GT requesters. For the GL requesters we only consider rGLl . As rGLs always has the highest priority standard analysis of non-preemptive static-priority systems can be applied for these [32]. For a safe functioning of rGLl an upper bound on the latency and thus interference must exist. We will show how this upper bound can be derived based on the monitor definitions. To guarantee the safe functioning of GT requesters we must show a minimum throughput and a maximum latency to the start of serving GT with this rate. We show that the MCSP belongs to the class of latency rate

servers [4, 30]. To prove this, we first show that the system delivers a minimum rate towards the GT requester in the critGT state. After obtaining this throughput guarantee, we derive an upper bound on the latency till the requester gets served with this rate. For this we show that for all priority states of the MCSP an upper bound for the latency exists, i.e. an upper bound until the critGT state is entered and until GT is served in the critGT state. Bounded latency for GLl: We show the bounded latency by deriving an upper bound on the interfering consecutive workload. In the worst case the latency of rGLl is composed of three parts: (1) the time a rGLl access is pending until the system switches in critGL state Θnormal , GL (2) the time of serving rGLs requests in the critGL state before the first access of rGLl is served Θcrit GL , and (3) the time for switching the state Θswitch : switch + Θcrit ΘGLl = Θnormal GL + Θ GL

(12)

We assume the state switch to happen instantaneous during the selection of the next access. Thus we neglect this value for the rest of this analysis and need only to derive Θnormal GL and Θcrit GL . The latency Θnormal only depends on the allowed workload GL GLl arrival curve for BE αBE (∆t). This directly follows the definition of the monitoring at the MCSP and the satisfaction condition from Equation (5). If a GLl access is pending, BE gets only served if the WAF is satisfied. Thus the allowed GLl burst of BE requests in αBE denotes the workload that can be served consecutively till the MCSP goes to the critGL state. We denote this allowed burst as b Definition 7. The allowed burst b(αr ) of requester r with an upper workload arrival function αr describes the maximum allowed burst of workload that can be scheduled consecutively. as: From this definition we can directly derive Θnormal GL GLl Θnormal = b(αBE ) GL

(13)

Θcrit GL

For the maximum time between the state switch and starting of serving rGLl must be obtained. In the critical mode only rGLs and rGLl accesses are served and the design ensures that no access is being served during the state change. Hence this time only depends on αGLs . As in the case of the delay induced from BE, this time corresponds to the maximum burst of workload allowed in αGLs which is b(αGLs ): Θcrit GL = b(αGLs )

(14)

To derive the remaining service, the interfering workload must be known. The MCSP uses a static priority scheduler. Thus the interfering workload is generated from the set of higher priority requesters of GT. In the critGT state this set only consists of the GL requesters as defined in Equation (10). Hence the interfering workload is given as: αInterf erence (∆t) = αGL (∆t). With this the remaining service curve can be calculated as: ′ βGT (∆t) = max (β(u) − αGL (u)) ∀∆t 0≤u≤∆t

(17)

To provide throughput guarantees, we have to show that the remaining service is sufficient to serve the reserved workload of the GT requester αGT . This follows directly from Equation (11). Thus GT and GL can be scheduled without a violation of the QoS guarantees. That way the MCSP provides a minimum service rate that is sufficient to serve the requested workload of GT in the critical mode. Bounded latency for GT: After proving a minimum and sufficient service rate, we obtain the maximum latency ΘGT until GT gets served with this rate. We derive this analogous to the latency bound of GL. In the worst case this latency is composed of three parts: (1) the time a GT requests is pending until the system switches in critical state Θnormal , (2) the time of serving GL requests in the critical GT state before the first request of GT is served Θcrit GT , and (3) the time for switching the state Θswitch : switch ΘGT = Θnormal + Θcrit GT GT + Θ

(18)

Again we neglect the switching time for the rest of this and Θcrit consideration and need only to derive Θnormal GT . GT normal The latency ΘGT only depends on the allowed workload GT arrival curve for BE αBE (∆t). This directly follows the definition of the monitoring at the MCSP and the satisfaction condition from Equation (5). If a GT access is pending, BE gets only served if the WAF is satisfied. Thus the allowed GT burst of BE requests in αBE denotes the workload that can be served till the MCSP goes to the critGT state. With this as: we can define Θnormal GT GT Θnormal = b(αBE ) GT

(19)

Θcrit GT

For the maximum time between the state switch and starting of serving GT must be obtained. In the critical mode only GT and GL accesses are served (Equation (10)) and the design ensures that no access is being served during the state change. Hence this time only depends on the allowed burst for GL accesses:

With these latencies we now can derive the maximum latency for rGLl till it gets served as:

Θcrit GT = b(αGL )

GLl ΘGLl = b(αBE ) + b(αGLs )

With these latencies we now can specify the maximum latency for GT till it gets served with the reserved rate:

(15)

Throughput guarantee for GT: For the derivation of the minimum throughput, the interference from other requesters must be known. The interference describes how long other requesters are using the resource and thus how much service remains for the requester under observation in ′ the worst case. The remaining service for GT βGT can be obtained based on the resource capacity β and the interfering workload αInterf erence as shown in Definition 5: ′ βGT (∆t) = max (β(u) − αInterf erence (u)) ∀∆t 0≤u≤∆t

(16)

GT ΘGT = b(αBE ) + b(αGL )

(20)

(21)

The WAFs for GL and GT are defined during design time and from these the WAFs for BE can derived. Hence the interference and latency bounds for GT and GL are known at design time. With the known latency bounds ΘGT ,ΘGL and the guaranteed rate αGT the MCSP is in the class of Latency Rate Servers (LR) [30]. This enables the use of the LR model to derive bounds on latencies and buffering for any combinations of multiple LR servers and so MCSPs in sequence.

• SP : BE has lowest priority; GT is shaped based on the number of accesses with a fixed size • SPrev : BE has a higher priority than GT and GLl; BE is shaped based on the number of accesses with a fixed size • M CSP : BE has a higher priority than GT and GLl; BE is shaped with the proposed workload-aware shaping In the experiments the performance impact of the different scheduling approaches on BE traffic was investigated. For the BE traffic we obtained traces for memory accesses from the CHSTONE [12] benchmark. To generate the traces, we used a model of the ARM architecture with 32 bytes long cache-lines and a 32kB L1 cache in the Gem5 simulator [8]. Each generated trace consisted of over 2.500 accesses to the memory that induces an average load of 5% on the memory. For each CHSTONE benchmark we generated 20 sets of traces for the safety critical and other BE functions as background traffic. The generated traces for the safety critical functions were chosen longer so that all accesses of the BE function from the benchmark had to compete with them for the memory. Based on this we conducted two experiments. First we compared the average latency and execution time of the different BE applications between the different scheduling approaches with a fixed reservation for GT. In the second experiment we compared the execution times from setups with different reservations for GT. In the first experiment, we reserved 30% of the resource capacity for GT traffic with an access size of Cmax . BE, together with the benchmark under observation, used 50% of the resource capacity. For BE we assumed two different access sizes Cmin and Cmax , with Cmax = 1.5 · Cmin , of which each used 25%. Additionally we used a workload of 10% for sporadic activations of GL (5% for GLs and 5% for GLl) accesses with Cmin as the time needed. Thus we had an average resource usage of 90%. We compared two scenarios in this experiment. In the first scenario the safety critical function fully utilized their reservations (exact reservation). In the second scenario we assumed a over-reservation so that only 50% of the reserved workload were used by GT. That means the safety-critical GT functions reserved 30% but only used 15%. The results for the first experiment are shown in Figure 4 and 5. Figure 4 shows the average latencies for several applications for exact reservation (a) and over-reservation (b). For each application the results are normalized to the latency of the classical SP shaper. As can be seen, the prioritization of BE traffic in SPrev and M CSP drastically reduces the latency for BE traffic. Furthermore, the use of the more accurate workload estimation with different access sizes (M CSP ) shows better results than using a simple

norm. Latency

In this section we evaluate the benefits of workload-aware shaping compared to classic shaping approaches that use access counters and prioritize the safety-critical requesters. For this we compare the latency and execution time of different applications in the BE class with different shaping approaches applied. The evaluation was done using Python simulation models of a memory and the different schedulers. For the comparison we used two related shaping approaches which are capable of providing QoS guarantees and our MCSP:

1 0.8 0.6 0.4 blowfishsha

gsm mipsmotiondfdiv aes

SP

SPrev

MCSP

(a) Average latency with exact reservations

norm. Latency

8. EXPERIMENTAL EVALUATION

1 0.8 0.6 0.4 blowfishsha

gsm mipsmotiondfdiv aes

SP

SPrev

MCSP

(b) Average latency with over-reservations

Figure 4: Average latency for different applications normalized to SP with exact reservation (a) and over-reservation (b)

access counter (SPrev ). However, this improvement depends on the actual traffic patterns. For the blowfish, sha and aes benchmarks, the improvement is smaller compared to other applications like mips. The reduced latencies lead to a decreased execution time of the benchmark and hence improved performance as seen in Figure 5. In this figure the normalized execution time of the different applications is presented. The prioritization of BE in the SPrev can decrease the execution time by 5-10%. While the MCSP can reduce the execution time by up to 17%. In the second experiment, we varied the workload reservations of the safety-critical requesters. For this we used the same average resource load as before (90%) and varied the GT reservations and load between 10% and 70%. Figure 6 shows the different execution times of a benchmark (aes) for three different safety-critical loads. In the first scenario (I) we reserved 10% for GT, in the second (II) 40%, and in the third (III) 70%. All results are normalized to the time of the SP scheduler from the first scenario (I). As can be seen, the execution time increases with an increasing load of GT if SP with classic prioritization of the GT traffic is used. This results from the fact, that BE has to wait for more load of GT before it is allowed to use the resource. With the prioritization of BE in SPrev and M CSP the execution time is not increasing. In these designs the priority of BE is only degraded, if too much BE load arrives at the resource. With increasing GT load, less BE requesters are present and thus less BE requests can arrive concurrently. Hence the possibility for BE to become low priority and wait for GT is smaller. The decreasing effect of the execution time for SPrev and M CSP , that can be seen in the figure, is not a direct effect of the proposed mechanism. As less BE load is present, the chance for a blocking between multiple BE requesters is reduced. This leads to a smaller latency and

norm. Execution time

1 0.95 0.9 0.85 0.8

gsm mipsmotiondfdiv aes

blowfishsha

SP

SPrev

MCSP

norm. Execution time

(a) Average execution time with exact reservations

1 0.95 0.9 0.85 0.8

gsm mipsmotiondfdiv aes

blowfishsha

SP

SPrev

MCSP

(b) Average execution time with over-reservations

norm. Execution time

Figure 5: Average execution times normalized to SP with exact reservation (a) and over-reservation (b)

1.4 1.2 1 0.8 I SP

II SPrev

III MCSP

Figure 6: Average execution time for different GT reservations (I:10%, II:40%, III:70%)

thus execution time. From the evaluation we see, that exploiting throughput and latency slack of critical applications by prioritizing noncritical over critical accesses increases the overall system performance. However, the performance benefit depends on the access patterns of the applications and the overall system load.

9. CONCLUSION In this paper we presented a novel resource access shaping scheme for mixed-criticality systems. Unlike many other existing approaches, we prioritize best-effort traffic whenever possible and use an accurate acquisition of the workload. With this we achieve an improved performance for generalpurpose functions. At the same time it provides full guarantees to safety critical real-time functions. We have done this with use of monitoring techniques and typed events [33] to represent the workload induced by a group of functions on

the shared resources. The monitor ensures that all guarantees are met and enables exploiting throughput and latency slack of critical applications by prioritizing non-critical over critical access. We formally derived the quality of service guarantees of the new approach to provide safe bounds for the safety critical function. With this we showed, that monitoring can provide latency rate properties to a low priority requester at a strict-priority scheduler. Our experimental evaluation revealed that the use of a more accurate workload model achieves a better performance than using a fixed granularity. By prioritizing BE traffic the approach provides a reduction of the latency for general purpose functions by up to 30% and of the execution times by up to 10%. Additionally, the fine-granular workloadaware shaping further increases this benefit and reduces the execution times by up to 15% and the latencies by up to 50%. Thus workload-aware shaping allows a higher overall resource utilization while providing identical isolation guarantees compared to existing solutions.

10.

ACKNOWLEDGMENTS

This work was funded within the ARAMiS project by the German Federal Ministry for Education and Research with the funding ID 01|S11035. The responsibility for the content remains with the authors.

11.

REFERENCES

[1] AMBA Specification (Rev. 2), May 1999. [2] L. Abeni and G. Buttazzo. Resource reservation in dynamic real-time systems. Real-Time Systems, 27(2):123–167, 2004. [3] B. Akesson, K. Goossens, and M. Ringhofer. Predator: A Predictable SDRAM Memory Controller. In Int’l Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), pages 251–256. ACM Press New York, NY, USA, Sept. 2007. [4] B. Akesson, A. Hansson, and K. Goossens. Composable resource sharing based on latency-rate servers. In Digital System Design, Architectures, Methods and Tools, 2009. DSD ’09. 12th Euromicro Conference on, pages 547–555, Aug 2009. [5] B. Akesson, L. Steffens, E. Strooisma, and K. Goossens. Real-time scheduling using credit-controlled static-priority arbitration. In Embedded and Real-Time Computing Systems and Applications, 2008. RTCSA ’08. 14th IEEE International Conference on, pages 3–14, Aug 2008. [6] S. Baruah, A. Burns, and R. Davis. Response-time analysis for mixed criticality systems. In Real-Time Systems Symposium (RTSS), 2011 IEEE 32nd, pages 34–43, Nov 2011. [7] S. Baruah, H. Li, and L. Stougie. Towards the design of certifiable mixed-criticality systems. In Real-Time and Embedded Technology and Applications Symposium (RTAS), 2010 16th IEEE, pages 13–22, 2010. [8] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood. The gem5 simulator. SIGARCH Comput. Archit. News, 39(2):1–7, Aug. 2011.

[9] A. Burns and A. Wellings. Dual priority assignment: A practical method for increasing processor utilisation. In Real-Time Systems, 1993. Proceedings., Fifth Euromicro Workshop on, pages 48–53, June 1993. [10] F. Dewan and N. Fisher. Efficient admission control for enforcing arbitrary real-time demand-curve interfaces. In Real-Time Systems Symposium (RTSS), 2012 IEEE 33rd, pages 127–136, Dec 2012. [11] A. Francini and F. Chiussi. Minimum-latency dual-leaky-bucket shapers for packet multiplexers: theory and implementation. In Quality of Service, 2000. IWQOS. 2000 Eighth International Workshop on, pages 19–28, 2000. [12] Y. Hara, H. Tomiyama, S. Honda, H. Takada, and K. Ishii. Chstone: A benchmark program suite for practical c-based high-level synthesis. In ISCAS, pages 1192–1195. IEEE, 2008. [13] F. Harmsze, A. Timmer, and J. van Meerbergen. Memory arbitration and cache management in stream-based systems. In Proceedings of the Conference on Design, Automation and Test in Europe, DATE ’00, pages 257–262, New York, NY, USA, 2000. ACM. [14] S. Heithecker and R. Ernst. Traffic shaping for an fpga based sdram controller with complex qos requirements. In Proceedings of the 42Nd Annual Design Automation Conference, DAC ’05, pages 575–578, New York, NY, USA, 2005. ACM. [15] R. Henia, A. Hamann, M. Jersak, R. Racu, K. Richter, and R. Ernst. System level performance analysis - the symta/s approach. Computers and Digital Techniques, IEE Proceedings -, 152(2):148–166, Mar 2005. [16] S. Hosseini-Khayat and A. D. Bovopoulos. A simple and efficient bus management scheme that supports continuous streams. ACM Trans. Comput. Syst., 13(2):122–140, May 1995. [17] IEC 61508: Functional Safety of Electrical/Electronic/Pro- grammable Electronic Safety Related Systems. International Electrotechnical Commission, 1999. [18] ISO 26262:2011, Road vehicles - Functional safety (2011). 2011. [19] B. Jacob, S. Ng, and D. Wang. Memory Systems: Cache, DRAM, Disk. Elsevier Science, 2010. [20] JEDEC, Arlington, Va, USA. JESD79-2F: DDR2 SDRAM Specification, Nov. 2009. [21] N. Muralimanohar and R. Balasubramonian. Interconnect design considerations for large nuca caches. SIGARCH Comput. Archit. News, 35(2):369–380, June 2007. [22] M. Neukirchner, P. Axer, T. Michaels, and R. Ernst. Monitoring of workload arrival functions for mixed-criticality systems. In Real-Time Systems Symposium (RTSS), 2013 IEEE 34th, pages 88–96, Dec 2013. [23] J. Nowotsch and M. Paulitsch. Quality of service capabilities for hard real-time applications on multi-core processors. In Proceedings of the 21st International Conference on Real-Time Networks and Systems, RTNS ’13, pages 151–160, New York, NY, USA, 2013. ACM. [24] J. Nowotsch, M. Paulitsch, D. Bühler, H. Theiling, S. Wegener, and M. Schmidt. Multi-core

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

[36]

interference-sensitive wcet analysis leveraging runtime resource capacity enforcement. Technical Report 2013-10, Informatik, 2013. M. Paolieri, E. Quiñones, and F. J. Cazorla. Timing effects of ddr memory systems in hard real-time multicore architectures: Issues and solutions. ACM Trans. Embed. Comput. Syst., 12(1s):64:1–64:26, Mar. 2013. J. Rexford, F. Bonomi, A. Greenberg, and A. Wong. Scalable architectures for integrated traffic shaping and link scheduling in high-speed atm switches. Selected Areas in Communications, IEEE Journal on, 15(5):938–950, Jun 1997. K. Richter. Compositional scheduling analysis using standard event models: the SymTA/S approach. PhD thesis, 2005. H. Shah, A. Raabe, and A. Knoll. Priority division: A high-speed shared-memory bus arbitration with bounded latency. In Design, Automation Test in Europe Conference Exhibition (DATE), 2011, pages 1–4, 2011. D. Stiliadis and A. Varma. A general methodology for designing efficient traffic scheduling and shaping algorithms. In INFOCOM ’97. Sixteenth Annual Joint Conference of the IEEE Computer and Communications Societies. Driving the Information Revolution., Proceedings IEEE, volume 1, pages 326–335 vol.1, Apr 1997. D. Stiliadis and A. Varma. Latency-rate servers: a general model for analysis of traffic scheduling algorithms. IEEE/ACM Trans. Netw., 6(5):611–624, Oct. 1998. L. Thiele, S. Chakraborty, and M. Naedele. Real-time calculus for scheduling hard real-time systems. In Circuits and Systems, 2000. Proceedings. ISCAS 2000 Geneva. The 2000 IEEE International Symposium on, volume 4, pages 101–104 vol.4, 2000. K. Tindell, A. Burns, and A. Wellings. Analysis of hard real-time communications. Real-Time Systems, 9:147–171, 1994. E. Wandeler, A. Maxiaguine, and L. Thiele. Quantitative characterization of event streams in analysis of hard real-time applications. Real-Time Syst., 29(2-3):205–225, Mar. 2005. E. Wandeler and L. Thiele. Real-time interfaces for interface-based design of real-time systems with fixed priority scheduling. In Proceedings of the 5th ACM International Conference on Embedded Software, EMSOFT ’05, pages 80–89, New York, NY, USA, 2005. ACM. H. Yun, G. Yao, R. Pellizzoni, M. Caccamo, and L. Sha. Memory access control in multiprocessor for real-time systems with mixed criticality. In ECRTS, pages 299–308, 2012. Q. Zhao, Z. Gu, and H. Zeng. Pt-amc: Integrating preemption thresholds into mixed-criticality scheduling. In Design, Automation Test in Europe Conference Exhibition (DATE), 2013, pages 141–146, March 2013.