Minimizing Power Supply Noise Through ... - Semantic Scholar

2 downloads 183 Views 908KB Size Report
in application mapping for NoCs. 2 Our work considers NoCs communication workload im- pact on power delivery network for multi-core NoC-based systems.
Minimizing Power Supply Noise Through Harmonic Mappings in Networks-on-Chip ∗



Nizar Dahir , Terrence Mak , Fei Xia and Alex Yakovlev School of Electrical and Electronic Engineering Newcastle University Newcastle upon Tyne, UK

{nizar.dahir,terrence.mak,fei.xia,alex.yakovlev}@ncl.ac.uk ABSTRACT

1. INTRODUCTION

Power supply integrity has become a critical concern with the rapid shrinking of device dimensions and the ever increasing power consumption in nano-scale integration. Particularly, power supply noise is strongly correlated to the spatial distribution of activity densities and this can be attributed to the on-chip communication, which dictates the power dissipation and overall system performance in networkson-chip. In this paper, we propose a new mapping strategy aiming to create a balanced activity distribution across the whole chip. We formulate the problem of application mapping as a minimization of the activity density by employing a repulsive force-based objective function. Metrics of regional activity density and characteristics of its impact on power supply noise are considered. The proposed method has been rigorously evaluated based on a large set of real-application benchmarks. Significant reduction in power supply noise can be achieved with negligible energy overhead. This new approach would provide a more scalable solution for future large-scale system integration.

Power supply noise (PSN) has adverse effects on digital circuit performance and reliability. It could cause signal deterioration and create soft errors. It has been reported that variations in power supply would have significant impact on operational frequency and system power dissipation [16]. Power supply noise is caused by resistive (IR) drop and inductive (∆I) droop. The resistive voltage drop occurs mainly due to the resistance of power delivery wires on the chip. While the inductive droop is mainly due to wire inductance in the package and chip wires and is proportional to the rate of change of current di/dt. Technology scaling exacerbates the problem of power supply noise for many reasons. Firstly, due to the large-scale integration, power density is substantially increasing. Also, wire size in the power network is shrinking, causing resistance to substantially increase, resulting in higher IR drop. Secondly, higher switching frequency increases ∆I droop. Thirdly, lower operating voltage decreases the noise margin. These facts make voltage drop in 65 nm technology be as much as 30% of the supply voltage [2]. Thus, mitigating power supply noise becomes a grand challenge for the sustainability of future large-scale integration development. In communication centric systems such as Networks-onChip (NoCs), the impact of power supply noise is even more prominent. It would deteriorate performance and increase power dissipation significantly. Power supply noise would cause timing inconsistencies due to its impact on timing path delays. Also, it would cause fluctuations of clock frequency. It has been reported that a change of 10% of power supply will cause about 6% of change in frequency and 18% increase in power dissipation due to the increase in leakage power [19]. Power dissipation activity in NoCs is strongly correlated with the network traffic load. This load can be determined in the early design stages, once the application is mapped to the target architecture, offering a unique opportunity to estimate power consumption with good accuracy using high network-level simulators. In this work, we propose an NoC application mapping method which aims for power supply noise minimization. We use the metric of activity density and analyze the impact of core activity density on power supply noise. To minimize activity density, we propose tile repulsive force as a mapping objective and argue that using this objective in application mapping for NoCs will result in more scattered and thus higher balanced activity distribution. We also verify, us-

Categories and Subject Descriptors B.7.2 [Hardware]: Integrated Circuits—Design Aids

General Terms Performance, Reliability

Keywords Networks-on-Chip, Power Supply Noise, Application Mapping, Activity Factor, Power Density ∗Nizar Dahir is also a staff member with the University of Kufa in Najaf-Iraq and they are sponsoring his Ph.D. study at Newcastle University. email: [email protected]. †Terrence Mak is also with the Department of Computer Science and Engineering, The Chinese University of Hong Kong. email: [email protected].

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CODES+ISSS’12, October 7-12, 2012, Tampere, Finland. Copyright 2012 ACM 978-1-4503-1426-8/12/09 ...$15.00.

ing the NoC power supply noise model, that systems resulting from this mapping have significantly lower power supply noise compared to energy-minimization mappings with low energy penalty. The main contributions of this work are:1 Presents a new concept of optimizing power supply noise in application mapping for NoCs. 2 Our work considers NoCs communication workload impact on power delivery network for multi-core NoC-based systems. 3 Introduces the metric of activity density and analyse the impact of its spatial patterns on power delivery. Significant relationships between the spatial distribution of cores and power supply noise are discovered. 4 Proposes tile repulsive force as a mapping objective strategy which, in contrast to other NoC mapping strategies, results in spreading the highly active tiles across the chip and causes significant drop in power supply noise with a relatively low energy overhead compared to energy-aware mapping. The remainder of this paper is organized as follows: Section 2 presents the related work on power supply noise mitigation and NoCs application mapping. This section will also present the notations, definitions and the main models used in this paper. In Section 3 the methodology used to achieve our power supply noise minimization by application mapping in NoCs is and the metrics of tile repulsive force and activity density are proposed and discussed. Results are presented and discussed in Section 4. Finally, the paper is concluded in Section 5.

2.

RELATED WORK AND BACKGROUND

2.1 Related Work Classical techniques employ on-chip decoupling capacitors to reduce power supply noise. Additional capacitors are inserted into the computational circuits to increase the load and extra current spikes would be filtered. This can be an effective technique at the cost of additional chip area. In [6], methodologies are proposed to optimally determine the physical location for the decoupling capacitors during floor-planning. Alternatively, optimal power-gating technique coupled with dynamic scheduling are also proposed to minimize the voltage drop caused by high-frequency logic switching in the gated blocks [21]. Power supply noise can also be tackled at the system-level. Particularly, workload assignment can have a significant impact on the induced power supply noise in a multi-core system. In [20], a simulated-annealing approach is employed to optimize the assignment of workloads to the cores, such that the resulting power supply noise can be minimized. Although it has been demonstrated that the methodology is efficient in reducing the overall power supply noise, the communication activities between the cores, which would create significant power supply noise, have been ignored. Networks-on-chip takes up a significant portion of the overall power consumption and produce notable power supply noise around the highly communicating regions. Previously, most of the application mappings focused on minimizing communication delay or energy consumptions. Notable

examples include proposing optimization algorithms to minimize energy consumption [11], increasing performance[5], and improving system reliability [1]. The impact of power supply noise and its derived risks to an NoC system has always been ignored. This also been exacerbated as a result of aggressive energy and performance focused optimization. This is because highly communicating cores are usually mapped to regions in a closer proximity and this would lead to some regions in a chip with significantly higher activity density. More power would be drawn into these areas at an expense of voltage drops and power supply noise. This under-balanced communication can be seen as an advantage to energy and performance, but at the expenditure of system reliability. The concept of activity density in VLSI circuits is first introduced by [15]. The author defines transition density to be the average switching rate and develop an algorithm to compute this density based on stochastic models of logic signals. The results show that higher activity density will lead to higher power and ground current densities which would directly affect power supply integrity and introduce thermal hotspots. Also they show that higher activity would have bad impact on circuit reliability in terms of electromigration failures. This can be seen as another problem with higher activity density and motivates balanced activity design in VLSI circuits. In our previous work, we proposed a computational model that determines the power supply variations across NoCbased chip multi-processors (CMPs) [18]. To determine the voltage variations across the chip at the presence of workload, both energy and traffic models are captured. This follows a characterization of the switching activities for each individual component on the chip. An approximation model, which is based on peak voltage noise estimation for the power delivery network [23], is employed in the overall noise approximation. In this paper, we propose a method to balance both system energy efficiency and reliability by providing a balanced communication activity density across the whole chip.

2.2 Background We now give definitions and notations used in this paper and present a brief description of the metrics of power supply noise and energy.

2.2.1 Definitions The NoC architecture is characterized by Architectural Graph ARG = G(T, P ) which is defined as a directed graph where each vertex, ti ∈ T , represents an NoC tile and each directed arc pi,j ∈ P , represents the path from tile i to tile j. Each path pi,j consists of a set of links L(pi,j ). The set L(pi,j ) is determined by the routing algorithm used to route packets from source to destination. On the other hand the application communication requirements are described by Application Graph AP G = G(S, A) which is a directed acyclic graph where each vertex si ∈ S represents a task and each arc ai,j ∈ A represents the communication from task si to task sj . The quantities associated with ai,j are the bandwidth requirement from si to sj , b(ai,j ) and data volume w(ai,j ). A mapping function (Ω) maps an application characterized by the APG to the target architecture characterized by the ARG.

2.2.2 Mertics for Power Supply Noise To compute power supply noise, we adopt the definition given in [7]. Consider G to be the set of all power grid nodes. For node x ∈ G in the power delivery network, power supply noise, during a switching time period Ts , is given as: Z Ts min(VNM − ∆V (x, t), 0)dt (1) P SN (x) = 0

Where VNM is the noise margin, ∆V (x, t) is the power supply variation for node x at time t. We determine total power supply noise on the chip as the summation of the power supply noise for all nodes in the power grid, i.e. X P SN (x) (2) P SNtot = ∀x∈G

Where P SNtot is the total power supply noise on the chip.

2.2.3 Energy Model To compute the energy cost of a mapping function, we used the bit energy metric, Ebit , to compute this cost whenever necessary. The average energy consumed by sending bit one bit from tile i to tile j (Ei,j ) is given by [22]: bit bit bit Ei,j = (ESbit + EB )nhops (i, j) + EL (nhops (i, j) − 1) (3)

Where nhops (i, j) is the number of hops along this path bebit bit tween tiles i and j. ESbit , EB and EL are the energy dissipated by the crossbar, buffer and link of the router respectively. The total energy cost of a mapping function (Etot (Ω)) is computed as: X bit w(ai,j ).EΩ(i),Ω(j) (4) Etot (Ω) = ai,j ∈A

Where w(ai,j ) is the communication data volume from task si task sj in APG. The number of hops from the source to destination tiles i and j, nhops (i, j) is determined by the routing algorithm. In this work the XY routing algorithm is assumed. XY is a deterministic, deadlock free routing algorithm, which routes the packets from its source along the X direction first and then along the Y direction towards its destination. A summary of all notations used in this paper is shown in Table 1.

3.

METHODOLOGY

G(T, P ) G(S, A) L(p) Ω P SNtot VNM Etot (Ω) Wr Nr Fr Br (Ω) ˆr (Ω) B αu γk (D) Γi Fi,j Ftot

NoC architectural graph with the set of NoC tiles {T} as verticies and the set of paths {P} among these tiles as arcs. Application graph with the set of tasks {S} as verticies and the set of communications {A} among these tasks as arches. The set of NoC links that constitutes a path p in the architectural graph. Mapping function that maps application graph to architectural graph. Total power supply noise. power supply noise margin. Total energy cost of mapping Ω. Channel width of router r. Number of channels in router r. Frequency of router r. The bandwidth capacity of router r. The bandwidth load of router r. Activity factor of VLSI unit u. Regional activity density of tile k considering region of diameter D. Metaphor of charge for tile i which is a function of the tile local activity density Γi = eki γi . ki is a tuning constant. The tile repulsive force between tiles i and j. The total repulsive force for all NoC tiles.

Table 1: notations used in this paper

Where areak is tile area. We now define regional activity density of an NoC tile in terms of local activity densities. Consider the activity in a region with a particular size in the vicinity of a tile of interest k. For tile k we define this region as the set of all tiles j that satisfy kk, jk ≤ D, where kk, jk is the Manhattan distance between tiles k and j, D is the region radius. Now considering a regular NoC with similar tile size and architecture, the regional activity density of tile k as a function of region radius D, γk (D). This is expressed as the summation of local activity densities of all tiles that lie within this region i.e.: X γj (7) γk (D) = ∀j∈T,kk,jk≤D

3.1 Local and Regional Activity Densities In this work we employ a mapping strategy based on minimizing the activity density. We define the metric of local activity density considering an NoC tile k which consists of a set of functional units (floating point unit, SRAM, router etc.). For unit u ∈ k we define the local activity density (γu ) in terms of the unit’s maximum power consumption (Pumax ), switching activity factor (αu ) and area as:   αu (5) γu = Pumax areau Here αu ranges from 0 to 1 and is the ratio of the unit’s average load to its maximum loading capacity. Now considering an NoC tile k which consists of a set of units, the local activity density, γk , can be expressed as: P max αu ∀u∈k Pu γk = (6) areak

According to this definition the local activity density defined in Eq 6 can also be denoted as γk (0). In contrast to other tile units, whose activity is only determined by the task assigned to the tile, the router activity is highly characterized by the communication demand of the application and the placement of the communicating tasks across the NoC system. Considering router r with Nr number of channels with Wr channel width and Fr frequency, the bandwidth capacity of router r, Br , can be expressed as: Br = Wr × Nr × Fr

(8)

ˆr , is the The actual switching load of the router’s logic, B data bandwidth the router is responsible for relaying. This load is determined by the application mapping function (Ω) and the routing algorithm. The router load is the summation of loads of all router channels. For router r with a set

of channels (CHr ) the router load is given by: X ˆr = ˆ B B(ch)

(9)

ch∈CHr

ˆ Where B(ch) is the communication bandwidth load of channel ch. Now αr can be readily computed as; αr =

ˆr B Br

cost. To define this cost, we introduce the metric of tile repulsive force Fi,j . As an analogy, this force can be considered similar to the repulsive force among like charges which is given by Coulomb’s law (see Figure 1(a)). Two tiles i and j repel each other by a force directly proportional to their charge (a function of their local activity) and inversely proportional to the square of the distance between them. We express this force as follows:

(10) Fi,j = K

3.2 Power Supply Noise Optimization Objective Higher switching activity density leads to higher average power and current densities. Moreover, high activity density would also increase peak current demand due to larger simultaneously switching circuitry. This causes both the average and peak IR drops to increase. The inductive, ∆I, droop also increases with activity density due to higher rate of switching which leads to higher fluctuations in the current draw [15, 3, 17]. Thus, minimizing activity density will improve supply integrity and lower power supply noise. These facts are also supported by our analysis of the correlation between power supply noise and both local and regional activity densities presented in Section 4.2. To reduce the

FQi,Qj

+Qi

+Qj

(11)

Where Γi = eki γ(i) and Γj = ekj γ(j) , ki , kj and K are constants. The exponential function is used to impose an exponential increase in Fi,j , and thus higher cost, when the local activity increases. The distance di,j considered here is the Manhattan distance between the tiles. Since the force defined by Eq. 11 decreases quadratically with distance, and to reduce the computation time required to evaluate this force during mapping optimization, we consider the repulsive force among tiles given that they lie within a region with a predefined radius D (see Figure 1(b)). Thus, we compute the total of this repulsive force as:   X X  (12) Fi,j  Ftot = ∀j∈T

FQi,Qj

Γi Γj d2i,j

∀i∈T,ki,jk≤D

A mapping function which results in lower Ftot will have more scattered distribution of highly active tiles. This would result in lower regional activity and power supply noise.

3.3 Problem Formulation Now, we formulate the problem of power supply noise optimization mapping in NoCs as follows: Given: Application graph, G(S, A) and architectural graph, G(T, P ) that satisfy

(a) Repulsive force among like charges

|S| ≤ |T |

(13)

find: A mapping function Ω that maps each task in APG to a tile in ARG P  P Objective: min ∀j∈T F i,j ∀i∈T,ki,jk≤D such that: Ω(si ) ∈ T Ω(si ) 6= Ω(sj ) ˆ k) B(lk ) ≥ B(l

, ∀ si ∈ S , ∀ si 6= sj

(14) (15)

, ∀ lk ∈ L(pi ), ∀pi ∈ P

(16)

ˆ k ) is the link load where B(lk ) is link bandwidth and B(l which is determined by the application mapping function and computed as: X ˆ k) = b(ai,j ) × π(lk , p(Ω(si ), Ω(sj ))) (17) B(l ∀ai,j ∈A

(b) Repulsive force between tile 0 and other tiles for different region sizes. Γi = eki γi is the tile charge. Figure 1: Illustration of tile repulsive force regional activity density within a particular region, our objective function must minimize the number of high activity tiles that lie within this region (see Eq. 7). Thus, mappings which result in condensing high activity tiles must have high

Where π(l, p) is 1 if l belongs to the set of links constituting path p, L(p), and 0 otherwise. The first two conditions are for ensuring that each task will be mapped onto only one tile and that no more than one task can be mapped onto one tile, respectively. The third constrain is necessary to guarantee that a link load will not exceed its bandwidth capacity.

3.4 Simulated Annealing-Based Solution The application mapping problem in NoCs is known to be NP-hard [11]. For an NoC of size n × m there are (n × m)!

possible mappings. In this work we use simulated annealingbased solution. Simulated annealing can help to avoid being trapped at local minima, as the temperature function would give an opportunity to jump out from the minima throughout searching. It is suitable for NP hard problems where suboptimal solution is required, which is the case in this work. The pseudo code of the this solution is shown in Algorithm 1. Algorithm 1 Pseudo code of the simulated annealing-based solution Define: T empf : final temperature, β: cooling rate. N (t): the set of neighbors of tile t. Input: AP G: application graph G(S, A), ARG: architectural graph G(T, P ). Output: Ω is the optimal mapping function. Initialize: T empc = IN I T EM P , Ω = RAN D M AP P IN G. 1: Ωp ← Ω 2: based on Ωp , compute γt ∀t ∈ T p 3: compute Ftot using Eq. 12 4: while T empc ≥ T empf do 5: t1 ← arg maxt (γt ), ∀t ∈ T 6: choose t2 from N (t1 ) randomly with equal probability 7: Ωp (Ω−1 (t1 )) = Ω−1 (t2 ) 8: Ωp (Ω−1 (t2 )) = Ω−1 (t1 ) 9: if bandwidth constrain for Ωp are satisfied then 10: based on Ωp , compute γt ∀t ∈ T 11: compute Ftot using Eq. 12 p 12: ∆F = Ftot − Ftot 13: if (∆F < 0) then 14: Ω ← Ωp p 15: Ftot ← Ftot 16: else 17: if e−∆F /T empc > rand() then 18: Ω ← Ωp p 19: Ftot ← Ftot 20: end if 21: end if 22: end if 23: T empc = βT empc 24: end while 25: RETURN Ω The algorithm takes an application graph (APG) and an architectural graph (ARG) as inputs. Also, the final temperature T empf and the cooling rate β need to be defined at the beginning of the algorithm. Then, both mapping function and temperature (T empc) are set to their initial values. In lines 1-3, and based on the initial mapping function (Ωp ), the activity densities of all tiles (γt ∀t ∈ T ) and the objective function (Ftot ) are computed. The parameters T empc , T empf and β are tuned experimentally. The initial temperature, T empc controls the level of relaxation of randomness of exploration at the start of the optimization, which needs to be chosen carefully to enable good exploration of the solution space. The cooling rate β controls the speed of convergence. It need to be chosen to allow slow enough convergence to ensure a good exploration of the solution space but not smaller than necessary since this would make the convergence too slow. The main optimization loop is at lines 4-24. At lines 68 the new mapping (or neighbouring state) is generated by

first choosing the tile with the highest activity density, and then randomly choosing another tile from its neighbours. The tasks assigned to these tiles are then exchanged (lines 7 and 8). By always modifying the tile with the highest activity, we ensure that next mapping would have different hotspot which helps in good exploration of solution space and makes convergence faster. The condition at line 9 checks if the new mapping satisfy the bandwidth constrain stated in Eq. 17. If this constrain is not satisfied, the new mapping is ignored and the algorithm goes directly to next iteration. If the bandwidth constrain is met, the new activity densities and the objective function are computed for this new mapping (lines 12 and 13). In lines 14-15 the new mapping is accepted as new state if it has lower value of the objective function (i.e. ∆F < 0)), if not, this mapping can be accepted with a probability of (e−∆F /T empc ). Then the current temperature is reduced by the cooling rate and the loop starts again.

4. EXPERIMENTAL RESULTS 4.1 Simulation Setup and Tools To compute power supply noise in NoCs, we integrate NoC power and area model [12], a traffic simulator [9] and a power delivery network model [23]. This computational model takes application characteristics as input and employs a mapping function to assign tasks to their NoC architecture. This function will determine the traffic distribution which is input to the NoC simulator to compute the communication workload across the chip. This work is not limited to a particular NoC topology of architecture. However, for convenience we will describe it in the context of a regular mesh NoC. We also assume CMPs that run concurrent parallel tasks. To determine supply voltage variations across the chip in the presence of a workload, two models are needed. The first one characterizes the switching workload of different components on the chip. The second is for power delivery network which delivers power from the power source to these components. We employed a fast power grid model which computes peak voltage noise at the nodes across the power delivery network [23]. For the power delivery network (PDN), we included both off-chip and on-chip power delivery network models. The onchip PDN consists of a global level mesh structure routed in the top metal layers. In this work we used a lumped model of the power delivery network. The on-chip power network is modelled as RLC mesh with a grid segment length such that we have 5 × 5 granularity per NoC tile. This granularity was shown to be enough for capturing the power supply voltage variations across the chip [10]. The RLC values for the grid segments were determined using PTM [4]. We employed a SPICE netlist of this PDN. We integrated an NoC cycle accurate simulator [9] and router power model [12] to compute the workload. The floorplan and architecture of Intel’s TeraFlop tile is used. Tile’s computational units power traces are estimated using the results presented in [20]. Integrating these models enables us to see the temporal and spatial voltage variations as direct function of the activity across the chip. The tool accuracy was also evaluated against SPICE simulation and an average error of only 1.4% was reported [18].

node voltage (V.)

To illustrate the significance of the problem and the motivation behind our power supply noise-aware mapping in NoCs, we start by analyzing 100 random mappings for both the MMS and the VOPD benchmarks [14]. For each of these mappings we computed the resulting voltage variations. A snapshot of this simulation for the VOPD application is shown in Figure 2. We also compute the resulting total

1.0 0.99 0.98 0.97 0.96 0.95 0.94 0.93 0.92 0.91 0.9

0.98 0.97 0.96 0.95

on noise than regional activity for D = 1. The same trend was found for the MMS benchmark. This implies that the higher the noise margin is, the higher the region size that need to be considered for noise optimization.

60

total energy, Etot(mJ)

4.2 Activity Density and Power Supply Noise

min. max.

50

34.7% 40

30

20

10

41.6%

0.94 0

MMS

0.93 0.92

benchmark

VOPD

(a) energy range.

Tile 15 Tile 0 Tile 1

15 10

grid Y

5 0

20

15

10

5

0

grid X

Figure 2: A snapshot of voltage variations during the VOPD benchmark simulation. power supply noise (see Eq. 1) and the total energy Etot (see Eq. 4) that results from each mapping. The ranges of both PSN and energy for both benchmarks are shown in Figure 3. It can be seen that the range PSN (see Figure 3(b)) is much higher than the range of energy (see Figure 3(a)). This implies that mapping has significant impact on power supply noise. This impact on PSN is significantly higher than the impact impact on energy. This suggests that considering PSN in application mapping for NoCs would have high impact on the power supply noise of the resulting system. Using the power supply noise results of these random mappings, we also investigated how spatial patterns of regional activity density affects power supply noise. In particular, we investigated the impacts of tile local activity (γt (0)), and regional activity γt (D) (for two region sizes D = 1 and D = 2), on power supply noise. Figure 4 plots the correlation coefficient of total power supply noise with both the tile local and regional activities (for D =1 and D =2) for the VOPD benchmark. The noise margin VNM (see Eq. 1) is varied from 20 mV. to 120 mV. with a step of 10 mV . It can be noticed that for high VNM (closer to VDD ), noise is highly dependent on the local activity γ(0). When the noise margin decreases, noise dependence on both local and regional activities decreases. However, the correlation of the regional activity (γ(D) for D > 0) with power supply noise does not decrease as rapid as that of local activity (γ(0)). As a result, at some point, regional activity dominates local activity as a source of noise. This change in the role of local and regional activity can be explained by the fact that when the noise margin increases, local activity will not be enough to cause voltage to drop below this margin. At this point regional activity starts to play a higher role in noise introduction. Also, we can see that regional activity for D = 2 has slightly higher impact

−6

power supply noise, PSN (v.s)

20

7

x 10

min. max.

6 5 4

91.5%

3 2

98.2%

1 0

MMS

benchmark

VOPD

(b) power supply noise range. Figure 3: Random mappings energy and power supply noise ranges for two benchmarks. Different mappings can have significant difference in power supply noise. This fact is exploited to optimize power supply noise throughout mapping.

Previous works on application mapping for NoCs considered different objectives but, no technique took the activity density of the resulting system into consideration. Even worse, some techniques (energy-aware mapping for instance) usually result in higher activity density in some regions of the chip. This results from placing highly communicating tasks close to each other. Consequently, hotspots results at these regions, which cause the chip to experience unbalanced switching activity, higher power supply noise, and increased temperature. As a result this leads to lower system reliability.

4.3 Application Topology and Activity Density To better understand the dependence of activity density on application topology, we characterize applications by two parameters: connectivity and bandwidth. The connectivity of the application tasks is defined as the ratio of the existing connections between tasks to the number of all possible connections. Given an application graph (APG) G = G(S, A),

average of activity density(W/mm2)

1.0

correlation to PSN

0.9

0.8

0.7

0.6

local activity γ(0) regional activity γ(1) regional activity γ(2)

0.5

0.4

20

30

40

50

60

70

0.35

conn.=10%

V (mV.)

90

100

110

conn.=25%

0.25 0.2 0.15 0.1 0.05

50

bandwidth (MB/sec)

120

NM

connectivity is defined as: |A| |S| 2 2

!

While the average bandwidth (BW) is defined as; P ∀a∈A b(a) BW = |A|

(18)

STD of activity density (%)

(a) average of activity density

Figure 4: The power supply noise correlation to activity density. Results from 100 random mappings for the VOPD benchmark. Significant correlation between power supply noise and activity density can be seen which is higher for regional activity than local activity.

Connectivity =

conn.=20%

0.3

0 10 80

conn.=15%

conn.=10%

conn.=15%

conn.=20

200

500

conn.=25%

20 18 16 14 12 10 8 6 4 2 0

10

50

bandwidth (MB/sec)

1000

(b) STD of activity density (19)

To analyse how both parameters influence activity density (see Eq. 7), random APG’s with a range of connectivity (2% to 10%) and bandwidths (20-100 MB/s.) are generated using TGFF [8] and 64 tasks. For each graph, 1000 random mapping to an 8 × 8 NoC are generated. The average activity density, over these 1000 random mappings is plotted against both metrics in Figure 5. It can be seen that activity density increases with both bandwidth requirements and connectivity. However, as the connectivity is higher, the activity density increases more sharply with the bandwidth. To determine the impact of these parameters together and mapping on regional activity density, we characterize the impact of mapping by computing by the standard deviation ST D of activity density over the 1000 random mappings. Higher values of ST D reflect higher impact of mapping on activity density. Figure 5(b) plots ST D against APG topology parameters (connectivity and average bandwidth). It can be noticed that for applications with lower connectivity the impact of mapping on activity density is higher. This can be explained by the fact that a lower connectivity leaves higher space for the mapping technique to manoeuvre and redistribute the communication activities across the chip. As a result, in applications with highly connected tasks the regional activity does not vary with mapping as much as for applications with low connectivity. It can also be observed that application bandwidth has nearly no impact on ST D of regional activity since this trend is nearly similar for different bandwidths.

Figure 5: Average and STD of activity density for application graphs with different bandwidths and connectivities. 1000 random mappings are generated for each graph. Activity density increases with both connectivity and bandwidth. However, mapping has higher impact on activity density for applications with lower connectivity.

4.4 Mapping Results To evaluate the proposed power supply noise minimization mapping, we used six real benchmarks with different sizes, topologies and bandwidth requirements. These benchmarks include a generic complex MultiMedia system which comprises h263 video encoder and mp3 audio decoder (MMS) [11], Telecommunication benchmark (TELE) and Video Object Plane Decoder (VOPD)[14]. In addition to three benchmarks, AMI49, AMI25 and MPEG4 decoder found in [1]. The details of size and communication bandwidth requirements of these benchmarks are shown in the first four columns of Table 2. Using simulated annealing-based optimization, these benchmarks were first mapped into their target NoC architectures with the objective of minimizing the total repulsive force Ftot defined in Eq. 12. To evaluate each mapping in terms of power supply noise, we used the tool described in subsection 4.1 assuming 65 nm technology with nominal supply voltage VDD =1 V. and 1 GHz clock frequency. We compared our noise-aware mappings with energy-aware mappings in terms of both power supply noise and energy. Since the changes in power supply

0.15

2

γ(W/mm )

2

γ(W/mm )

0.2 0.2 0.15 0.1 0.05

0.1 0.05

0

0 1

2

1 3

4

5

6

tile Y

7

1

2

3

4

5

6

4

5

6

7

1

2

3

4

5

6

7

tile X

(b) activity density (γ) for min. Ftot mapping

−8

x 10

−8

1 PSN(v.s)

1 PSN (v.s)

3

tile Y

tile X

(a) activity density (γ) for min. Etot mapping

x 10

2

7

0.5

0

0.5

0 10

30 20

20 30

grid Y

10

30 20

10

20 30

grid X

(c) power supply noise for min. Etot mapping

grid Y

10 grid X

(d) power supply noise for min. Ftot mapping

Figure 6: The spatial distribution of activity density and power supply noise for the AMI49 application. Significant reduction in power supply noise is achieved by harmonic distribution of activity. time scale are slower compared with clock frequency [13], we used a time window of 100 cycle (0.1 µs) for computing the power supply variations [18]. The simulation runs for 10,000 windows (one million clock cycles) for each mapping. We computed power supply noise using Eq. 1 assuming noise margin, VNM to be 10% of nominal VDD (i.e. 0.1 V.). An example illustrating the impact of our repulsive force minimization mapping can be seen in Figure 6, which shows the spatial activity density and power supply noise distributions resulting from both energy and repulsive force minimizations for the AMI9 benchmark. It can be noticed that energy-aware mapping results in condensing highly active tiles resulting in regions with high activity on the chip (see Figure 6(a)). This results in hotspots that experience high power supply noise (see Figure 6(c)). In contrast, our repulsive force minimization mapping results in scattered and more homogenous activity distribution (see Figure 6(b)), which, in turn, reduces the power supply noise significantly (by 76% for this example) as depicted in Figure 6(d). A summary of the results of both power supply noise and energy, using both mapping strategies for the six benchmarks are shown in Table 2. These results are also depicted graphically in Figure 7. It can be seen that our repulsive force minimization resulted in significant drops in power supply noise (72.5% of noise is removed on average) with low

energy penalties (about 4% in average). This implies that considering regional activity in NoC mapping is very useful and can result in systems with lower noise and better power supply integrity. It can also be noticed that, in general, the difference in energy between energy-aware mapping and repulsive-force based mapping is relatively small ( 1.1% and 1.5% for the TELE and MPEG4 benchmarks respectively). This can be explained by the fact that Ftot defined in Eq. 12 can be reduced by both increasing the distance between highly active tiles and/or lowering the local activities of the tiles. A mapping function which will result in lower total switching activity will also result in lower total energy (see Eq. 4). Thus using Ftot as an objective for the mapping function will result in a sub-optimal solution for energy minimization. This fact is also verified experimentally by finding very high correlation between Ftot computed using Eq. 12 and Etot computed using Eq. 4 for the mappings mentioned in Section 4.2.

5. CONCLUSIONS Power supply noise has significant impact on system reliability. Particularly, on-chip communication could easily incur power supply noise without proper isolation. We propose a new mapping strategy which aims at reducing power

no. 1 2 3 4 5 6

benchmark (No. of tasks) AMI49 (49) AMI25 (25) MMS (25) TELE (16) VOPD (16) MPEG4 (9)

NoC size 7×7 5×5 5×5 4×4 4×4 3×3

min/max BW (M B/s) 5.3/85.3 53.3/213.3 0.025/116.8 11/71 16/500 8.5/502

Etot optimization energy(mJ) PSN(V.s) 6.44 3.427e-8 3.21 1.046e-8 5.18 1.416e-8 7.90 4.984e-9 26.2 6.816e-8 26.1 6.846e-8

Ftot optimization energy(mJ) PSN(V.s) 6.84 8.215e-9 3.32 2.353e-9 5.52 5.082e-9 7.98 1.407e-9 28.3 1.204e-8 26.6 2.701e-8 AVERAGE

energy penalty 6.2% 3.3% 6.1% 1.1% 7.4% 1.5% 4.26%

PSN reduction 76% 78% 65% 72% 83% 61% 72.5%

Table 2: Summary of the benchmarks and mappings results

100 minimizing energy minimizing PSN energy diff.(%)

90 80

25 70 20

60 50

15 40 10

30 20

5

energy difference (%)

total energy, Etot (mJ)

30

10 0

AMI49

AMI25

MMS

TELE

benchmark

VOPD

MPEG4

0

(a) total energy consumption. 100 min. energy

min. PSN

PSN reduction(%) 90

0.7e−7

80 0.6e−7 70 0.5e−7

60

0.4e−7

50 40

0.3e−7

30 0.2e−7

PSN reduction(%)

power supply noise, PSN(V.S)

0.8e−7

20 0.1e−7

0

10 AMI49

AMI25

MMS

TELE

benchmark

VOPD

MPEG4

0

(b) resulting total power supply noise. Figure 7: Comparison of both power supply noise and energy optimizations. Our noise minimization could achieve significant reduction in power supply noise compared to energy minimization with a low energy penalty.

supply noise in inter-core communication through optimizing activity distribution. We found that different mappings can result in significant variation in terms of power supply noise (up to 98.2%). A new metric based on communication activity density, which has direct impact on power supply noise, has been developed. We integrate this new metric into a new mapping algorithm and employ a repulsive force-based strategy. This new strategy leads to a balanced distribution of activities across the chip and, hence, results in smaller power supply noise. The new mapping strategy can achieve significant power supply noise reduction (up to 83%) with negligible energy penalty (about 4% in average). This work enables better power supply integrity and reliable power delivery for future many-core systems.

6. REFERENCES [1] C. Ababei, H. S. Kia, O. P. Yadav, and H. Jingcao. Energy and reliability oriented mapping for regular networks-on-chip. In Networks on Chip (NoCS), 2011 Fifth IEEE/ACM International Symposium on, pages 121–128, 2011. [2] A. H. Ajami, K. Banerjee, and M. Pedram. Scaling analysis of on-chip power grid voltage variations in nanometer scale ulsi. Analog Integrated Circuits and Signal Processing, 42(3):277–290, 2005. [3] K. Arabi, R. Saleh, and M. Xiongfei. Power supply noise in socs: Metrics, management, and measurement. Design Test of Computers, IEEE, 24(3):236 –244, may-june 2007. [4] Y. Cao, T. Sato, D. Sylvester, M. Orshansky, and C. Hu. Predictive technology model. Nanoscale integration and modeling group, Arizona State Univerity, http://ptm.asu.edu/, 2006. [5] E. Carvalho and F. Moraes. Congestion-aware task mapping in heterogeneous mpsocs. In System-on-Chip, 2008. SOC 2008. International Symposium on, pages 1–4, 2008. [6] H. H. Chen and D. D. Ling. Power supply noise analysis methodology for deep-submicron vlsi chip design. In Design Automation Conference, 1997. Proceedings of the 34th, pages 638–643, 1997. [7] A. R. Conn, R. A. Haring, and C. Visweswariah. Noise considerations in circuit optimization. pages 220–227. ACM, 1998. Proceedings of the 1998 IEEE/ACM international conference on Computer-aided design. [8] R. Dick, D. Rhodes, and W. Wolf. Tgff: task graphs for free. In Hardware/Software Codesign, 1998. (CODES/CASHE ’98) Proceedings of the Sixth International Workshop on, pages 97 –101, mar 1998.

[9] F. Fazzino, M. Palesi, and D. Patti. Noxim: Network-on-chip simulator. URL: http://sourceforge.net/projects/noxim, 2008. [10] M. S. Gupta, J. L. Oatley, R. Joseph, W. Gu-Yeon, and D. M. Brooks. Understanding voltage variations in chip multiprocessors using a distributed power-delivery network. In Design, Automation and Test in Europe Conference and Exhibition, 2007. DATE ’07, pages 1–6, 2007. [11] H. Jingcao and R. Marculescu. Energy-aware mapping for tile-based NoC architectures under performance constraints. In Design Automation Conference, 2003. Proceedings of the ASP-DAC 2003. Asia and South Pacific, pages 233–239, 2003. [12] A. B. Kahng, B. Li, L. S. Peh, and K. Samadi. Orion 2.0: A power-area simulator for interconnection networks. Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, PP(99):1–5, 2011. [13] S. Kvatinsky, E. Friedman, A. Kolodny, and L. Scha? andchter. Power grid analysis based on a macro circuit model. In Electrical and Electronics Engineers in Israel (IEEEI), 2010 IEEE 26th Convention of, pages 000708 –000712, nov. 2010. [14] S. Murali and G. De Micheli. Bandwidth-constrained mapping of cores onto noc architectures. In Design, Automation and Test in Europe Conference and Exhibition, 2004. Proceedings, volume 2, pages 896–901 Vol.2, 2004. [15] F. Najm. Transition density: a new measure of activity in digital circuits. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, 12(2):310 –323, feb 1993. [16] S. R. Nassif. Power grid analysis benchmarks. In Design Automation Conference, 2008. ASPDAC 2008. Asia and South Pacific, pages 376–381, 2008.

[17] S. Nithin, G. Shanmugam, and S. Chandrasekar. Dynamic voltage (ir) drop analysis and design closure: Issues and challenges. In Quality Electronic Design (ISQED), 2010 11th International Symposium on, pages 611 –617, march 2010. [18] T. M. Nizar Dahir and A. Yakovlev. Communication centric on-chip power grid models for networks-on-chip. In VLSI and System-on-Chip (VLSI-SoC), 2011 IEEE/IFIP 19th International Conference on, pages 180–183, 2011. [19] M. Saint-Laurent and M. Swaminathan. Impact of power-supply noise on timing in high-frequency microprocessors. IEEE Transactions on Advanced Packaging, 27(1):135–144, 2004. [20] S. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan, P. Iyer, A. Singh, and T. Jacob. An 80-tile 1.28 tflops network-on-chip in 65nm cmos. In Solid-State Circuits Conference, 2007. ISSCC 2007. Digest of Technical Papers. IEEE International, pages 98–589. IEEE, 2007. Solid-State Circuits Conference, 2007. ISSCC 2007. Digest of Technical Papers. IEEE International. [21] Y. Wang, J. Xu, Y. Xu, W. Liu, and H. Yang. Power gating aware task scheduling in mpsoc. Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, (99):1–12, 2010. [22] T. Ye, L. Benini, and G. De Micheli. Analysis of power consumption on switch fabrics in network routers. In Design Automation Conference, 2002. Proceedings. 39th, pages 524 – 529, 2002. [23] L. R. Zheng and H. Tenhunen. Fast modeling of core switching noise on distributed lrc power grid in ulsi circuits. In Electrical Performance of Electronic Packaging, 2000, IEEE Conference on., pages 307–310, 2000.