Power-Performance Study of Block-Level Monolithic 3D-ICs ... - gtcad

0 downloads 0 Views 463KB Size Report
Jun 1, 2014 - We demon- strate that ... face at a constant depth (Figure 1(b)). This empty .... and (3) Move a block from one tier to another, or swap two blocks.
Power-Performance Study of Block-Level Monolithic 3D-ICs Considering Inter-Tier Performance Variations Shreepad Panth† , Kambiz Samadi§ , Yang Du§ , and Sung Kyu Lim† †

School of ECE, Georgia Institute of Technology, Atlanta, GA § Qualcomm Research, San Diego, CA

{spanth,limsk}@ece.gatech.edu ABSTRACT In this paper we study the power vs. performance tradeoff in blocklevel monolithic 3D IC designs. Our study shows that we can close the power-performance gap between 2D and a theoretical lower bound by up to 50%. We model the inter-tier performance variations caused by a low temperature manufacturing process on the non-bottom tiers. We also model an alternate manufacturing process, where highly resistive tungsten interconnects are used on the bottom tier to withstand a high temperature process on the nonbottom tiers. We propose a variation-aware floorplanning technique that makes our design more tolerant to these variations. We demonstrate that our design methods can help us obtain high quality designs even under inter-tier performance variations.

Metallization Gate

Thermal oxide H+ ion

BOX Empty wafer (b)

Bulk handle (a)

Thin Si Layer

ILD

(c)

MIV

Categories and Subject Descriptors B.8.2 [Performance and Reliability]: Performance Analysis and Design Aids

Keywords Monolithic 3D; Block-level; Inter-tier variation

(d)

(e)

(f)

Figure 1: The fabrication process of monolithic 3D ICs [1]. (a) The bottom tier is created the same way as 2D-ICs. (b,c,d) Attachment of thin layer of silicon to the top of the bottom tier. (e) FEOL of top tier and creation of MIVs and top-tier contacts, and (f) BEOL processing of top-tier.

1. INTRODUCTION Three dimensional integrated circuits (3D ICs) have emerged as a promising solution to extend the 2D scaling trajectory predicted by Moore’s Law. Currently, through-silicon vias (TSVs) enable 3D ICs, allowing vertical stacking of multiple dies fabricated separately. An emerging alternative is monolithic 3D that enables orders of magnitude higher integration density due to the extremely small size of the monolithic inter-tier vias (MIVs). In monolithic 3D integration technology, one fabricates two or more tiers of devices sequentially, instead of bonding pre-fabricated dies. This eliminates the need for die alignment, enabling smaller via sizes. Overall, monolithic 3D ICs offer several advantages over traditional 3D ICs: (1) the small size of MIVs enables ultra-high integration density, considerably reducing silicon area and cost, (2) the This work is supported by Qualcomm Research. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. DAC’14, June 01–05 2014, San Francisco, CA, USA. Copyright 2014 ACM 978-1-4503-2730-5/14/06 ... $15.00. http://dx.doi.org/10.1145/2593069.2593188.

significantly reduced MIV parasitics help improve the power performance envelope, and (3) the manufacturing process is entirely foundry-driven, and does not involve a packaging house for the processing of backside redistribution layers and micro-bumps. This enables tighter process control, potentially leading to a faster rampup once the technology is mature. The fabrication process for monolithic 3D ICs is shown in Figure 1 [1]. First, the bottom-tier is fabricated similar to a conventional 2D-IC (Figure 1(a)). Next, a thermal oxide is grown on an empty wafer, and H + ions are implanted just below the silicon surface at a constant depth (Figure 1(b)). This empty wafer is then flipped and bonded to the top of the bottom tier using low temperature molecular bonding (Figure 1(c)). The silicon is then sheared off at the H + ion line and polished to give a high quality top silicon layer (Figure 1(d)). The gates are formed on the top tier, and the MIVs are created with the contact mask of the top tier (Figure 1(e)). Finally, the metallization of the top tier is created as usual (Figure 1(f)). The two device tiers are connected by extremely small inter-tier vias (< 100nm diameter) [1]. Prior works on monolithic 3D ICs include (a) transistor-level logic designs [2, 4], where transistors in individual gates are split into multiple tiers, and (b) block-level designs [6], where 2D functional blocks are floorplanned on to multiple tiers and connected using MIVs. However, all of these prior works assume that both

Table 1: The change in resistivity values of different metal layers in the Nangate 45nm library due to Tungsten interconnects. Layer Width(nm) Thickness(nm) ρ(W) / ρ(Cu) Metal1 - Metal3 70 140 2.38 Metal4 - Metal6 140 280 2.67 Metal7 - Metal8 400 800 2.94 Metal9 - Metal10 800 2000 3.04

the top and bottom tier have equal performance, which can only be achieved once the technology is mature. Due to process limitations outlined in Section 2, monolithic 3D ICs will have either degraded transistors in the top tier or degraded interconnects in the bottom tier. There has been no prior work on quantifying which option gives a better power-performance envelope, and this paper aims to resolve this question. In order to do this, we implement several block-level designs, and compare the power-performance envelope of 2D-ICs with monolithic 3D ICs under inter-tier process variation in either devices or interconnects. The contributions of this work are as follows: (1) We first evaluate the power-performance envelope of monolithic 3D ICs assuming a mature process, and show that it closes the gap between 2D and the ideal block-level implementation by up to 50%. (2) We present a methodology to evaluate the powerperformance envelope of a monolithic 3D IC under inter-tier variations. (3) We present an inter-tier variation-aware floorplanning scheme that improves the power and performance by up to 10.6% and 12.6% respectively, and (4) We demonstrate that degraded interconnects are preferable to degraded transistors, and that we can still close the gap to the ideal block level implementation by up to 50% w.r.t. performance and 36% w.r.t. power.

2. INTER-TIER VARIATION Monolithic 3D ICs differ from TSV-based 3D ICs in that tiers are fabricated sequentially. The devices and interconnects of the top tier are fabricated on top of an already existing front end-ofline (FEOL) and back end-of-line (BEOL). During the processing of the top tier, care must be taken to prevent damage to the devices and interconnects of the bottom tier. To prevent damage to the devices of the bottom tier, a low temperature transistor process is key on the top tier. It has been demonstrated [10] that transistors can be fabricated at temperatures down to 625◦ C without any loss of performance. While this is sufficient to prevent damage to the underlying devices, this temperature is still too high to prevent damage to the copper BEOL. This problem can be avoided by using tungsten as the interconnect material on the bottom tier [1]. Tungsten has a bulk resistivity 3.1× that of copper, and this is expected to slow down the system performance. However, the resistivity of an interconnect is size dependent, and it increases with smaller dimensions due to grain boundary and sidewall scattering. Lopez [5] provided a general curve-fit equation to take these effects into account, and several measurement-based studies have extracted the various curve fitting parameters for both copper and tungsten [7, 3, 9]. Using these equations, the change in the interconnect resistivity for the Nangate 45nm library is tabulated in Table 1 Since size effects play a greater role in narrower metal lines, we observe that the local metal lines degrade less than the global metal lines. We use these values and modify the interconnect technology file (.ict), and use Cadence QRC Techgen to re-characterize the interconnect extraction libraries. If, however, we wish to use copper on the bottom tier, laserscan anneal has been proposed for the dopant activation on the top

Table 2: Minimum size (X1) standard cell average delay (in ps), assuming worst loading, at different corners. Numbers in brackets are normalized to the respective TT implementation. Std. Cell NAND2 AOI211 XOR2 DFF Clk-Q DFF Setup

TT 221.8 (1.00) 154.5 (1.00) 163.42 (1.00) 213.1 (1.00) 40.29 (1.00)

TTm10p 243.9 (1.10) 173.8 (1.12) 187.6 (1.14) 243.8 (1.14) 50.95 (1.26)

TTm20p 265.2 (1.19) 192.9 (1.25) 210.85 (1.28) 277.7 (1.30) 58.11 (1.44)

TT_W 222.35 (1.00) 154.97 (1.00) 163.86 (1.00) 214.05 (1.00) 43.86 (1.08)

tier. This method only results in localized heating, thereby preventing any damage to the devices and interconnects on the bottom tier. However, this process results in degraded transistors, and the PMOS and NMOS performance degrade by 27.8% and 16.2% respectively [8]. We refer to these degraded transistors as the T T m20p corner, as on average, the performance is worse by roughly 20%. However, this work was from six years ago, and improvements in the process are bound to be made. We define another corner T T m10p, which has a PMOS and NMOS degradation of 13.9% and 8.1% respectively, which is exactly half of the T T m20p corner. This is meant to represent progress on the fabrication front. We modify the transistor parameters in the spice models to represent these corners and use Encounter Library Characterizer to obtain the new standard cell libraries. We tabulate the resulting performance of select standard cells at maximum loading in Table 2. In addition to re-characterization at different transistor corners, tungsten interconnects also increase the internal parasitics of standard cells. We also re-characterize the standard cells under this condition, and name this corner TT_W. From this table, we observe that the cell delays for simple gates such as NAND roughly follow the average of NMOS and PMOS degradation, while complex gates are more or less dominated by PMOS degradation. In addition, we see that the setup time for the flip-flops degrade at a much higher rate than either NMOS/PMOS. Tungsten interconnects only have a minimum impact on the gate performance, as the wires within the standard cells are very small, and the resistance is dominated by the RON of the transistor. In summary, we have two choices: (1) Use tungsten on the bottom tier and deal with degraded interconnects and slightly worse standard cells, or (2) Use copper on the bottom tier and deal with significantly degraded standard cells on the top tier. In this paper, we study both options and compare and contrast them.

3.

DESIGN FLOW

This section presents our RTL-to-GDSII design flow used in this paper. An overview of the flow is shown in Figure 2. In this figure, orange boxes indicate 3D specific steps. Once the design is synthesized, it is sent to our floorplanner (described in subsection 3.1), which gives us the outlines of all the blocks in the 3D space. Next, we perform MIV planning (described in subsection 3.2) to determine all the MIV locations. With these locations, each block and tier is placed and routed (P&R) separately in Cadence Encounter. At this stage, we dump wire-load models and go back to synthesis to get a better result. Once the P&R is complete again, we proceed to 3D timing and power analysis (described in subsection 3.3).

3.1

Floorplanning

In this paper, we use a sequence-pair based simulated annealing floorplanner. We maintain one sequence pair per tier, representing the entire 3D space [6]. We assume all blocks to be soft, and place and route them only after their outlines have been determined. Our floorplanner is timing driven, and we achieve this by weighting each inter-block net by the longest path delay through it.

Figure 2: The block-level RTL-to-GDSII design flow used in this paper. Orange indicates 3D specific steps.

Figure 3: Our inter-tier variation-aware floorplanner.

Given an initial random solution, we perform the following kinds of solution perturbations: (1) In the sequence pair of a given tier, swap two blocks in the positive sequence or negative sequence or both sequences, (2) Change the aspect ratio of a particular block, and (3) Move a block from one tier to another, or swap two blocks between tiers. With this framework, the floorplanner can or cannot be inter-tier variation-aware. If we assume that both tiers have identical performance, the cost function for the floorplanner is the weighted sum of wirelength and footprint area. This is in contrast to TSV-based 3D floorplanners, which also try to minimize the via count. This is because MIVs are so small that they need not be minimized.

3.1.1

Estimated Block Pin Locations

New MIV Locations (without overlap)

Initial MIV Locations (with overlap) Block (a)

(b)

Variation-Aware Floorplanner

In most designs, not every block is timing critical. The nontiming critical blocks can in theory operate faster, but they are synthesized at the frequency of the critical block to save area and power. Therefore, even with degraded transistors, these blocks can be synthesized to operate at the frequency of the critical block, albeit with a larger area. As long as the critical blocks do not operate with slower transistors or interconnects, the chip can still meet timing. We utilize this in our variation-aware floorplanner. An overview of our inter-tier variation-aware floorplanner is shown in Figure 3. First, given the block RTL and timing constraints, we synthesize four different versions of each block: One for the nominal corner, and one for each of the degraded libraries. In the case of tungsten interconnects, we also modify the resistivity of the wire load models to accurately drive synthesis. For each version of the block, we take note of the area and longest path delay (LPD) through it. Given that the design has inter-tier variations, each block will have a different area and LPD depending on the tier in which it lies. If LPD(bi ) is the tier-dependant longest path delay of a block bi , we define the modified cost function of the floorplanner as: X

NBlock

CostV A = α.W L + β.Area + γ

LP D(bi )

(1)

i=1

In the above equation, WL refers to the wirelength. The area of a block is also dependent on its tier. Therefore, whenever a 3D move is made, we update the area of all the blocks that have changed their tier. The third term in the above equation will try to place the timing critical blocks in the faster tier, and push the non-timing critical blocks to the slower tier.

3.2 MIV Planning The output of the floorplanner are block outlines in a 3D space. Once we have these, we need to insert MIVs into the design. Al-

Figure 4: Our MIV planning methodology (a) Initial estimated MIV locations (b) After one iteration of MIV planning. though MIVs are extremely small, they still need to be inserted in the whitespace in between the blocks. The authors of [6] provide a methodology to determine MIV locations by tricking a 2D router to do 3D routing. However, this method assumes hard blocks, where the block pin locations are pre-determined. In the case of soft blocks, the block pin locations are determined only after floorplanning is finished. However, in monolithic 3D ICs, these block-pin locations depend on the MIV locations as well. This is a chicken and an egg problem, and we present an iterative method to determine both the MIV and the block pin locations. Given the block outlines, we first assume that all the pins of a block are in its center. For each 3D net, the optimal MIV location can then roughly be given as the center of its 3D bounding box. However, this approach will lead to overlap between blocks and MIVs, as well as between MIVs themselves (shown in Figure 4(a)). Given these initial MIV locations, we create verilog and DEF files for each tier. We then open each tier in Cadence Encounter, and use its partitioner to give us block pin locations based on the estimated MIV locations (also shown in Figure 4(a)). We use these block pin locations to run the MIV planning algorithm again to determine the new MIV locations (shown in Figure 4(b)). This entire process can be repeated until the MIV locations stabilize. In practice, we observe that only one or two iterations are required. Once the MIV locations are finalized, each block and tier can be placed and routed separately in Cadence Encounter.

3.3

3D Timing and Power Analysis

Once we have the placed and routed netlists of all the blocks and tiers, we load them into Synopsys PrimeTime. For each cell, depending on the tier in which it lies, we pick the appropriate stan-

700 +15.5% 1.20V

Power (mW)

600 +5.8%

500

240

1300

220

1200

200

1100

1.15V

-11.7%

+7.3%

400

1.10V

140

1.05V

1.0

1.1

800

-16.4%

1.2 1.3 1.4 Frequency (Ghz)

1.5

100 1.6 0.35

-6.6% -12.2%

700

120

1.00V

300

+1.7%

900

160 -28.4%

+3.7%

1000

+22.5%

180

2D 3D_TT Ideal

600 0.40

0.45 0.50 0.55 Frequency (Ghz)

(a) des3

(b) b19

0.60

0.68

0.72

0.76 0.80 Frequency (Ghz)

0.84

(c) mul128

Figure 5: Power-performance trade-off curves assuming that both the tiers have identical transistors and interconnects. Table 3: Benchmarks used in our evaluation. Benchmark #Blocks #Gates #Inter-Block Nets des3 55 63,194 6,138 b19 55 78,852 14,223 mul128 63 253,867 12,447

Table 4: Basic floorplan comparisons assuming both tiers have same performance. Numbers in brackets are normalized to the respective 2D implementation. Ckt. des3 b19 mul128

Flavor 2D 3D Ideal 2D 3D Ideal 2D 3D Ideal

#Gates (×103 ) 68.9 (1.00) 66.2 (0.96) 64.4 (0.94) 82.3 (1.00) 80.62 (0.98) 79.35 (0.96) 251 (1.00) 245 (0.97) 235 (0.93)

Footprint (mm2 ) 0.328 (1.00) 0.156 (0.48) 0.398 (1.00) 0.204 (0.51) 1.096 (1.00) 0.550 (0.50) -

Total # MIV WL (m) (×103 ) 1.514 (1.00) 1.287 (0.85) 3.75 0.938 (0.62) 3.341 (1.00) 2.847 (0.85) 13.46 1.838 (0.55) 4.693 (1.00) 4.447 (0.95) 7.261 3.271 (0.70) -

dard cell library. We also modify our extraction tech file for each block and tier depending on the interconnect material. These parasitics are also loaded into Synopsys PrimeTime. We also create a top-level netlist and parasitic file to represent the MIV connectivity and parasitics. According to [1], if the inter-tier oxide thickness is greater than or equal to 100nm, there is negligible inter-tier coupling. Therefore, we ignore any such coupling in this paper. Once all the netlists and parasitics are loaded, we perform 3D timing analysis, and statistical power analysis.

4. POWER-PERFORMANCE STUDY We pick one benchmark from the OpenCores benchmark suite (des3), one from the IWLS benchmark suite (b19), and design one custom 128-bit integer multiplier, and tabulate their statistics in Table 3. They are implemented in the Nangate 45nm library, and the cell counts shown are the synthesis results without any wire load models. In all our 3D implementations, the diameter of an MIV is assumed to be 100nm, with a resistance of 2Ω and a capacitance of 0.1f F [4].

4.1 Identical Performance on Both Tiers In this section, we assume that both tiers have identical transistors and interconnects. This represents an ideal manufacturing process, and represents the best possible case for monolithic 3D. As described in Section 3, we first perform initial floorplanning

and use this to derive wire load models for each benchmark Once we have the wire-load models, we perform floorplanning again, and tabulate basic floorplan comparisons for 2D and 3D in Table 4. In addition to these two flavors, we define an “ideal” block-level implementation. This implementation is obtained by assuming that all the inter-block nets have zero length and parasitics. During the block implementation, we set the output load of the blocks to be zero and the inputs to be driven by ideal drivers. This is the theoretical lower bound on any block-level implementation of this design, given the same set of blocks, and the constraint that each block is implemented in 2D. From this table, we observe that monolithic 3D leads to significantly shorter wirelength. Although the inter-block wirelength is always significantly reduced, the total wirelength reduction depends on the intra-block wirelength as well. Benchmarks such as “mul128” have most of the wirelength within the block, so there is not much total wirelength reduction. In addition, we observe that the shorter wires leads to fewer gates being required. Next, we study the power-performance trade-off of the three different implementations. In order to get the numbers for the ideal implementation, we force the parasitics of all inter-block nets to be zero in Synopsys PrimeTime. In addition to the nominal VDD of 1.1V , we characterize the standard cell libraries at four additional VDD values covering ±10% of nominal VDD (1.00V, 1.05V, 1.10V, 1.15V, 1.20V). We then measure the power and frequency at each of the VDD values. These curves are plotted in Figure 5. From this figure, we observe that 3D usually offers a performance advantage (at the same power) over 2D, and it closes the gap to ideal by up to 50%. This additional performance can be traded for power savings to meet the 2D frequency, and we see up to a 16.1% reduction in power. In these curves, the ideal implementation of “b19” requires extrapolation to make iso-performance power comparisons at the nominal 2D frequency. We do not make such a comparison due to inaccuracies that are bound to be introduced by extrapolation. The reason the absolute values of the gains in the “mul128” benchmark are so small is because the critical path is always within a single block. Since the inter-block nets are not timing critical, shortening them does not make the design faster, and there is no additional performance to trade for power. Making this design faster will require architectural modifications such as block folding.

4.2

Impact of Inter-Tier Variations

In this section, we study the impact of the inter-tier variations on the power-performance envelope of monolithic 3D. As described in Section 3, we perform synthesis of each block for each degraded li-

TT TTm10p TTm20p TT_W

0.72

Small blocks are most critical

LPD Through Block (ns)

0.70

Table 5: Basic floorplan comparisons for different degraded 3D options. The numbers in brackets are normalized to the respective 2D numbers in Table 4. Ckt.

0.68

TT_W only slightly larger than TT

0.66 0.64

Large blocks are not timing critical

0.62 0.60 0

2000

4000

6000

8000

10000

12000

Figure 6: Synthesis results of “des3” benchmark for degraded transistors and interconnects. Top Tier

MIV Landing Pads

Bottom Tier

Bottom Tier

MIVs

(a)

Top=TTm10p Top=TTm20p Bot=TT_W Top=TTm10p b19 Top=TTm20p Bot=TT_W Top=TTm10p mul128 Top=TTm20p Bot=TT_W des3

#Gates (×103 ) 68.1 (0.99) 67.2 (0.98) 66.8 (0.97) 80.8 (0.98) 82.0 (1.00) 80.8 (0.98) 247 (0.98) 249 (0.99) 246 (0.98)

Footprint (mm2 ) 0.159 (0.49) 0.177 (0.54) 0.153 (0.47) 0.212 (0.53) 0.222 (0.56) 0.208 (0.52) 0.574 (0.52) 0.575 (0.52) 0.568 (0.52)

Total # MIV WL (m) (×103 ) 1.29 (0.85) 3.92 1.44 (0.95) 5.67 1.31 (0.87) 3.11 2.84 (0.85) 11.6 2.90 (0.87) 11.3 2.91 (0.87) 12.9 4.35 (0.93) 4.48 4.38 (0.94) 4.48 4.29 (0.91) 4.48

14000

Block Area (um2)

Top Tier

Flavor

Critical Block Moved to Bottom Tier

(b)

Figure 7: Floorplan screenshots of “des3” when the top tier is at the TTm20p corner. (a) Without variation-aware floorplanning, and (b) With variation aware floorplanning. brary. The synthesis results for “des3” are shown in Figure 6. This figure plots the block area vs. the longest path delay through it. Each point on this plot is a single block. As seen from this graph, the largest blocks in this benchmark are not timing critical. For all of the degraded transistor and interconnect options, they have the same frequency and area. However, the smallest blocks seem to be the most timing critical. They require much larger area (buffers) to try and meet timing, and it is still not possible. Therefore, the floorplanner must move them to the non-degraded tier. We also observe that tungsten interconnects have almost no timing degradation, and only a marginal area overhead. This is because if a design is interconnect dominated, it is quite easy to make it gate-dominated and meet timing by inserting more buffers. While timing can be met, it will come at a power penalty. We now run our variation aware floorplanner on all benchmarks for each degraded option. Assuming that the top tier is at the T T m20p corner, sample floorplan screenshots for “des3” with and without variation aware floorplanning is shown in Figure 7. In this

figure, we observe that the variation aware floorplanner moves the smaller, more timing critical blocks to the bottom tier, so they can operate at the desired frequency. We tabulate the basic floorplan comparisons for all the degraded options in Table 5. The numbers are normalized to the respective 2D numbers in Table 4. As seen from this table, all of the degraded options use more gates than the case when both tiers have identical performance. However, the gate counts are still less than 2D. Similarly, both the footprint area and the wirelength are increased from the non-degraded case, but are still less than 2D. The only exception is the “mul128” benchmark, when the bottom tier is at the TT_W corner. This has a slightly lower wirelength than the non-degraded option, but this is due to the trade off with footprint area. Next, we plot the power-performance trade-off curves for the degraded transistors and interconnects in Figure 8. These are shown as solid lines. We also plot the results of degraded transistors and interconnects on top of a non-variation aware floorplanning solution. These are shown as dashed lines. As observed from this figure, our variation-aware floorplanner (VAFP) always outperforms the non-variation aware one. We also observe that except in the case of “mul128”, the top tier having TTm20p transistors is worse than 2D, even with VAFP. We also observe that after VAFP, the top tier with TTm10p transistors is always better than 2D. Finally, we observe that tungsten interconnects on the bottom tier are by far the best option, and although there is negligible timing degradation compared to the identical tiers case, some power overhead exists.

4.3

Impact of Variation-Aware Floorplanning

To show the impact of our inter-tier variation-aware floorplanner, we tabulate the iso-power frequency and iso-performance power in Table 6. The comparison point for each of the three benchmarks is the respective 2D power and frequency at nominal VDD . If a particular point is not achievable within ±10% of nominal VDD , and extrapolation is required, we mark it with a ‘-’. From this table, we observe that our variation aware floorplanner improves the iso-power performance by up to 12.6% and the isoperformance power by up to 10.6%. However, the non-variation aware floorplan results are often not able to meet the 2D frequency even with a 10% VDD boost. If the VDD was increased further so that they could meet timing, our variation-aware floorplanner would show even more benefit.

4.4

Overall Comparisons

Similar to the previous section, we tabulate the iso-power performance and iso-performance power for 2D, ideal, the non-degraded monolithic 3D, as well the variation-aware floorplanning results for degraded monolithic 3D in Table 7. From this table, we clearly see that tungsten interconnects on the

2D 700

3D,Top=TTm10p

3D, Bot=TT_W +1.8%

1200

+7.6%

200

600

Power (mW)

3D, Top=TTm20p

+4.7%

1000

500

160

-10.6%

-3.8%

-14%

400

800

120 600

300 0.8

80 0.9

1.0

1.1

1.2

1.3

1.4

1.5

0.30

Frequency (Ghz)

0.35

0.40

0.45

0.50

0.55 0.60 0.65 0.70 0.75 0.80 0.85

Frequency (Ghz)

(a) des3

Frequency (Ghz)

(b) b19

(c) mul128

Figure 8: Power-performance trade-off curves assuming degraded transistors and interconnects. Dashed lines represent non variation-aware floorplanning and solid lines represent variation-aware floorplanning. Table 6: The impact of variation-aware floorplanning (VAFP). ‘-’ indicates that point is not achievable within ±10% VDD . Top=TTm10p Top=TTm20p Non-VAFP VAFP Non-VAFP VAFP iso-power frequency (Ghz) 1.233 (1.000) 1.259 (1.021) 1.14 (1.000) 1.19 (1.044) des3 iso-performance power (mW) 507.746 (1.000) 479.1 (0.944) 547.65 (-) iso-power frequency (Ghz) 0.417 (1.000) 0.424 (1.017) 0.396 (1.000) 0.396 (1.000) b19 iso-performance power (mW) 151.723 (1.000) 144.58 (0.953) 173.14 (1.000) 172.828 (0.998) iso-power frequency (Ghz) 0.737 (1.000) 0.793 (1.076) 0.692 (1.000) 0.779 (1.126) mul128 iso-performance power (mW) 892.95 (-) 922.53 (-) Ckt.

Parameter

Bot=TT_W Non-VAFP VAFP 1.222 (1.000) 1.28 (1.047) 519.48 (1.000) 464.55 (0.894) 0.432 (1.000) 0.439 (1.016) 135.06 (1.000) 135.06 (1.000) 0.793 (-) 887.37 (-)

Table 7: Iso-power performance and iso-performance power results for all implementation flavors. Ckt.

Parameter

iso-power frequency (Ghz) iso-performance power (mW) iso-power frequency (Ghz) b19 iso-performance power (mW) iso-power frequency (Ghz) mul128 iso-performance power (mW) des3

2D

Ideal

1.222 (1.000) 519.48 (1.000) 0.408 (1.000) 157.05 (1.000) 0.779 (1.000) 922.53 (1.000)

1.411 (1.155) 372.06 (0.716) 0.5 (1.225) - (-) 0.807 (1.036) 810.56 (0.879)

bottom tier outperform degraded transistors on the top tier. This option is preferable from the manufacturing perspective as well, as the process is already ready. Even with tungsten interconnects on the bottom tier, we see that we can close the gap to the ideal blocklevel implementation by up to 50% w.r.t. performance and 36% w.r.t. power.

5. CONCLUSION In this paper, we presented a methodology to design and analyze monolithic 3D ICs (M3D), where the tiers have different device and interconnect delay and power characteristics. We developed a floorplanning scheme that addresses this inter-tier variation and builds variation-tolerant block-level designs. Using these tools, we first showed a significant power-performance improvement with M3D designs compared with 2D designs and theoretical bounds. In addition, our variation-aware floorplanning scheme improves performance and power significantly when compared with non variationaware floorplanning. We also studied the impact of highly resistive tungsten interconnects that are used for the bottom tier to withstand high temperature manufacturing of non-bottom tiers. Our design methods can still help us obtain high quality power-performance results even under this material change.

Both=TT 1.293 (1.058) 458.45 (0.883) 0.439 (1.076) 131.81 (0.839) 0.793 (1.018) 859.15 (0.931)

6.

3D Top=TTm10p Top=TTm20p Bot=TT_W 1.259 (1.030) 1.19 (0.974) 1.28 (1.047) 479.1 (0.922) 547.65 (1.054) 464.55 (0.894) 0.424 (1.039) 0.396 (0.971) 0.439 (1.076) 144.58 (0.921) 172.828 (1.100) 135.06 (0.860) 0.793 (1.018) 0.779 (1.000) 0.793 (1.018) 892.95 (0.968) 922.53 (1.000) 887.37 (0.962)

REFERENCES

[1] P. Batude et al. 3-D Sequential Integration: A Key Enabling Technology for Heterogeneous Co-Integration of New Function With CMOS. IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 2012. [2] S. Bobba et al. CELONCEL: Effective design technique for 3-D monolithic integration targeting high performance integrated circuits. In Proc. Asia and South Pacific Design Automation Conf., 2011. [3] D. Choi et al. Electron mean free path of tungsten and the electrical resistivity of epitaxial (110) tungsten films. Phys. Rev. B, 2012. [4] Y. J. Lee, P. Morrow, and S. K. Lim. Ultra High Density Logic Designs Using Transistor-Level Monolithic 3D Integration. In Proc. IEEE Int. Conf. on Computer-Aided Design, 2012. [5] G. Lopez. The impact of interconnect process variations and size effects for gigascale integration. PhD. Thesis, Georgia Tech, 2009. [6] S. Panth, K. Samadi, Y. Du, and S. K. Lim. High-Density Integration of Functional Modules Using Monolithic 3D-IC Technology. In Proc. Asia and South Pacific Design Automation Conf., 2013. [7] J. Plombon et al. Influence of phonon, geometry, impurity, and grain size on Copper line resistivity. Applied Physics Letters, 2006. [8] B. Rajendran et al. Low Thermal Budget Processing for Sequential 3-D IC Fabrication. IEEE Trans. on Electron Devices, 2007. [9] W. Steinhogl et al. Tungsten interconnects in the nano-scale regime. Microelectronic Engineering, 2005. [10] C. Xu et al. Improvements in low temperature (