A Scalable Near-Memory Architecture for Training Deep Neural ... - arXiv

1 downloads 0 Views 2MB Size Report
[32] M. Peemen, R. Shi, S. Lal, B. Juurlink, B. Mesman, and H. Cor- poraal .... 1, 2017. [76] Cavigelli, Lukas and Gschwend, David and Mayer, Christoph and.
IEEE TRANSACTIONS ON COMPUTERS, VOL. (VOL), NO. (NO), (MONTH) (YEAR)

1

A Scalable Near-Memory Architecture for Training Deep Neural Networks on Large In-Memory Datasets

arXiv:1803.04783v1 [cs.DC] 19 Feb 2018

Fabian Schuiki, Michael Schaffner, Frank K. Gürkaynak, and Luca Benini, Fellow, IEEE Abstract—Most investigations into near-memory hardware accelerators for deep neural networks have primarily focused on inference, while the potential of accelerating training has received relatively little attention so far. Based on an in-depth analysis of the key computational patterns in state-of-the-art gradient-based training methods, we propose an efficient near-memory acceleration engine called NTX that can be used to train state-of-the-art deep convolutional neural networks at scale. Our main contributions are: (i) identifying requirements for efficient data address generation and developing an efficient accelerator offloading scheme reducing overhead by 7× over previously published results; (ii) support a rich set of operations allowing for efficient calculation of the back-propagation phase. The low control overhead allows up to 8 NTX engines to be controlled by a simple processor. Evaluations in a near-memory computing scenario where the accelerator is placed on the logic base die of a Hybrid Memory Cube demonstrate a 2.6× energy efficiency improvement over contemporary GPUs at 4.4× less silicon area, and an average compute performance of 1.01 Tflop/s for training large state-of-the-art networks with full floating-point precision. The architecture is scalable and paves the way towards efficient deep learning in a distributed near-memory setting. Index Terms—Parallel architectures, memory structures, memory hierarchy, machine learning, neural nets

F

1

I NTRODUCTION

M

ODERN Deep Neural Networks (DNNs) have to be trained on clusters of GPUs and millions of sample images to be competitive [1]. Complex networks can take weeks to converge during which the involved compute machinery consumes megajoules of energy to perform the exa-scale amount of operations required. Inference, i.e. evaluating a network for a given input, provides many knobs for tuning and optimization. Substantial research has been performed in this direction and many good hardware accelerators have been proposed to improve inference speed and energy efficiency [2]. As we shall show, training of DNNs is much harder to do and many of these optimizations do no longer apply. Stochastic Gradient Descent (SGD) is the standard algorithm used to train such deep networks [3]. Over time, improvements to SGD such as momentum and adaptive learning rates have appeared, but the core principle remains the same. Consider Figure 1 which shows the data dependencies when training a simple neural network with three layers f, g, h. Inference is concerned only with finding y, up to which the graph has a convenient structure: every operation only depends on the previous layer. Training however wants to find the gradients ∆θf , ∆θg , ∆θh which in addition to the previous layer also depend on the intermediate results x1 , x2 , x3 , y. This has severe implications: while for inference the most data we ever need to memorize is the output of the largest layer, training requires us to temporarily store the output of every layer. Since we need to save all intermediate results, we can no longer apply optimizations such as fusing activation or sub-sampling functions with the preceding layer.

Manuscript received (???); revised (???).

θf x1

θg



θh

f

x2

g

x3

h

y

L

df

∆x2

dg

∆x3

dh

∆y

dL

Forward Pass Backward Pass

∆θf

∆θg

∆θh



Figure 1. Data dependency graph of the forward pass (above) and backward pass (below). f, g, h are DNN layers, L is the loss function, x1 , x2 , x3 and θ1 , θ2 , θ3 are the layer activations and parameters. The backward pass introduces a data dependency between the first node f and the last node ∂f . Thus intermediate activations need to be stored.

While it has been shown that inference is robust to lowering arithmetic precision [2], the impact of fixed-point or reduced-precision floating-point arithmetic on training is not yet fully understood (see Section 6). Until additional light is shed on the topic, a training accelerator must support 32 bit floating-point (fp32) arithmetic to be able to compete with the ubiquitous GPU. We know of only one accelerator that tackles learning in fp32 [4] and it requires the deployment of a significant amount of custom silicon. Using GPUs as training accelerators is attractive due to them being readily available, but has its drawbacks. The large die sizes and high power consumption, together with the associated problem of dark silicon, leads to much more silicon being deployed than is strictly necessary. A problem often overlooked by accelerators is that of data storage: training DNNs requires large datasets that need to be brought into

IEEE TRANSACTIONS ON COMPUTERS, VOL. (VOL), NO. (NO), (MONTH) (YEAR)

2 2.1

C OMPUTATIONAL PATTERNS Stochastic Gradient Descent

SGD is the most used optimization algorithms for machine learning [3]. Given a neural network f (x; θ), it optimizes the network parameters θ to minimize a loss function L(f (x; θ), yˆ). This function models the deviation of the actual output f (x; θ) from the desired output yˆ. In its simplest form, SGD consists of evaluating the loss function for different inputs x (forward pass, Equation 1), finding the derivative with respect to θ (backward pass, Equation 2), and then updating the network parameters (Equation 3). In practice the learning rate  is being varied as training progresses. li (θ) = L(f (xi ; θ), yˆi ) m X 1 g = ∇θ li m

(1) (2)

i=1

θ ← θ − g

(3)

where m is the number of inputs xi (batch size) and yˆi is the desired output for input xi . Besides the forward and backward pass, SGD only requires support for the WAXPBY (scaled vector addition, Equation 3) operation in order to perform a full training step on an accelerator.

Avg Pool

Softmax

ReLU

Tanh/Sig

Norm.‡

Dropout

Neural Networks AlexNet [11] VGG 11/16/19 [12] GoogLeNet [1] ResNet 18-152 [13] Incep. v2/v3 [14] Incep. v4 [15] RNNs/LSTMs [3]

Max Pool

Table 1 Overview of the layers found in the DNNs considered in this work, and hardware support in this work and [10] for performing a full training step inside the HMC.

Linear

the system and ample room for intermediate results. In this paper we show that a processing system embedded in a memory device is a competitive and scalable option for training DNNs: • We describe the computational patterns required to train modern DNNs in Section 2 and show that, despite the diverse set of layers and operations, the number of primitive operations required is small. • We pair general purpose RISC-V cores with dedicated floating-point streaming co-processors to efficiently implement the identified operations. Section 3 discusses the architecture which organizes these processors into clusters and embeds them into the Logic Base (LoB) die of a Hybrid Memory Cube (HMC). • In Section 4 we outline how the computational patterns identified in Section 2 map to the proposed architecture. • Our architecture provides significant computational capabilities in the unused die area on the LoB. In Section 5 we show that through voltage and frequency scaling and additional logic dies we can increase these capabilities significantly, while staying within the thermal design limits of the device. Furthermore we show that our approach can compete with state-of-the-art GPUs in terms of energy efficiency and compute power. Most importantly, little to no additional silicon needs to be deployed besides the memory devices. On a data center scale, this translates into significant savings in power, cooling, and equipment cost. The remainder of this paper is organized as follows: Section 2 introduces the computational patterns involved in training DNNs. Section 3 describes the proposed hardware architecture and Section 4 shows how DNN layers map to it. Section 5 presents experimental results and comparisons to other accelerators. The remaining sections describe related and future work, and provide a conclusion.

Conv.

2

X X X X X X –

X X X X X X X

X X X X X X –

– – X X X X –

X X X X X X X

X X X X X X X

– – – – – – X

X X X – X X –

– – X – – X X

Hardware Support for Training NS [10] –* –* –† NTX (this work) X X X

–* X

X§ X§

–† X

X§ X§

X§ X§

– X



Only forward pass supported Local response and batch normalization * Requires expensive re-tiling of data § div, exp, log, sqrt require iterative evaluation (see Section 4.3.5) ‡

2.1.1 Momentum A common extension to SGD is the concept of momentum [5], which introduces a “velocity” v that is carried over from one training step to the next. Rather than directly affecting the parameters as in Equation 3, the gradient is now used to alter the velocity which in turn is used to update the parameters: v ← αv − g

(4)

θ ←θ+v

(5)

A further variation on this concept [6] introduces an additional interim update to the parameters before the gradient is calculated. The fundamental operation of the weight update remains WAXPBY. 2.1.2 AdaGrad, RMSProp, and Adam The AdaGrad [7] and RMSProp [8] variations on SGD use an adaptive learning rate. In both cases the squared gradient is accumulated across iterations and used to adjust the learning rate as training progresses. The weight update for RMSProp looks as follows: r ← ρr + (1 − ρ)g g  θ←θ− √ g δ+r

(6) (7)

Note that the multiplication , division, and square-root operations are applied element-wise. In this case an accelerator must also support vector multiplication, division, and square-root operations to perform a full training step. The Adam algorithm [9] keeps separate track of first and second moments across iterations to adaptively adjust the learning process, but uses the same set of fundamental operations. 2.2

Layer Diversity

Contemporary DNNs consist of a variety of different layers. Table 1 provides an overview of the layers used in the

SCHUIKI et al.: A SCALABLE NEAR-MEMORY ARCHITECTURE FOR TRAINING DEEP NEURAL NETWORKS ON LARGE IN-MEMORY DATASETS

Table 2 Intermediate result sizes of the convolutions in ResNet-34. Note that layers are repeated multiple times. During inference, the memory must hold the largest intermediate result, i.e. the max(·) of all output P sizes. During training, it must hold all intermediate results, i.e. the · of all output sizes and repetitions. Layer

Rep.

Output Size

Inference

Training

conv1 conv2 conv3 conv4 conv5

1× 6× 8× 12× 6×

112 · 112 · 64 56 · 56 · 64 28 · 28 · 128 14 · 14 · 256 7 · 7 · 512

803 000 201 000 100 000 50 200 25 100

803 000 1 200 000 802 000 602 000 151 000

803 000

3 560 000

DNNs considered in this work. We observe that the layers making up the bulk of computation, namely convolutions and linear layers [3], [4], [10], boil down to Multiply and Accumulate (MAC) reduction operations [16], [17], [18]. Non-linear operations such as max pool and Rectified Linear Unit (ReLU) require additional comparison operations. All other layers can be distilled into vector addition and multiplication operations, either due to their nature or by approximating them iteratively. An iterative approach is feasible if the approximated function is used rarely within the overall computation, which is often the case for div, exp, log, sqrt, and others. We thus conclude that an arithmetic unit capable of vectorized MAC and comparison operations is sufficient to efficiently perform the computations that arise in the considered DNNs (Table 1). Note that while inference architectures also support these operations, other concerns such as data scheduling and the availability of memory prevent their use for training.

2.4 2.4.1

Inference vs. Training

The computations involved in applying a neural network to an input can be described as a Directed Acyclic Graph (DAG), see Figure 1. The layered nature of this graph makes the represented computation amenable to hardware acceleration: The operations can be performed layer-bylayer, requiring only a limited amount of memory to carry intermediate results from one layer to the next. The training of such networks on the other hand requires various derivatives to be computed. Given a computation graph that describes the network (forward pass), automatic differentiation [19] may be used to find a complementary graph (backward pass) that determines the parameter derivatives. The added operations introduce unfavorable dependencies on the intermediate results of the forward pass: The very last derivative to be computed depends on the very first intermediate result produced, thus requiring significantly more memory than the forward pass, where intermediate results could just be discarded. Consider the convolutions of ResNet-34 listed in Table 2 for example. For an inference step, the layer with the largest output size dictates the amount of memory required, i.e. 803 · 103 values. For a training step, the output of each layer must be kept in memory, i.e. 3560 · 103 values.

Convolutional Neural Networks Convolution Shapes

Convolution operations are abundant in Convolutional Neural Networks (CNNs) and are responsible for the majority of computations performed. As such they are a prime target for optimization. Take the following simplified formulation of the forward pass for example: y =x∗w

x, y ∈ RN ×N

w ∈ RU ×U

(8)

The dimensions of the input and output tensor x, y are generally much bigger than those of the weights w. In real world networks, we find U = 1, 3, 5 for the weights and N > 100 for the input and output. In simple terms, this is a convolution of a big tensor (x) with a small tensor (w), producing a big tensor (y). A dedicated hardware accelerator can make use of this by keeping the weights in a local memory and streaming the large image data in and out. Optimizations such as the Winograd and Fast Fourier Transform can help further reduce the computational footprint. In contrast, during the backward pass we have to compute the following two convolutions, the first for the chain rule and the second for the gradient: ∆x = ∆y ∗ flip(w) ∆w = x ∗ ∆y

(9) (10)

Note that flip(w) is w rotated by 180◦ . The convolution in Equation 10 has a completely different shape than the previous two: a big tensor (x) convolved with a big tensor (∆y), producing a small tensor (w). 2.4.2

2.3

3

Striding

Strided convolutions and pooling operations are a common scheme to implement spatial down-sampling. A stride greater than one implies that not all values in the input tensor contribute equally to the output tensor. While not a problem for the forward pass, the operations in the backward pass become less regular since each incoming gradient may affect a different number of outgoing gradients. Subdividing the values into different cases produces regular sub-problems that can be treated independently. Strided convolutions for example can be decomposed into multiple unstrided convolutions. 2.4.3

Non-linear Layers

The ReLU activation function and max pooling are two prominent examples of non-linear layers commonly employed in DNNs. Implementation is trivial for the forward pass, since both operations boil down to determining the maximum over a set of numbers. During the backward pass however, implementation becomes more involved. ReLU needs to conditionally propagate the incoming gradients based on the corresponding values in the forward pass input. Max pooling is more complicated, as each of the gradients has multiple potential destinations in the pooling window. Additionally, multiple gradients may be propagated to the same location, essentially requiring the concept of an additive scattering operation for the backward pass.

4

IEEE TRANSACTIONS ON COMPUTERS, VOL. (VOL), NO. (NO), (MONTH) (YEAR)

2.4.4

Layer Fusion

Layers in CNNs are generally followed by an element-wise activation function and/or a sub-sampling operation. Accelerators that only perform the forward pass can easily fuse these with the computation of the convolution. For training however, the inputs to each function must be preserved for the backward pass, thus rendering layer fusion impossible.

2.5

Recurrent Neural Networks

Recurrent Neural Networks (RNNs) generally consist of matrix-vector multiplications and activation function evaluations [3]. The former are BLAS2 operations that tend to be memory-bound, which is problematic. By processing multiple samples in parallel (i.e. introducing a batch size), these operations become matrix-matrix multiplications, thus falling under the BLAS3 category [20]. The same applies to Multi-Layer Perceptrons (MLPs) as well. Variations such as Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) introduce additional bypass operations to compensate the “Vanishing Gradient” effect [21] during training, but the underlying structure remains the same. RNNs have received little attention in terms of hardware acceleration, especially with respect to training. Our architecture fully supports training LSTMs and GRUs.

2.6

Compression and Sparsity

Aggressive exploitation of sparsity is becoming more common for inference. Recent work has investigated the possibility to reduce the size of neural networks through quantization and compression [22], as well as joint training and pruning [23] techniques, while maintaining the network’s accuracy. These approaches do not generally apply to training, and our architecture has no direct support for them.

3

A RCHITECTURE

The LoB of a HMC offers a unique opportunity to introduce a Processor-in-Memory (PiM) as depicted in Figure 2. The memory dies are subdivided into vertically connected vaults, with individual memory controllers on the LoB. Traffic between the serial links and vaults is transported by the means of an all-to-all network (logically a crossbar) [24], [25], [26], [27], [28]. Our architecture consists of multiple processing clusters attached to a crossbar1 , which thus gain full access to the entire memory space of the HMC. The memory cube is attached to a host CPU or other cubes via the four serial links. The om-chip network is responsible for arbitration of traffic between the serial links, the DRAM, and the PiM. This arbitration can be prioritized such that external memory accesses from the serial links are given priority over internal ones originating in the processing system. It also allows requests from the PiM to be routed to the serial links, for inter-HMC communication [29]. 1. This network is implemented efficiently using a multi-stage logarithmic interconnect such as that described in [29].

3.1

Processing Cluster

We combine a general purpose RISC-V processor core [30] with multiple NTX floating-point streaming co-processors. Both operate on a 128 kB TCDM which offers a shared memory space with single-cycle access. The memory is divided into 32 banks that are connected to the processors via a lowlatency logarithmic interconnect. These form a cluster which also contains a DMA engine that is capable of transferring two-dimensional planes of data between the TCDM and the HMC’s memory space. This solution has proven to be more area and energy efficient than implicit caches, and the DMA can anticipate and time block data transfers precisely, thereby hiding latency [27], [31], [32]. The RISC-V processors perform address calculation and control data movement via the DMA. Actual computation is performed on the data in the TCDM by the NTX co-processors which we describe in the next section. Address translation is performed either in software or via a lean Memory Management Unit with Translation Look-aside Buffer as described in [27]. This allows the PiM to directly operate on virtual addresses issued by the host. If there are multiple HMCs attached to the host, care must be taken since the PiMs can only access the memory in the HMC that they reside in. An additional explicitly-managed memory outside the clusters, labelled “L2” in Figure 2, holds the RISC-V binary executed by the processors and additional shared variables. The binary is loaded from DRAM. 3.2

Network Training Accelerator

The computations involved in training DNNs are highly regular. To leverage this feature we developed the Network Training Accelerator (NTX), a floating-point streaming coprocessor that operates directly on the TCDM. Conceptually the NTX co-processor is similar to the one presented in [10], but it is a complete redesign optimized for performance and training. The streaming nature of the co-processor alleviates the need for a register file which is a common bottleneck in SIMD architectures and thus limits their scalability [33]. The architecture of NTX is depicted in Figure 3. It consists of four main blocks: (i) the FPU containing the main data path, (ii) the register interface for command offloading, (iii) the controller that decodes the commands and issues microinstructions to the FPU, and (iv) the address generators and hardware loops. 3.2.1 Floating-Point Multiply and Accumulate The FPU in NTX can perform fast FMAC operations with single-cycle throughput. It is based on a Partial CarrySave (PCS) accumulator [34] which aggregates the 48 bit multiplication result at full fixed-point precision (≈300 bit). After accumulation the partial sums are reduced in multiple pipelined segments. In order to reach an operating frequency above 1.5 GHz in 28 nm (SS 125◦ C 1.0 V), two segments are sufficient. The employed format has been aligned with IEEE754 32 bit floats. 3.2.2 Nested Hardware Loops When implemented in C code, the layers in DNNs turn into deeply nested loops. Consider the following example of a 2D convolution which operates on a 3D input tensor x and produces a 3D output tensor y:

SCHUIKI et al.: A SCALABLE NEAR-MEMORY ARCHITECTURE FOR TRAINING DEEP NEURAL NETWORKS ON LARGE IN-MEMORY DATASETS

5

NTX 16 (small) m=16 clusters n=1 cores per cluster k=8 NTXs per cluster NTX @ 1.5 GHz

NTX

NTX

NTX

NTX

NTX 64 (big) m=64 clusters n=1 cores per cluster k=8 NTXs per cluster NTX @ 1.5 GHz

Figure 2. Top-level block diagram of one HMC enhanced with m processing clusters. The LoB contains the vault controllers, main interconnect, and the four serial links that lead off-cube. The proposed processing clusters attach directly to the main interconnect and gain full access to the HMC’s memory space and the serial links. Each cluster consists of a DMA unit, a TCDM, and one or more RISC-V processor cores augmented with NTX streaming co-processors. We designed the NTX to operate at 1.5 GHz, whil the remaining additions to the system operate at 750 MHz.

NTX

Figure 3. Block diagram of the NTX accelerator. It contains 5 hardware loops; 3 address generator units; a double-buffered command staging area in the register interface; a main controller; and a FPU with a comparator, index counter (for argmax calculations) and a fast FMAC unit. The employed depths for all FIFOs are indicated and have been determined in simulations for a TCDM read-latency of 1 cycle.

for (int k = 0; k < K; ++k) for (int n = 0; n < N; ++n) for (int m = 0; m < M; ++m) { float a = b[k]; for (int d = 0; d < D; ++d) for (int u = 0; u < U; ++u) for (int v = 0; v < V; ++v) { a += x[d][n+u][m+v] * w[k][d][u][v]; } y[k][n][m] = a; }

As can be seen, the resulting code consists of 6 nested loops. To maximize autonomy and reduce offloading overhead, we designed the NTX to perform up to 5 of the innermost loops directly in hardware. This allows us to offload a 3D

reduction operation over a 2D area in one command. The remaining outer loops are executed on the control processors, interleaved with DMA control and high-level address calculation. Many loop nests in C follow a similar pattern: They contain a simple operation in the innermost loop’s body which uses the loop indices for address calculation, and outer loops initialize an accumulation variable or store back a result. Figure 5 depicts the structure of the loops that NTX can directly execute. Each command issued contains an outer, init, and store level parameter, which dictate the number of loops and the levels at which the accumulator is initialized and stored back to memory. The indices of the

6

IEEE TRANSACTIONS ON COMPUTERS, VOL. (VOL), NO. (NO), (MONTH) (YEAR)

Activity

Core DMA Write

DMA Read

NTX Utilization TCDM Efficiency

8 0 100% 50%

50%

28.2% utilization

NTXs median 84.4%

0% 100%

0% 0.0

58.2% utilization

median 87.3% 2.5

5.0

7.5

10.0 time [µs]

12.5

15.0

17.5

20.0

Figure 4. A 3x3 convolution running on one cluster. The periods of activity of the RISC-V processor and the DMA unit are shown as blocks, the activity of the co-processors is indicated as the number of active NTXs. The processor and DMA are busy during 58.2% and 28.2% of the computation, respectively. The utilization of the co-processors is given as percent of maximum throughput. The efficiency of the TCDM is given as the percentage of memory requests serviced per cycle; the remaining requests stall due to conflicts. The system has a banking factor of 1.8. A description of the convolution is given in Section 3.2.4.

NTX Commands:

architecture has two memory ports, the MASK instruction runs at 2/3 operations per cycle. 3.2.4

Figure 5. (a) Nested C loops directly executable by NTX. (b) Overview of the supported commands and their throughput.

five hardware loops are combined into two operand and one result address in the Address Generation Units (AGUs). The accumulator can be initialized to zero or to the value at any of the three addresses. Two ports into the TCDM perform the read and write operations, which we have found to be sufficient to maintain single-cycle reduction operations. Buffering queues of depth five and seven at the memory ports help smoothing the effects of congestion in the memory system. 3.2.3

Conditional Streaming Operations

Non-linear layers such as ReLU or max pool require scattering operations or conditional accumulation to evaluate the backward pass (see Section 2.4.3). As shown in Figure 5b, the NTX supports conditional streaming operations (e.g. MASK and MASKMAC). Values read from memory can be masked or conditionally accumulated based on a comparison performed on a second stream of values, which requires three and two memory accesses per cycle, respectively. Since our

Structure Reuse and Efficient Offloading

When we parallelize operations across multiple coprocessors, the structure of the performed loops is generally the same across all of them. If we operate on tiles, even multiple offloads share the same loop configuration and merely differ in the operand base addresses. Therefore we have implemented command offloading via a staging area which holds the entire command configuration. When issuing a command this configuration is copied into the controller, giving us the benefit that the next command by default reuses the previous configuration. This double-buffering scheme also allows the next command to be prepared while one is still executing. This staging area is memory-mapped into the CPU core’s address space. Sibling co-processors can either be programmed individually or, by using a special broadcast address, all at once. This minimizes the number of CPU instructions needed per offload. As a concrete example, consider the following pseudo code of a convolution that offloads the innermost five loops onto eight NTX co-processors: // x: D*N*M, w: K*D*U*V, b: K, y: K*N*M for tiles tK,tN,tM of K,N,M { DMA: load b; NTX: init y with b; for tiles tD of D { DMA: load x,w; for groups of 8 in tK { NTX: for n,m,d,u,v in tN,tM,tD,U,V { y[k][n][m] = x[d][n+u][m+v] * w[k][d][u][v]; } } } DMA: store y; }

The convolution operates on input x, weights w, and bias b, and produces an output y. The first loop subdivides the output into tiles. For each tile we load the bias b using

the DMA and initialize the output y with it using the NTXs. The second loop then subdivides the input into tiles. For each tile we load the input x and weights w using the DMA, offload the entire convolution of the tile to the NTXs, and store the output y back using the DMA. Our actual implementation of the above code overlaps operation of the DMA and the NTXs by slightly rearranging the loops and inserting appropriate synchronization points. The transformation is trivial but verbose, which is why we omit it here. Figure 4 shows the execution of one iteration of the innermost loop. The NTX’s ability to perform long-running calculations independently allows the RISC-V processor to perform address calculation and orchestration of the DMA in the meantime. As a matter of fact, a significant portion of the core’s 58% utilization is due to the DMA only being capable of two-dimensional transfers. This requires the core to perform more address calculations and DMA control than strictly necessary. The presented convolution is computebound which manifests itself as a low DMA utilization of only 28% of the available bandwidth. Once a command is running the NTXs reach a utilization of 84%, i.e. they operate at 84% of their theoretical throughput. The remaining 16% are lost to banking conflicts in the TCDM. Overall the TCDM achieves an efficiency of 87%. The efficiency can be improved further by increasing the banking factor above the current 1.8 (32 banks, 18 ports), at the cost of a larger interconnect. We have found 32 banks to offer a good area/efficiency compromise.

4

P ROGRAMMING M ODEL

In this section we present an overview of common layer types; identify the challenges involved in training and key differences to inference; elaborate on the computations involved in selected layers; and then discuss our offloading model, memory considerations, and how individual layers can be mapped to NTX for the forward and backward pass. We conclude with a discussion of compression and variants of the training algorithm. 4.1

Offloading

As described in Section 3 in detail, the NTX provides a memory-mapped staging area for command preparation. Upon a write to the command field, the entire configuration is copied from the staging area into a separate register where it is executed by the NTX. The write blocks if the register is still occupied by a command in execution. This allows for command configuration and offloading to overlap command execution, allowing the controlling processor to configure the next command while the NTX is still busy. A convenient side-effect of this scheme is that the previous command’s configuration remains in the staging area, such that the next command need only adjust fields that have changed, and can reuse the rest. 4.2

Memory

As described in Section 3 and [10], [27], the NTX operates exclusively on memory local to the cluster. A DMA unit in conjunction with an MMU is used to copy data from DRAM

Transmitted Data [KiB]

SCHUIKI et al.: A SCALABLE NEAR-MEMORY ARCHITECTURE FOR TRAINING DEEP NEURAL NETWORKS ON LARGE IN-MEMORY DATASETS

7

12.5

10 5 0

1.88

1.27

32 40 48 Burst Length [B]

72

1.56 1.09 0.422 0.172 0.0312

8

16

24

3.09

88

96

Figure 6. Data transferred by the DMA at different burst lengths, when calculating a 3 × 3 convolution tile. The histogram bins the total amount of data transferred between the TCDM and DRAM by burst length. Large bursts correspond to the transfer of input and output data, while small bursts are due to transfer of convolution weights.

via the host’s virtual addresses into cluster memory, where the accelerators access them directly. The cluster as described provides local scratchpad memory that needs to be explicitly managed rather than a cached transparent view into the entire memory space. Thus the operations to be performed on NTX and the cluster need to be appropriately tiled along the dimensions of their input and output data. The input data of each tile is copied into the cluster, and the output is copied back into DRAM using the DMA. This data can be laid out in the DRAM in a pre-tiled fashion as described in [10], which allows it to be transferred with a single DMA operation. Establishing such a tiling however is non-trivial and re-tiling of data is likely to be necessary for DNNs. We thus propose the tiling to be performed on-the-fly, such that the accelerator need not make any assumptions about the layout of the data in DRAM. To fully utilize the bandwidth into DRAM, it is paramount that the accesses emitted by the DMA occur in sufficiently long bursts and have high locality with respect to DRAM pages to reduce overhead. In the case of 4D tiling [10], this is given by the fact that the pre-tiled data lies in DRAM as a dense consecutive sequence. In the case of on-the-fly tiling the DMA has to issue more and smaller bursts since the required data does not lie in DRAM consecutively. The tile dimensions however offer multiple degrees of freedom to adjust the access patterns generated by the clusters. We expect multiple clusters to operate on neighboring tiles and often with the same filter kernels. This offers, in conjunction with transaction reordering and coalescing at the DRAM controller, ample opportunities to ensure the accesses into DRAM fully leverage the bus width and locality within open pages. For example, HMCs [28] use an internal bus width of 32 B, and a maximum block size (page size) in the range of 32 B to 256 B. The innermost tile dimension would then be chosen to be a multiple of 32 B. Consider for example one tile of the 3 × 3 convolution discussed earlier. Figure 6 shows the amount of data read and written by the DMA at specific burst lengths. Most data transfers occur as bursts of 72 B, 88 B, or 96 B, and 92% of all data is transferred in bursts above 32 B. The few small bursts are due to convolution weight transfers, which can be cached to improve burst length further. We thus conclude that our architecture is capable of fully utilizing DRAM bandwidth by emitting sufficiently large accesses.

IEEE TRANSACTIONS ON COMPUTERS, VOL. (VOL), NO. (NO), (MONTH) (YEAR) all contrib. to 3 cells

3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

1 2 1 2 1

contrib. to 2 cells

2 3 2 3 2 1 2 1 2 1 2 3 2 3 2 1 2 1 2 1

contrib. to 1 cell

3 3 3 3 3

window 3x3 stride 3

window 3x3 stride 1

8

Figure 7. Irregularity introduced by stride in stencil operations such as convolution and max pooling. With a stride of one (left), all input cells contribute to the same number of output cells. With a stride of three (right), input cells contribute to one, two, or three output cells.

4.3

Mapping Layers to NTX

4.3.1 General Strategy The combination of general purpose RISC-V processors and dedicated floating-point streaming co-processors makes our architecture very flexible. It is a many-core platform with explicitly-managed scratchpad memories, where data copies and efficient bursts of Floating-Point Operations (flops) are controlled by the program running on all CPU cores. This program is not restricted to a simple compute kernel, but can be a much larger piece of code that performs multiple training iterations in a row. Thus once training data has been loaded into DRAM, our architecture can perform an entire training epoch autonomously without requiring any coordination with a host computer. The explicit memory and computation management makes programming NTX more involved than classic architectures with implicit caching. The following subsections provide details in this regard. However since our processing cores are based on the RISC-V instruction set architecture, we can make use of the existing tool chain ecosystem [35], [36]. Programs can be written in C or C++ and compiled with GCC.2 4.3.2 Tiling The NTXs in each cluster operate on an explicitly-managed scratchpad memory, which in general does not have sufficient capacity to hold the entire data associated with a layer at once. It is therefore necessary to subdivide the computation for each layer into smaller tiles. The control processor then coordinates tile-wise (a) copying of input data and parameters from DRAM into cluster memory, (b) operation of the NTX, and (c) copying of output data back from cluster memory to DRAM. See Section 3.2.4 for a concrete example. 4.3.3 Strided Convolution A convolution’s stride introduces an irregularity in how many output elements a given input affects. This poses a problem for the backward pass, since the incoming gradients now need to be propagated to a variable number of points. See Figure 7 for a visual depiction of the pattern. Such a pattern cannot be generated with NTX. However, we observe that we can subdivide the backward pass into multiple convolutions with a reduced kernel size and no stride, each of which is regular in the sense that the number of correspondences between input and output elements is constant. 2. At the time of writing, GCC 5.2.0 with custom extensions and the C runtime of the PULP Platform was used [30].

4.3.4 Strided Pooling Similar to the convolution case, the stride in pooling introduces an irregularity which makes implementation of the backward pass challenging. Each outgoing gradient may have none, one, or multiple incoming gradients propagated to it. Again, see Figure 7. To account for this, we initialize the outgoing gradient to 0 and for every incoming gradient value iterate over all possible target locations within the pool, and conditionally add it to the outgoing gradient where appropriate. This requires us to keep track of the index of the maximal element for each output value during the forward pass. Average pooling exhibits the same irregularities due to striding, but the implementation of the backward pass is simpler in the sense that the incoming gradient is propagated to all possible locations within the pool, rather than to only one selectively. 4.3.5 Special Functions (exp, log, div, sqrt) There is no dedicated hardware to evaluate special functions such as division, exp, log, square roots, or arbitrary powers. As the number of such operations is typically very low (in the order of a few thousands per training step), it is feasible to implement them using iterative algorithms [37] on the NTX, calculating multiple results in parallel. For tens to hundreds of inputs, pipeline latency can be hidden and the evaluation takes in the order of 30 to 100 cycles per element. 4.3.6 Normalization Schemes such as Local Response and Batch Normalization [38] can be implemented in our architecture by performing multiple passes over the input tensor. In a first pass, the necessary statistics of the input tensor are calculated. In a second pass, the values in the input tensor are offset and scaled as dictated by the employed normalization.

5

R ESULTS AND E VALUATION

In this section we evaluate the silicon and energy efficiency of our proposed architecture and compare it against the previous implementation [10]. Furthermore we investigate the effects of voltage and frequency scaling and the impact of multiple logic dies per memory cube. We conclude by comparing different NTX configurations against existing accelerators and evaluate the data center scale impact of our architecture. 5.1

Methodology

5.1.1 DRAM Power We model the power consumption of the vault controllers, DRAM dies, and HMC interconnect as the following relationship: Pdram (B) = 7.9 W + B · 21.5 W s/GB, where B is the requested bandwidth. We call this the “DRAM” power. This model is based on the observation in [10] that 8.8 W are consumed in a 1 GB cube, with a ±10% deviation due to traffic into DRAM. Under no traffic this equates to 0.9 · 8.8 W = 7.9 W. Under an average traffic of

SCHUIKI et al.: A SCALABLE NEAR-MEMORY ARCHITECTURE FOR TRAINING DEEP NEURAL NETWORKS ON LARGE IN-MEMORY DATASETS

head

tail

DMA head

tail

NTX compute-bound

memory-bound

Figure 8. Execution time of a kernel running on a cluster. See Section 5.1.2 for a details. Tdseq corresponds to memory transfers that need to happen before and after the main computation, e.g. first data fetch and last data store. Tdpar corresopnds to transfers that can happen in parallel to the computation Tc . Shown are a compute-bound case where Tc dominates, and a memory-bound case where Tdpar dominates.

an estimated 51.2 GB/s caused by their investigated workloads, this increases to the reported 8.8 W, a bandwidthdependent power increase of 21.5 W s/GB. These estimates are conservative and do not consider further power-saving measures, such as DVFS or power gating of HMC components. 5.1.2 Cluster Power We have synthesized our design for a 28 nm Fully Depleted Silicon On Insulator (FD-SOI) technology using Synopsys Design Compiler, which we also use to estimate power based on simulation traces. The Register Transfer Level (RTL) model was back-annotated with timing information obtained from the synthesized design at 125 ◦C/1.0 V (slowslow corner). To obtain a representative power figure of the cluster under load, we extracted the traces from a RTL simulation of the cluster running a 3 × 3 convolution. We observe an energy of 165 pJ per clock cycle. Since simulating a large number of computations is prohibitively slow on RTL, we model the power consumption and execution time of a kernel on one cluster as follows. For each kernel we determine the execution time of the computation (Tc ) and DMA transfers (Tdpar , Tdseq ) as: Tc = Nc / ηc rc f

[s]

(11)

Tdpar = (Ddma − Dhead − Dtail ) / ηd rd f

[s]

(12)

Tdseq = (Dhead + Dtail ) / ηd rd f

[s]

(13)

where Tdpar represents the DMA transfers that can run in parallel with computation and Tdseq those that need to happen before (“head”) and after (“tail”) the computation. In more detail, Nc and Ddma are the total number of compute operations performed and bytes transferred by the kernel; rc are the peak compute operations per cycle of the cluster; and rd is the peak bandwidth of the DMA per cycle. For the architecture with 8 NTXs presented in Section 3, rc = 8 op and rd = 4 B. ηc and ηd account for inefficiencies such as interconnect contentions and are determined empirically from simulations. We then formulate the execution time, requested bandwidth, and power consumption of the kernel as: Tcl = max{Tc , Tdpar } + Tdseq

[s]

(14)

Bcl = Ddma / T

[B/s]

(15)

Pcl = 165 pJ · f

[W]

(16)

See Figure 8 for a visual explanation. Note that we issue DMA transfers in chunks of multiple kB and the engine is

9

capable of having multiple simultaneous transfers in flight. This allows us to hide the latency into DRAM which we estimate to be on the order of 40 cycles [39]. It is crucial that we fully saturate the precious bandwidth into DRAM when performing strided memory accesses, e.g. when transferring a tile of a tensor. There the length of the tile’s innermost dimension is critical, as it determines the length of one burst accesses. Since we have full control over the tiling, we can ensure that a tile has at least 8 elements along its shortest dimension. This yields consecutive accesses of at least 32 B, which is the minimum block size in an HMC [28]. 5.1.3 Cube Power Based on the above, we model the requested bandwidth, power consumption, and energy efficiency of a kernel parallelized on a HMC with K clusters as B = K · Bcl

[B/s]

(17)

T = Tcl / K

[s]

(18)

[W]

(19)

[flop/s W]

(20)

P = Pdram (B) + K · Pcl η = Nf p / P T

Note that we assume the kernels to be embarrassingly parallel, which is true for mini-batched machine learning tasks. Also note that Nf p (flop/s) is related to but distinct from Nc (op/s): The NTX is capable of performing a fused FMAC which counts as one compute operation, but two floating-point operations. 5.1.4 Neural Network Models To expand our estimates to entire neural networks, we first model the execution time and energy efficiency of different layers. To this end, we build a library of functions that model the number of computations and data transfers required for inference and training steps of each layer, parametrized over the input and output dimensions and layer parameters. We then model the inference and training of the neural networks presented in this section as successive calls to these library functions. Furthermore we assume that the training data has been loaded into the HMC by the host processor beforehand, which for the 1 GB cube takes less than 10 ms at a peak bandwidth of 240 GB/s (90% of the 320 GB/s theoretically provided by the HMC [28]). The energy spent to do this is amortized by the HMC performing independent operation on the data for several seconds, e.g. by performing dataset augmentation to prolong the time to the next host intervention. 5.1.5 Technology Scaling We use internal comparisons and publicly available information to estimate the effect of scaling down the technology node of the LoB from the 28 nm FD-SOI process investigated by us to a more modern 14 nm FinFET node [40], [41]. For this change we observed across several designs an increase of 1.4× in speed, a decrease of 0.4× in area, and 0.7× in dynamic power dissipation. To our knowledge there is no publicly available information on the DRAM characteristics of HMCs. “SMCSim” [27] assumes them to be similar to the MT41J512M8 device by Micron, which is based on a 50 nm process. Given the manufacturer and [24], the device seems to be a reasonable

IEEE TRANSACTIONS ON COMPUTERS, VOL. (VOL), NO. (NO), (MONTH) (YEAR)

Figure 9. Energy efficiency versus operating frequency of different numbers of clusters and different technology nodes: 28 nm logic / 50 nm DRAM, 14 nm logic / 30 nm DRAM. The clusters perform a 3 × 3 convolution. The voltage is varied between 0.6 V and 1.2 V in proportion to the frequency. Points of highest efficiency of each configuration are marked in bold red. The internal bandwidth of the HMC puts an upper bound on the achievable energy efficiency, visible in the upper half of the graph.

reference for early HMCs. We estimate the DRAM technology scaling factor for power consumption to be 0.87, by comparing the supply currents and voltages of this device to the newer 30 nm MT40A512M8. Precision, Sparsity, Compression

Training a DNN with reduced floating-point precision or even fixed-point arithmetic is much harder than doing the same for inference. The intuition here is that the SGD algorithm performs smaller and smaller changes to the parameters as training progresses. If these changes fall beneath the numeric precision, the algorithm effectively stops converging. There is no a priori obvious range of magnitudes within which parameters fall, thus the arithmetic must support a significant dynamic range without additional prior analysis. NTX employs 32 bit floating-point arithmetic which is commonly used in deep learning frameworks and CPUs/GPUs, rendering such analysis unnecessary. Note that there is evidence that training is possible in fixed-point arithmetic with little accuracy loss in some cases [42]. However, results tend to be limited to specific networks and other work suggests that reducing precision may not be feasible at all without incurring significant accuracy loss [43]. Recent work on network compression and pruning techniques has shown promising results in terms of reducing computational overhead [22], [44], [45], [46]. The general purpose nature of the RISC-V processors in our architecture allows some of these schemes to be implemented. For example entire convolutions may be skipped or certain forms of decompression and re-compression may be performed on the processor cores. The NTX has not been optimized for sparse tensor operations however, and we leave their detailed analysis for future work. 5.3

@0.28GHz @0.56GHz @0.98GHz @1.68GHz @2.24GHz @3.08GHz @1.30GHz @1.70GHz @2.30GHz

Power 0 [W]

Operating Frequency of NTX [GHz]

5.2

512 × 256 × 128 × 64× 32× 16× 64× 32× 16×

28/50 nm

Efficiency [Gop/sW]

14/30 nm

10

Voltage and Frequency Scaling

In this section we assess the efficiency of NTX at different operating points. We vary the supply voltage between 0.6 V and 1.2 V; and the operating frequency between 0.1 GHz and 2.5 GHz for the 28 nm process and 0.14 GHz and 3.5 GHz for the 14 nm process. The voltage is assumed

5 DRAM static

10 15 DRAM dynamic

20 Logic

25 SRAM

Figure 10. Energy dissipation of different configurations, evaluated at their most-efficient operating point in Figure 9. Note that even the massively parallel configurations with more than 64 clusters are below a TDP of 25 W.

to scale linearly with frequency [47] and is thus varied in proportion to the frequency. Figure 9 plots the energy efficiency of HMCs with different NTX configurations against the operating frequency. Two counteracting effects lead to a tradeoff between efficiency and frequency: On one hand DRAM consumes significant static power, making it beneficial to operate at a higher frequency to decrease the time to solution. On the other hand the NTX power consumption increases quadratically with voltage and thus frequency. For larger configurations, the internal bandwidth limit of the HMC is reached at a certain frequency, visible as a dent in the efficiency. The points of highest efficiency are listed in Table 4. Figure 10 shows a breakdown of the power consumption at these operating points. All configurations remain within a power budget of 25 W, which according to [48] is feasible for a HMC with active cooling. If the static power of the DRAM decreases, e.g. by switching to a different memory technology, these optimal operating points will change. 5.4

Multiple Logic Layers

Table 4 shows the area occupied by different NTX configurations. The unoccupied area on the LoB is not precisely known, and estimates range from 10 mm2 [10] to 50 mm2 [49]. In the following we assume that the LoB has an area of 50 mm2 , of which 25 mm2 are unused and thus available to custom logic. This allows configurations of up to 64 clusters per HMC. For larger configurations, we propose the use of multiple stacked logic dies such as the 3D Logic in Memory (LiM) proposed in [50]. While the use of additional layers increases the complexity of the die stack, they allow for a significant increase in parallelism and efficiency. Furthermore, the use of LiM layers for custom accelerator logic has the additional benefit of decoupling the LoB manufacturing process from the accelerator, thus allowing modular assembly of “Application Specific Memory Cubes (ASMCs)”. We expect this concept to be relevant for High Bandwidth Memory (HBM) [51] as well. 5.5

Comparison with NeuroStream

NeuroStream (NS) [10] was aimed primarily at efficient inference and requires data to be very carefully laid out in memory (4D tiling). This constraint on data layout makes training very inefficient, since intermediate activations after each layer need to be re-tiled when storing them back to DRAM. This puts a significant workload on the RISC-V processor cores and causes additional traffic into memory.

SCHUIKI et al.: A SCALABLE NEAR-MEMORY ARCHITECTURE FOR TRAINING DEEP NEURAL NETWORKS ON LARGE IN-MEMORY DATASETS

Figure of Merit Number of Clusters Cores per Cluster Accelerators per Core Cluster Frequency [GHz] Accelerator Frequency [GHz] Peak Performance [Gop/s] Core Efficiency [Gop/s W] Area [mm2 ] Single Cluster – SRAM – Logic Total – SRAM – Logic

NS [10]

NTX “small”

16 4 2 1.0 1.0 256 116

16 1 8 0.75 1.5 384 97

0.580 0.300 0.280 9.3 4.8 4.48

64 1 8 0.75 1.5 1536 97

0.636 0.300 0.336 10.5 5.1 5.38

0.636 0.300 0.336 41.0 19.5 21.5

11.2 1.10 1.10 9.00

13.1 2.31 1.65 9.14

28.7 9.24 6.60 12.9

Inference Total [ms] – Convolution [ms] – Linear [ms] – Pooling [ms] Avg. Bandwidth [GB/s] Peak Bandwidth [GB/s] Efficiency [Gop/s W]

14.0 13.1 0.08 0.83 14.4 51.2 20.3

11.3 10.5 0.07 0.74 17.8 57.6 21.4

2.83 2.63 0.02 0.19 71.0 230 39.1

Training Total [ms] – Convolution [ms] – Linear [ms] – Pooling [ms] Avg. Bandwidth [GB/s] Peak Bandwidth [GB/s] Efficiency [Gop/s W]

56.8 54.8 0.65 1.38 11.3 51.2 15.0

34.8 33.1 0.43 1.23 18.5 57.6 21.0

8.69 8.23 0.11 0.31 74.0 231 38.3

† †

50

NTX “big”

Power [W] Total – Logic – SRAM – DRAM



60

Gop/sW

Table 3 Architecture comparison between NTX (this work) and the inference architecture NS [10]. Figures of merit based on inference and training steps of GoogLeNet [1]. We compare implementations in 28 nm FD-SOI technology.



Estimated assuming no pre-tiled memory layout (4D tiling [10]). ReLU and pooling layers are not directly supported by this architecture and have to be emulated on the control cores.

The processors are under high load to keep the NS saturated with floating-point operations, such that spending compute cycles on re-tiling also means stalling the NS co-processors. Our architecture does not depend on such a tiling. In Table 3 we compare NTX to NS, both implemented in 28 nm. The much improved offloading scheme allows us to increase the ratio of co-processors to control cores from 2:1 to 8:1. The fast FMAC allows us to operate the NTX at twice the frequency of the rest of the cluster, leading to an increase of peak performance from 256 Gop/s to 384 Gop/s for the 16 cluster version. The increased number of hardware loops and operations supported by NTX, together with the improved performance, allow us to increase the energy efficiency of a training step from 15 Gop/s W to 21 Gop/s W. The 16 cluster configuration requests a peak bandwidth of 57.6 GB/s, which does not saturate the internal bandwidth of up to 320 GB available inside the HMC. We can improve the energy efficiency to 38.3 Gop/s W by increasing the number of clusters to 64.

40

GPU NS (Azarkhish) NTX 28nm NTX 14nm

11

2.7x

30 2.5x

20 10 0

K80

M40

TitanX

P100

1080Ti

NS

NTX 32

NTX 64

Figure 11. Comparison of energy efficiency when training the networks listed in Table 4 (geometric mean), with GPUs, NS [10], and the largest NTX configurations that do not require additional LiMs. NTX 32 in 28 nm achieves a 2.5x increase, and NTX 64 in 14 nm a 2.7x increase in efficiency over GPUs in similar technology nodes.

5.6

Comparison with other Accelerators

To compare against other accelerators, we use one training step of AlexNet [11], GoogLeNet [1], Inception v3 [14], three variants of ResNet [13], and a LSTM with 512 inputs and hidden states as workload. Table 4 and Figure 11 provides an overview of the compared architectures. To our knowledge there are three other custom accelerators that claim support for training at precisions similar to ours: NeuroStream [10], DaDianNao [42], and ScaleDeep [4]. DaDianNao is based on a 32 bit fixed-point data path, but evidence that training is possible with no accuracy loss is only provided for a simple network trained on the MNIST dataset. Nevertheless we include it in Table 4 for the sake of completeness. ScaleDeep is a systolic distributed architecture that provides full 32 bit floating-point arithmetic. It keeps the network parameters and state in distributed memory, which reduces data movement and increases efficiency, but also sets a lower bound on the amount of silicon required to train a DNN. Furthermore it remains unclear where the large training datasets can be stored in such architectures. Certainly additional external storage is required which reduces energy efficiency. HMCs have the advantage of providing high density of up to 8 GB in newer versions [55] per device, allowing even a small accelerator configuration to work on large networks and datasets. GPUs are currently the accelerator of choice to train DNNs. Our architecture can achieve significantly higher energy efficiency than a GPU at a comparable technology node (see Figure 11). Considering the largest NTX configurations that do not require additional LiMs, we achieve an efficiency increase of 2.5× from 11.8 Gop/s W to 29.9 Gop/s W in 28 nm, and an increase of 2.7× from 20.4 Gop/s W to 54.9 Gop/s W in 14 nm. 5.7

Deployed Silicon

The key benefit of our architecture is that it leverages existing unused silicon area. This incurs almost no additional costs, since we assume the HMCs to be already present in the system as main memory of the CPU, and manufacturing costs of the spare silicon area is the same regardless of whether it is being used. This allows us to deploy up to 32 processing clusters in 28 nm and 64 processing clusters in 14 nm with no additional silicon needed. Figure 12 compares the Gop/s of compute performance per deployed amount of silicon for NTX and GPUs. Our

12

IEEE TRANSACTIONS ON COMPUTERS, VOL. (VOL), NO. (NO), (MONTH) (YEAR)

Table 4 Comparison between different configurations of the architecture proposed in this work, related custom accelerators, and GPUs. The energy efficiencies reported are with respect to training different DNNs and an LSTM. Energy Efficiency [Gop/s W] Arithmetic §

AlexNet [11]

GoogLeNet [1]

Incep. v3 [14]

ResNet 34 [13]

ResNet 50 [13]

ResNet 152 [13]

Geom. Mean

LSTM§

23.6 31.6 40.8 34.6 45.6 58.0 69.5 78.6 82.9

24.1 32.3 41.7 35.4 46.7 59.3 71.0 80.4 84.8

21.6 28.9 37.3 31.6 41.7 53.0 63.4 71.8 75.7

21.3 28.5 36.8 31.2 41.2 52.3 62.6 70.9 74.8

23.5 31.4 40.6 34.4 45.4 57.7 69.1 78.2 82.5

22.3 29.9 38.6 32.8 43.2 54.9 65.8 74.4 78.5

29.9 40.0 51.6 43.8 61.3 73.1 90.1 111.2 116.9

1.0 0.6 0.6

0.256 2.09 680

(a) (b) (c)

10.2 — 366.2

15.1 — 350.3

14.6 — —

13.1 — 581.1

12.9 — —

14.2 — —

13.0 130.9* 420.9

— — —

0.59 1.11 1.08 1.3 1.58

8.74 7.00 7.00 10.6 11.3

(a) (a) (a) (a) (a)

— — 12.8 — 20.1

4.5 11.3 9.9 19.8 16.6

3.5 — — 19.5 —

— — 17.6 — 27.6

3.7 — 8.5 18.6 13.4

8.8 — 12.2 24.18 19.56

4.7 11.3 11.8 20.4 18.9

— 15.6 — — —

Freq. [GHz]

19.7 26.3 34.0 28.8 38.0 48.3 57.9 65.5 69.1

LiM

(a) (a) (a) (a) (a) (a) (a) (a) (a)

Area [mm2 ]

0.589 0.870 1.331 0.788 1.219 1.720 2.007 2.294 2.294

DRAM [nm]

Peak Top/s

Characteristics

Logic [nm]

Platform

This Work NTX (16×) NTX (32×) NTX (64×) NTX (16×) NTX (32×) NTX (64×) NTX (128×) NTX (256×) NTX (512×)

28 28 28 14 14 14 14 14 14

50 50 50 30 30 30 30 30 30

10.5 20.7 41.0 4.2 8.3 16.4 32.8 65.6 131.2

0 0 1 0 0 0 1 2 3

2.30 1.70 1.30 3.08 2.24 1.68 0.98 0.56 0.28

Custom Accelerators NS (16×) [10] DaDianNao [42] ScaleDeep [4]

28 28 14

50 28 —

9.3 67.7 —

— — —

GPUs Tesla K80 † Tesla M40 † Titan X ‡ Tesla P100 † GTX 1080 Ti ‡

28 28 28 16 16

40× 30× 30× 21◦ 20×

— — — — —

561 601 601 610 471



Inception/ResNet: batch size 64 with TensorFlow/cuDNN 5.1 [52]; GoogLeNet: batch size 128 with Torch/cuDNN 5.1 [53] § All nets: batch size 16 Torch/cuDNN 5.1 [54] 512 inputs and hidden states, batch size 32 for NTX and 64 for GPU [20] × ◦ * GDDR5 and GDDRX5, process node estimated based on GPU release year HBM2 [51] Fixed-point implementation § (a) floating-point 32 bit, (b) fixed-point 16/32 bit, (c) floating-point 16/32 bit



120

Gop/mm²s

100 80

ratio of the power consumed by a data center to the power consumed solely by its compute units:

GPU NS (Azarkhish) NTX 28nm NTX 14nm

ηpue =

4.4x

60 40 20 0

2.7x

K80

M40

TitanX

P100

1080Ti

NS

NTX 32

NTX 64

Figure 12. Comparison of the Gop/s of compute performance per deployed area of silicon, for GPUs, NS [10], and the largest NTX configurations that do not require additional LiMs. NTX 32 in 28 nm achieves a 2.7x increase, and NTX 64 in 14 nm a 4.4x increase in area efficiency over GPUs in similar technology nodes.

solution requires 4.4× less area to achieve the same compute performance as a GPU. Even more when one considers that the chosen 32 and 64 cluster configurations can fit into the aforementioned unused silicon, their cost is virtually zero. This sets our solution apart from ScaleDeep, DaDianNao, and other GPUs, which require significant silicon overhead. 5.8

Ptotal Pcompute

Savings at Data Center Scale

Computing at a data center scale incurs a significant energy and cost overhead over the raw hardware’s power consumption. This is among other factors due to the required air conditioning and cooling. A standard measure for this overhead is the Power Usage Effectiveness (PUE) [56], the

Data centers have been reported to have ηpue = 1.82 [57], [58], and more recently as low as 1.12 [59]. The figure depends heavily on the local climate and usually only the winter months’ numbers are published. We assume an average ηpue = 1.2. We consider a NVIDIA DGX-1 server with two Intel Xeon CPUs and eight Tesla P100 cards. One such unit consumes 3.2 kW of power, 2.4 kW of which are due to the GPUs [60]. We assume DDR4 DRAM to consume 6 W per 16 GB of storage under full load [61]. We investigate two different approaches of replacing the GPUs of the system with NTX-augmented HMCs. 5.8.1

Same Peak Compute

The P100 cards achieve a combined peak compute of 84.8 Tflop/s. Figure 13 shows the number of HMCs required to match this performance, and the achievable energy savings, with different NTX configurations per cube. The 43 HMCs with NTX 128 required to achieve the same compute power consume only 860 W, saving 2.4 kW of GPU power and an additional 128 W of DRAM power, for an overall reduction of 2.1×. With a PUE of 1.2 this translates to 1868 kW of saved power, which at an energy price of 0.1104 $/kWh [62] is 1808 $ per year and server.

110 Number of HMCs

100 90 80

Number of HMCs Saved Power

70 60 50 40 30 NTX 16

NTX 32

NTX 64

NTX 128

NTX 256

1700 1600 1500 1400 1300 1200 1100 1000 900 800 NTX 512

Saved Power [W]

SCHUIKI et al.: A SCALABLE NEAR-MEMORY ARCHITECTURE FOR TRAINING DEEP NEURAL NETWORKS ON LARGE IN-MEMORY DATASETS

NTX Clusters per HMC

170 160 150 140 130 120 110 100 90 80 NTX 16

260 240 220 200 180 160 Number of HMCs Compute Capability NTX 32

NTX 64

NTX 128

140 NTX 256

120 NTX 512

Compute Capability [Tflop/s]

Number of HMCs

Figure 13. Number of HMCs required to meet a compute of 84.8 Tflop/s for different numbers of NTX clusters per HMC, and the corresponding power savings over 8 GPUs achieving the same compute capability.

NTX Clusters per HMC

Figure 14. Number of HMCs that can be deployed with a power budget of 2.4 kW for different numbers of NTX clusters per HMC, and the corresponding compute capability.

5.8.2 Same Thermal Design Power Figure 14 shows the number of HMCs that can be deployed within the 2.4 kW GPU power budget of the DGX-1. 129 HMCs with NTX 128 are capable of achieving a total compute power of 258.9 Tflop/s, a 3.1× improvement over the P100.

6

R ELATED W ORK

Acceleration of DNNs, in particular the forward pass, is a well researched field with a rich literature. Goodfellow, et al. [3] provide a good coverage of the mathematical background of Deep Learning. An overview of techniques for efficient DNN inference and the involved challenges can be found in [2], [63], [64]. 6.1

Accelerators for Inference

There exists an abundance of architectures for CNN inference [65], [66], [67], [68], [69], [70], [71], [72], [73], [74], [75], [76], [77], [78], [79]. Some are FPGA-based and usually report an energy efficiency below 10 Gop/s W [67], [68], [69], [70], [71], [72]. E.g. a 16 bit fixed-point design synthesized for a Xilinx Virtex7 with the Caffeine framework [69] achieves 8.5 Gop/s W. ASIC-based accelerators [74], [75], [76], [77], [78], [79] provide higher efficiency with several hundreds to thousands of Gop/s W. For example, the recent Tensor Procesing Unit by Google [80] provides a maximum throughput of 92 Top/s (8 bit) operations in 28 nm with a power of 40 W (per die) and 96 W including DRAM. Several approaches have been investigated to improve the performance of inference accelerators. For example, approximations such as heavy quantization and/or lossy network compression techniques are investigated in

13

more detail in [22], [73], [77], [81]. Sparsity of the activations is another angle that has recently been investigated in [82]. PIM architectures for inference that leverage the LoB of an HMC to deploy compute units close to the memory are investigated in [49] and [10]. Gao, et al. [49] deploy several 2D accelerator arrays with 16 bit fixed point arithmetic (similar to Eyeriss [75]) on the LoB and achieve a core energy efficiency in the order of 450 Gop/s W for the processing elements (excluding DRAM). Azarkshish, et al. [10] present a PIM flexible architecture based on a clustered many-core PIM platform, augmented with 32 bit FP streaming accelerators that reach an energy efficiency up to 22.5 Gflop/s W (including DRAM). Although preliminary estimations for training are provided, its architecture lacks support for some operations encountered during the backward pass, which adversely impacts its efficiency when used for training. Several custom accelerator architectures [76], [78], [83] have been proposed for large convolutional kernels. Unfortunately when operating on modern networks that comprise of small convolutions on the order of 3×3 and less, these accelerators will not be able to work as efficiently, reducing their viability. Furthermore, several accelerators assume that the network parameters can be stored entirely on chip [79], [84]. This assumption does not hold for stateof-the-art networks consisting of several tens to hundreds of megabytes of parameter data [1], [11], [12], [13], [14], hence providing a compelling reason to push for PIM solutions. This observation becomes even more pronounced when considering the training of such networks, during which the data cannot be quantized and compressed as well as during inference. 6.2

Accelerators for Training

Turning towards custom accelerators that support both the forward and backward pass, we observe that much fewer architectures have been proposed so far [4], [42], [83]. The Neurocube [83] is a PIM architecture employing a data-driven compute model in combination with finite state machines near the HMC vault controllers that generate the addresses for the currently processed CNN. The study only considers small networks with up to 6 layers, uses only 16 bit fixed point arithmetic, and provides a lower energy efficiency (10 Gop/s W) than our architecture (25.3 Gflop/s W). DaDianNao [42], a multi-node system for neural networks, is an offspring of the DianNao inference architectures [79], [84], and each compute node consists of a 28 nm chip that includes 36 MB of eDRAM in order to keep the network weights on-chip. The chip provides 5.6 Top/s per chip, at a power consumption of around 16 W, resulting in an efficiency of 350 Gop/s W for 16 bit fixed point arithmetic. The only custom architecture providing 16 bit and 32 bit FP support is ScaleDeep [4], which is a scalable multi-node architecture that leverages the heterogeneity in computations and employs chips that are assembled from memory-heavy and compute-heavy tiles. The network state is distributed across several chips an nodes in order to reduce data movement. Thanks to these architectural features, their energy efficiency estimates are very high, around 332 Gflop/s W in a 14 nm technology. While these numbers are impressive for a 32 bit FP architecture, it remains unclear, however, where the training data of the large datasets

14

IEEE TRANSACTIONS ON COMPUTERS, VOL. (VOL), NO. (NO), (MONTH) (YEAR)

typically used in deep learning is stored (the architecture does not have any DRAM storage). Our architecture on the other hand provides ample room for large datasets when scaled to multiple HMCs, since each HMC contains one gigabyte of DRAM in our setting. The architecture draws inspiration from the PIM architecture for inference proposed in [10], and introduces the NTX streaming accelerator with an efficient offloading mechanism and a rich set of streaming operations to support training of DNNs (see Section 5 for a comparison). Note that GPUs are currently the only platform used for energy-efficient training in production. 6.3

GPUs

GPUs can be seen as the main workhorse of Deep Learning and are commonly used for both inference and training due to their flexibility. Recent implementations on the GTX 780 and GTX Titan (both featuring a Kepler microarchitecture) reach 1650 Gflop/s at 250 W and 999 Gflop/s at 240 W, which corresponds to 6.6 and 4.2 Gflop/s W, respectively [10], [85]. Embedded GPUs like the Tegra K1 (Kepler) and X1 (Maxwell) have lower absolute throughput, but reach similar energy efficiency of around 7 and 8.5 Gflop/s W, respectively [85], [86]. The Maxwell and Kepler generation GPUs were superseded by the Pascal generation, the high-end datacenter models of which offer several new features that are beneficial for Deep Learning. This includes stacked DRAM (HBM2) memory, native support for 16 bit floating-point operations and NVLink [87]. Compared to previous generations, the Tesla P100 has a 2× higher 32 bit floating-point throughput of 10.6 Tflop/s [87], and achieves a significantly higher energy efficiency in the order of 20 Gflop/s W [52], [54]. Finally, there is the recently announced Volta generation that introduces a new compute element termed TensorCore, able to perform a fused 16 bit FP 4×4 matrix-matrix multiply-add operation (with 16 bit or 32 bit FP output) in one cycle [88], [89]. Equipped with up to 640 such cores, the Volta generation promises a 5× increase in Deep Learning performance compared to the prior Pascal generation [90].

7 7.1

F UTURE W ORK Mesh of HMCs

Our architecture is inherently scalable since the HMC standard allows for memory cubes to be interconnected via the serial links [28]. This allows for the HMCs to be arranged in a mesh. The RISC-V processing cores can communicate via these links through the main interconnect. Parallel formulations of SGD [91], [92] will likely be necessary to scale the solution beyond a few cubes. 7.2

High Bandwidth Memory

The power consumption of the DRAM in our architecture is significant. By moving to a more modern 3D-stacked memory technology such as HBM, energy efficiency can be improved. The wide interface of HBM brings new challenges and design constraints that require detailed analysis.

8

C ONCLUSION

We have presented the streaming floating point co-processor NTX with a decisive focus on training DNNs. Its data path is built around a fast fused accumulator with full 32 bit precision, which gives it a key advantage over architectures that are based on fixed-point arithmetic or lower floating-point precision. The co-processor is capable of generating three independent address streams from five nested hardware loops, allowing it to traverse structures with up to five dimensions in memory independently. A rich set of arithmetic and logic commands allows it to perform the reductions and matrix/vector operations commonly found in the forward pass, but also the threshold, mask, and scatter operations encountered during the backward pass. We combine eight such co-processors with memory, a control processor, and a DMA unit into a cluster. An efficient offloading scheme frees up resources on the control processor to exert fine-grained control over data movement. The data does therefore not need to be put into memory in a specific, pre-tiled pattern, but can be operated on directly in its canonical and dense form. Integrated into the LoB of an HMC, multiple clusters can exploit the high bandwidth and low accesses latency into DRAM in this near-memory setting, leading to an energy efficiency that is 2.6× higher than contemporary GPU implementations, while requiring 4.4× less silicon to be deployed. Furthermore, configurations which fit into the unused area on the LoB incur virtually zero additional manufacturing costs. Compared to a contemporary data center solution, NTX can provide the same compute capability at 2.1× less power, or 3.1× more compute capability at the same power.

R EFERENCES [1]

C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in CVPR, 2015. [2] V. Sze, Y.-H. Chen, T.-J. Yang, and J. Emer, “Efficient processing of deep neural networks: A tutorial and survey,” arXiv:1703.09039, 2017. [3] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016, http://www.deeplearningbook.org. [4] S. Venkataramani, A. Ranjan, S. Banerjee, D. Das, S. Avancha, A. Jagannathan, A. Durg, D. Nagaraj, B. Kaul, P. Dubey et al., “ScaleDeep: A Scalable Compute Architecture for Learning and Evaluating Deep Networks,” in ISCA, 2017, pp. 13–26. [5] B. T. Polyak, “Some methods of speeding up the convergence of iteration methods,” USSR Computational Mathematics and Mathematical Physics, vol. 4, no. 5, pp. 1–17, 1964. [6] I. Sutskever, J. Martens, G. Dahl, and G. Hinton, “On the importance of initialization and momentum in deep learning,” in ICML, 2013, pp. 1139–1147. [7] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization,” JMLR, vol. 12, no. Jul, pp. 2121–2159, 2011. [8] T. Tieleman and G. Hinton, “Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude,” COURSERA: Neural networks for machine learning, vol. 4, no. 2, pp. 26–31, 2012. [9] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv:1412.6980, 2014. [10] E. Azarkhish, D. Rossi, I. Loi, and L. Benini, “Neurostream: Scalable and Energy Efficient Deep Learning with Smart Memory Cubes,” IEEE TPDS, vol. PP, no. 99, 2017. [11] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in NIPS, 2012, pp. 1097–1105. [12] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv:1409.1556, 2014.

SCHUIKI et al.: A SCALABLE NEAR-MEMORY ARCHITECTURE FOR TRAINING DEEP NEURAL NETWORKS ON LARGE IN-MEMORY DATASETS

[13] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016. [14] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in CVPR, 2016, pp. 2818–2826. [15] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inceptionv4, inception-resnet and the impact of residual connections on learning.” in AAAI, 2017, pp. 4278–4284. [16] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin et al., “Tensorflow: Large-scale machine learning on heterogeneous distributed systems,” arXiv:1603.04467, 2016. [17] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in ICME, 2014, pp. 675–678. [18] R. Collobert, K. Kavukcuoglu, and C. Farabet, “Torch,” in Workshop on Machine Learning Open Source Software, NIPS, vol. 76, 2008. [19] L. B. Rall, “Automatic differentiation: Techniques and applications,” 1981. [20] J. Appleyard, T. Kociský, and P. Blunsom, “Optimizing performance of recurrent neural networks on gpus,” arXiv:1604.01946, 2016. [21] S. Hochreiter, Y. Bengio, P. Frasconi, J. Schmidhuber et al., “Gradient flow in recurrent nets: the difficulty of learning long-term dependencies,” 2001. [22] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” arXiv:1510.00149, 2015. [23] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and connections for efficient neural network,” in NIPS, 2015, pp. 1135– 1143. [24] J. T. Pawlowski, “Hybrid memory cube (HMC),” in Hot Chips, 2011. [25] J. Jeddeloh and B. Keeth, “Hybrid memory cube new dram architecture increases density and performance,” in VLSI Technology (VLSIT), 2012 Symposium on. IEEE, 2012, pp. 87–88. [26] M. Black, “Hybrid memory cube,” in Electronic Design Process Symposium, 2013, pp. 3–3. [27] E. Azarkhish, D. Rossi, I. Loi, and L. Benini, “Design and evaluation of a processing-in-memory architecture for the smart memory cube,” in ARCS. Springer, 2016, pp. 19–31. [28] “Hybrid Memory Cube Specification 2.1,” http://www. hybridmemorycube.org, 2015, acc.: Sept 2017. [29] E. Azarkhish, C. Pfister, D. Rossi, I. Loi, and L. Benini, “Logic-Base Interconnect Design for Near Memory Computing in the Smart Memory Cube,” TVLSI, vol. 25, no. 1, pp. 210–223, Jan 2017. [30] M. Gautschi, P. D. Schiavone, A. Traber, I. Loi, A. Pullini, D. Rossi, E. Flamand, F. K. Gürkaynak, and L. Benini, “Near-Threshold RISC-V Core With DSP Extensions for Scalable IoT Endpoint Devices,” TVLSI, 2017. [31] D. Rossi, A. Pullini, I. Loi, M. Gautschi, F. K. Gürkaynak, A. Bartolini, P. Flatresse, and L. Benini, “A 60 GOPS/W, -1.8 V to 0.9 V Body Bias ULP Cluster in 28nm UTBB FD-SOI Technology ,” Solid-State Electronics, vol. 117, pp. 170 – 184, 2016. [32] M. Peemen, R. Shi, S. Lal, B. Juurlink, B. Mesman, and H. Corporaal, “The neuro vector engine: Flexibility to improve convolutional net efficiency for wearable vision,” in DATE, 2016, pp. 1604–1609. [33] S. Liu, Z. Du, J. Tao, D. Han, T. Luo, Y. Xie, Y. Chen, and T. Chen, “Cambricon: An instruction set architecture for neural networks,” in ISCA, 2016, pp. 393–405. [34] F. de Dinechin, B. Pasca, O. Cret, and R. Tudoran, “An FPGAspecific approach to floating-point accumulation and sum-ofproducts,” in ICFPT, Dec 2008, pp. 33–40. [35] A. Waterman, Y. Lee, D. A. Patterson, and K. Asanovic, “The riscv instruction set manual, volume i: Base user-level isa,” EECS Department, UC Berkeley, Tech. Rep. UCB/EECS-2011-62, 2011. [36] A. S. Waterman, Design of the RISC-V instruction set architecture. University of California, Berkeley, 2016. [37] J. F. Hart, Computer Approximations. Krieger Publishing Co., 1978. [38] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in ICML, 2015, pp. 448–456. [39] J. L. Hennessy and D. A. Patterson, Computer architecture: a quantitative approach. Elsevier, 2011. [40] R. Tewell, “FD-SOI – Harnessing the Power & A Little Spelunking into PPA,” Presented at DAC’53, USA, 2016.

15

[41] S. Davis, “IEDM 2013 Preview,” http://electroiq.com/chipworks_ real_chips_blog/2013/12/06/iedm_2013_preview/, 2013, acc.: Sept 2017. [42] T. Luo, S. Liu, L. Li, Y. Wang, S. Zhang, T. Chen, Z. Xu, O. Temam, and Y. Chen, “DaDianNao: a neural network supercomputer,” TOC, vol. 66, no. 1, pp. 73–88, 2017. [43] U. Köster, T. Webb, X. Wang, M. Nassar, A. K. Bansal, W. Constable, O. Elibol, S. Hall, L. Hornof, A. Khosrowshahi et al., “Flexpoint: An adaptive numerical format for efficient training of deep neural networks,” in Advances in Neural Information Processing Systems, 2017, pp. 1740–1750. [44] K. Ullrich, E. Meeds, and M. Welling, “Soft weight-sharing for neural network compression,” arXiv:1702.04008, 2017. [45] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured sparsity in deep neural networks,” in NIPS, 2016, pp. 2074–2082. [46] Y. Guo, A. Yao, and Y. Chen, “Dynamic network surgery for efficient dnns,” in Advances In Neural Information Processing Systems, 2016, pp. 1379–1387. [47] T. Simunic, L. Benini, A. Acquaviva, P. Glynn, and G. De Micheli, “Dynamic voltage scaling and power management for portable systems,” in Proceedings of the 38th annual Design Automation Conference. ACM, 2001, pp. 524–529. [48] Y. Eckert, N. Jayasena, and G. H. Loh, “Thermal feasibility of diestacked processing in memory,” in WoNDP, 2014. [49] M. Gao, J. Pu, X. Yang, M. Horowitz, and C. Kozyrakis, “TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory,” in ASPLOS, 2017. [50] A. Rush, “Memory Technology and Applications,” Presentation at HotChips, https://www.hotchips.org/wp-content/uploads/ hc_archives/hc28/HC28.21-Tutorial-Epub/HC28.21.1-Next-GenMemory-Epub/HC28.21.150-Mem-Tech-Allen.Rush-AMD.v3-t16.pdf, August 2016, acc.: July 2017. [51] A. Shilov, “JEDEC Publishes HBM2 Specification as Samsung Begins Mass Production of Chips,” https://www.anandtech.com/ show/9969/jedec-publishes-hbm2-specification, Jan 2016, acc.: Sept 2017. [52] “Tensorflow Benchmarks,” https://www.tensorflow.org/ performance/benchmarks, August 2017, acc.: September 2017. [53] J. Murphy, “Deep Learning Benchmarks of NVIDIA Tesla P100 PCIe, Tesla K80, and Tesla M40 GPUs,” https://www.microway.com/hpc-tech-tips/deep-learningbenchmarks-nvidia-tesla-p100-16gb-pcie-tesla-k80-tesla-m40gpus/, Jan 2017, acc.: Jul 2017. [54] J. Johnson, “cnn-benchmarks,” https://github.com/jcjohnson/ cnn-benchmarks, acc.: September 2017, Commit hash 83d441f. [55] U. Pirzada, “Micron Will Unveil the Hybrid Memory Cube 3.0 Specification in 2016 – Significant Gains Expected Over HMC 2.0,” http://wccftech.com/micron-hybrid-memory-cube-30-specification/, 2015, acc.: Sept 2017. [56] “Information technology – Data centres – Key performance indicators – Part 2: Power usage effectiveness (PUE),” International Organization for Standardization, Geneva, CH, Standard, April 2016. [57] J. Yuventi and R. Mehdizadeh, “A critical analysis of power usage effectiveness and its use in communicating data center energy consumption,” Energy and Buildings, vol. 64, pp. 90–94, 2013. [58] M. Dayarathna, Y. Wen, and R. Fan, “Data center energy consumption modeling: A survey,” IEEE Communications Surveys & Tutorials, vol. 18, no. 1, pp. 732–794, 2016. [59] J. Gao and R. Jamidar, “Machine learning applications for data center optimization,” Google White Paper, 2014. [60] NVidia, “NVIDIA DGX-1 Deep Learning System,” https://images.nvidia.com/content/technologies/deeplearning/pdf/61681-DB2-Launch-Datasheet-Deep-LearningLetter-WEB.pdf, 2016, acc.: November 2017. [61] C. Angelini, “Measuring ddr4 power consumption,” http://www.tomshardware.com/reviews/intel-core-i7-5960xhaswell-e-cpu,3918-13.html, August 2014, accessed Oct 2017. [62] U.S. Energy Information Administration, “Average price of electricity to ultimate customers by end-use sector,” https://www.eia.gov/electricity/monthly/epm_table_grapher. php?t=epmt_5_6_a, August 2017, accessed Oct 2017. [63] A. Canziani, A. Paszke, and E. Culurciello, “An analysis of deep neural network models for practical applications,” arXiv:1605.07678, 2016. [64] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” in ECCV, 2014, pp. 818–833.

16

IEEE TRANSACTIONS ON COMPUTERS, VOL. (VOL), NO. (NO), (MONTH) (YEAR)

[65] L. Xu, D. P. Zhang, and N. Jayasena, “Scaling deep learning on multiple in-memory processors,” in WoNDP, 2015. [66] P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, and Y. Xie, “Prime: A novel processing-in-memory architecture for neural network computation in reram-based main memory,” in ISCA, 2016, pp. 27–39. [67] G. Lacey, G. W. Taylor, and S. Areibi, “Deep learning on fpgas: Past, present, and future,” arXiv:1602.04283, 2016. [68] C. Wang, L. Gong, Q. Yu, X. Li, Y. Xie, and X. Zhou, “DLAU: A scalable deep learning accelerator unit on FPGA,” TCAD, vol. 36, no. 3, pp. 513–517, 2017. [69] C. Zhang, Z. Fang, P. Zhou, P. Pan, and J. Cong, “Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks,” in ICCAD, Nov 2016. [70] V. Gokhale, A. Zaidy, A. X. M. Chang, and E. Culurciello, “Snowflake: an Efficient Hardware Accelerator for Convolutional Neural Networks,” in ISCAS, 2017. [71] M. Zhu, L. Liu, C. Wang, and Y. Xie, “CNNLab: a Novel Parallel Framework for Neural Networks using GPU and FPGA-a Practical Study with Trade-off Analysis,” arXiv:1606.06234, 2016. [72] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing fpga-based accelerator design for deep convolutional neural networks,” in FPGA, 2015. [73] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, “EIE: efficient inference engine on compressed deep neural network,” in ISCA, 2016. [74] K. Ovtcharov, O. Ruwase, J.-Y. Kim, J. Fowers, K. Strauss, and E. S. Chung, “Accelerating deep convolutional neural networks using specialized hardware,” Microsoft Research Whitepaper, vol. 2, no. 11, 2015. [75] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energyefficient reconfigurable accelerator for deep convolutional neural networks,” JSSC, vol. 52, no. 1, 2017. [76] Cavigelli, Lukas and Gschwend, David and Mayer, Christoph and Willi, Samuel and Muheim, Beat and Benini, Luca, “Origami: A convolutional network accelerator,” in GLSVLSI, 2015, pp. 199– 204. [77] R. Andri, L. Cavigelli, D. Rossi, and L. Benini, “YodaNN: An Architecture for Ultra-Low Power Binary-Weight CNN Acceleration,” TCAD, 2017. [78] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, “A high-throughput neural network accelerator,” Micro, vol. 35, no. 3, pp. 24–32, 2015. [79] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, and O. Temam, “ShiDianNao: Shifting vision processing closer to the sensor,” in ACM SIGARCH CAN, vol. 43, no. 3, 2015, pp. 92–104. [80] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers et al., “In-Datacenter Performance Analysis of a Tensor Processing Unit,” in ISCA, 2017, pp. 1–12. [81] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnornet: Imagenet classification using binary convolutional neural networks,” in ECCV, 2016. [82] A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally, “SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks,” in ISCA, 2017, pp. 27–40. [83] D. Kim, J. Kung, S. Chai, S. Yalamanchili, and S. Mukhopadhyay, “Neurocube: A programmable digital neuromorphic architecture with high-density 3D memory,” in ISCA, 2016, pp. 380–392. [84] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, “Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning,” in ACM Sigplan Notices, vol. 49, no. 4, 2014, pp. 269–284. [85] L. Cavigelli, M. Magno, and L. Benini, “Accelerating real-time embedded scene labeling with convolutional networks,” in DAC, 2015, pp. 1–6. [86] NVidia, “GPU-Based Deep Learning Inference: A Performance and Power Analysis,” https://www.nvidia.com/content/tegra/ embedded-systems/pdf/jetson_tx1_whitepaper.pdf, 2015. [87] D. Foley and J. Danskin, “Ultra-performance pascal gpu and nvlink interconnect,” Micro, vol. 37, no. 2, 2017. [88] R. Smith, “NVIDIA Volta Unveiled: GV100 GPUP and Tesla V100 Accelerator Announced,” http://www.anandtech.com/ show/11367/nvidia-volta-unveiled-gv100-gpu-and-tesla-v100accelerator-announced, May 2017, acc.: July 2017.

[89] T. P. Morgan, “Nvidia’s Tesla Volta GPU Is The Beast Of The Datacenter,” https://www.nextplatform.com/2017/05/10/ nvidias-tesla-volta-gpu-beast-datacenter/, May 2017, acc.: July 2017. [90] NVidia, “Artificial Intelligence Architecture | NVIDIA Volta,” https://www.nvidia.com/en-us/data-center/voltagpu-architecture/, 2017, acc.: July 2017. [91] B. Recht, C. Re, S. Wright, and F. Niu, “Hogwild: A lock-free approach to parallelizing stochastic gradient descent,” in NIPS, 2011, pp. 693–701. [92] H. B. McMahan, E. Moore, D. Ramage, S. Hampson et al., “Communication-efficient learning of deep networks from decentralized data,” arXiv:1602.05629, 2016.

PLACE PHOTO HERE

PLACE PHOTO HERE

PLACE PHOTO HERE

Fabian Schuiki received the B.Sc. and M.Sc. degree in electrical engineering from ETH Zürich, in 2014 and 2016, respectively. He is currently pursuing a Ph.D. degree with the Digital Circuits and Systems group of Luca Benini. His research interests include transprecision computing as well as near- and in-memory processing.

Michael Schaffner received the Ph.D. degree in electrical engineering from ETH Zürich, in 2017. Since 2012, he has been a Research Assistant with the Integrated Systems Laboratory and Disney Research, Zürich. His current research interests include digital signal processing, video processing, and the design of very large scale integration circuits and systems. Dr. Schaffner received the ETH Medal for his master’s thesis in 2013.

Frank K. Gürkaynak received the B.Sc. and M.Sc. degrees in electrical engineering from Istanbul Technical University, and the Ph.D. degree in electrical engineering from ETH Zürich, in 2006. He is currently a Senior Researcher with the Integrated Systems Laboratory at ETH Zürich, where his research interests include digital low-power design and cryptographic hardware.

Luca Benini received the Ph.D. degree in electrical engineering from Stanford University, Stanford, CA, USA, in 1997. He has served as the Chief Architect for the Platform2012/STHORM PLACE Project in STmicroelectronics, Grenoble, from PHOTO 2009 to 2013. He has held visiting and conHERE sulting researcher positions at EPFL, IMEC, Hewlett-Packard Laboratories, and Stanford University. He is currently a Full Professor with the University of Bologna. He has published more than 700 papers in peer-reviewed international journals and conferences, four books, and several book chapters. His research interests are in energy-efficient system design and multi-core SoC design. He is also active in the area of energy-efficient smart sensors and sensor networks for biomedical and ambient intelligence applications. He is a member of the Academia Europaea. He is currently the Chair of the Digital Circuits and Systems group at ETH Zürich.