A Custom Instruction Approach for Hardware and

2 downloads 0 Views 236KB Size Report
Box 6176, Campinas-SP, Brazil ..... more read calls are mandatory to read the result from. IRC. Figure 2. Loading arguments into inter- nal registers. 7 ...
A Custom Instruction Approach for Hardware and Software Implementations of Finite Field Arithmetic over F2163 using Gaussian Normal Bases Marcio Juliato, Guido Araujo, Julio L´opez and Ricardo Dahab Institute of Computing, University of Campinas Cidade Universit´aria Zeferino Vaz, Po. Box 6176, Campinas-SP, Brazil {marcio.juliato,guido,jlopez,rdahab}@ic.unicamp.br Abstract

of elliptic curve cryptographic schemes, and the most critical operation for its performance is the finite field multiplication. Thus, the speed of an ECC scheme is dependent on the performance of its underlying finite field arithmetic. Nowadays, the widely used 32-bit processors favor software implementations using polynomial bases due to the easy mapping of finite field operations to 32-bit general-purpose instructions. On the other hand, little work has been done in implementing such operations in hardware using normal bases, which makes it difficult to predict whether they would offer some computational advantage over polynomial bases. Moreover, since it is easy to analyze the efficiency of software implementations by just running real applications on a general processor, this is not true for hardware implementations. Even taking advantage of a wide range of tools to analyze hardware timing, only the stand-alone behavior of the modules is considered, and thus, the overheads of a real-world situation are discarded. Since their software counterparts run in a real environment where all overheads are considered, we conclude that hardware implementations still lack more realistic measurement approaches. In this paper, we evaluate finite field arithmetic operations over the recommended NIST [15] binary field F2163 using Gaussian Normal Bases (GNBs). Since we are working with field programmable gate arrays (FPGAs), we built a real platform with a 32-bit general purpose processor, which allows us to experiment with software and hardware implementations. In this approach, the NIOS2 [16] processor incorporates the hardware implementation of finite field operations as custom instructions. In order to make the comparisons as fair as possible, a set of standardized parameters were adopted, among them processor clock, data and instruction cache sizes, SRAM memory sizes and types, and the measurement scheme as well. We review previous related work in Section 2, which mainly deal with hardware implementations

In this paper we explore the potential use of custom instructions in a reconfigurable hardware platform to accelerate arithmetic operations in the binary field F2163 using a Gaussian normal basis representation. System-on-chip (SOC) techniques based on field programmable gate arrays (FPGAs) are used, making it possible to run real applications on the system while considering all execution overheads. Thus we are able to fairly compare hardware and software performances, as well as precisely determine their speedups. Using this approach, we show that a field multiplication can be accelerated over 2619 times when implemented in hardware. Moreover, using this fast field multiplier in a hardware/software approach, we accelerate point multiplication, the fundamental operation of ECC, over 116 times.

1. Introduction Elliptic curve cryptography (ECC) is a publickey mechanism independently proposed by Victor Miller [1] and Neal Koblitz [2], which provides the same functionality as the well-known RSA [3, 5]. The security of ECC schemes is based on the problem called elliptic curve discrete logarithm problem (ECDLP), where the best known algorithm to solve this problem has fully-exponential running time. This means that significantly smaller parameters can be used in ECC than other competitive systems such as RSA and DSA; for example, a 160-bit ECC key provides equivalent level of security to a 1024-bit RSA key. Consequently, ECC systems are more adequate for highly constrained domains like small sensors and smart-cards, since they simultaneously provide high performance and small area. The fundamental operation of ECC is the point multiplication (or scalar multiplication) kP , where k is a non-negative integer and P a point on an elliptic curve. This operation dominates the execution time

0-7803-9407-0/05/$20.00  2005 IEEE

5

ICFPT 2005

using polynomial bases. Unlike these, we adopt GNBs, which are shown to yield efficient hardware implementations. In Section 3, we present the standardized platform for hardware and software comparisons, discussing which parameters may influence the results fairness. Once the platform is detailed, the finite field arithmetic implementation over F 2163 and some optimizations are shown in Section 4. We then analyze in Section 5 the implementation in terms of execution time, speedups, and implementation area. Finally, in Section 6, we present some conclusions about our comparisons and the use of custom instructions in ECC.

Taking into account these constraints, we selected the NIOS2 processor as the basic element of our platform. This is a 32-bit processor that allows the modification of its instruction set in order to meet performance goals. In addition to this, it makes it possible a smooth creation and integration of hardware modules into the processor, either as a custom instruction or custom peripheral. Moreover, it also comes with a GCC compiler port. Besides the processor, the research platform consists of memory, configurable data and instruction caches. An Altera Stratix FPGA provides reconfigurable ability to the platform. The idea behind the implementation of hardware modules as custom instructions is tailoring the processor for cryptographic algorithms, leading to great speedups with few modifications in the programs already implemented. More precisely, it is only necessary to include a C library, which implements the macro for the custom instruction call. Another important aspect to consider is the reuse of the specialized processor in another SOC project, resulting in a faster integration than redesigning processor/peripherals interfaces, thus favoring the time-to-market. From the software point of view, custom instructions can be called as C functions.

2. Related Work Contrary to software implementations using polynomial bases, there exist few scientific papers concerning hardware implementations using normal bases. Some extensions to the instruction set of the MIPS32 processor are proposed in [7] for fast arithmetic over Fp and F2163 using polynomial bases, which makes a scalar multiplication over the binary field F2191 run approximately six times faster than a standard software implementation. In [9] a parameterizable processor for elliptic curve over F 2m is proposed, using a polynomial basis representation. Finally, a method for producing hardware designs for ECC systems over F2m using an optimal normal basis representation is presented in [13]. Although a number of algorithms for fast GNBs multiplication have been proposed in [4] and [6], only in [14] an FPGA implementation of a processor, using GNBs, for ECC over F2163 is presented. However none of these works present a hardware implementation capable of running programs written in some high level programming language. Once finite field arithmetic hardware implementations are simulated as stand alone modules using toolkits, a number of practical factors are not considered. As a consequence, it is difficult to determine the real speedup of such implementations over their software versions, and thus, we cannot fairly compare them.

#include "system.h" int main() { u32 a[6], b[6], c[6]; // SOFTWARE EXECUTION ... // HARDWARE EXECUTION, Loading the arguments MULTIPLICATION(0x10,a[0],b[0]); MULTIPLICATION(0x11,a[1],b[1]); MULTIPLICATION(0x12,a[2],b[2]); MULTIPLICATION(0x13,a[3],b[3]); MULTIPLICATION(0x14,a[4],b[4]); c[0]=MULTIPLICATION(0x15,a[5],b[5]); // Reading the result c[1]=MULTIPLICATION(0x01,0,0); c[2]=MULTIPLICATION(0x02,0,0); c[3]=MULTIPLICATION(0x03,0,0); c[4]=MULTIPLICATION(0x04,0,0); c[5]=MULTIPLICATION(0x05,0,0); // SOFTWARE EXECUTION ... return 0; }

Figure 1. Executing C = A.B in F2163 using custom instructions In this platform, the program starts executing in software, i.e., using the general-purpose instructions. If the program needs to execute a faster finite field operation, it calls a sequence of custom instructions, which makes the algorithm to continue executing in hardware. After that, the program’s execution ends in software. This approach is much simpler than dealing with processor peripherals, which would demand the implementation of routines to manage the DMA transfers and interruptions. Besides, it leads to a very simple and elegant implementation, as shown in Figure 1, where a multiplication in F 2163 is performed in hardware.

3. The Research Platform When systems-on-chip (SOC) are developed, it is important to consider some factors as its performance, cost, as well as the easiness of integration with other cores. In addition, one should also consider the processor ability to implement custom instructions and also a good compiler to generate code for the prototypes. In our case, we need a prototyping platform that allows us to investigate the use of custom instructions dedicated to elliptic curve cryptography.

6

Since the performance of point multiplication [10] also depends on the underlying finite field arithmetic, we decided to adopt a bottom-up approach. Thus we only treat in this paper the field arithmetic operations running as custom instructions. We also analyze the advantage of using the hardware/software approach, favoring a satisfactory relation between speedup and implementation area.

This way, we separate the finite field operations into two main groups. The first group comprises the so-called splittable operations, which can be executed in 32-bit slices. Thus, it is necessary to operate six times in 163-bit arguments to get a 163-bit result. The addition is the simplest operation in this group. In the second group we have the non-splittable operations, which require the entire 163-bit arguments before the execution takes place.

3.1. System Standardization

3.2.1. Splittable Operations

Considering the 163-bit A and B operands, each custom instruction call calculates a 32-bit slice of the result C, meaning that six calls are necessary to get an entire result. This is the case of addition, square, square root, trace and resolution of quadratic equations. Therefore, in most cases, some internal registers are needed to store intermediate results, which will be reused in the next 32-bit execution slice. Whenever possible, the custom instruction should use this approach, achieving, as a consequence, higher speedups and lower implementation areas, since small registers are needed.

Since we want to make the comparison between hardware and software as fair as possible, we use in this work the same NIOS2 system, i.e., a sixstage pipelined processor running at 120MHz with dynamic branch prediction and with a hardware multiplier. In order to decrease cache misses, which would interfere in the results favoring the hardware implementation over the software, we use relatively large caches, respectively, 8KB and 16KB for data and instruction caches. In addition, a 64KB data memory internal to the FPGA is used to store the program data, and a 1MB instruction memory, external to the FPGA, stores the program code. In order to standardize the measurement system, we use all the operations as functions, no matter if they are implemented in hardware or software. Then, these functions are called multiple times using a loop structure. Besides the overhead introduced by the loop structure, the compiler must translate each macro call to a sequence of load and store instructions, preparing the instruction arguments and storing the result into memory. As a result, even if the custom instruction executes in one clock cycle, it cannot be accomplished in a single clock cycle when called via macro. A possible solution to this problem is to optimize the program at the assembly level, at the expense of losing source code portability. Still, the comparison between hardware and software remains fair, since we target real-world applications and such overheads are present in both implementations.

3.2.2. Non-splittable Operations

Differently from the case just discussed, some operations depend on their arguments to be entirely available before processing, that being the case of field multiplication, inversion, and some permutation functions. To overcome this problem, custom instructions must have internal registers available for temporary storage of arguments, results, and intermediate data. Figure 2 represents the first step when loading the internal registers (IRA and IRB) with the operands. In the last load call the instruction execution is immediately launched, which might take multiple clock cycles. When it completes, the result is written into the internal register (IRC). Since the instruction already returns a 32-bit slice of the result (c[0]), five more read calls are mandatory to read the result from IRC.

3.2. Custom Instructions When working in the finite field F 2163 , the operations take one or two 163-bit arguments, and return one 163-bit result. Usually, general-purpose architectures work with 16, 32, and 64-bit datapaths, demanding a highly optimized software implementation to handle this bottleneck. If the word size is w-bits, a m-bits element uses m/w words. When working with m = 163 and the NIOS2 processor (w = 32), an element A ∈ F2163 will need six 32-bit words. NIOS2 instructions have an interface containing two 32-bit input arguments, and one 32-bit output result. Thus, it is important to adopt some techniques to perform efficient custom instruction calls.

Figure 2. Loading arguments into internal registers

7

4. Finite Field Arithmetic

of a finite field element is a simple right rotation of its vector representation. Similarly, it can be shown that the square root of an element is a left rotation of its vector representation. From Figures 4 and 5 we observe that both operations can be split into 32-bit execution slices. In the first square root call, as shown in Figure 6, the words a[0] and a[1] are sent to the custom instruction, from where we realize that the leftmost bit of a[0] must be kept in an internal register to be used in the last call. In the same manner, a[1] must be stored in order to be used in the second call, except its leftmost bit. A word number is also sent to the custom instruction in order to inform which part of the argument register is being operated upon, and thus, when to use the values previously stored.

A normal basis for F 2m is defined as 2 m−1 {β, β 2 , β 2 , ..., β 2 }, where β ∈ F2m . Thus, any m−1 2i element A ∈ F2m is written as A = i=0 ai β , where ai ∈ {0, 1}, and can be represented as the binary string (a 0 a1 a2 ...am−1 ). Let n = m/w and s = wn − m. In software, A may be stored in an array of n w-bit words: A = (a[0], a[1], ..., a[n − 1]). The rightmost s bits of a[n − 1] are unused (set to 0). When using normal bases, Fermat’s Little Theorem m states that A2 = A, for all A ∈ F2m . GNBs are normal bases of low complexity [12]. For binary fields for which there are no optimal normal bases, GNBs offer an alternative for implementing a normal basis. The type T of a Gaussian normal basis is a positive integer, measuring the complexity of the multiplication operation with respect to that basis. Let m be an odd prime and T be a positive integer. Then a GNB of type T for F 2m exists if p = T m + 1 is prime and gcd(T m/k, m) = 1, where k is the multiplicative order of 2 modulo p. Optimal normal bases are precisely the GNBs with type T = 1 or T = 2. In order to provide a broad comparison between hardware and software approaches, we implemented the most important finite field operations as custom instructions. Among them, addition, square, square root, quadratic equation solution, trace, and two multiplication methods. The next sections describe these finite field operations, as well as some techniques for efficient implementation of the corresponding custom instructions.

word a[0]

word a[5]

Figure 4. Square block diagram word a[0]

word a[1]

word a[5]

Figure 5. Square Root block diagram c[0]=SQUARE_ROOT(0,a[0],a[1]); c[1]=SQUARE_ROOT(1,a[2],0); c[2]=SQUARE_ROOT(2,a[3],0); c[3]=SQUARE_ROOT(3,a[4],0); c[4]=SQUARE_ROOT(4,a[5],0); c[5]=SQUARE_ROOT(5,0,0);

4.1. Addition Given two elements A and B ∈ F 2m , the sum C = A + B is computed as the xor of the vectors A and B, i.e., ci = ai ⊕ bi for i ∈ {0, m − 1}. Then the addition operation can be split in 32-bit slices, each of them returning a 32-bit result, as shown in Figure 3. word a

word a[1]

Figure 6. Calling the Square Root custom instruction

word b

4.3. Trace and Quadratic Equation Solution m , for m odd, is The trace of an element A ∈ F 2 m−1 calculated as follows: T r(A) = i=0 ai . Thus, the computation of T r(A) reduces to the xor of all bits of the vector A, which can be implemented very efficiently in hardware, as shown in Figure 7. In order to calculate the trace, one may divide the execution in 32-bit slices. However, it is necessary to store the partial result into a 1-bit internal register, which is used in the next instruction call. By means of the third call, the custom instruction returns the trace of the register A. A solution 1 of the quadratic equation X 2 +X = A can be determined by Algorithm 1.

word c

Figure 3. Addition block diagram

4.2. Square and Square Root Given A ∈ F2m , the square is calculated as A 2 = m−1 m−1 i 2 i+1 ai β 2 . Due to Fermat’s ( i=0 ai β 2 ) = i=0  2i Little Theorem, A2 = am−1 + m−1 i=1 ai β , and then, A2 = (am−1 a0 a1 a2 ...am−2 ). Thus, the square

1 Since

8

m is odd, if X is a solution, X + 1 is also a solution.

P1 (B) = (b1 , bw11 , bw21 , ..., bwm−1,1 ), Pj (B) = (0, bw1j , bw2j , ..., bwm−1,j ), j > 1. In [4] it is shown how to calculate w ij .

From the hardware implementation shown in Figure 8, we realize that it is possible to split the operation by maintaining the last bit of the previous call in an internal 1-bit register. Thus, each instruction call returns a 32-bit slice of the result.

Algorithm 2 L´opez’s Multiplication

INPUT: A, B ∈ F2m , wij , i ∈ {1, m−1}, j ∈ {1, T } OUTPUT: C = AB 1. C ← 0, RB ← B >> 1 2. For i from m − 1 downto 0 do 2.1 C ← C>> 1 2.2 SP ← Tj=1 Pj (RB ) 2.3 If ai = 1 then C ← C ⊕ SP 2.4 RB ← RB >> 1 3. Return (C)

A

C 0

Figure 7. Trace block diagram

Algorithm 1 Solve X 2 + X = A

B

INPUT: A = (a0 a1 ...am−1 ) ∈ F2m , m odd OUTPUT: X = (x0 x1 ...xm−1 ) ∈ F2m 1. Set x0 ← a0 2. For i from 1 to m − 2 do 2.1. xi ← ai ⊕ xi−1 3. Set xm−1 ← 0 4. Return(X)

word a[0]

P1

P2

P3

P4

XOR

Sp

Figure 9. Block diagram of the β function

word a[1] A0

B0

C0 0

>>

Initialization

ROTATE RIGHT

RB i

Ai

Multi-cycle Execution

Bit 162

>>

Ci

BETA FUNCTION

>>

ROTATE RIGHT

>>

Figure 8. Q.E. Solution block diagram

ROTATE RIGHT

XOR

1

A i+1

RB i+1

ROTATE RIGHT

0

C i+1

Figure 10. Block diagram of the hard´ ware implementation for Lopez multiplication

4.4. Multiplication So far we have presented simple arithmetic operations, which only use basic logic operations. However, finite field multiplication in normal bases is a complex operation. Several algorithms have been proposed for optimal normal bases and Gaussian normal bases. In this paper, we consider two multiplication methods, which are based on the multiplication proposed by NIST [15]. In [8], L´opez introduced an algorithm for multiplication in F2m using Gaussian normal bases, which is based on the observation that βB can be computed as the sum of T permutation of B, where T is the type of the GNB. For F2163 we have T = 4. Consider the i finite field elements δi = ββ 2 for 0 ≤ i < m. Let ni , 1 ≤ i < m, be the numbers 1s in the normal nof wij i representation of δ i , i.e., δi = j=1 β2 .  Then β · B = Tj=1 Pj (B) where the permutation Pj is defined as follows:

This method for multiplication over F 2m is shown in Algorithm 2. Line 2.2 of such algorithm shows the computation of S P = β · RB (the sum of T permutations), which is called β function. Figure 9 and 10 shows, respectively, the block diagram of the hardware implementation for β function and L´opez multiplication. Since the finite field multiplication needs the entire A and B arguments before its execution, it is difficult to split the multiplication as done in the previous operations. L´opez multiplication leads to an efficient hardware implementation. In particular, we observed that it was possible to unroll up to six algorithm iterations and execute them in a single clock cycle, without changing the processor cycle time. An execution iteration consists of a β function, followed by a xor and a multiplexer.

9

5. Results

The custom instruction calls for the multiplication performed in hardware is shown in Figure 1, from where we can also notice that it is necessary to signal to the custom instruction if the internal register is being written 0x1n or read 0x0n, where n denotes the word number of the internal register. Notice that the sixth call dispatches the multicycle execution, and when it completes, the result is written into the internal register (IRC). Since the instruction already returns a 32-bit slice of the result (c[0]), five more read calls are mandatory to read the result from IRC.

Both hardware and software implementations were analyzed using a NIOS2 development board, based on an Altera Stratix EP1S10F780C6ES FPGA. All the test programs were highly optimized and then compiled with the GCC compiler (version 3.4.1) ported to the NIOS2 processor. The -O2 optimization level was used, what produced, in most cases, the fastest results. The Quartus2 (version 4.2) software was used to synthesize the NIOS2 systems and program the FPGA. All tests were performed using only one custom instruction per processor in order to precisely determine the real speedup of each implementation. Although instruction selection is not the focus of this work, such experiments can easily guide us to such process. This happens because we have to take into account the implementation area of the custom hardware, its generality, its speedup, as well as the ratio speedup/area. However, due to the custom instruction approach, we have to consider some constraints (e.g., cycle time, argument passing to the custom instruction, custom instruction interface, etc.). Such implementation details may reduce the performance of the operations and/or increase their implementation area, when compared to a stand-alone implementation.

An efficient bit-level algorithm for multiplication in F2m can be derived from Algorithm 2. The description of this algorithm, called modified NIST multiplication, is presented in Algorithm 3. Its execution core relies basically on four permutations, xor and and ports. However at the end of the computation, a trace must be calculated.

Algorithm 3 Modified NIST Multiplication

INPUT: A, B ∈ F2m , wij , i ∈ {1, m−1}, j ∈ {1, T } OUTPUT: C = AB 1. C ← 0 2. For i from 1 to m − 1 do T 2.1 PB ← j=1 Pj (B) 2.2 ci ← T race(A · PB ) 2.3 A ← A  1, B ← B  1 3. Return (C)

5.1. Performance Analysis In this section we compare the operations described previously, whose performances are organized in Table 1. Since the finite field addition is a 32-bit xor function, and the NIOS2 processor already has such logical operation, the addition did not result in any valuable speedup. We observe that the square, square root and the trace already have good software implementations, because their hardware implementations do not lead to a good speedup. Without considering the multiplication algorithms, Table 2 shows that β function result in a very good speedup, which is explained by the very efficient way that permutation can be implemented in hardware, consisting basically in a set of wires and a xor function. In software, the 163-bit permutations are relatively slow, since we are restricted to a sequence of shifts, ands, ors and xors. Finally, the module that calculates the solution of a quadratic equation also leads to a good hardware implementation, once it executes almost five times faster than its software version. Focusing now on the multiplication algorithms, we started adopting a hardware/software approach for L´opez multiplication. In this case, the multiplication executes in software, having only the β function being performed in hardware. Through Tables 1 and 2 we realize that, despite the great speedup (53.4)

Bi

Ai

P1

P2

P3

P4

XOR