Efficient implementation of elliptic curves on sensor

0 downloads 0 Views 735KB Size Report
Table: Published timings for a point multiplication in a MICAz Mote for the 160-bit security level. .... Karatsuba multiplication in F2m c(z) = a(z) · b(z). = a1b1zm + ...
Efficient implementation of elliptic curves on sensor nodes Diego F. Aranha, Julio L´opez, Leonardo Oliveira, Ricardo Dahab

Institute of Computing - UNICAMP Supported by FAPESP, Grant No. 2007/06950-0.

Aranha, L´ opez, Oliveira, Dahab

Efficient implementation of ECC on sensor nodes

Wireless Sensor Networks

A WSN is an ad hoc network comprised of sensoring devices employed for cooperative monitoring tasks.

Sensor Node Gateway Sensor Node

Aranha, L´ opez, Oliveira, Dahab

Efficient implementation of ECC on sensor nodes

The problem

Challenge Since the nodes must be cheap and disposable, protecting the communication between resource-constrained nodes is hard.

Aranha, L´ opez, Oliveira, Dahab

Efficient implementation of ECC on sensor nodes

The problem

Challenge Since the nodes must be cheap and disposable, protecting the communication between resource-constrained nodes is hard.

Contributions Efficient implementation of arithmetic in F2163 and F2233 ; Efficient implementation of elliptic curve cryptography.

Aranha, L´ opez, Oliveira, Dahab

Efficient implementation of ECC on sensor nodes

The platform

MICAz Mote: ATMega128 processor, 7.3828 MHz of clock frequency; 4KB of RAM memory, 128KB of ROM memory; Simple two-stage pipeline; Limited shift instructions; High cost of memory instructions (addressing, reads, writes). Aranha, L´ opez, Oliveira, Dahab

Efficient implementation of ECC on sensor nodes

Programmer’s arsenal ATMega128 is a typical RISC processor: 32 registers, but 6 of them are special for pointers; 1 register for memory/arithmetic temporary values; Only 32 - 6 - 1 = 25 useful registers. Relevant instructions: Instruction rsl, lsl swap bld, bst eor ld, st adiw, sbiw

Description Right/left shift by 1-bit Swap high and low nibbles Bit load/store from/to flag Exclusive bitwise OR Memory load/store Pointer arithmetic

Aranha, L´ opez, Oliveira, Dahab

Cost 1 cycle 1 cycle 1 cycle 1 cycle 2 cycles 2 cycles

Efficient implementation of ECC on sensor nodes

Related work Table: Published timings for a point multiplication in a MICAz Mote for the 160-bit security level.

Finite Field

Binary

Prime

Work [Malan et al. 2004] [Yan and Shi 2006] [Eberle et al. 2005] [Szczechowiak et al. 2008] [Seo et al. 2008] [Kargl et al. 2008] [Wang and Li 2006] [Szczechowiak et al. 2008] [Gura et al. 2004] [Uhsadel et al. 2007] [GrobSchadl 2006]

Aranha, L´ opez, Oliveira, Dahab

Execution Time (s) 34 13.9 4.14 2.16 1.14 0.83 1.35 1.27 0.87 0.76 0.745

Efficient implementation of ECC on sensor nodes

Related work

According to previous works: Binary fields are insufficiently supported; Binary curves would lead to lower performance; Architectural extensions are heavily needed.

Aranha, L´ opez, Oliveira, Dahab

Efficient implementation of ECC on sensor nodes

Binary elliptic curves

A binary elliptic curve is the set of solutions (x, y ) ∈ F2m × F2m satisfying the equation y 2 + xy = x 3 + ax 2 + b, where a, b ∈ F2m with b 6= 0, and a point at infinity ∞. When a ∈ {0, 1} and b = 1, the curve is called a Koblitz curve.

Aranha, L´ opez, Oliveira, Dahab

Efficient implementation of ECC on sensor nodes

Elliptic curves

The set of points {(x, y ) ∈ E (F2m )} ∪ {∞} under the addition operation + (chord-and-tangent rule) forms an additive group. Given an elliptic point P and an integer k, the operation kP, called scalar multiplication, is defined by kP = P | + P +{z. . . + P.} k times

This is the fundamental operation employed by protocols based on elliptic curves.

Aranha, L´ opez, Oliveira, Dahab

Efficient implementation of ECC on sensor nodes

Binary field F2m Irreducible polynomial: f (z) (trinomial or pentanomial)

Polynomial basis: a(z) ∈ F2m =

m−1 X

ai z i .

i=0

Software representation: vector of n = dm/8e bytes. Graphical representation:

Aranha, L´ opez, Oliveira, Dahab

Efficient implementation of ECC on sensor nodes

Addition in F2m

c(z) = a(z) + b(z) =

n−1 X

(Ai ⊕ Bi )z i

i=0

Aranha, L´ opez, Oliveira, Dahab

Efficient implementation of ECC on sensor nodes

Addition in F2m

c(z) = a(z) + b(z) =

n−1 X

(Ai ⊕ Bi )z i

i=0

Aranha, L´ opez, Oliveira, Dahab

Efficient implementation of ECC on sensor nodes

Squaring in F2m

a(z)2 =

m−1 X

ai z 2i = am−1 z 2m−2 + · · · + a2 z 4 + a1 z 2 + a0

i=0

Squaring is a simple expansion of the coefficients of a. Example: a(z) = z 4 + z 3 + 1 = 11001 a(z)2 = 101000001

Aranha, L´ opez, Oliveira, Dahab

Efficient implementation of ECC on sensor nodes

Squaring in F2m

We can accelerate this algorithm with a lookup table. For each 4-bit u, compute T (u) = (0, u3 , 0, u2 , 0, u1 , 0, u0 ):

Aranha, L´ opez, Oliveira, Dahab

Efficient implementation of ECC on sensor nodes

Squaring in F2m c(z) = a(z)2

Aranha, L´ opez, Oliveira, Dahab

Efficient implementation of ECC on sensor nodes

Modular squaring in F2m c(z) = a(z)2 mod f (z)

Aranha, L´ opez, Oliveira, Dahab

Efficient implementation of ECC on sensor nodes

Modular squaring in F2m c(z) = a(z)2 mod f (z)

Problem: Redundant memory accesses between squaring and modular reduction.

Aranha, L´ opez, Oliveira, Dahab

Efficient implementation of ECC on sensor nodes

Modular squaring in F2m

Our solution: Integrate squaring and modular reduction.

Aranha, L´ opez, Oliveira, Dahab

Efficient implementation of ECC on sensor nodes

Modular squaring in F2m

Our solution: Integrate squaring and modular reduction. c(z) = a(z)2 mod f (z)

Aranha, L´ opez, Oliveira, Dahab

Efficient implementation of ECC on sensor nodes

Integrated modular squaring in F2m

Aranha, L´ opez, Oliveira, Dahab

Efficient implementation of ECC on sensor nodes

Integrated modular squaring in F2m

Problem: Too much additions. Aranha, L´ opez, Oliveira, Dahab

Efficient implementation of ECC on sensor nodes

Proposed optimization for modular squaring Our solution: Precompute sparse contributions.

Aranha, L´ opez, Oliveira, Dahab

Efficient implementation of ECC on sensor nodes

Proposed optimization for modular squaring Our solution: Precompute sparse contributions.

Aranha, L´ opez, Oliveira, Dahab

Efficient implementation of ECC on sensor nodes

Multiplication in F2m

Two strategies: Karatsuba multiplication; L´opez-Dahab multiplication;

Aranha, L´ opez, Oliveira, Dahab

Efficient implementation of ECC on sensor nodes

Karatsuba multiplication in F2m

c(z) = a(z) · b(z) = a1 b1 z m + [(a1 + a0 )(b1 + b0 ) + a1 b1 + a0 b0 ]z m/2 + a0 b0

Aranha, L´ opez, Oliveira, Dahab

Efficient implementation of ECC on sensor nodes

Karatsuba multiplication in F2m

c(z) = a(z) · b(z) = a1 b1 z m + [(a1 + a0 )(b1 + b0 ) + a1 b1 + a0 b0 ]z m/2 + a0 b0

Aranha, L´ opez, Oliveira, Dahab

Efficient implementation of ECC on sensor nodes

Multiplication in F2m c(z) = a(z) · b(z) = (. . . (am−1 b(z)z + am−2 b(z)) z + . . . + a1 b(z)) z + a0 b(z)

Aranha, L´ opez, Oliveira, Dahab

Efficient implementation of ECC on sensor nodes

Multiplication in F2m c(z) = a(z) · b(z) = (. . . (am−1 b(z)z + am−2 b(z)) z + . . . + a1 b(z)) z + a0 b(z) Example: (z 3 + 1) · (z 3 + z + 1) =

1001 · 1011 =

= ((1011 · z + 0)z + 0)z + 1011 1011 1001 1011 0000 0000 1011 1010011 Aranha, L´ opez, Oliveira, Dahab

Efficient implementation of ECC on sensor nodes

L´opez-Dahab multiplication in F2m We can use this formula to multiply b(z) by a 4-bit polynomial.

If a(z) is divided into 4-bit polynomials, compute a(z) · b(z) by:

Aranha, L´ opez, Oliveira, Dahab

Efficient implementation of ECC on sensor nodes

L´opez-Dahab multiplication in F2m We can accellerate this method with a precomputation table.

Aranha, L´ opez, Oliveira, Dahab

Efficient implementation of ECC on sensor nodes

L´opez-Dahab multiplication in F2m We can accellerate this method with a precomputation table. For each 4-bit u, compute T (u) = u · b: T 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Aranha, L´ opez, Oliveira, Dahab

Efficient implementation of ECC on sensor nodes

L´opez-Dahab multiplication in F2m

Aranha, L´ opez, Oliveira, Dahab

Efficient implementation of ECC on sensor nodes

L´opez-Dahab multiplication in F2m

Aranha, L´ opez, Oliveira, Dahab

Efficient implementation of ECC on sensor nodes

L´opez-Dahab multiplication in F2m

Aranha, L´ opez, Oliveira, Dahab

Efficient implementation of ECC on sensor nodes

L´opez-Dahab multiplication in F2m

Problem: Lots of memory operations and not enough registers! Aranha, L´ opez, Oliveira, Dahab

Efficient implementation of ECC on sensor nodes

Proposed optimization for L´opez-Dahab multiplication

Our solution: Use a rotating register window of length n + 1

Aranha, L´ opez, Oliveira, Dahab

Efficient implementation of ECC on sensor nodes

Proposed optimization for L´opez-Dahab multiplication

Our solution: Use a rotating register window of length n + 1

Aranha, L´ opez, Oliveira, Dahab

Efficient implementation of ECC on sensor nodes

Proposed optimization for L´opez-Dahab multiplication

Aranha, L´ opez, Oliveira, Dahab

Efficient implementation of ECC on sensor nodes

Proposed optimization for L´opez-Dahab multiplication

Aranha, L´ opez, Oliveira, Dahab

Efficient implementation of ECC on sensor nodes

Proposed optimization for L´opez-Dahab multiplication

Aranha, L´ opez, Oliveira, Dahab

Efficient implementation of ECC on sensor nodes

Proposed optimization for L´opez-Dahab multiplication

Problem: Available registers might be insufficient (e.g. F2233 ). Aranha, L´ opez, Oliveira, Dahab

Efficient implementation of ECC on sensor nodes

Multi-step implementation of L´opez-Dahab multiplication Our solution: Break series of summations in blocks.

Aranha, L´ opez, Oliveira, Dahab

Efficient implementation of ECC on sensor nodes

Multi-step implementation of L´opez-Dahab multiplication Our solution: Break series of summations in blocks.

Aranha, L´ opez, Oliveira, Dahab

Efficient implementation of ECC on sensor nodes

Multi-step implementation of L´opez-Dahab multiplication

Aranha, L´ opez, Oliveira, Dahab

Efficient implementation of ECC on sensor nodes

Analysis of multiplication algorithms

Table: Costs in number of executed instructions for the multiplication of two n-byte vectors.

Algorithm L´ opez-Dahab Proposed Karatsuba

Number of instructions in terms of n words Reads Writes XOR 4n2 + 9n |T | + 2n2 + 6n 2n2 + 13n 2n2 + 4n |T | + 5n 2n2 + 11n 11n + 3M(dn/2e) 7n + 3M(dn/2e) 4n + 3M(dn/2e)

Aranha, L´ opez, Oliveira, Dahab

Efficient implementation of ECC on sensor nodes

Analysis of multiplication algorithms

Table: Costs in number of cycles for multiplication in F2163 and F2233 . Algorithm L´ opez-Dahab Proposed Karatsuba+LD Karatsuba+Proposed

Aranha, L´ opez, Oliveira, Dahab

n = 21 Total cycles 7743 3923 8379 5019

n = 30 Total cycles 14844 7226 13530 7748

Efficient implementation of ECC on sensor nodes

Modular reduction

Algorithm 1 Fast reduction for f (z) = z 163 + z 7 + z 6 + z 3 + 1. Input: c(z) = c[0..2n − 1]. Output: c(z) = c(z) mod f (z). 1: for i ← 41 to 21 do 2: t ← c[i] 3: c[i − 21] ← c[i − 21] ⊕ (t  5) 4: c[i − 20] ← c[i − 20] ⊕ (t  4) ⊕ (t  3) ⊕ t ⊕ (t  3) 5: c[i − 19] ← c[i − 19] ⊕ (t  4) ⊕ (t  5) 6: end for 7: t ← c[20]  3 8: c[0] ← c[0] ⊕ (t  7) ⊕ (t  6) ⊕ (t  3) ⊕ t 9: c[1] ← c[1] ⊕ (t  1) ⊕ ( 2) 10: c[20] ← c[20] & 0x07 11: return c

Aranha, L´ opez, Oliveira, Dahab

Efficient implementation of ECC on sensor nodes

Modular reduction Algorithm 2 Fast reduction for f (z) = z 163 + z 7 + z 6 + z 3 + 1. Input: c(z) = c[0..2n − 1]. Output: c(z) = c(z) mod f (z). 1: for i ← 41 to 21 do 2: t ← c[i] 3: c[i − 21] ← c[i − 21] ⊕ (t  5) 4: c[i − 20] ← c[i − 20] ⊕ (t  4) ⊕ (t  3) ⊕ t ⊕ (t  3) 5: c[i − 19] ← c[i − 19] ⊕ (t  4) ⊕ (t  5) 6: end for 7: t ← c[20]  3 8: c[0] ← c[0] ⊕ (t  7) ⊕ (t  6) ⊕ (t  3) ⊕ t 9: c[1] ← c[1] ⊕ (t  1) ⊕ ( 2) 10: c[20] ← c[20] & 0x07 11: return c

Problems: Reduntant memory operations and expensive shifts!

Aranha, L´ opez, Oliveira, Dahab

Efficient implementation of ECC on sensor nodes

Proposed optimization for modular reduction Our solutions: Small register window and lookup tables.

Aranha, L´ opez, Oliveira, Dahab

Efficient implementation of ECC on sensor nodes

Proposed optimization for modular reduction Our solutions: Small register window and lookup tables. Algorithm 4 Proposed optimization for faster modular reduction. Input: c(z) = c[0..2n − 1], T0 , T1 . Output: c(z) = c(z) mod f (z). Note: R(r0 , r1 , r2 , t) ≡ r0 ← r0 ⊕T0 [t], r1 ← r1 ⊕T1 [t], r2 ← t  5 1: rb ← 0, rc ← 0 2: i ← 21, j ← 40 3: while i > 3 do 4: R(rb , rc , ra , c[j]), c[i] ← c[i] ⊕ rb 5: R(rc , ra , rb , c[j − 1]), c[i − 1] ← c[i − 1] ⊕ rc 6: R(ra , rb , rc , c[j − 2]), c[i − 2] ← c[i − 2] ⊕ ra 7: i ← i − 3, j ← j − 3 8: end while 9: R(rb , rc , ra , c[22]), c[3] ← c[3] ⊕ rb 10: R(rc , ra , rb , c[21]), c[2] ← c[2] ⊕ rc 11: c[1] ← c[1] ⊕ ra 12: c[0] ← c[0] ⊕ rb 13: return c Aranha, L´ opez, Oliveira, Dahab

Efficient implementation of ECC on sensor nodes

Analysis of modular reduction Modular reduction in F2163 : Uses a rotating window of 3 registers; Needs two 256-byte lookup tables. Modular reduction in F2233 : Cannot use register windows; Does not need lookup tables; Unrolling and elimination of redundant memory operations. Table: Costs in number of executed instructions for modular reduction.

Algorithm Original Proposed

F2163 Reads Writes 88 66 43 23

Aranha, L´ opez, Oliveira, Dahab

F2233 Reads Writes 122 92 92 62

Efficient implementation of ECC on sensor nodes

Observations

Additional technicalities: Lookup tables and precomputed tables are aligned at 256 byte addresses; Inversion implemented by extended Euclidean algorithm with dedicate shifting functions (I /M ≈ 16).

Aranha, L´ opez, Oliveira, Dahab

Efficient implementation of ECC on sensor nodes

Elliptic curve arithmetic

We selected two fast algorithms for point multiplication: 4-TNAF in Koblitz curves [Solinas 2000]; L´opez-Dahab method in generic curves [L´ opez et al. 1999].

Aranha, L´ opez, Oliveira, Dahab

Efficient implementation of ECC on sensor nodes

Elliptic curve arithmetic Algorithm 5 w -TNAF method for point multiplication. Input: k ∈ Z, P ∈ E (F2m ). Output: kP ∈ E (F2m ). P i 1: Compute the representation TNAFw (k) = t−1 i=0 ui τ w −1 2: Compute Pu = αu P, for u ∈ {1, 3, 5, . . . , 2 − 1} 3: Q ← ∞ 4: for i ← t − 1 to 0 do 5: Q ← τQ 6: if ui 6= 0 then 7: Let ui such that αu = ui or α−u = −ui 8: if ui > 0 then Q ← Q + Pu ; else Q ← Q − Pu 9: end if 10: end for 11: return Q

Important: Point addition/subtraction costs 8 multiplications!

Aranha, L´ opez, Oliveira, Dahab

Efficient implementation of ECC on sensor nodes

Elliptic curve arithmetic Algorithm 6 LD method for point multiplication. P m Input: k = t−1 i=0 ki ∈ Z, P = (x, y ) ∈ E (F2 ), curve coefficient b. Output: kP ∈ E (F2m ). 1: x1 ← x, z1 ← 1, z2 ← x 2 , x2 ← z22 + b, 2: for i ← t − 2 to 0 do 3: r1 ← x1 · z2 , r2 ← x2 · z1 , r3 ← r1 + r2 , r4 ← r1 · r2 4: if ki 6= 0 then 5: z1 ← r32 , r1 ← x · z1 , x1 ← r1 + r4 , r1 ← z22 , r2 ← x22 6: z2 ← r1 · r2 , x2 ← r12 , r1 ← r22 , r2 ← b · r1 , x2 ← x2 + r2 7: else 8: z2 ← r32 , r1 ← x · z2 , x2 ← r1 + r4 , r1 ← z12 , r2 ← x12 9: z1 ← r1 · r2 , x1 ← r12 , r2 ← r22 , r2 ← b · r1 , x1 ← x1 + r2 10: end if 11: end for 12: return Q = (x3 , y3 ) computed from (x1 /z1 , x2 /z2 );

Important: Resistant to simple timing attacks. Aranha, L´ opez, Oliveira, Dahab

Efficient implementation of ECC on sensor nodes

Implementation Material: GCC 4.1.2 for ATMega128; Software library implemented from scrath; AVR Simulator 4.14. Programming languages: C; Assembly. Curve parameters: Koblitz curves NIST-K163 and NIST-K233; Binary curves NIST-B163 and NIST-B233.

Aranha, L´ opez, Oliveira, Dahab

Efficient implementation of ECC on sensor nodes

Implementation results Table: Timings for arithmetic algorithms in F2163 .

Algorithm Squaring Modular Squaring LD Mult. with registers Karatsuba+LD with registers Modular reduction Inversion

Aranha, L´ opez, Oliveira, Dahab

C language Cycles 629 1154 9738∗ 12246 606 243790

Assembly Cycles 430 570 4508 6968 430 81365

Efficient implementation of ECC on sensor nodes

Implementation results Table: Timings for arithmetic algorithms in F2163 .

Algorithm Squaring Modular Squaring LD Mult. with registers Karatsuba+LD with registers Modular reduction Inversion

C language Cycles 629 1154 9738∗ 12246 606 243790

Assembly Cycles 430 570 4508 6968 430 81365

Observations: Reduction by pentanomial costs the same as reduction by trinomial in [Kargl et al. 2008]; (∗ ) This timing is for a new variant of the algorithm. Aranha, L´ opez, Oliveira, Dahab

Efficient implementation of ECC on sensor nodes

Implementation results

Table: Timings for arithmetic algorithms in F2233 .

Algorithm Squaring Modular Squaring LD Mult. with registers (multi-step) Karatsuba+LD with registers Modular reduction Inversion

Aranha, L´ opez, Oliveira, Dahab

C language Cycles 908 1340 18028∗ 25850 911 473618

Assembly Cycles 463 956 8314 9261 620 142986

Efficient implementation of ECC on sensor nodes

Implementation results

Table: Timings for point multiplication.

Algorithm 4-TNAF on curve NIST-K163 LD on curve NIST-K163 LD on curve NIST-B163 4-TNAF on curve NIST-K233 LD on curve NIST-K233 LD on curve NIST-B233

Aranha, L´ opez, Oliveira, Dahab

C language Time (s) 0.67 1.30 1.55 1.48 3.25 3.90

Assembly Time (s) 0.32 0.62 0.74 0.73 1.57 1.89

Efficient implementation of ECC on sensor nodes

Comparison - Execution time

Implementation in C language at the 160-bit security level: Improvement of 41% over previous fastest implementation;

Implementation in Assembly at the 160-bit security level: Improvement of 61% over previous fastest implementation; Improvement of 25% over previous fastest implementation while satisfying resistance to simple timing attacks.

Aranha, L´ opez, Oliveira, Dahab

Efficient implementation of ECC on sensor nodes

Comparison - Storage

Table: Cost in bytes of memory for implementations of point multiplication on the 160-bit security level.

TinyECCK (C-only) Kargl et a. (C+ASM) 4-TNAF method – C version 4-TNAF method – C+ASM LD method – C version LD method – C+ASM

Aranha, L´ opez, Oliveira, Dahab

ROM memory 5.6 KB 11 KB 20 KB 24 KB 12 KB 16 KB

RAM memory 0.6 KB – 1 KB 1.6 KB 1 KB 1.6 KB

Efficient implementation of ECC on sensor nodes

Conclusions New state-of-art implementation of ECC on sensor MICAz Mote: Efficient implementation of binary field arithmetic: Most efficient implementation of squaring, multiplication, modular reduction and inversion for this platform (improvements ranging from 11% to 68%); Binary fields can be efficient on wireless sensors; Optimizations can be applied to similar platforms.

Efficient implementation of elliptic curve cryptography: Point multiplication under 13 second on the 163-bit security level and under 34 second on the 233-bit level; Binary curves are suitable for ECC on WSNs.

Aranha, L´ opez, Oliveira, Dahab

Efficient implementation of ECC on sensor nodes

Conclusions Curve

Binary, Generic

Binary, Koblitz

Prime

Work [Malan et al. 2004] [Yan and Shi 2006] [Eberle et al. 2005] [Eberle et al. 2005] (extensions) [Kargl et al. 2008] Proposed(timing attacks) [Szczechowiak et al. 2008] [Seo et al. 2008] Proposed Proposed(timing attacks) Proposed(233-bit security) [Wang and Li 2006] [Szczechowiak et al. 2008] [Gura et al. 2004] [Uhsadel et al. 2007] [GrobSchadl 2006]

Aranha, L´ opez, Oliveira, Dahab

Execution Time (s) 34 13.9 4.14 0.50 0.83 0.74 2.16 1.14 0.32 0.62 0.73 1.35 1.27 0.87 0.76 0.745

Efficient implementation of ECC on sensor nodes

Questions?

Aranha, L´ opez, Oliveira, Dahab

Efficient implementation of ECC on sensor nodes