Hardware Implementation of High Throughput ... - Semantic Scholar

3 downloads 48764 Views 472KB Size Report
It aims to support the WEP security in the MAC layer of ... application, RC4 is the most widely used software based ..... the best throughput compared to the recent works. ... [8] T. Khine, S. Seo, S. Yoon, S. Park, H. Ochi, “High Speed MAC for 4.
Hardware Implementation of High Throughput RC4 Algorithm Thi Hong Tran, Leonardo Lanante, Yuhei Nagao, Masayuki Kurosaki, Hiroshi Ochi Dept. of Computer Science and Electronics, Kyushu Institute of Technology, 680-4 Iizuka Fukuoka JAPAN E-mail: hong, leonardo, nagao, ochi, [email protected] blocks of 256 bytes RAM to perform the byte swapping operation. Because of the limitation on throughput, which is up to 22 MB/s [6], it is impossible to be applied on high throughput systems such as the IEEE 802.11n 4×4 MIMO wireless LAN that requires a 75 MB/s RC4 core. In another approach, as presented in [7], a register based S-box is implemented. Its processing time reduced to 1 clock cycle, but its hardware cost increased by three times as compared with [6]. Moreover, the long critical path caused by the huge MUX and DEMUX gates limits the operating frequency of the core. As a result, the throughput of this architecture is supposed to be lower than the RAM based approach. In this paper, a novel RAM-based FPGA implementation of the RC4 algorithm is exposed. The design is dedicated to the WEP security in MAC layer of 802.11n 600 Mbps 4 × 4 Multi Input Multi Output (MIMO) wireless LAN systems. The main idea of the proposed architecture is the utilization of a tri-port RAM consisting of a read, write and a read/write port to reduce the memory resource and increase throughput. This paper is organized as follows. Section II explains the RC4 algorithm. The proposed architecture is shown in section III. Hardware cost and throughput of the proposed implementation are compared with the other works in section IV. And the final section V is our conclusion.

Abstract— In this paper, we present an efficient and high throughput hardware implementation of the RC4 algorithm. The main idea of the proposed architecture is the utilization of a tri-port RAM to reduce the memory resource and to increase throughput. The proposed design requires two clock cycles for generating one byte of ciphering key and uses only a block of 256 bytes RAM. These result in 50% increment of system throughput and three times reduction of RAM resource compared to the recent architectures. The proposed implementation supports variable key length from 8 to 128 bits and achieves 80 MB/s throughput at 160 MHz operating frequency. It aims to support the WEP security in the MAC layer of 600 Mbps 4×4 MIMO wireless LAN system based on IEEE 802.11n standard. Keywords: RC4 algorithm, RC4’s hardware implement, WEP, WLANs security, tri-port RAM, etc.

I.

INTRODUCTION

The Rivest Cipher 4 (RC4) or Alleged RC4 (ARC4) algorithm has been proposed by Ron Rivest of RSA Security in 1987. It was kept as a trade secret until 1994. In terms of application, RC4 is the most widely used software based stream cipher. For example, it is used in Secure Socket Layers (SSL) to protect internet traffic and in Wired Equivalent Privacy (WEP) to secure wireless network. Achieving high throughput rate in RC4 is very challenging because of the byte wise swapping of the S-box elements. To perform the byte swapping, S-box elements need to be processed through three different steps i.e. read out from the S-box, calculated for a new location, and finally written back to the new location in the S-box. Each byte of ciphering key is generated after all three steps are finished [1]. In addition, three steps are difficult to perform in parallel or pipeline because of the dependence of current steps and current iteration on results of their previous one. The conventional approaches take a number of clock cycles just to generate a single byte of ciphering key. For instance, a software implementation of the RC4 using assembly language presented in [2] required 7 clock cycles. Whereas the hardware implementation proposed in [3] took 8 clock cycles, or a RAM-based CPLD implementation in [4] needed 4 clock cycles, and RAM-based FPGA implementation as proposed in [5] and [6] required 3 clock cycles. In addition, the architecture proposed in [6] used 3 978-1-4673-0219-7/12/$31.00 ©2012 IEEE

II.

RC4 ALGORITHM

The RC4 algorithm defines a method to generate a pseudo random stream of bits, which is called as ciphering key, from a provided master key. In a communication system, at the transmitter side, plaintext data is encrypted with the RC4’s ciphering key using XOR operation. At the receiver side, the encrypted data is decrypted with the same ciphering key again using XOR operation. The RC4 generates the ciphering key by performing two stages as shown in Figure 1. The first stage is called as Key Scheduling Algorithm (KSA). This stage is divided into two sub-stages which are named as Initial and KeySetup. In the Initial sub-stage, an S-box, which has 256-byte elements, is initialized. Each element of the S-box is assigned to a value that equals to its index number.

77

right flowchart of Figure 1. In this flowchart, N is the numbers of required ciphering key for encrypting/decrypting the data. III.

PROPOSED ARCHITECTURE

A. Block diagram The RC4’s proposed architecture is shown in Figure 2. The CAL-J block calculates the index pointer j. The COUNT block is an 8-bit counter which is used to generate the index pointer i. The CAL-K block calculates the index pointer k. The KEYGEN block receives the master key and its length to generate 256-byte key stream that is necessary for calculating the index pointer j in the KeySetup sub-stage. The proposed architecture supports the key length from 8 bits to 128 bits. The main part of the core is the RAM block, which is used to store the S-box. It also plays an important role on performing the permutation. The main part of the RAM block is a 256-byte tri-port RAM. Three ports of the RAM are respectively read-only, write-only, and read/write ports (Figure 3).

Figure 1. RC4’s processing flowchart

In the KeySetup sub-stage, the initialized S-box is permuted based on the provided master key. The master key is the security key which is used to encrypt/decrypt the system. The possible length of the master key is from 1 to 256 bytes. But most systems support the key length from 1 to 8 bytes. The KeySetup sub-stage performs 256 iterations as shown in the left flowchart of Figure 1. In this figure, i and j represent the index pointers of the S-box. S[i] and S[j] respectively represent the current values of the S-box at i and j index positions. Key[i] is value of the i-th byte of the key stream. The key stream is generated by cyclically extending the master key to 256 bytes. All the sum operations shown in Figure 1 are modulo 256 and thus have one byte lengths. The final result of the KSA stage is the pseudo randomized 256-byte S-box. The KSA stage is performed as soon as the master key is provided. This reduces the latency of the actual data encryption due to the KSA stage. The second stage is called as Pseudo Random Generator Algorithm (PRGA). This stage is performed when the ciphering key is needed to encrypt or decrypt the data. It uses the result of the KSA stage to generate pseudo random ciphering key. The processing of this stage is similar to that of the KeySetup sub-stage, except the following points. Firstly, the index pointer j is calculated without needing the information of the master key. Secondly, in addition to the two index pointers i and j that are used for permuting the Sbox, another index pointer k is used for generating the ciphering key. The processing of this stage is shown in the

Figure 1. RC4’s proposed architecture

Figure 2. Illustration the using of tri-port RAM 78

Figure 3. The proposed RC4 core’s operation

The ports’ read/write operation is scheduled properly to reduce the core’s processing time, to finally increase the core’s throughput.

• From the second iteration during PRGA stage, the ciphering key S[k] appears at the port-1’s output. • The port-1 is used to read the S-box at the index pointer i. • Ports 2 and 3 are used for swapping the S[i] and S[j].

B. Operation The proposed RC4 core’s operation is shown in Figure 4. In the KSA-Initial sub-stage, ports 2 and 3 of the RAM are used to initialize the S-box elements. Port 2 writes values from 0 to 127 to the first 128 elements, while port 3 writes values from 128 to 255 to the last 128 elements of the S-box. This sub-stage takes 128 clock cycles to finish. In the KSA-KeySetup sub-stage and the PRGA stage, the first clock cycle is used to read the S-box at the initial value of the index pointer i. This is done in port-1 of the RAM. From the second clock cycle, each iteration is processed in two steps where each step is finished in one clock cycle.

In the swap operation of step-2, port-2 is used to write S[j] into the location pointed by the index pointer i. Port-3, on the other hand, is used to write the delayed S[i] into the location pointed by the index pointer j. To avoid the race condition which happens during simultaneous writing tasks on the same memory location, the value of i and j indexes are compared before deciding whether to perform the write command or not. In case the index i equals to index j, swapping is not necessary. And thus, the write command is cancelled to avoid any potential problem. In addition, the RAM is configured as cross-ports writefirst. By using this setting, if the read and write commands are issued simultaneously by different ports to the same address, the write command is performed ahead of the read one. At the result, in step-2, the swapping of S[i] and S[j] is completed before S[i] is read for the next iteration. Figure 3 summarizes the operation of the tri-port RAM by showing the connections of its three ports. The left hand side figure shows the connection used in the KSA-Initial sub-stage. And the right hand side figure shows the connection used in the KSA-KeySetup sub-stage and the PRGA stage. The 256 iteration of the KSA-KeySetup sub-stage takes 2 × 256 + 1 = 513 clock cycles. The PRGA stage takes 2 × N + 1 clock cycles to generate N bytes of key stream. Totally, the core needs 642 + 2 × N clock cycles to finish the KSA and PRGA stages of generating N bytes of ciphering key. To improve the throughput rate of the core, the KSA stage is preprocessed as soon as the master key is known.

In step-1, the following occurs simultaneously. • The read data S[i] of the previous read command appears at the port -1’s output. • The index pointer j is calculated by getting the sum of its previous value, the key stream Key[i], and the read data S[i]. Note that Key[i] is generated by the KEYGEN block. It is equal to zero in the PRGA stage. • For the PRGA stage, port-1 is used to read the S-box at the index pointer k. This becomes the (i − 1) -th byte of the ciphering key. • Port-3 is used to read S-box at the index pointer j. For step-2, the following occurs simultaneously. • The read data S[j] appears at port-3’s output. • The read data S[i], which appeared at the port-1’s output in step-1, is delayed by one clock cycle. • For the PRGA stage, the index pointer k is calculated by getting the sum of the read data S[j] and the delayed S[i].

79

complex tri-port RAM was chosen. The idea of scheduling the RAM’s read/write operation, which decided to the success of the proposed architecture, was explained in detail. The paper shows that the proposed implementation achieves the best throughput compared to the recent works. The required RAM resource is reduced by three times as compared with the recent RC4 core in [6]. At 160 MHz operating frequency, the proposed design obtain 80 MB/s throughput rate, makes it applicable in WEB security of the IEEE 802.11n 600Mbps 4×4 MIMO Wireless LAN and many other high throughput systems.

Thus, the throughput rate is determined by taking care the processing time of the PRGA stage only. IV.

RESULTS AND COMPARISONS

TABLE I.

Processing Time (clock cycles)

TABLE II. FPGA Device Comb. Logic CLB slices DFF and Latch Memory (Bytes) Gate Count Frequency (MHz) Throughput (MB/s)

PROCESSING TIME’S COMPARISON [2]

[3]

[4]

[5] & [6]

Proposed

7N

8N

1280 + 4 N

768 + 3N

642 + 2 N

HARDWARE RESOURCE’S & THROUGHPUT’S COMPARISON

[3] XC4000E4013EPQ208-2 547 255 256 (dual port RAM)

ACKNOWLEDGMENT

[6] Proposed XC2V250fg256

11372 40

256 138 279 768 (single port RAM) 64

229 128 262 256 (tri-port RAM) 160

5

22

80

A part of this work was supported by a grant of Regional Innovation Cluster Program 2nd stage (Global Type) implemented by Ministry of Education, Culture, Sports, Science and Technology (MEXT). REFERENCES [1]

B. Schneier, “Applied Cryptography - Protocols, Algorithms and Source Code in C”, Second Edition, John Wiley and Sons, New York, 1996. [2] B. Schneier, D. Whiting, “Fast Software Encryption: Designing Encryption Algorithms for Optimal Software Speed on the Intel Pentium Processor”, Fast Software Encryption workshop (FSE97), LNCS, Vol. 1267, pp. 242-259, Springer-Verlag, Haifa, Israel, January 20-22, 1997. [3] P. Hamalainen, M. Hannikainen, T. Hamalainen and J. Saarinen, “Hardware Implementation of the Improved WEP and RC4 Encryption Algorithms for Wireless Terminals”, The European Signal Processing Conference (EUSIPCO'2000), September 5-8, 2000, Tampere, Finland, pp. 2289-2292. [4] P. D. Kundarewich, S. J.E. Wilton, A. J. Hu, “A CPLD- Based RC-4 Cracking System”, The 1999 Canadian Conference on Electrical and Computer Engineering, May 1999, vol.1, pp. 397 – 402. [5] K.H Tsoi, K.H Lee and P.H.W Leong, “A Massively Parallel RC4 Key Search Engine”, Proc. of the 10th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM'02), September 22 - 24, 2002 Napa, California, pp. 13-21. [6] P. Kitsos, G. Kostopoulos, N. Sklavos and O. Koufopavlou, “Hardware Implementation of the RC4 stream Cipher”, In Proc. of 46th IEEE Midwest Symposium on Circuits & Systems ’03, Cairo, Egypt, 2003, vol.3, pp. 1363 – 1366. [7] S. S. Gupta, K. Sinha, S. Maitra and B. P. Sinha, “One Byte per Clock: A Novel RC4 Hardware”, 11th International Conference on Cryptology - Indocrypt 2010, Dec. 2010, India, [8] T. Khine, S. Seo, S. Yoon, S. Park, H. Ochi, “High Speed MAC for 4 Streams 11n WLAN Chip”, IEICE General conference, no., AS-2-3, March 2010. [9] Z. Wang, “An Intelligent Multi Port Memory”, Journal of Computer, 2010, vol. 5, No. 3, pp. 471 – 478. [10] Integrated Device Technology, Inc., Datasheet: High Speed 8K × 16 Triport Static RAM, January 2009.

The proposed architecture was designed and simulated using Synphony HLS from Synopsys. A software program of the RC4 algorithm was created for the core verification. In terms of performance, the processing time of the proposed architecture was compared with other works ([2] – [6]) in table I, where N is the byte numbers of required ciphering key. The values show that the proposed design has the shortest processing time. It only costs 2 clock cycles per ciphering byte. Thus, it can increase the RC4 core throughput by at least 50%. The proposed architecture was implemented in Xilinx 2V250fg256 FPGA board. Its hardware resource and final throughput were compared with [3], [6] in table II. The results show that the proposed implementation cost less resource and higher throughput than [3] and [6]. In detail, it needs the same amount of RAM resource but less combinational logic as compared with [3] and one third of RAM resource as compared with [6]. At 160 MHz operating frequency, the proposed architecture achieves throughput rate of 80 MB/s, which is 16 times and 3.6 times faster than that of [3] and [6] respectively. V.

CONCLUSION

In this paper, a novel RAM based hardware implementation of the RC4 algorithm had been shown. To achieve high throughput and reduce RAM resource, a

80