LowCost IPblocks for UMTS turbo decoders G. Masera, M. Mazza, G. Piccinini, F. Viglione, and M. Zamboni Politecnico di Torino, Dip. di Elettronica Corso Duca degli Abruzzi 24, 10129 Torino, Italy
[email protected] Abstract After several years from their introduction in 1993, turbo codes are now universally known as one of the most effective techniques for achieving performance very close to the Shannon theoretical limits in many transmission systems; they are also proposed as coding standard for several applications, including the Universal Mobile Telecommunications System (UMTS). This paper presents new, lowcost IP blocks for the implementation of UMTS standard compliant turbo decoders. The proposed architectures achieve an important complexity reduction with respect to the previously reported solutions: this achievement, resulting from an accurate architecture optimization, is mainly related to the elimination of some large memories, that are allocated in the usual decoder structure.
1.
Introduction
The convolutional concatenated codes with iterative decoding, usually called Turbo Codes [1] have been included into important communication standards, such as the Universal Mobile Telecommunication System (UMTS) for the third generation of mobile communications [2], and the new CCSDS telemetry channel coding [3]. From the theoretical point of view, there is a very rich literature on the system design of turbo codes and on their error correction capability; on the practical side, there is also a great interest in the hardware implementation of turbo codes: as the decoding algorithm is very computationally intensive, the design of efficient IPblocks implementing the decoder is a very challenging topic, especially in the case of wireless applications, where severe constraints are given in terms of cost and power consumption. As specified in the UMTS standard, the turbo code is the parallel concatenation of two systematic and recursive convolutional encoders, separated by an interleaver (see Figure 1), that is a memory with the function of scrambling the bit sequence. The information stream is pro cessed by the first encoder ( ) and, at the same time, the permutated sequence generated at the interleaver output is encoded by the second convolutional code ( ). The code sequence of the whole parallel concatenated convo
u
I
E1
c’
E1
c’’
!"#$!"%&'%
lutional code (PCCC) is formed by concatenating the two separated code sequences. Serial convolutional concatenated codes (SCCC) and more general schemes of concatenation have also been proposed [4, 5]. The decoding issue is solved by splitting it into a number of simpler problems, equal to the number of constituent encoders. The decoding network is composed of a concatenation of soft decoders, which manage quantities (the soft information) related to the reliability of the input and output bits of each encoder: these decoder units are known as SISO (soft input, soft output); interleavers are placed among the single decoders, according to the scheme used in the encoder. The entire decoding algorithm uses an iterative procedure which can be stopped when the required level of reliability is reached. The correction performance increase with the number of iterations, at the cost of a lower decoding speed and a higher latency and power consumption. In this paper, after a brief introduction related to the decoding algorithm and to the general scheme of the decoder, new, lowcost architectures are presented for the implementation of the turbo decoder as IPblocks. The reduction of the implementation cost is obtained by mainly acting on the amount of RAM memory to be allocated: first a new SISO is described, which allows to diminish the needed RAM capacity by 10% with respect to usual implementations; then a new scheme is presented for the design of the UMTS interleaver, able to drastically relieving the requirement of memory locations.
2. The Turbo Decoder The architecture of the turbo decoder, compliant to UMTS standard, is reported in Figure 2. The main blocks are the input and output buffers, the two SoftInput SoftOutput (SISO) units, the interleaver ( and the de
Soft Inputs
Input Buffer
π (c;I)
Output Buffer
π (u;O)
Soft Outputs
π (u;I)+ π (c;I)
yz{
yz{ }

SISO 1 π (u;I)
π (c;I)
π (u;O)
I1
SISO 2 π (u;I)
I
mno
f I
h7W0P#RYQ
[email protected][U 9;: =`KCgM RTS UTO&
\ f I I I ] !K i.=`_$C8abc< ] ?J=E_$C8a7bd< ] D=E_$C8a"e
\ 5 H I K ^ =`_$C8a7bc< I ?J=E_$C8a7b f I K i =`_$C8a e < I =`D'C>M RBS O& jBWRYP#
[email protected] Z[j 9 :#] ] ] \ f < I =E?kC>M RTS lOWRYP#
[email protected] Zl H I ;9 : ] K^3=E_$C8abd< I ] D=E_$C8ab I ] K!i.=`_CYa e where K ^ and K i are the states in which the transition _ starts or ends respectively, while K is the state for which the metrics are calculated; the OP#Q operator chooses the maximum between its arguments and adds a logarithmic correction factor.
t u v πx
pqrr
β
\ 5 HJI =LK!CNM H I K^[=`_$C8a7bc< I ?J=E_$C8abd< I D#=E_$C8a"e RTS U"O& VW0P$RYQ XEZ[U 9 :#] ] ]
α
interleaver (9;: , which performs the inverse permutation with respect to ( . The input buffer stores the input soft information during the iterations Interleaver and the deinterleaver are two memories in which data are read and written in different orders, according to a predefined permutation law. The generation of the interleaver addresses is described in section 4. The architecture reported in Figure 2 can process one or two blocks of data simultaneously. In the first case a single memory can be shared between the two interleavers ( ( and ( 9;: ) and the two SISO units operate alternatively for the 50% of decoding time; if the two constitutive codes are identical, the entire decoder can be folded needing only one soft module. In the case of simultaneous decoding of two blocks, two memories for interleaver and deinterleaver must be allocated, and also input and output buffers must be sized to contain two data blocks. The SISO units are the core of the turbo decoder. They perform a MAP (Maximum A Posteriori) algorithm, receiving the soft information related to the input and output data streams
[email protected]?5AB(C and =ED#AB(C , and generating =E?3AGF*C and =ED$AGF*C , a refined version of the input quantities. In the following, only the basic processing equations are reported, while the detailed description of the algorithm is in [5, 6]. The SISO routine elaborates the < metrics, related to the a posteriori probabilities of the input data stream, evaluating the following equations:
s
β
α
*),+.!* ## "!// 01#23/!&454346% '7% 8
yz{
mno mno
t u v πw π (u;O) π (c;O)
1~J+ 7$%7
2[/k ## "!/ / 0 3. The Architecture of the SISO Unit The implementation of the equations given f in the previous section for the updating of H and metrics involves the use of AddCompareSelect (ACS) units we name ACS sections. The decoding algorithm operates on blocks of data of fixed lenght [6] (slinding windows); the minimum window length depends on the choice of the code and it is selected after performance simulations. In order to guarantee good f performance, decoded outputs must be obtained from metrics resulting after a backward recursion f of at least steps: this long sequence of updating is called “training window” f and it is processed f by a dedicated ACS section (ACS dummy). The metrics resulting from the last step of the training window are used to initialize a second sequence of f backward recursions, that is executed by the second ACS section: f the metrics obtained by this second ACS section have a memory longer than trellis steps and so they are reliable enough to be used for the output generation. Previous works [7, 8] propose schemes which include three memories for the storage of the input metrics and a fourth one to temporarly store metrics H . The requirement of three RAMs for the storage of soft inputs comes from the fact that a minimum of two blocks must be completely stored inside the SISO before starting the backward recursion on the second block: this implies the allocation of two memories; the third memory is needed to receives a third block while the training window recursion is performed on the second block. The new proposed architecture, reported in Figure 3, only needs two memories (Mem1 and Mem2) to store soft inputs, so reducing area and power consumption. It also includes the H metric memoryf and the ACS sections for the generation of the metrics H , ,
[email protected]?;C , and =EDC . Basically we avoid the third memory for the soft inputs by exploiting the input buffer,
quency of 64 MHz; the twomemory solution allows to reduce the LUT number to 2710, for the same frequency.
4. Implementations of the Interleaver
¡¢ £ 7 06¤¥§¦g ¨  '© 0¨# 0ªTT!« ¬2/¨# 7ªTªG® . //©L>¯
°#
° E 'ª.k± ## "!// 0
from which data are not read in the trasmission order, but instead they are read at blocks of length in the reverse order: in this way, f the training window processing is performed by the ACS dummy section directly on the data read from the input buffer, without the need of storing a whole block. In figure 4 a graphical representation of how the architecture works is reported. The dashdotted line stands for the reverse order writing of NDPlong blocks of input metrics in Mem1 or Mem2; the dashed one represents the calculation of H metrics and their storage in Mem H f . The dotted line indicates the training recursion (ACS dummy), f while the solid f one corresponds to the simultaneous metric (ACS ) and output (ACS < ) evaluations. The scheme works in the following way: the first block is written in Mem1 in the reverse order. Then H recursion is performed on the previously stored data, memorizing results in Mem H ; simultaneously, the second block is written in Mem2 and the training f window recursion is applied on it. steps later, and H recursions are executed on the first and on the second blocks respectively, generating the output metrics for the first one. During the last operation, the first block is no longer useful, so it can be replaced with the third one, on which the training window recursion is applied. These operations are iterated hence on, alternating Mem1 and Mem2 in the storing of input metrics. In order to evaluate the actual cost reduction offered by the proposed solution, two synthesizable VHDL models have been produced for the usual SISO structure with three memories [8], and for the twomemory schemes. The models have been synthesized using a commercial logic synthesis tool for a XCV1000 Xilinx device. The test case refers to the 8 state convolutional code specified in the UMTS standard; 7 bits have been used for the representation of soft inputs and 10 bits for internal state metrics; the sliding window is set to ² ³ . From the synthesis results, a total LUT number of 3012 has been obtained for the threememories case, with an estimated clock fre
The interleaver is a memory in which data are read and written in different orders. One of the two operations, reading or writing, is performed sequentially from the start to the end of the data block, while the other one is executed following a scrambled sequence. There are two different ways to generate the scrambled addresses: either to store them in a memory or to compute them run time. If the addresses are stored in a ROM memory, the architecture will be notreconfigurable, fixing both the data block length and the permutation law. Storing the sequence into a RAM memory is a more flexible solution, allowing that a new permutation law is downloaded each time the frame length changes. However this scheme has a cost in terms of area and power consumption, due to the fact that the address RAM is usually larger than the data RAM, where soft data are scrambled. As an example, in a UMTS decoder any block length in the range 40 to 5120 is supported; assuming that scrambled data are represented on 7 bits, a 5120x7 bit RAM is required for data and a 5120x13 bit RAM is needed for addresses. In the following a lowcost run time address generator will be presented. It is based on the UMTS technical specifications, which state the interleaver address may be computed through a high complexity algorithm [2] consisting in the following main steps: 1. write a ´¶µ¸· matrix by row; 2. scramble the matrix by column and then by row; 3. read the matrix by column. The numbers of rows ´ and columns · depend on the frame length ¹ and there are 183 possible different matrix sizes; this makes the choice of implementing the matrix itself with a memory impractical. A better solution is to transform the UMTS specification into mathematical operations to be performed on the address values. The main three specified operations are the row scrambling, the column scrambling and the pruning. The last one is due to the fact that the data to be scrambled are arranged in a matrix whose number of elements is almost always larger than the frame length ¹ ; so the unused elements have to be deleted from the interleaved sequence after the two scrambling operations have been applied. The row scrambling is the simplest operation, because the standard includes only four possible permutations. The column scrambling is more complex because it is composed of two levels of permutation, in order to ensure a pseudorandom behavior of the interleaver. The pruning operation has a high level of complexity and it works processing the addresses expressed as the composition of row and column indexes.
Input Address
Xc Column Scrambler
X c_new
Pruning Generator
prun
Row & Column Generator
Pruning Unit
Xr
Address Generator
Row Scrambler Xr_new
addrtmp
Interleaved Address
7 0»º+ 7#*%7
¼#2J/½®" ` !!¾! 5$%% 'ªBª ## !" 8 The block diagram of the address generator is reported in Figure 5, where six constituent modules are shown. The first module generates the row address ¿&À and the column address ¿ j , while the two following modules perform the column scrambling and row scrambling. The pruning generator computes the offset to be subtracted to each temporary address (elaborated by the address generator), while the final result is obtained from the pruning unit. It is assumed that some parameters dependent on the frame length ¹ (such as · and ´ ) are evaluated and provided externally. The computation of the row address ¿ À and the column address ¿ j is obtained with a simple circuit including two counters and two adders that perform the required integer division and modulo operation (¿ À MÂÁ0Ã Ä
[email protected]ÅTÆ·*C , ¿ j M ÅOÇ7È.· , where Å is the current address). The scrambled column address É3Ê is obtained by evaluating the following expression:
É j MK ] @= Ë7=E¿ÀCµÌ¿ j CÍÌÎ$Ï©=ÑÐ&ÒÔÓ!C8a where Ë is a sequence of prime numbers, K is a recursive function, while Ð is provided by the control unit. The Ë sequence is obtained reading a Õ#³Jµ.Ö RAM addressed by ¿ À and it is multiplied by ¿ j . The modulo operation is performed using a linear algorithm and it is implemented with a chain of 8 cells, each including an adder and a comparator. The output of this chain addresses a ×#³©µØ RAM, that performs the recursive function K . The two RAMs have to be written from the external each time that ¹ changes. In order to obtain the actual scrambled column address ¿ j ÙR"Ú , few simple logic operations are applied to É j . The UMTS specifications include only four possible row scrambling rules, two of which are simply a row index reversal order. Thus the row scrambling module can be implemented with a simple combinatorial circuit which calculates the new row address ¿ À ÙRYÚ . The address generator rebuilds the temporary interleaved address ÛÏ Ï Ë'Ü@ÝgÞ from ¿ À Ù RYÚ and ¿ j ÙRYÚ . It has to perform the reading of the matrix by columns, so it evaluates the expression ÛÏÏ Ë Ü@ÝgÞ M¶¿ j Ù RYÚ µ´bß¿À ÙRYÚ . The pruning generation module calculates the value ÐË?à to be subtracted from ÛÏ Ï Ë Ü@ÝgÞ in order to obtain the pruned address. It can be proved that this value is always one of four consecutive integer numbers, and they can be obtained using a 256 word RAM along with three adders.
The last module, the pruning unit, performs the subtraction and finally provides the interleaved address. This scheme can be easily pipelined in order to reach the maximum throughput of 2 Mbit/s required by the UMTS standard, with a latency of few clock cycles. The entire interleaver address generator has been synthesized using Synplify, mapping the architecture on a Xilinx Virtex XCV 1000 CG 560 and it makes use of about 10,500 gates (including RAMs and ROMs). The whole scheme can work at a frequency of 40 MHz (the multiplications are performed in a single clock cycle), while, in order to reach the maximum UMTS rate, the required clock frequency is 32 MHz. Compared to the use of a RAM for downloading the precalculated address sequence, the presented implementation allows an area saving of about 85% (with a consequent reduction of the power consumption), at the cost of a little higher latency.
5. Conclusions In this paper two new architectures for the implementation of UMTS standard compliant turbo decoders are presented. First a new scheme of the main block of the turbo decoder, the softinput softoutput (SISO) unit, is proposed. Only two memories to store the input metrics are used: it allows to save about 10% of the gate count of the SISO, and to decrease the latency by 30%. Moreover, a new, lowcost architecture of the interleaver address generator is presented: it has a gate count about 85% smaller than a RAM based implementation. [1] C. Berrou, A. Glavieux, P. Thitimajshima, “Near Shannon limit error correcting coding and decoding: Turbo codes”, Proc. 1993 Inter. Conf. Commun., pp. 10641070, May 1993. [2] 3GPP/UMTS, ”Universal Mobile Telecommunication System (UMTS). Multiplexing and channel coding (TDD)”, 3GTS 25.222 document, version 3.1.1, Release 1999. [3] C. C. for Space Data Systems, “Telemetry channel coding”, Blue Book 101.0B4, CCSDS, May 1999. [4] S. Benedetto, G. Montorsi, “Iterative decoding of serially concatenated convolutional codes”, El. Letters, July 1996. [5] S. Benedetto, D. Divsalar, G. Montorsi, F. Pollara, “Softinput Softoutput modules for the construction and distributed iterative decoding of code networks”, European Trans. on Telecom., vol. 9, no. 2, MarchApril 1998. [6] L.R. Bahl, J. Cocke, F. Jelinek, J. Raviv, “Optimal Decoding of Linear Codes for Minimizing Symbol Error Rate”, IEEE Trans. on Information Theory, pp. 284287, March 1974. [7] A.J. Viterbi, “An Intuitive Justification and a Simplified Implementation of the MAP decoder for convolutional codes”, IEEE Journal on Selected Areas in Communications, vol. 16 2, pp. 260264, February 1998. [8] G. Masera, G. Piccinini, M. Ruo Roch, M. Zamboni, “VLSI Architecture for Turbo Codes”, IEEE Trans. on VLSI, September 1999.