# **Reduced-Complexity k-best Decoder for LTE Standard**

Shirly Edward.A<sup>1</sup> and Malarvizhi.S<sup>2</sup>

<sup>1</sup>Department of ECE,SRM University, Vadapalani, Chennai, India <sup>2</sup>Department of ECE, SRM University, Kattankulathur, Chennai, India. <sup>1</sup>edward.s@vdp.srmuniv.ac.in, <sup>2</sup>malarvizhi.g@ktr.srmuniv.ac.in

### Abstract

This paper presents a VLSI implementation of reduced -complexity and reconfigurable MIMO(Multiple-Input Multiple-Output) signal detector targeting 3GPP-LTE standard. In recent wireless communication system, MIMO technology is considered as the key technique in LTE to meet the target. Maximum Likelihood (ML) detection is the optimal detection algorithm for MIMO systems. FPGA implementation of ML detector becomes infeasible as its complexity grows exponentially with the increase in number of antennas. Therefore, we propose a modified K-best detector algorithm which employs parallel and distributed sorting strategy combined with bitonic sorter that has near-ML detection solution. The design was implemented targeting Xilinx Spartan 6 device and the resource utilization results are presented and the performance comparison with the literature was also done. The total on-chip power estimated is 213mW.

**Keywords**: MIMO (Multiple-Input Multiple-Output), LTE (Long Term Evolution), Maximum-Likelihood (ML),Sphere decoder, Real Valued Decomposition(RVD), Partial Euclidean Distance(PED), Very Large Scale Integrated Circuit(VLSI), Field Programmable Gate Array(FPGA)

# **1. Introduction**

Multiple-Input Multiple-Output (MIMO) system provides increased data throughput and link range by using multiple transmit and receive antennas without additional bandwidth and transmit power[1]. MIMO plays a key role in every new wireless standard, such as HSDPA, IEEE 802.11n wireless LAN, IEEE 802.16e ,WiMax and 3<sup>rd</sup> Generation Partnership Project (3GPP) Long Term Evolution (LTE) [2]. The main challenge in exploiting the potentials of MIMO technology [3] is the increase in data rates, higher capacity requirements and high computing power at the receiver end.

MIMO systems can be employed to improve the transmission quality by diversity techniques and spatial multiplexing methods, but separation of multiplexed streams of data is the main implementation challenge in terms of computational complexity and power consumption. Therefore an efficient VLSI implementation of the detector is the key to enable low power, high performance and low cost equipment. The Maximum – Likelihood (ML) detection algorithm using exhaustive search gives the optimal solution. To reduce the exponential complexity in ML detection, sphere decoders [4] are proposed to achieve near-ML performance with reasonable complexity. QR decomposition technique is used to reduce the complexity of ML by the process of tree search and pruning. Tree search algorithms can achieve near-ML performance but require significantly less complexity compared to the optimal ML method [5].

Algorithms like depth-first search, breadth –first search, fixed complexity sphere decoder and best first search are available for pruning the tree. In [6], Burg implemented the hard output depth-first search algorithm and Guo implemented the soft –output K-best decoder [7]. In [8], Wong proposed a pipelined VLSI architecture for the K-best algorithm. The parallel merge algorithm was proposed in [9] to increase the throughput of

the existing conventional K-best architectures. In literature, some researchers divide the tree search into several parts [10, 11] and implement the K-best algorithm but still the complexity of the search does not decrease further.

Although K-best decoder presents constant throughput, the computation of accumulated branch metrics in each layer of the tree and the selection of K metrics with lowest values becomes a challenge. Hence appropriate sorting strategies are required in order to accomplish the selection tasks in K-best decoder. In [12], the architectures for parallel sorting algorithms and their functionality with K-best sphere decoder were analyzed. These parallel sorting architectures were based on sorting strategies like Bubble sort [13] and Batcher-Sort. In [14], parallel insertion sorting algorithm was presented.

In this paper, we present the FPGA implementation of a sphere detector using modified K-best breadth first search algorithm. The algorithm supports 4-QAM with a combination of 2 antennas since the baseline in 3GPP-LTE is 2x2 antennas. The tree search methodology is modified to decrease the latency and hardware complexity. A parallel and distributed sorting strategy is employed in the intermediate layers to exploit the parallelism of the FPGA in order to achieve high data rates. In the last layer, Batchers merge based bitonic sorting is used to sort the accumulated metrics and choose the transmitted symbol set.

In Section 2 and 3 of this paper MIMO system model, Maximum Likelihood detection and K-best algorithm for sphere detection are discussed in brief. The real value decomposition technique and the tree formation are briefly presented in Section 4. Section 5 and 6 presents the modified K-best algorithm and its VLSI architecture. In Section 7 the FPGA implementation and results are discussed. Section 8 concludes the paper.

# 2. MIMO System Model

Consider a MIMO system model with  $N_T$  transmit and  $N_R$  receive antennas. The equivalent complex-valued discrete –time baseband model of the MIMO channel between the transmitter and receiver antennas is described by an  $N_T \times N_R$  dimensional matrix *H*. The  $N_R$  –dimensional received signal vector is given by

$$y = Hs + n \tag{1}$$

Where  $s = [s_1 s_2 \dots s_{N_T}]^T$  the N<sub>T</sub>-dimensional transmit signal vector and *n* stands for the N<sub>R</sub>-dimensional additive i.i.d circularly symmetric complex Gaussian noise vector with variance N<sub>o</sub> per complex-valued dimension. The entries of *s* are chosen independently from a set *o* of complex-valued constellation points with *Q* bits per scalar symbol, i.e.,  $|o| = 2^{\circ 0}$ . The set of all possible transmitted vector symbols forms the constellation denoted by  $o^{N_T}$ .

The Maximum Likelihood(ML) criterion for estimating s from y, by assuming perfect knowledge of the channel matrix H, is given by

$$\hat{s} = \arg \min \Box y - Hs \Box^2$$

$$s \in o^{N_T}$$
(2)

The ML solution can be obtained through an exhaustive search, which offers excellent performance in a MIMO system. However, it is highly impractical to perform an exhaustive search for a large MIMO system.

#### 2.1. Sphere Decoding Algorithm

Sphere Decoding (SD) algorithms are proposed to reduce the computational complexity by converting the exhaustive search into a tree search problem. An efficient pruning criterion is used to decrease the number of visited nodes. SD takes into account only the lattice points that are inside a sphere of a given radius r. The following inequality is referred to as the sphere constraint:

$$\Box y - Hs \Box^2 < C \tag{3}$$

Where *C* is the squared radius of the sphere and *y* is the center of the sphere.

The channel matrix H can be decomposed by QR decomposition, and then equation (3) can be rewritten as

$$\Box y' - Rx \Box^2 \le C' \tag{4}$$

Where  $C' = C - \Box (Q')^H y \Box^2$  and  $y' = Q^H y$ .

R is an upper triangular matrix with positive diagonal elements and Q is a matrix with orthogonal columns.

The basic idea behind the tree-search algorithms lies in the transformation of the ML detection problem into a tree search problem. In tree search algorithm the distance between the received vector y and the candidate received vector symbols Hs can be decomposed into partial Euclidean distances  $d_i$  which depend only on s. The symbol s increases strictly when proceeding from a parent node to one of its children. The algorithm finds the leaf node that is associated with the smallest  $d_i$  which corresponds to the ML solution.

## **3. K-Best Algorithm**

Based on the search strategy, the sphere detector algorithm can be divided into depthfirst and breadth–first detection algorithms. The K-best algorithm [7] proceeds with the breadth-first search technique which does not require a sphere constraint. Tree pruning is performed by constraining the cardinality of the set of admissible nodes on each level of the tree to a parameter K.The breadth–first algorithm searches for the PEDs in the forward direction only and the best K candidate symbol sets based on PEDs are retained at each level in the tree. The candidates are selected from the symbol set by giving precedence to those children which yield the smallest associated PEDs.

The main advantage of the K-best algorithm over the depth-first algorithm is its fixed complexity, which is determined by the parameter K. The choice of parameter K in the algorithm also entails a tradeoff between complexity and BER performance. If K is chosen to be very large, complexity and memory usage requirements are high. But if K is small, there is a chance of accidentally excluding the ML solution from the list of candidate vector symbol set. The value of K should be kept as large as possible without compromising on the optimality, compared with ML detection algorithm. Limiting the value of K can reduce the complexity of the breadth-first search detection algorithm.

### 4. Real –valued Decomposition

The received signal vector in equation (1) will be in complex domain. The sphere decoding algorithm discussed in the previous section can be applied only when the real and imaginary components of y, H and s are decoupled, to form a real system with equations which will have twice the dimension of the complex system [15]. Therefore, the received  $N_T$  dimensional complex-valued system model can be decomposed into an equivalent  $2N_T$  dimensional real-valued system model according to the following equation.

International Journal of Multimedia and Ubiquitous Engineering Vol. 10, No. 3 (2015)

$$\begin{bmatrix} \mathbf{R}(\mathbf{y}) \\ I(\mathbf{y}) \end{bmatrix} = \begin{bmatrix} \mathbf{R}(H) & -I(H) \\ I(H) & \mathbf{R}(H) \end{bmatrix} \begin{bmatrix} \mathbf{R}(s) \\ I(s) \end{bmatrix} + \begin{bmatrix} \mathbf{R}(n) \\ I(n) \end{bmatrix}$$
(5)

The real-valued decomposition technique transforms the original search tree into a tree with twice the depth.

| MIMO<br>System | Real-valued |                        | Complex-<br>valued |                        |
|----------------|-------------|------------------------|--------------------|------------------------|
| Configuration  | Depth       | No. of<br>Sub<br>nodes | Depth              | No. of<br>Sub<br>nodes |
| 2x2 ,4 QAM     | 4           | 2                      | 2                  | 4                      |
| 2x2,16 QAM     | 4           | 4                      | 2                  | 16                     |
| 4x4,16 QAM     | 8           | 4                      | 4                  | 16                     |

Table 1. Comparison of Real-valued and Complex-valued Systems

From the Table 1 it can be inferred that when operating directly on complex constellation points,  $K \mid O \mid$  PEDs must be calculated in each step whereas applying RVD technique reduces the number to only  $K \sqrt{\mid O \mid}$  PEDs. Therefore, the overall silicon complexity and power consumption of the individual processing element is much lower with RVD.

Figure 1 shows the decision tree structure for 4-QAM, 2x2 MIMO System. After real valued decomposition the tree contains four layers. The symbol estimation in equation (2) can be done by sequentially following the decision tree structure. At each node in the tree a decision has to be made whether to send +1 or a -1, such that the uppermost path through the tree corresponds to a sent sequence(+1,+1,+1,+1).

## 5. Modified K-best Algorithm

The breadth-first search algorithm can be modified to decrease the latency, by calculating two PEDs in parallel and discarding the larger one. In our design, the processing element computes PEDs of all the children of a single parent node in one cycle.

We employ parallel and distributed sorting strategy in our algorithm [16, 17]. The steps involved are

(1). Distribute the parent nodes into two different sets.

(2). Parallel comparison for these set of nodes are done using comparators and the node with smallest PED is assigned the best node.

(3). The path extension is done based on best children nodes.

International Journal of Multimedia and Ubiquitous Engineering Vol. 10, No. 3 (2015)



Figure 1. Decision Tree Structure for 4-QAM 2x2 MIMO System

Input : y, H,  $M_T=2N_T$ , H=QRAlgorithm:

(1).Initialize  $d_{i+1} = 0$  and  $i=M_T$ 

(2). Calculate the PEDs for the symbol set at level I using the equation.

$$d_{i} = d_{i+1} + \left\| y_{i} - \sum_{j=i}^{M_{T}-1} r_{i,j} s_{j} \right\|^{2}$$

(3).While (i>0) do

(4) if  $(i=M_T)$  then

(5) { Calculate  $d_i$  for K symbols  $s_{j}$ .

(6) Set  $s_j$  as parent node and continue}

(7) Else

(8) { Calculate  $d_i$  for all leaf nodes of  $s_i$ 

(9) Compare and select the best PED for each  $s_i$ 

(10)Expand the tree with the nodes of K best PEDs}

(11)end if

(12)end while

In the last layer, the PEDs obtained are sorted using bitonic sorting technique and the best node with minimum PED and its symbol set are given as output. This represents the hard decision output of the decoder. The parallel implementation allows the algorithm to give a fixed throughput.

International Journal of Multimedia and Ubiquitous Engineering Vol. 10, No. 3 (2015)



Figure 2. Block Diagram of the Modified K-best Algorithm

## 6. VLSI Architecture

The block diagram for the modified K-best sphere decoder is shown in Figure 2. The blocks were divided into separate units and therefore processing can be pipelined.

In the preprocessing unit, the inputs y and H are buffered, OR decomposition was performed and received vector y is multiplied with  $Q^{H}$ . The input vector and the channel matrix are doubled by real valued decomposition. The PE1 unit calculates the Partial Euclidean Distances with  $d_4 = \left\| y_4 - R_{4,4} s_4 \right\|^2$ , where  $s_4$  is the set of possible partial symbols. For layer i=3, PEDs transmitted are computed by  $d_3 = \left\| y_3 - R_{3,3}s_3 - R_{3,4}s_4 \right\|^2$  and the PED from the previous list is added to the result corresponding  $s_4$ . The PEDs of the children of respective  $s_4$  are compared and the smallest PED and its symbol set are fed as input to the next PE2 unit. In PE2 unit, the computation of  $d_2 = \left\| y_2 - R_{2,2}s_2 - R_{2,3}s_3 - R_{2,4}s_4 \right\|^2$  and the smallest PEDs from the previous layer are added to the result corresponding  $s_4 s_3$ . The PEDs of the children of respective  $s_4s_3$  are compared in parallel and the smallest PEDs and its symbol sets are fed to PE3 unit.

The PE3 unit computes  $d_1 = \|y_1 - R_{1,1}s_1 - R_{1,2}s_2 - R_{1,3}s_3 - R_{1,4}s_4\|^2$  and the smallest PEDs from the previous layer are added to the result corresponding  $s_4s_3s_2$ . The PEDs of the leaf node of respective  $s_4s_3s_2$  are compared in parallel and four smallest PEDs and its symbol sets are given as output to the bitonic sorter. The output PEDs are sorted in the bitonic sorting unit and the smallest PED and its symbol set are taken as the optimal estimate of the received symbol.

#### 6.1. Bitonic Sorter

It is an efficient merge-based algorithm for parallel sorting introduced by Batcher [18].For many years, bitonic sorting was considered to be the most practical parallel sorting algorithm. Figure 3 shows the bitonic sorter unit, where 4 comparators are used in total and 2 comparators are serially connected in the critical path. Figure 4 shows the blocks present in each compare and sort unit. Table 2 shows the complexity of different sorters that are used in literature [19] for sorting of PEDs in the sphere decoder. It can be observed that as n increases the complexity of bitonic sorter shows minimal increase compared to bubble sorter and insertion sorter which increases proportionately. Therefore we apply bitonic sorting in order to reduce the overall delay and silicon complexity.







Figure 4. Compare and Sort Unit

| Sorting<br>techniques | Complexity         | n = 4   | n=8     |
|-----------------------|--------------------|---------|---------|
| Bubble<br>Sorting     | O(n <sup>2</sup> ) | 16      | 64      |
| Insertion<br>sorter   | O(K.n)             | 8 (K=2) | 16(K=2) |
| Bitonic<br>sorter     | O(nlogn)           | 2.408   | 7.225   |

# 7. Implementation and Results

The modified K-best algorithm with the sorter unit has been implemented using Xilinx Plan Ahead Design tool [20]. In Plan ahead software, the implementation and timing results can be viewed to analyze critical logic and to improve the design performance with floor planning and constraint modification. The Xilinx Plan Ahead Design tool is used to implement and verify the proposed modified K-best algorithm and its VLSI architecture on the Xilinx Spartan 6 FPGA with word length of 8-bits with  $N_T = 2$ .

We had implemented a 2x2 MIMO system with 4-QAM modulation by taking Rayleigh fading channel environment. At the receiver end, the channel information is assumed to be known perfectly. The estimated complex channel matrix H is converted into real valued matrix through RVD, in order to reduce the complexity and silicon area of the algorithm. The resource utilization results were presented for the architecture in Table 3.

| Resources            | PED Unit   | Bitonic<br>Sorter |
|----------------------|------------|-------------------|
| Slice Registers      | 112        | 4                 |
| Slice LUTs           | 1205       | 57                |
| BUFG                 | 1          | 1                 |
| Maximum<br>Frequency | 183.065MHz | 243.072MHz        |

**Table 3. Resource Utilization** 

The PE1 and CS unit takes 19 clock cycles for PED calculation and sorting for layer 4 and layer 3. In layer 2, the PE2 and CS unit takes 25 clock cycles. In layer 1 the PE3 unit takes 27 clock cycles and bitonic sorter unit takes 2 clock cycles for merge and sorting as indicated in Table 4.

| Blocks              | PED Calculation     | Sorting complexity  |
|---------------------|---------------------|---------------------|
|                     | (# of clock cycles) | (# of clock cycles) |
| PE1&CS              | 15                  | 4                   |
| PE2&CS              | 21                  | 4                   |
| PE3& bitonic sorter | 27                  | 2                   |

| Table 4 | 4. Latenc | y Comparison |
|---------|-----------|--------------|
|---------|-----------|--------------|

From the timing analysis, it is found that the minimum time period taken for the design is 5.463 ns. Therefore, the maximum achievable clock frequency for PED unit is found to

be 183.065MHz and for bitonic sorter unit it is 243.072MHz The power estimation of the design mapped on a FPGA device can be found using Xpower Analyzer Tool. The total on-chip power of the design is 213mW for PED unit and 108mW for bitonic sorter unit.

| Parameters        | This work | [21]      |
|-------------------|-----------|-----------|
| Device            | Spartan-6 | Virtex- 2 |
| Antennas          | 2x2       | 2x2       |
| Modulation        | 4QAM      | 4QAM      |
| Throughput        | 27Mbps    | 8Mbps     |
| PED<br>Complexity | 1317      | 918       |

Table 5. Comparison between K-Best SD-Implementations

The maximum throughput of the K-best detection with reference to [22] is calculated by the Eq.(6)

Throughput= 
$$f_c \frac{\log_2(M).N_T}{C}$$
 (6)

Where  $f_c$  is the maximum clock frequency, M is the constellation size,  $N_T$  is the antenna number and C is the number of clock cycles needed for calculating the PEDs in the last layer. For our K-best detector design, the parameter C = 27. Therefore, the maximum throughput achieved for the design would be 27Mbps. In Table 5, our proposed work is compared with previous K-best Sphere Decoder implementation, it shows clearly that our design improves the throughput of the detector with a slight increase in the resource utilization.

#### 8. Conclusion

In this paper, we presented a reconfigurable VLSI architecture for the proposed modified K-best algorithm targeting 3GPP-LTE standard. The 2x2, 4-QAM sphere decoder was implemented, which forms the baseline of LTE standard. The algorithm includes bitonic sorter in the last layer apart from parallel and distributed sorting in the previous layers. The hardware complexity reduces to a greater extent because less number of comparators is required in the critical path. In practice, the sphere detector will be attached with channel decoders to provide robustness against fading channels and noisy wireless environment. Our future work will be to implement the algorithm for higher constellation with different antenna configurations with further increase in throughput for emerging wireless communication standards.

## References

- A.J. Paulraj, D.A. Gore, R.U. Nabar and H. Bolcskei, "An overview of MIMO Communications-a key to gigabit wireless", Proceedings of IEEE, vol.92, no. 2, (2004), pp.198-218.
- [2] Overview of 3GPP Release 10 V 0.0.8(2010-09), http://www.3gpp.org/ftp/Information/Work-Plan/Description Releases/Rel-10.description 20100924.zip
- [3] Q. Li et.al, "MIMO Techniques in WiMax and LTE:A feature overview", IEEE Communication Magazine, vol. 48, no.5, (2010), pp.86-92.
- [4] B.Hassibi and H.Vikalo, "On the sphere-decoding algorithm I:expected complexity", IEEE Transactions on signal processing, vol. 53, (2005), pp. 2806.
- [5] L. Barbero and J. Thompson, "A fixed-complexity MIMO detector based on the complex sphere decoder", Proceedings of IEEE International workshop on Signal Processing and Advances in Wireless

Communication, (2006), pp.1-5.

- [6] A. Burg, M. Borgmann, M. Wenk, W. Zellweger, W. Fitchner and H. Bolcskei, "VLSI implementation of MIMO detection using sphere decoding algorithm", IEEE Journal of Solid-State Circuits, vol. 40, (2005), pp. 1566.
- [7] Z. Guo and P. Nilsson, "Algorithm and implementation of the K-best sphere decoding for MIMO detection", IEEE Journal on Selected Areas in Communications, vol. 24, (2006), pp. 491-50.
- [8] K. Wong, C. Tsui, R-K. Cheng and W. Mow, "A VLSI architecture of a K-best lattice decoding algorithm for MIMO channels", Proceedings of IEEE International Symposium in Circuits and Systems, (2002), pp.273-276.
- [9] N. Moezzi-Madani, W. Davis, "Parallel merge algorithm for high-throughput signal processing applications", Electronics Letters, vol.45, (2009), pp.188.
- [10] P. Radosavljevic, K.J. Kim, "Parallel searching-based sphere detector for MIMO downlink OFDM systems", IEEE Transactions on Signal Processing, vol. 60, no.6, (**2012**), pp. 3240-3252.
- [11] C.A. Shen and A.M. Eltawil, "A radius adaptive K-best decoder with early termination", IEEE Transactions on circuits and Systems I, vol. 57, no.9, (2010), pp. 2476-2486.
- [12] P. Cervantes-Lozano, L.F.Gonzalez-Perez and A. D. Garcia-Garcia, "Analysis of Parallel Sorting Algorithms in K-best Sphere decoder Architectures for MIMO Systems", IEEE International Conference on Reconfigurable Computing and FPGAs, (2011), pp.321-326.
- [13] W. Min, "Analysis on bubble sort algorithm optimization", International Forum on Information Technology and Applications, (2010), pp.208-211.
- [14] P.A. Bengough and S.J. Simmons, "Sorting-based VLSI architectures for the M-algorithm and Talgorithm trellis decoders", IEEE Transactions on Communications, vol. 43, no. 234, (1995), pp. 514-522.
- [15] B. M. Hochwald and S. T. Brink, "Achieving Near-Capacity on a Multiple-Antenna channel", IEEE Transactions on Communications, vol. 51, no. 3, (2003), pp. 389-399.
- [16] B.Kim and I-C.Park, "K-best MIMO detection based on interleaving of distributed sorting", Electronics Letters, vol.44, no.1, (2008)
- [17] S.chen, T.Zhang and Y.Xin, "Relaxed K-best MIMO signal detector design and VLSI Implementation", IEEE Transactions on Very Large Scale Integration Systems, vol. 15, no.3, (2007), pp. 328-337.
- [18] K.E. Batcher, "Sorting networks and their applications", Proceedings AFIPS Spring Joint Comput., Conference, vol. 32, (1968), pp.307-314.
- [19] G. E. Blelloch *et.al.*, "A Comparison of sorting algorithms for the connection machine CM-2", Proceedings of Symposium on parallel Algorithms and Architectures, (1991)
- [20] "Xilinx Plan Ahead Design Tool, Reference Guide", www.Xilinx.com.
- [21] J. Kerttula, M. Myllyla and M. Juntti, "Implementation aspects of list sphere detector algorithms", Proceedings of the IEEE Global Telecommunications Conference (2007), pp.3915-3920.
- [22] L. Liu, J. Lofgren and P. Nilsson, "Area-Efficient configurable High-throughput signal detector supporting Multiple MIMO modes", IEEE Transactions on circuits and systems –I:Regular Papers, vol. 59, no. 9, (2012), pp. 2085-2096.