# VLSI Architecture for Discrete Cosine Transform using Modified CORDIC Algorithm Sapna Tiwari<sup>1</sup>, Dr. Payal Suhane<sup>2</sup> M. Tech. Scholar, Department of Electronics and Communication Engineering, Vidhyapeeth Institute of Science and Technology, Bhopal, M.P.<sup>1</sup> Director, Vidhyapeeth Institute of Science and Technology, Bhopal, M.P.<sup>2</sup> Abstract— In the latter years many of the architectures for discrete cosine transform (DCT) has been suggested and concluded that CORDIC (Coordinate-Rotation-Digital-Computer) processor based design is best suited and convenient for DCT design. CORDIC (Coordinate-Rotation-Digital-Computer) is division of shift and add phenomenon based principle for rotation of vector and rotation of plan, which is mainly used for the calculation of Trigonometric and Hyperbolic operations. CORDIC based architecture delivers iteration method and regulated like digit by digit operation. For these operation, it is preowned add, subtract, shifting of given bits and lookup table. Proposed architecture is compromise of input elements adding and subtracting, CORDIC module and output elements. Proposed Architecture is counterfeit for 8-point DCT and synthesized adopting Xilinx ISI 14.1i Vertex-5 device (xc5vfx100t-3ff1738) as a target device, which can engage at a maximum frequency of 184.556 MHZ. **Keywords: - DCT, CORDIC, Shift and Add, Virtex-5, Number of Slice** ### I. INTRODUCTION DCT based coding/decoding systems play a dominant role in real-time applications. However, the DCT is computationally intensive. In addition, 1-D DCT has been recommended by standard organizations the Joint Photographic Expert Group (JPEG) [1] .The standards developed by these groups aid industry manufacturers in developing real-time 1-D DCT chips for use in various image transmission and storage systems [2]. DCT based coding and decoding systems play a dominant role in real-time applications in science and engineering like audio and Images. VLSI DCT processor chips have become indispensable in real time coding systems because of their fast processing speed and high reliability. JPEG has defined an international standard for coding and compression of continuous tone- still images. This standard is commonly referred to as the JPEG standard [3]. The primary aim of the JPEG standard is to propose an image compression algorithm that would be generic, application independent and aid VLSI implementation of data compression. As the DCT core becomes a critical part in an image compression system, close studies on its performance and implementation are worthwhile and important. Application specific requirements are the basic concern in its design. In the last decade the advancement in data communication techniques was significant, during the explosive growth of the Internet the demand for using multimedia has increased [4]. Video and Audio data streams require a huge bandwidth to be transferred in an uncompressed form. Several ways of compressing multimedia streams evolved, some of them use the Discrete Cosine Transform (DCT) for transform coding and its inverse (IDCT) for transform decoding. Image compression is a useful topic in the digital world. A digital image bitmap can contain considerably large amounts of data causing exceptional overhead in both computational complexity as well as data processing. Storage media has exceptional capacity however access speeds are typically inversely proportional to capacity. Compression is a must to manage large amounts of data for network, internet, or storage media [5]. Compression techniques have been studied for years, and will continue to improve. Typically image and video compressors and decompressors (CODECS) are performed mainly in software as signal processors can manage these operations without incurring too much overhead in computation [6]. However, the complexity of these operations can be efficiently implemented in hardware. Hardware specific CODECS can be integrated into digital systems fairly easily. Improvements in speed occur primarily because the hardware is tailored to the compression algorithm rather than to handle a broad range of operations like a digital signal processor. Data compression itself is the process of reducing the amount of information into a smaller data set that can be used to represent, and reproduce the information [7]. Types of image compression include lossless compression, and lossy compression techniques that are used to meet the needs of specific applications. The proposed work is a realization of the less delay 1-D DWT for image compression. This architecture uses row - column decomposition, the number of calculations for processing an 8x8 block of pixels is reduced. 1-D DCT operation is expressed as addition of vector-scalar products and basic common computations are identified and shared to reduce computational complexity. Compared to Distributed arithmetic based architecture, the proposed DCT consumes less delay. VHDL implementation of DCT cores for low delay consumption is implemented. # II. LITERATURE REVIEW Mahendra Vucha et al. [1], design of less complexity computing hardware architecture for trigonometric and exponential has become challenge for many researchers. CORDIC algorithm has emerged in last decade to provide efficient computing for trigonometric, exponential and logarithmic computations. As the multimedia applications demand high speed and parallel computation of data, the CORDIC algorithm is a suitable computing technique to process such applications. This article aimed to presents a hardware efficient architecture for Discrete Cosine Transform (DCT) and evaluated by executing image processing application. The proposed architecture has been realized using verilog HDL and implemented on Artix 7 FPGA to capture its design attributes such as speed and design complexity and compared with simple DCT architectures. Choudhary Sadhana et al. [2], in FPGAs framework, memory is one of the major restricting component for handling enormous information. Then, FPGAs have limited on-chip memory, subsequently it requires proficient utilization of assets all together handle framework flaw like as force imperatives, size and on-request execution. There are a few methods of on-chip information pressure has been examined and considered by the analysts, and it will keep going to create. In the result investigation area, we furnished the examination with the existing procedures of coding, for example, JPEG, MQ-coder (i.e., JPEG2000 standard) and so on Xilinx FPGA, the considered ABRC engineering gives decreased memory, burnsthrough less power and similar working recurrence. I. Tsounis et al. [3], in this paper, we assess the blunder strength of a picture information pressure IP center, a FPGA-based quickening agent of the CCSDS 121.0-B-2 calculation used to pack the ESA PROBA-3 ASPIICS Coronagraph System Payload picture information. We have improved a shortcoming infusion stage recently proposed for the SEU assessment of FPGA delicate processor centers to interface with the objective picture information pressure IP center and compute the needed for disappointment investigation picture quality measurements. Through a broad shortcoming infusion crusade, we break down the weakness of the picture pressure center against Single Event Upsets (SEU) in a SRAM FPGA arrangement memory. The authors have compared the performance of 4x4 transforms with the conventional 8x8 DCT in floating point. Firstly, the authors have compared the conventional 4x4 DCT in floating point with conventional 8x8 DCT in floating point. Next, the 4x4 integer transform is compared with the conventional 8x8 DCT in floating point. The comparison was done on computation time of the transform and inverse transform and objective quality, based on the calculation of PSNR between input and reconstructed image. The authors have concluded that the integer transform approximation of the DCT will reduce the computational time considerably. #### III. DISCRETE COSINE TRANSFORM Discrete Cosine Transformation (DCT) is the most widely used transformation algorithm. DCT, first proposed by way of Ahmed [9] et al, 1974, has got greater importance in current years, in particular in the fields of photograph Compression and Video Compression. This chapter makes a speciality of green hardware implementation of DCT by way of reducing the variety of computations, enhancing the accuracy of reconstruction of the unique information, and lowering chip place. due to which the electricity consumption additionally decreases. DCT also improves velocity, compared to different trendy picture compression algorithms like JPEG. ## DCT output $$F(0) = 0.5(f(0) + f(1) + f(2) + f(3) + f(4) + f(5) + f(6) + f(7))\cos\frac{\pi}{4}$$ $$F(1) = 0.5[\{(f(0) - f(7))\}\cos\frac{\pi}{16} + \{f(1) - f(6)\}\cos\frac{3\pi}{16} + \{f(2) - f(5)\}\cos\frac{5\pi}{16} + \{f(3) + f(4)\}\cos\frac{7\pi}{16}]$$ $$F(2) = 0.5[\{(f(0) - f(3) - f(4) + f(7)\}\cos\frac{2\pi}{16} + \{f(1) - f(2) - f(4)\}\cos\frac{2\pi}{16} + f(4)\cos\frac{2\pi}{16} f(4)$$ Figure 1: 8-point Discrete Cosine Transform $$F(3) = 0.5[\{(f(0) - f(7))\}\cos\frac{3\pi}{16} + \{f(6) - f(1)\}\cos\frac{7\pi}{16} + \{f(5) - f(2)\}\cos\frac{\pi}{16} + \{f(4) + f(3)\}\cos\frac{5\pi}{16}]$$ $$F(4) = 0.5[(f(0) + f(3) + f(4) + f(7) - f(1) - f(2) - f(5) - f(6))\cos\frac{\pi}{4}]$$ $$F(5) = 0.5[\{(f(0) - f(7))\cos\frac{5\pi}{16} + \{f(6) - f(1)\}\cos\frac{\pi}{16} + \{f(2) - f(5)\}\cos\frac{7\pi}{16} + \{f(3) + f(4)\}\cos\frac{3\pi}{16}]$$ $$F(6) = 0.5[\{(f(0) - f(3) - f(4) + f(7)\}\cos\frac{6\pi}{16} - \{f(1) - f(2) - f(5) + f(6)\}\cos\frac{2\pi}{16}]$$ $$F(7) = 0.5[\{(f(0) - f(7))\cos\frac{7\pi}{16} + \{f(6) - f(1)\}\cos\frac{5\pi}{16} + \{f(2) - f(5)\}\cos\frac{3\pi}{16} + \{f(4) + f(3)\}\cos\frac{\pi}{16}]$$ #### IV. CORDIC ALGORITHM The simple form of CORDIC is based on observation that if a unit length vector with an (x,y)=(1,0) is rotated by an angle $\alpha$ degrees, its new end point will be at $(x,y)=(\sin\alpha,\cos\alpha)$ thus coordinates can be computed by finding the coordinates of new end point of the vector after rotation by an angle $\alpha$ . Rotation of any (x,y) vector: Figure 2: Block Diagram of CORDIC Processor Basic equation of CORDIC algorithm $$x_{i+1} = x_i \cos(\alpha) - y_i \sin(\alpha) \tag{1}$$ $$y_{i+1} = y_i \cos(\alpha) + x_i \sin(\alpha)$$ (2) Rearrange equations $$x_{i+1} = \cos(\alpha) [x - y \tan \alpha]$$ (3) $$y_{i+1} = \cos(\alpha)[y + x \tan \alpha] \tag{4}$$ $$\tan \alpha = \frac{\sin \alpha}{\cos \alpha}$$ # V. PROPOSED METHODOLOGY This algorithm performs its computation by decomposing the transform of size 'N' into 2 equal transforms of size N/r at each phase for a computation. When all such small elements are combined together conducive to compute DCT then it is known as DCT butterfly unit of '2' size. The flow graph of the proposed DCT architecture is displayed in Figure 3. Figure 3: Flow Chart of the Proposed DCT Architecture Step-I: - The binary input function is a signal conditioning device that interfaces to the serial-in-serial-out shift register. All integer number applied to the binary form in DCT architecture. Binary input is leaning on the word limit i.e. suppose word limit of the binary input (3 down to 0) means the input range is 0 to 15. Step-II: - Second block of the proposed DCT architecture is serial-in-serial-out shift register. With the support of flips-flops, Serial-in-serial-out shift register can be developed. The register is firstly cleaned, suppress all output of the serial-in-serial-out shift register becomes to zero. The initial-sequentially-tuned input data is then feed to the as an input signal of the first flip-flop of the left. During each and every clock pulse, one bit is broadcast from left to right. Step-III: - Third block of the proposed DCT architecture is decision block. According to the number the block is select and gives the output of the adder and sub-tractor. There are condition is applied of the decision block based on common term of the DCT output equation. Step-IV: - Conferring to the decision multiplier block it used adder and sub-tractor block. Step-V: - And last of the algorithm are used to CORDIC algorithm. CORDIC algorithm handles two inputs per clock and so two output samples are processed per clock cycle. The advantage of the CORDIC technique is minimized delay overall system. Figure 4 shows the DCT using a CORDIC architecture is explain, in which clearly observed there are eight input from f(0) to f(8). All the input are pairing i.e. f(0) to f(7), f(2) to f(5), f(4) to f(3) and f(1) to f(6), because the common are term are used in all the pair. In these architecture seven adders, nine sub-tractors and six CORDIC architectures are used. CORDIC architecture is depends on rotation, shifter and addition. CORDIC algorithm is recognizable selection line. Figure 4: Architecture for DCT using CORDIC Table 1 shows how i is selected during each iterations of the equation. If the rotations positive, the rotation of unit vector takes place in a negative direction, the X variable is reduced by a fraction of the Y variable, and Y variable is incremented by a fraction of the X variable. If the angle is negative, the opposite operation is performance for each variable. Table 1: Rotation Parameter for CORDIC Algorithm | Processor | CORDIC(1) | CORDIC(2) | CORDIC(3)(6) | CORDIC(4)(5) | |-----------|-----------|-----------|--------------|--------------| | | π | 3π | 7n | 3π | | | 4 | 8 | 16 | 16 | | 1 | +1,0 | +1,0 | +1,0 | +1,1 | | 2 | | +1,2 | +1,1 | +1,3 | | 3 | | +1,3 | +1,3 | +1,10 | | 4 | | +1,6 | +1,10 | | | 5 | | +1,7 | | | | 6 | | +1,9 | | | # VI. SIMULATION RESULT Given experiment shows that there are 16-bit, 8 inputs f0, f1, f2, f3, f4, f5, f6, and f7 are simulated throw Xilinx 14.1i VHDL test bench simulation for DCT calculation and result was obtained. Final output displayed in table 5.1 as k1, k2, k3, k4, k5, k6, k7, k8. Figure 5: Resister transfer Level (RTL) View of 8-point DCT The proposed DCT implementation using CORDIC algorithm gives a lower slice 342 as compared with 1102 for previous DCT implementation using multiplier based algorithm. The proposed method is 31.03% improved in previous algorithm in the field of number of slice register. The proposed DCT implementation using CORDIC algorithm gives lower LUTs as compared with 2551 for previous DCT implementation using multiplier based algorithm. The proposed method is 51.28% improved in previous algorithm in the field of number of LUTs. The proposed DCT implementation using CORDIC algorithm gives a lowers maximum frequency 184.556 MHz as compared with 224.9 MHz for previous DCT implementation using multiplier based algorithm. The proposed method is 17.09% improved in previous algorithm in the field of number of Maximum Frequency. The proposed DCT implementation using CORDIC algorithm gives a lower No. of IOBs 238 as compared with 1588 for previous DCT implementation using multiplier based algorithm. The proposed method is 74.35% improved in previous algorithm in the field of No. of IOBs. The Bar Graph for the DCT Different Architecture according to the percentage win is given next. ``` Timing Summary: ------- Speed Grade: -3 Minimum period: 5.418ns (Maximum Frequency: 184.556MHz) Minimum input arrival time before clock: 10.476ns Maximum output required time after clock: 11.789ns Maximum combinational path delay: 13.795ns ``` Figure 6: Timing Summary of Proposed 8-point DCT Figure 7: Output Waveform of Proposed 8-point DCT Table 2: Device Utilization for DCT Algorithm | | Previous | Proposed | | |---------------------|-----------|-------------|--| | | Design | Design | | | Number of Register | 742 | 342 | | | Numbe of Slice LUTs | 1921 | 1303 | | | LUT-FF Pairs | 412 | 236 | | | Number of IOBs | 258 | 258 | | | Maximum Frequency | 201.4 MHz | 184.556 MHz | | | Operation | | | | Figure 4: Bar Graph of the 8-point DCT # VII. CONCLUSION In these thesis, implementation of 8-point DCT using CORDIC algorithm and calculate the slice register, slice LUTs, IOBs and maximum frequency and to compare all the parameter. The numbers of steps followed in this thesis are as under: Two highly programmable, low-delay and efficient CORDIC algorithm was presented, verified and compared to similar logic structures already published. These new DCT designs are advantageous to previously published work in implementations that more efficient, which is desirable for realization of a digital image processing. The implemented DCT designs using CORDIC algorithm are consume less percentage of given parameters, which are 30.03% of slice register, 48.72% of slices LUTs, 51.28 % of fully used LUT-FF pairs, 74.35% of No. of IOBs and 17.9% maximum frequency compared to previous algorithm. This is greatly reducing the area as compared to previous algorithm. #### REFERENCES - [1] Mahendra Vucha and A. L. Siridhara, "CORDIC Architecture for Discrete Cosine Transform", 6th International Conference on Communication and Electronics Systems, IEEE 2021. - [2] Choudhary Sadhana and Sarika Raga, "A Comparative Analysis at Binary Arithmetic Coders on FPGA System", International Conference on Industry 4.0 Technology, 136-140, IEEE 2020. - [3] I. Tsounis, M.Psarakis, "Analyzing the Resilience to SEUs of an 'Image-Data' Compression Core in a COTS SRAM FPGA", NASA/ESA Conference, Colchester, UK, pp. 17-24, 2019. - [4] IS Morina and PDP Silitonga, "Compression and Decompression of Audio Files Using the Arithmetic Coding Method" 6-1, Scientific Journal-of-Informatics, 2019 - [5] Jiajia Chen, Shumin Liu, Gelei Deng and Susanto Rahardja, "Hardware Efficient Integer Discrete Cosine Transform for Efficient Image/Video Compression", IEEE Access, Vol. 07, 2019. - [6] Trong-Thuc Hoang, Cong-Kha Pham and Duc-Hung Le, "High-speed 8/16/32-point DCT Architecture using Fixedrotation Adaptive CORDIC", IEEE 2018. - [7] S. U. Uvaysov, V. A. Kokovin, and S.S. Uvaysova, "Real-time sorting and lossless compression of data on FPGA," 2018 MWENT, Moscow, pp. 1-5, 2018. - [8] Mamatha I, Nikhita Raj J, Shikha Tripathi, Sudarshan TSB, "Systolic Architecture Implementation of 1D DFT and 1D DCT", 978-1-4799-1823-2/15/\$31.00 ©2015 IEEE. - [9] J. E. Volder, "The CORDIC trigonometric computing technique," IRE Trans. Electron. Comput. Vol. EC-8, no.3, pp.335-339, Sept. 1959. - [10] Liyi Xiao Member, IEEE and Hai Huang, "Novel CORDIC Based Unified Architecture for DCT and IDCT", 2012 International Conference on Optoelectronics and Microelectronics (ICOM) 978-1-4673-2639-1/12/\$31.00 ©2012 IEEE. - [11] Shymna Nizar N.S, Abhila and R Krishna, "An Efficient Folded Pipelined Architecture for Fast Fourier Transform Using Cordie Algorithm", 2014 IEEE International Conference on Advanced Communication Control and Computing Technologies (ICACCCT) IEEE. - [12] E. Jebamalar Leavline, S. Megala2 and D. Asir Antony Gnana Singh, "CORDIC Iterations Based Architecture for Low Power and High Quality DCT", 2014 International Conference on Recent Trends in Information Technology 978-1-4799-4989-2/14/\$31.00 © 2014 IEEE. - [13] Hyeonuk Jeong *et al*, "Low-Power Multiplierless" DCT Architecture Using Image Data Correlation;" IEEE Transactions on Consumer Electronics, Vol. 50, No. 1, FEBRUARY 2004. - [14] Syed Ali khayam, "The Discrete cosine transform (DCT) Theory and Application" Department of Electrical and Computer Engineering Michigan state University, March10th 2003. - [15] Satyasen Panda, "Performance Analysis and Design of a Discreet Cosine Transform Processor Using CORDIC algorithm", 2008-2010. - [16] Befrooz parhami, "Computer Airthmatic Algorithms and Hardware design", published by Oxford university press Inc. 198, Madison Avenue, New Yark, 2000.