TY - JOUR
T1 - A VLSI array processing oriented fast fourier transform algorithm and hardware implementation
AU - Liu, Zhenyu
AU - Song, Yang
AU - Ikenaga, Takeshi
AU - Goto, Satoshi
PY - 2005/12
Y1 - 2005/12
N2 - Many parallel Fast Fourier Transform (FFT) algorithms adopt multiple stages architecture to increase performance. However, data permutation between stages consumes volume memory and processing time. One FFT array processing mapping algorithm is proposed in this paper to overcome this demerit. In this algorithm, arbitrary 2k butterfly units (BUs) could be scheduled to work in parallel on n = 2s data (k = 0, 1,⋯, s - 1). Because no inter stage data transfer is required, memory consumption and system latency are both greatly reduced. Moreover, with the increasing of BUs, not only does throughput increase linearly, system latency also decreases linearly. This array processing orientated architecture provides flexible tradeoff between hardware cost and system performance. In theory, the system latency is (s × 2s-k) × tclk and the throughput is n/(s x 2s-k × tclk), where tclk is the system clock period. Based on this mapping algorithm, several 18-bit word-length 1024-point FFT processors implemented with TSMC0.18μm CMOS technology are given to demonstrate its scalability and high performance. The core area of 4-BU design is 2.991 × 1.121 mm2 and clock frequency is 326 MHz in typical condition (1.8 V, 25°C). This processor completes 1024 FFT calculation in 7.839 μs.
AB - Many parallel Fast Fourier Transform (FFT) algorithms adopt multiple stages architecture to increase performance. However, data permutation between stages consumes volume memory and processing time. One FFT array processing mapping algorithm is proposed in this paper to overcome this demerit. In this algorithm, arbitrary 2k butterfly units (BUs) could be scheduled to work in parallel on n = 2s data (k = 0, 1,⋯, s - 1). Because no inter stage data transfer is required, memory consumption and system latency are both greatly reduced. Moreover, with the increasing of BUs, not only does throughput increase linearly, system latency also decreases linearly. This array processing orientated architecture provides flexible tradeoff between hardware cost and system performance. In theory, the system latency is (s × 2s-k) × tclk and the throughput is n/(s x 2s-k × tclk), where tclk is the system clock period. Based on this mapping algorithm, several 18-bit word-length 1024-point FFT processors implemented with TSMC0.18μm CMOS technology are given to demonstrate its scalability and high performance. The core area of 4-BU design is 2.991 × 1.121 mm2 and clock frequency is 326 MHz in typical condition (1.8 V, 25°C). This processor completes 1024 FFT calculation in 7.839 μs.
KW - Array processing
KW - Fast fourier transform (FFT)
KW - Singleton algorithm
UR - http://www.scopus.com/inward/record.url?scp=29144533436&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=29144533436&partnerID=8YFLogxK
U2 - 10.1093/ietfec/e88-a.12.3523
DO - 10.1093/ietfec/e88-a.12.3523
M3 - Article
AN - SCOPUS:29144533436
SN - 0916-8508
VL - E88-A
SP - 3523
EP - 3529
JO - IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences
JF - IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences
IS - 12
ER -