TY - GEN
T1 - A 45nm 37.3GOPS/W heterogeneous multi-core SoC
AU - Yuyama, Yoichi
AU - Ito, Masayuki
AU - Kiyoshige, Yoshikazu
AU - Nitta, Yusuke
AU - Matsui, Shigezumi
AU - Nishii, Osamu
AU - Hasegawa, Atsushi
AU - Ishikawa, Makoto
AU - Yamada, Tetsuya
AU - Miyakoshi, Junichi
AU - Terada, Koichi
AU - Nojiri, Tohru
AU - Satoh, Makoto
AU - Mizuno, Hiroyuki
AU - Uchiyama, Kunio
AU - Wada, Yasutaka
AU - Kimura, Keiji
AU - Kasahara, Hironori
AU - Maejima, Hideo
PY - 2010
Y1 - 2010
N2 - We develop a heterogeneous multi-core SoC for applications, such as digital TV systems with IP networks (IP-TV) including image recognition and database search. Figure 5.3.1 shows the chip features. This SoC is capable of decoding 1080i audio/video data using a part of SoC (one general-purpose CPU core, video processing unit called VPU5 and sound processing unit called SPU) [1]. Four dynamically reconfigurable processors called FE [2] are integrated and have a total theoretical performance of 41.5GOPS and power consumption of 0.76W. Two 1024-way matrix-processors called MX-2 [3] are integrated and have a total theoretical performance of 36.9GOPS and power consumption of 1.10W. Overall, the performance per watt of our SoC is 37.3GOPS/W at 1.15V, the highest among comparable processors [4-6] excluding special-purpose codecs. The operation granularity of the CPU, FE and MX-2 are 32bit, 16bit, and 4bit respectively, and thus we can assign the appropriate processor for each task in an effective manner. A heterogeneous multi-core approach is one of the most promising approaches to attain high performance with low frequency, or low power, for consumer electronics application and scientific applications, compared to homogeneous multi-core SoCs [4]. For example, for image-recognition application in the IP-TV system, the FEs are assigned to calculate optical flow operation [7] of VGA (640x480) size video data at 15fps, which requires 0.62GOPS. The MX-2s are used for face detection and calculation of the feature quantity of the VGA video data at 15fps, which requires 30.6GOPS. In addition, general-purpose CPU cores are used for database search using the results of the above operations, which requires further enhancement of CPU. The automatic parallelization compilers analyze parallelism of the data flow, generate coarse grain tasks, schedule tasks to minimize execution time considering data transfer overhead for general-purpose CPU and FE.
AB - We develop a heterogeneous multi-core SoC for applications, such as digital TV systems with IP networks (IP-TV) including image recognition and database search. Figure 5.3.1 shows the chip features. This SoC is capable of decoding 1080i audio/video data using a part of SoC (one general-purpose CPU core, video processing unit called VPU5 and sound processing unit called SPU) [1]. Four dynamically reconfigurable processors called FE [2] are integrated and have a total theoretical performance of 41.5GOPS and power consumption of 0.76W. Two 1024-way matrix-processors called MX-2 [3] are integrated and have a total theoretical performance of 36.9GOPS and power consumption of 1.10W. Overall, the performance per watt of our SoC is 37.3GOPS/W at 1.15V, the highest among comparable processors [4-6] excluding special-purpose codecs. The operation granularity of the CPU, FE and MX-2 are 32bit, 16bit, and 4bit respectively, and thus we can assign the appropriate processor for each task in an effective manner. A heterogeneous multi-core approach is one of the most promising approaches to attain high performance with low frequency, or low power, for consumer electronics application and scientific applications, compared to homogeneous multi-core SoCs [4]. For example, for image-recognition application in the IP-TV system, the FEs are assigned to calculate optical flow operation [7] of VGA (640x480) size video data at 15fps, which requires 0.62GOPS. The MX-2s are used for face detection and calculation of the feature quantity of the VGA video data at 15fps, which requires 30.6GOPS. In addition, general-purpose CPU cores are used for database search using the results of the above operations, which requires further enhancement of CPU. The automatic parallelization compilers analyze parallelism of the data flow, generate coarse grain tasks, schedule tasks to minimize execution time considering data transfer overhead for general-purpose CPU and FE.
UR - http://www.scopus.com/inward/record.url?scp=77952207417&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=77952207417&partnerID=8YFLogxK
U2 - 10.1109/ISSCC.2010.5434031
DO - 10.1109/ISSCC.2010.5434031
M3 - Conference contribution
AN - SCOPUS:77952207417
SN - 9781424460342
T3 - Digest of Technical Papers - IEEE International Solid-State Circuits Conference
SP - 100
EP - 101
BT - 2010 IEEE International Solid-State Circuits Conference, ISSCC 2010 - Digest of Technical Papers
T2 - 2010 IEEE International Solid-State Circuits Conference, ISSCC 2010
Y2 - 7 February 2010 through 11 February 2010
ER -