TY - JOUR
T1 - A low-overhead soft–hard fault-tolerant architecture, design and management scheme for reliable high-performance many-core 3D-NoC systems
AU - Dang, Khanh N.
AU - Meyer, Michael
AU - Okuyama, Yuichi
AU - Abdallah, Abderazek Ben
N1 - Funding Information:
This work is partially supported by Competitive Research Funding (CRF), The University of Aizu, Reference P-11 (2016), and JSPS KAKENHI Grant Number JP30453020. This work is also supported by VLSI Design and Education Center (VDEC), the University of Tokyo, Japan, in Collaboration with Synopsys, Inc. and Cadence Design Systems, Inc. The first and the last authors in the author list are the main contributors of this work.
Funding Information:
This project is partially supported by Competitive Research Funding (CRF), The University of Aizu, Reference P-11 (2016), and JSPS KAKENHI Grant Number JP30453020.
Publisher Copyright:
© 2017, Springer Science+Business Media New York.
PY - 2017/6/1
Y1 - 2017/6/1
N2 - The Network-on-Chip (NoC) paradigm has been proposed as a favorable solution to handle the strict communication requirements between the increasingly large number of cores on a single chip. However, NoC systems are exposed to the aggressive scaling down of transistors, low operating voltages, and high integration and power densities, making them vulnerable to permanent (hard) faults and transient (soft) errors. A hard fault in a NoC can lead to external blocking, causing congestion across the whole network. A soft error is more challenging because of its silent data corruption, which leads to a large area of erroneous data due to error propagation, packet re-transmission, and deadlock. In this paper, we present the architecture and design of a comprehensive soft error and hard fault-tolerant 3D-NoC system, named 3D-Hard-Fault-Soft-Error-Tolerant-OASIS-NoC (3D-FETO). With the aid of efficient mechanisms and algorithms, 3D-FETO is capable of detecting and recovering from soft errors which occur in the routing pipeline stages and leverages reconfigurable components to handle permanent faults in links, input buffers, and crossbars. In-depth evaluation results show that the 3D-FETO system is able to work around different kinds of hard faults and soft errors, ensuring graceful performance degradation, while minimizing additional hardware complexity and remaining power efficient.
AB - The Network-on-Chip (NoC) paradigm has been proposed as a favorable solution to handle the strict communication requirements between the increasingly large number of cores on a single chip. However, NoC systems are exposed to the aggressive scaling down of transistors, low operating voltages, and high integration and power densities, making them vulnerable to permanent (hard) faults and transient (soft) errors. A hard fault in a NoC can lead to external blocking, causing congestion across the whole network. A soft error is more challenging because of its silent data corruption, which leads to a large area of erroneous data due to error propagation, packet re-transmission, and deadlock. In this paper, we present the architecture and design of a comprehensive soft error and hard fault-tolerant 3D-NoC system, named 3D-Hard-Fault-Soft-Error-Tolerant-OASIS-NoC (3D-FETO). With the aid of efficient mechanisms and algorithms, 3D-FETO is capable of detecting and recovering from soft errors which occur in the routing pipeline stages and leverages reconfigurable components to handle permanent faults in links, input buffers, and crossbars. In-depth evaluation results show that the 3D-FETO system is able to work around different kinds of hard faults and soft errors, ensuring graceful performance degradation, while minimizing additional hardware complexity and remaining power efficient.
KW - 3D NoCs
KW - Architecture
KW - Design
KW - Fault-tolerance
KW - Reliability
KW - Soft–hard faults
UR - http://www.scopus.com/inward/record.url?scp=85010767436&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85010767436&partnerID=8YFLogxK
U2 - 10.1007/s11227-016-1951-0
DO - 10.1007/s11227-016-1951-0
M3 - Article
AN - SCOPUS:85010767436
SN - 0920-8542
VL - 73
SP - 2705
EP - 2729
JO - Journal of Supercomputing
JF - Journal of Supercomputing
IS - 6
ER -