The Reliability Wall for Exascale Supercomputing

IEEE Transactions on Computers - Tập 61 Số 6 - Trang 767-779 - 2012
Xuejun Yang1, Zhiyuan Wang1, Jingling Xue2, René Chalon1
1National Laboratory for Paralleling and Distributed Processing, School of Computer, National University of Defense Technology, Changsha, Hunan, China
2School of Computer Science and Engineering, University of New South Wales, Sydney, NSW, Australia

Tóm tắt

Từ khóa


Tài liệu tham khảo

10.1147/rd.502.0223

budnik, 2011, Blue Heron Project

2011, Los Alamos Nat'l Laboratory “Operational Data to Support and Enable Computer Science Research ”

10.1145/1362622.1362687

rudin, 1976, Principles of Mathematical Analysis

10.1145/324634.325389

moody, 2008, Scalable I/O Systems via Node-Local Storage: Approaching 1 TB/Sec File I/O

10.1109/DSN.2006.5

2011, UK High-End Computing “Overview of the Advanced Simulation and Computing Program ”

10.1177/1094342010369118

10.1109/2.348002

10.1016/j.future.2004.11.016

10.1145/1465482.1465560

gustafson, 1995, Reevaluating Amdahl's Law, Multiprocessor Performance Measurement and Evaluation, 92

10.1006/jpdc.1993.1087

10.1109/71.285606

johnson, 1990, Distributed System Fault Tolerance Using Message Logging and Checkpointing

10.1177/1094342006067469

10.1145/781512.781513

10.1006/jpdc.1997.1338

10.1145/1065944.1065973

10.1109/40.7773

10.1177/1094342009347767

lin, 2004, Error Control Coding

stearley, 2005, Defining and Measuring Supercomputer Reliability, Availability, and Serviceability (RAS), Proc Linux Cluster Inst Conf

elnozahy, 2008, System Resilience at Extreme Scale

10.1145/249978.249982

debardeleben, 2009, High-End Computing Resilience: Analysis of Issues Facing the HEC Community and Path-Forward for Research and Development

chakravorty, 2008, A Fault Tolerance Protocol for Fast Recovery

10.1145/568522.568525

simon, 2011, Modeling and Simulation at the Exascale for Energy and the Environment

scott, 2009, HW & SW Challenges and Trends to Reach Exascale, HPCChina '09 Proc High Performance Computing of China

kothe, 2007, Science Prospects and Benefits with Exascale Computing, 10.2172/1020814

beck, 1994, Compiler-Assisted Checkpointing, technical report Univ of Tennessee Knoxville

10.1109/FTCS.1990.89337

plank, 1995, Compiler-Assisted Memory Exclusion for Fast Checkpointing, IEEE Technical Committee on Operating Systems and Application Environments, 7, 10

lu, 2005, Scalable Diskless Checkpointing for Large Parallel Systems

plank, 1994, Libckpt: Transparent Checkpointing under Unix, technical report Univ of Tennessee Knoxville

10.1145/50202.50214

10.1109/71.730527