The Reliability Wall for Exascale Supercomputing
Tóm tắt
Từ khóa
Tài liệu tham khảo
budnik, 2011, Blue Heron Project
2011, Los Alamos Nat'l Laboratory “Operational Data to Support and Enable Computer Science Research ”
rudin, 1976, Principles of Mathematical Analysis
moody, 2008, Scalable I/O Systems via Node-Local Storage: Approaching 1 TB/Sec File I/O
2011, UK High-End Computing “Overview of the Advanced Simulation and Computing Program ”
gustafson, 1995, Reevaluating Amdahl's Law, Multiprocessor Performance Measurement and Evaluation, 92
johnson, 1990, Distributed System Fault Tolerance Using Message Logging and Checkpointing
lin, 2004, Error Control Coding
stearley, 2005, Defining and Measuring Supercomputer Reliability, Availability, and Serviceability (RAS), Proc Linux Cluster Inst Conf
elnozahy, 2008, System Resilience at Extreme Scale
debardeleben, 2009, High-End Computing Resilience: Analysis of Issues Facing the HEC Community and Path-Forward for Research and Development
chakravorty, 2008, A Fault Tolerance Protocol for Fast Recovery
simon, 2011, Modeling and Simulation at the Exascale for Energy and the Environment
scott, 2009, HW & SW Challenges and Trends to Reach Exascale, HPCChina '09 Proc High Performance Computing of China
beck, 1994, Compiler-Assisted Checkpointing, technical report Univ of Tennessee Knoxville
plank, 1995, Compiler-Assisted Memory Exclusion for Fast Checkpointing, IEEE Technical Committee on Operating Systems and Application Environments, 7, 10
lu, 2005, Scalable Diskless Checkpointing for Large Parallel Systems
plank, 1994, Libckpt: Transparent Checkpointing under Unix, technical report Univ of Tennessee Knoxville