Exploring the capabilities of support vector machines in detecting silent data corruptions

Sustainable Computing: Informatics and Systems - Tập 19 - Trang 277-290 - 2018
Omer Subasi1, Sheng Di2, Leonardo Bautista-Gomez3, Prasanna Balaprakash2, Osman Unsal3, Jesus Labarta3, Adrian Cristal3,4, Sriram Krishnamoorthy1, Franck Cappello2
1Pacific Northwest National Laboratory, Washington, USA
2Argonne National Laboratory, Lemont, IL, USA
3Barcelona Supercomputing Center, Spain
4IIIA – Artificial Intelligence Research Institute CSIC – Spanish National Research Council, Spain

Tài liệu tham khảo

Vapnik, 1995 SDC Detection Framework and Library, Available at: https://collab.cels.anl.gov/display/esr/aid. S. Di, F. Cappello, Adaptive impact-driven detection of silent data corruption for HPC applications, IEEE Transactions on Parallel and Distributed Systems doi:10.1109/TPDS.2016.2517639. Bautista-Gomez, 2015, Detecting and correcting data corruption in stencil applications through multivariate interpolation, 2015 IEEE International Conference on Cluster Computing (CLUSTER), 595, 10.1109/CLUSTER.2015.108 Subasi, 2016, Spatial support vector regression to detect silent errors in the exascale era, 413 Cao, 2003, Support vector machine with adaptive parameters in financial time series forecasting, IEEE Trans. Neural Netw., 14, 1506, 10.1109/TNN.2003.820556 Farooq, 2007, Chaotic time series prediction using knowledge based Green's kernel and least-squares support vector machines, 2007 IEEE International Conference on Systems, Man and Cybernetics, 373, 10.1109/ICSMC.2007.4414023 Raicharoen, 2003, Application of critical support vector machine to time series prediction, Proceedings of the 2003 International Symposium on Circuits and Systems, vol. 5, V-741 Fan, 2006, Dynamic least squares support vector machine, The Sixth World Congress on Intelligent Control and Automation, vol. 1, 4886 Smola, 2004, A tutorial on support vector regression, Stat. Comput., 14, 199, 10.1023/B:STCO.0000035301.49549.88 Cortes, 1995, Support-vector networks, Mach. Learn., 20, 273, 10.1007/BF00994018 Kuhn, 1951, Nonlinear programming, Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability, 481 Ovari, 2000 Chang, 2011, LIBSVM: a library for support vector machines, ACM Trans. Intell. Syst. Technol., 2, 27:1, 10.1145/1961189.1961199 Bautista-Gomez, 2011, FTI: high performance fault tolerance interface for hybrid systems, 32:1 Colella, 1984, The piecewise parabolic method (PPM) for gas-dynamical simulations, J. Comput. Phys., 54, 174, 10.1016/0021-9991(84)90143-8 Sod, 1978, A survey of several finite difference methods for systems of nonlinear hyperbolic conservation laws, J. Comput. Phys., 27, 1, 10.1016/0021-9991(78)90023-2 Martí, 2003, vol. 6 Schulz-Rinne, 1993, Numerical solution of the Riemann problem for two-dimensional gas dynamics, SIAM J. Sci. Comput., 14, 1394, 10.1137/0914082 Brio, 1988, An upwind differencing scheme for the equations of ideal magnetohydrodynamics, J. Comput. Phys., 75, 400, 10.1016/0021-9991(88)90120-9 Orszag, 1979, Small-scale structure of two-dimensional magnetohydrodynamic turbulence, J. Fluid Mech., 90, 129, 10.1017/S002211207900210X Timmes, 2000, On the cellular structure of carbon detonations, Astrophys. J., 543, 938, 10.1086/317135 Fusion Cluster, Available at: http://www.lcrc.anl.gov/. Fryxell, 2000, FLASH: an adaptive mesh hydrodynamics code for modeling astrophysical thermonuclear flashes, Astrophys. J. Suppl., 131, 273, 10.1086/317361 Schölkopf, 1997 Berrocal, 2015, Lightweight silent data corruption detection based on runtime data analysis for HPC applications, 275 Gomez, 2015, Detecting and correcting data corruption in stencil applications through multivariate interpolation, 2015 IEEE International Conference on Cluster Computing, 595, 10.1109/CLUSTER.2015.108 Di, 2015, An efficient silent data corruption detection method with error-feedback control and even sampling for HPC applications, 271 Sharma, 2015, Detecting soft errors in stencil based computations, The 11th Workshop on Silicon Errors in Logic – System Effects Subasi, 2016, Spatial support vector regression to detect silent errors in the exascale era, 413 Thomas, 2016, Sirius: neural network based probabilistic assertions for detecting silent data corruption in parallel programs, 35th Symposium on Reliable Distributed Systems (SRDS) Subasi, 2016, CRC-based memory reliability for task-parallel HPC applications, 1101 Fiala, 2012, Detection and correction of silent data corruption for large-scale high-performance computing, 78:1 Subasi, 2015, Programmer-directed partial redundancy for resilient HPC, 47:1 Turmon, 2003, Tests and tolerances for high-performance software-implemented fault detection, IEEE Trans. Comput., 52, 579, 10.1109/TC.2003.1197125 Ciocca, 2004, Application-level fault tolerance in the orbital thermal imaging spectrometer, 43 Sloan, 2012, Algorithmic approaches to low overhead fault detection for sparse linear algebra, 1 Bautista-Gomez, 2011, FTI: high performance fault tolerance interface for hybrid systems, Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC, 32:1 Subasi, 2015, Nanocheckpoints: a task-based asynchronous dataflow framework for efficient and scalable checkpoint/restart, 23rd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), 99 Subasi, 2015, Marriage between coordinated and uncoordinated checkpointing for the exascale era, 470 Martsinkevich, 2015, Fault-tolerant protocol for hybrid task-parallel message-passing applications, IEEE International Conference on Cluster Computing, CLUSTER, 563 Subasi, 2016, Unified fault-tolerance framework for hybrid task-parallel message-passing applications, Int. J. High Perform. Comput. Appl., 0