Seminars & Colloquia
CS, Colorado School of Mines
"Dependable High Performance Scientific Computing via Algorithmic Fault Tolerance"
Monday February 20, 2012 02:00 PM
Location: 3211, EB2 NCSU Centennial Campus
(Visitor parking instructions)
This talk is part of the System Research Seminar series
Exascale platforms expected to be available before the end of this decade will have 100 million to 1 billion CPU cores. Due to the large number of components, the probability that a failure occurs during the execution of an exascale application is expected to be much higher than today. In this talk, I will discuss our recent work on dependable high performance scientific Computing via algorithmic fault tolerance. We have developed highly efficient algorithmic fault tolerance techniques for selected widely used scientific computing algorithms to tolerate both soft and hard errors according to their specific algorithmic characteristics. The algorithms we consider include:
(1). Krylov subspace methods for solving sparse linear systems and eigenvalue problems including Conjugate Gradient method (CG), Generalized Minimal Residual method (GMRES) , BiConjugate Gradient method (BiCG), BiConjugate Gradient Stabilized method (Bi-CGSTAB), Arnoldi's method for eigenvalue problems, Lanczos method for symmetric eigenvalue problems, and Lanczos biorthogonalization for non-symmetric eigenvalue problems;
(2). Direct methods for solving dense linear systems and eigenvalue problems including LU, Cholesky, and QR; and
(3). Newton's method for solving systems of non-linear equations. By leveraging the algorithmic characteristics of the these algorithms, the proposed techniques sacrifice the generality of Triple Modular Redundancy technique (for soft errors) and checkpoint/restart technique (for hard errors) for the sake of higher efficiency.
Zizhong (Jeffrey) Chen is an Assistant Professor of Computer Science at the Colorado School of Mines. He is interested in the broad area of high performance computing (HPC). The goal of his research is to develop techniques, design algorithms, and build software tools for computational science applications to achieve both high performance and high reliability on a wide range of computational platforms. His research results have been frequently published in leading HPC related journals and conferences including TPDS, TC, JPDC, SIMAX, SISC, ICS, HPDC, PPoPP, and SC. He received a Best Paper Award from the 19th International Supercomputer Conference in 2004, a Distinguished Research Award from Jacksonville State University in 2008, and an Outstanding Faculty Award from the Colorado School of Mines in 2010. He is currently a senior member of the IEEE.
Host: Frank Mueller, Computer Science, NCSU