Seminars & Colloquia

Jeffrey Chen

CS, Colorado School of Mines

"Dependable High Performance Scientific Computing via Algorithmic Fault Tolerance"

Monday February 20, 2012 02:00 PM
Location: 3211, EB2 NCSU Centennial Campus
(Visitor parking instructions)

This talk is part of the System Research Seminar series

 

Abstract:

Exascale platforms expected to be available before the end of this decade will have 100 million to 1 billion CPU cores.  Due to the large number of components, the probability  that a failure occurs during the execution of an  exascale application is expected to be much higher  than today.  In this talk, I will discuss our recent work on  dependable high performance scientific Computing  via algorithmic fault tolerance.  We have developed highly efficient algorithmic fault tolerance techniques  for selected widely used scientific computing algorithms to  tolerate both soft and hard errors according to  their specific algorithmic characteristics. The algorithms we consider include:

 (1). Krylov subspace methods for solving sparse linear systems and  eigenvalue problems  including Conjugate Gradient method (CG),  Generalized Minimal Residual method (GMRES) , BiConjugate Gradient  method (BiCG),  BiConjugate Gradient Stabilized method (Bi-CGSTAB), Arnoldi's method  for eigenvalue problems,  Lanczos method for symmetric eigenvalue problems, and Lanczos  biorthogonalization for  non-symmetric eigenvalue problems;  

(2). Direct methods for solving dense linear systems and eigenvalue problems  including LU, Cholesky, and QR; and  

(3). Newton's method for solving systems of non-linear equations.  By leveraging the algorithmic characteristics of the these algorithms,  the proposed techniques sacrifice the generality of Triple Modular Redundancy  technique (for soft errors) and checkpoint/restart technique (for hard errors)  for the sake of higher efficiency. 

 

Short Bio:

Zizhong (Jeffrey) Chen is an Assistant Professor of Computer Science at the Colorado School of Mines. He is interested in the broad area of high performance computing (HPC). The goal of his research is to develop techniques, design algorithms, and build software tools for computational science applications to achieve both high performance and high reliability on a wide range of computational platforms. His research results have been frequently published in leading HPC related journals and conferences including TPDS, TC, JPDC, SIMAX, SISC, ICS, HPDC, PPoPP, and SC. He received a Best Paper Award from the 19th International Supercomputer Conference in 2004, a Distinguished Research Award from Jacksonville State University in 2008, and an Outstanding Faculty Award from the Colorado School of Mines in 2010. He is currently a senior member of the IEEE.  

Host: Frank Mueller, Computer Science, NCSU


Back to Seminar Listings
Back to Colloquia Home Page