CSC News

February 05, 2010

Mueller Receives Award to Study HPC Resilience

Dr. Frank Mueller, associate professor of computer science at NC State University, has been awarded $50,000 by Sandia National Laboratory to support his research proposal titled “Developing and Evaluating Advanced Methods for Resilience at Scale.”

 

The award will run from February 1, 2010 to January 31, 2011.

 

Abstract - For large-scale high-performance computing (HPC) systems with 10s/100s of thousands of cores, faults have become the norm rather than the exception. The objective of the proposed work is to alleviate scalability limitations of current fault tolerant practices on petascale installations, which could pave the path for forthcoming exascale systems. To this end, we propose to develop and evaluate advanced mechanisms to make large-scale HPC jobs resilient to failures. We will combine and then evaluate in-place rollback with redundant computing. We will develop techniques to detect and to recover from silent data corruption.

 

For more information on Dr. Mueller, click here.

 

~coates~

 

Return To News Homepage