Jidong Zhai

Tsinghua University

"A Scalable Fault Tolerant HPL Implementation to Tolerate Permanent Node Losses"

Wednesday June 15, 2016 10:00 AM
Location: 1228, EBII NCSU Centennial Campus
HPL is an important benchmark to rank HPC systems. However, a large-scale system may not afford a complete HPL test, due to decreasing system reliability and increasing HPL test time. Despite previous efforts, existing methods fail to either tolerate permanent node losses on real HPC systems or scale up to large systems with an acceptable performance loss.

 In this talk, I will present a scalable fault tolerant HPL, called NFT-HPL. It is inspired by an observation that we can trade partial memory for a fault-tolerant HPL. To address two key challenges of scalability and memory consumption when applying in-memory checkpoint to HPL, we propose two novel techniques in NFT-HPL. First, a group encoding strategy to guarantee its scalability. Second, a back-to-back checkpoint updating to reduce memory consumption of checkpoint and increase the system’s reliability. Experiments with 24,576 processes show that NFTHPL achieves over 95% of the performance of the original HPL.

Short Bio:

Jidong Zhai is an assistant professor in the Computer Science Department of Tsinghua University and a visiting assistant professor of Stanford University. He received the Ph.D. degree in Computer Science from Tsinghua University in 2010, with the Excellent Ph.D. Graduate Student Award of Tsinghua University. His research is focusing on high performance computing, compiler optimization, performance analysis and optimization of large-scale parallel applications. His research received a Best Paper Finalist at SC’14. The team led by him completed a sweep of all three champion titles of student supercomputing challenges at ASC’15, ISC’15, and SC’15. He served or is now serving TPC members or reviewers of IEEE TPDS, IEEE TCC, IEEE TKDE, ICPP, NAS, LCPC, ICA3PP, and HPCC. He is a co-chair of ICPP PASA workshop 2015. He is a recipient of 2010 Siebel Scholar and CCF outstanding doctoral dissertation award.

Host: Frank Mueller, CSC

