February 22, 2017

DOE Supports Mueller’s Research With Two Awards

Dr. Frank Mueller, professor of computer science at NC State University, has recently received two awards from the Lawrence Livermore National Laboratory via the US Department of Energy.
Mueller’s first proposal, “HPC Power Modeling and Active Control,” received $386,290, and will run from October 25, 2016 through September 30, 2019.
Abstract - As we approach the exascale era, power has become a primary bottleneck. The US Department of Energy has set a power constraint of 20MW on each exascale machine. To be able achieve one exaflop in 20MW,it is necessary that we use power intelligently to maximize performance under a power constraint.  In this work, we propose to alleviate the shortcomings of current HPC systems in addressing power constraints by (1) power-aware machine partitioning, (2) power-constrained job scheduling, (3) systematic provisioning and procurement of hardware under a power cap, (4)modeling of network, deep memories, and storage, as well as (5)investigating the inter-dependence between power and cooling.
The second proposal, “Failure Prediction and Exact Localization,” received $84,684, and will run from October 18, 2016 through August 16, 2017.
Abstract – Extreme-scale computing platforms are increasingly suffering from job failures due to hardware and software faults. Past work has predicted system availability but cannot predicting the locality of failures.

The objective of this work is to assess the potential of machine learning techniques for pinpointing failures before they happen with high true positive and low false positive rates. We propose to employ a combination of machine learning (ML) techniques for offline training followed by online real-time prediction of failure locations in a timely manner to take preemptive measurements and ``work around'' predicted trouble spots (nodes, network links/switches).
