Distributed Computing - Finding Storage, Processing Power in Numbers
In her office on Centennial Campus, Dr. Xioasong Ma constantly hears the “whine of progress.” It comes from researchers complaining about the difficulty of processing enormous amounts of experiment data, but it is the sound that drives the assistant professor of computer science. “They are always coming here with their problems,” the soft-spoken Ma says with a chuckle. “But their problems give me new areas to research.”
Computational science and data mining are becoming more complex in research, making processing and analyzing the data more difficult, says Ma, who holds a joint faculty position at NC State and Oak Ridge National Laboratory (ORNL). Applications that used to generate data consuming gigabytes of storage now take up terabytes and soon will require petabytes—1 million gigabytes—simply too much information for personal computers to handle. Using CAREER awards from the National Science Foundation and the Department of Energy, Ma is devising ways to tie dozens of computers together to handle the job.
Because more than half of the disk space on most personal computers usually sits idle, a giant cache of available space could be created by linking workstations in a particular department or building, Ma says. Together with assistant professor Vincent Freeh and scientists at ORNL, she has developed Project FreeLoader to borrow 50 gigabytes or so of unused storage from each computer to create a single “scratch space,” where data sets from computer simulations could be held temporarily.
But storing the data is only half the battle. Professors tell Ma they also need the ability to analyze it but don’t have anywhere near the needed computing power. Once again, there is power in numbers. Ma and Freeh are working on background computing systems that can process large projects on a network of computers while users continue to work on their individual desktops. The system monitors processing usage on each machine and automatically slows the background program when a user requires more speed for his or her work.
Ma also borrowed a page from multimedia producers to speed downloads and retrieve lost data quickly. Different pieces, or “stripes,” of data are sent simultaneously to the linked computers from an archival system and are made available for processing before the download is complete. Only a portion of the data is cached, which saves storage space and simplifies data recovery if a computer in the network crashes, Ma say.
“Scientists need to do science, not computer science,” she says. “We can help them do that.”
Published with permission, from the Fall 2006 issue of RESULTS magazine.
Photos by Roger Winstead.
Return To News Homepage