Component failures are endemic to any large-scale computational resource. As inexpensive computational clusters continue to push the limits of performance and scalability, fundamental issues remain in engendering stability in large-scale cluster systems. The increased component count inherent in cluster-based systems increases the instability of the resource as a whole due to the combinatorial dependency of the integrated system on single-component failure rates. For instance, even if a cluster is based on very high reliability nodes, where a node has a software or hardware failure once a year a 1000 node cluster will fail several times a day. Engendering stability in ever growing networked collections of cluster systems needs a systemic software solution that provides reliable access to computing resources.
Deja vu provides an integrated solution to the problem of transparent fault tolerance, which enables large-scale cluster supercomputers to mask hardware, operating system and software failures. The primary contributions of Déjà vu are:
Interestingly, Deja Vu's failure recovery model also enables preemptive scheduling in traditionally batch oriented environments. Since Deja Vu can recover from the "all nodes failure" case, a ququeing system can use this mode to preempt a running job and resume from the saved checkpoint at an arbitrary point in time. We are using this approach to develop a preemptive scheduler based on weighted fair queuing.
This work is a joint collaboration with the Pittsburgh Supercomputing Center (integration with distributed storage systems, accounting systems and grid integration) and the Institute for Scientific Research (grid integration).This work is supported by an NSF medium ITR grant (CNS-0325534). We thank the National Science Foundation for their support.