Deja Vu: Transparent Checkpoint, Recovery and Migration for Large-Scale Distributed Systems

Students:

Component failures are endemic to any large-scale computational resource. As inexpensive computational clusters continue to push the limits of performance and scalability, fundamental issues remain in engendering stability in large-scale cluster systems. The increased component count inherent in cluster-based systems increases the instability of the resource as a whole due to the combinatorial dependency of the integrated system on single-component failure rates. For instance, even if a cluster is based on very high reliability nodes, – where a node has a software or hardware failure once a year – a 1000 node cluster will fail several times a day. Engendering stability in ever growing networked collections of cluster systems needs a systemic software solution that provides reliable access to computing resources.

Deja vu provides an integrated solution to the problem of transparent fault tolerance, which enables large-scale cluster supercomputers to mask hardware, operating system and software failures. The primary contributions of Déjà vu are:

Interestingly, Deja Vu's failure recovery model also enables preemptive scheduling in traditionally batch oriented environments. Since Deja Vu can recover from the "all nodes failure" case, a ququeing system can use this mode to preempt a running job and resume from the saved checkpoint at an arbitrary point in time. We are using this approach to develop a preemptive scheduler based on weighted fair queuing.

This work is a joint collaboration with the Pittsburgh Supercomputing Center (integration with distributed storage systems, accounting systems and grid integration) and the Institute for Scientific Research (grid integration).

Current Status (1/15/2005)

Acknowledgements

This work is supported by an NSF medium ITR grant (CNS-0325534). We thank the National Science Foundation for their support.