-
Chapter and Conference Paper
A Scalable Asynchronous Replication-Based Strategy for Fault Tolerant MPI Applications
As computational clusters increase in size, their mean-time-to-failure reduces. Typically checkpointing is used to minimize the loss of computation. Most checkpointing techniques, however, require a central st...
-
Chapter and Conference Paper
Thread Migration/Checkpointing for Type-Unsafe C Programs
Thread migration/checkpointing is becoming indispensable for load balancing and fault tolerance in high performance computing applications, and its success depends on the migration/checkpointing-safety, which ...
-
Chapter and Conference Paper
On Improving Thread Migration: Safety and Performance
Application-level migration schemes have been paid more attention recently because of their great potential for heterogeneous migration. But they are facing an obstacle that few migration-unsafe features in ce...