SRS: A framework for developing malleable and migratable parallel applications for distributed systems

Sathish S. Vadhiyar, Jack J. Dongarra

Research output: Contribution to journalArticlepeer-review

70 Scopus citations

Abstract

The ability to produce malleable parallel applications that can be stopped and reconfigured during the execution can offer attractive benefits for both the system and the applications. The reconfiguration can be in terms of varying the parallelism for the applications, changing the data distributions during the executions or dynamically changing the software components involved in the application execution. In distributed and Grid computing systems, migration and reconfiguration of such malleable applications across distributed heterogeneous sites which do not share common file systems provides flexibility for scheduling and resource management in such distributed environments. The present reconfiguration systems do not support migration of parallel applications to distributed locations. In this paper, we discuss a framework for developing malleable and migratable MPI message-passing parallel applications for distributed systems. The framework includes a user-level checkpointing library called SRS and a runtime support system that manages the check-pointed data for distribution to distributed locations. Our experiments and results indicate that the parallel applications, with instrumentation to SRS library, were able to achieve reconfigurability incurring about 15-35% overhead.

Original languageEnglish
Pages (from-to)291-312
Number of pages22
JournalParallel processing letters
Volume13
Issue number2
DOIs
StatePublished - Jun 2003
Externally publishedYes

Funding

* This work is supported in part by the National Science Foundation contract #EIA-9975020, SC #R36505-29200099 and GRANT #EIA-9975015

FundersFunder number
National Science Foundation-9975015, 36505-29200099, -9975020

    Keywords

    • Checkpointing
    • Distributed
    • MPI
    • Malleable
    • Migrati on
    • Parallel
    • Reconfiguration

    Fingerprint

    Dive into the research topics of 'SRS: A framework for developing malleable and migratable parallel applications for distributed systems'. Together they form a unique fingerprint.

    Cite this