Abstract
The ability to produce malleable parallel applications that can be stopped and reconfigured during the execution can offer attractive benefits for both the system and the applications. The reconfiguration can be in terms of varying the parallelism for the applications, changing the data distributions during the executions or dynamically changing the software components involved in the application execution. In distributed and Grid computing systems, migration and reconfiguration of such malleable applications across distributed heterogeneous sites which do not share common file systems provides flexibility for scheduling and resource management in such distributed environments. The present reconfiguration systems do not support migration of parallel applications to distributed locations. In this paper, we discuss a framework for developing malleable and migratable MPI message-passing parallel applications for distributed systems. The framework includes a user-level checkpointing library called SRS and a runtime support system that manages the check-pointed data for distribution to distributed locations. Our experiments and results indicate that the parallel applications, with instrumentation to SRS library, were able to achieve reconfigurability incurring about 15-35% overhead.
Original language | English |
---|---|
Pages (from-to) | 291-312 |
Number of pages | 22 |
Journal | Parallel processing letters |
Volume | 13 |
Issue number | 2 |
DOIs | |
State | Published - Jun 2003 |
Externally published | Yes |
Funding
* This work is supported in part by the National Science Foundation contract #EIA-9975020, SC #R36505-29200099 and GRANT #EIA-9975015
Funders | Funder number |
---|---|
National Science Foundation | -9975015, 36505-29200099, -9975020 |
Keywords
- Checkpointing
- Distributed
- MPI
- Malleable
- Migrati on
- Parallel
- Reconfiguration