CUMULVS: Providing fault tolerance, visualization, and steering of parallel applications

G. A. Geist, James Arthur Kohl, Philip M. Papadopoulos

Research output: Contribution to journalArticlepeer-review

103 Scopus citations

Abstract

The use of visualization and computational steering can often assist scientists in analyzing large-scale scientific applications. Fault tolerance to failures is of great importance when running on a distributed system. However, the details of implementing these features are complex and tedious, leaving many scientists with inadequate development tools. CUMULVS is a library that enables programmers to easily incorporate interactive visualization and computational steering into existing parallel programs. Built on the PVM virtual machine framework, CUMULVS is portable and interoperable with all the computer architectures that PVM works with - a growing list that now stands at about 60 architectures. The CUMULVS library is divided into two pieces: one for the application program and one for the possibly commercial, visualization, and steering front end. Together, these two libraries encompass all the connection and data protocols needed to dynamically attach multiple, independent viewer front ends to a running parallel application. Viewer programs can also steer one or more user-defined parameters to "close the loop" for computational experiments and analyses. CUMULVS allows the programmer to specify user-directed checkpoints for saving an important program state in case of failures and also provides a mechanism to migrate tasks across heterogeneous machine architectures to achieve improved performance. Details of the CUMULVS design goals and compromises as well as future directions are given.

Original languageEnglish
Pages (from-to)224-235
Number of pages12
JournalInternational Journal of High Performance Computing Applications
Volume11
Issue number3
DOIs
StatePublished - 1997

Fingerprint

Dive into the research topics of 'CUMULVS: Providing fault tolerance, visualization, and steering of parallel applications'. Together they form a unique fingerprint.

Cite this