Abstract
The number of processors embedded on high performance computing platforms is growing daily to satisfy the user desire for solving larger and more complex problems. Scalable and fault-tolerant runtime environments are needed to support and adapt to the underlying libraries and hardware which require a high degree of scalability in dynamic large-scale environments. This paper presents a self-healing network (SHN) for supporting scalable and fault-tolerant runtime environments. The SHN is designed to support transmission of messages across multiple nodes while also protecting against recursive node and process failures. It will automatically recover itself after a failure occurs. SHN is implemented on top of a scalable fault-tolerant protocol (SFTP). The experimental results show that both the latest multicast and broadcast routing algorithms used in SHN are faster and more reliable than the original SFTP routing algorithms.
| Original language | English |
|---|---|
| Pages (from-to) | 479-485 |
| Number of pages | 7 |
| Journal | Future Generation Computer Systems |
| Volume | 26 |
| Issue number | 3 |
| DOIs | |
| State | Published - Mar 2010 |
| Externally published | Yes |
Keywords
- Fault tolerance
- Routing protocols
- Runtime environments
- Scalability
- Self-healing