Abstract
Programming development tools are a vital component for understanding the behavior of parallel applications. Event tracing is a principal ingredient to these tools, but new and serious challenges place event tracing at risk on extreme-scale machines. As the quantity of captured events increases with concurrency, the additional data can overload the parallel file system and perturb the application being observed. In this work we present a solution for event tracing on extreme-scale machines. We enhance an I/O forwarding software layer to aggregate and reorganize log data prior to writing to the storage system, significantly reducing the burden on the underlying file system. Furthermore, we introduce a sophisticated write buffering capability to limit the impact. To validate the approach, we employ the Vampir tracing toolset using these new capabilities. Our results demonstrate that the approach increases the maximum traced application size by a factor of 5× to more than 200,000 processes.
Original language | English |
---|---|
Pages (from-to) | 1-18 |
Number of pages | 18 |
Journal | Cluster Computing |
Volume | 17 |
Issue number | 1 |
DOIs | |
State | Published - Mar 2014 |
Funding
Acknowledgements We thank Ramanan Sankaran (ORNL) for providing a working version of S3D as well as a benchmark problem set for JaguarPF. We are grateful to Matthias Jurenz for his assistance on VampirTrace as well as Matthias Weber and Ronald Geisler for their support for Vampir. The IOFSL project is supported by the DOE Office of Science and National Nuclear Security Administration (NNSA). This research used resources of the Argonne Leadership Computing Facility at Argonne National Laboratory and the Oak Ridge Leadership Computing Facility at Oak Ridge National Laboratory, which are supported by the Office of Science of the U.S. Department of Energy under contracts DE-AC02-06CH11357 and DE-AC05-00OR22725, respectively. This work was supported in part by the National Science Foundation (NSF) through NSF-0937928 and NSF-0724599. This work is supported in a part by the German Research Foundation (DFG) in the Collaborative Research Center 912 “Highly Adaptive Energy-Efficient Computing“.
Funders | Funder number |
---|---|
National Science Foundation | NSF-0937928, NSF-0724599 |
U.S. Department of Energy | DE-AC05-00OR22725, DE-AC02-06CH11357 |
Office of Science | |
National Nuclear Security Administration | |
Argonne National Laboratory | |
Deutsche Forschungsgemeinschaft |
Keywords
- Atomic append
- Event tracing
- I/O forwarding