TY - GEN
T1 - Aggregation of real-time system monitoring data for analyzing large-scale parallel and distributed computing environments
AU - Böhm, S.
AU - Engelmann, C.
AU - Scott, S. L.
PY - 2010
Y1 - 2010
N2 - We present a monitoring system for large-scale parallel and distributed computing environments that allows to trade-off accuracy in a tunable fashion to gain scalability without compromising fidelity. The approach relies on classifying each gathered monitoring metric based on individual needs and on aggregating messages containing classes of individual monitoring metrics using a tree-based overlay network. The MRNet-based prototype is able to significantly reduce the amount of gathered and stored monitoring data, e.g., by a factor of ≈56 in comparison to the Ganglia distributed monitoring system. A simple scaling study reveals, however, that further efforts are needed in reducing the amount of data to monitor future-generation extreme-scale systems with up to 1,000,000 nodes. The implemented solution did not had a measurable performance impact as the 32-node test system did not produce enough monitoring data to interfere with running applications.
AB - We present a monitoring system for large-scale parallel and distributed computing environments that allows to trade-off accuracy in a tunable fashion to gain scalability without compromising fidelity. The approach relies on classifying each gathered monitoring metric based on individual needs and on aggregating messages containing classes of individual monitoring metrics using a tree-based overlay network. The MRNet-based prototype is able to significantly reduce the amount of gathered and stored monitoring data, e.g., by a factor of ≈56 in comparison to the Ganglia distributed monitoring system. A simple scaling study reveals, however, that further efforts are needed in reducing the amount of data to monitor future-generation extreme-scale systems with up to 1,000,000 nodes. The implemented solution did not had a measurable performance impact as the 32-node test system did not produce enough monitoring data to interfere with running applications.
UR - http://www.scopus.com/inward/record.url?scp=78149334326&partnerID=8YFLogxK
U2 - 10.1109/HPCC.2010.32
DO - 10.1109/HPCC.2010.32
M3 - Conference contribution
AN - SCOPUS:78149334326
SN - 9780769542140
T3 - Proceedings - 2010 12th IEEE International Conference on High Performance Computing and Communications, HPCC 2010
SP - 72
EP - 78
BT - Proceedings - 2010 12th IEEE International Conference on High Performance Computing and Communications, HPCC 2010
T2 - 2010 12th IEEE International Conference on High Performance Computing and Communications, HPCC 2010
Y2 - 1 September 2010 through 3 September 2010
ER -