TY - GEN
T1 - Balancing productivity and performance on the cell broadband engine
AU - Alam, Sadaf R.
AU - Meredith, Jeremy S.
AU - Vetter, Jeffrey S.
PY - 2007
Y1 - 2007
N2 - The Cell Broadband Engine (BE) is a heterogeneous multicore processor, combining a general-purpose POWER architecture core with eight independent single-instructionmultiple-data (SIMD) cores. Each core is capable of very high performance; however, users must explicitly manage data movement, scheduling, and synchronization. While these attributes provide some of the Cell processor's greatest performance strengths, they also form its greatest weaknesses in terms of developer productivity, code portability, and initial performance efficiencies. In this paper, we evaluate productivity and relative performance improvements of a Cell BE system for a diverse set of kernels and applications. Our experimental workload includes algorithms from scientific, cognitive, and imaging problem domains. Our results demonstrate that the Cell processor could be several times faster than a SSE-enabled, contemporary dual-core processor, and could sustain a high performance-to-productivity ratio. We outline strategies for transforming applications to exploit the Cell's architectural features, and measure productivity by comparing programming effort in terms of lines of code and performance. For instance, our measurements revealed that a covariance matrix creation routine - a common routine in hyperspectral imaging - ran over eight times faster than a 2.66 GHz Intel Woodcrest processor while sustaining a productivity metric of over two by parallelizing across the heterogeneous cores, unrolling loops, and improving instruction level parallelism with SIMD instructions in a high-level language.
AB - The Cell Broadband Engine (BE) is a heterogeneous multicore processor, combining a general-purpose POWER architecture core with eight independent single-instructionmultiple-data (SIMD) cores. Each core is capable of very high performance; however, users must explicitly manage data movement, scheduling, and synchronization. While these attributes provide some of the Cell processor's greatest performance strengths, they also form its greatest weaknesses in terms of developer productivity, code portability, and initial performance efficiencies. In this paper, we evaluate productivity and relative performance improvements of a Cell BE system for a diverse set of kernels and applications. Our experimental workload includes algorithms from scientific, cognitive, and imaging problem domains. Our results demonstrate that the Cell processor could be several times faster than a SSE-enabled, contemporary dual-core processor, and could sustain a high performance-to-productivity ratio. We outline strategies for transforming applications to exploit the Cell's architectural features, and measure productivity by comparing programming effort in terms of lines of code and performance. For instance, our measurements revealed that a covariance matrix creation routine - a common routine in hyperspectral imaging - ran over eight times faster than a 2.66 GHz Intel Woodcrest processor while sustaining a productivity metric of over two by parallelizing across the heterogeneous cores, unrolling loops, and improving instruction level parallelism with SIMD instructions in a high-level language.
UR - http://www.scopus.com/inward/record.url?scp=53349175902&partnerID=8YFLogxK
U2 - 10.1109/CLUSTR.2007.4629227
DO - 10.1109/CLUSTR.2007.4629227
M3 - Conference contribution
AN - SCOPUS:53349175902
SN - 1424413885
SN - 9781424413881
T3 - Proceedings - IEEE International Conference on Cluster Computing, ICCC
SP - 149
EP - 158
BT - Proceedings - 2007 IEEE International Conference on Cluster Computing, CLUSTER 2007
T2 - 2007 IEEE International Conference on Cluster Computing, CLUSTER 2007
Y2 - 19 September 2007 through 20 September 2007
ER -