TY - JOUR
T1 - Programming with BIG Data in R
T2 - Scaling Analytics from One to Thousands of Nodes
AU - Schmidt, Drew
AU - Chen, Wei Chen
AU - Matheson, Michael A.
AU - Ostrouchov, George
N1 - Publisher Copyright:
© 2016 Elsevier Inc.
PY - 2017/7
Y1 - 2017/7
N2 - We present a tutorial overview showing how one can achieve scalable performance with R. We do so by utilizing several package extensions, including those from the pbdR project. These packages consist of high performance, high-level interfaces to and extensions of MPI, PBLAS, ScaLAPACK, I/O libraries, profiling libraries, and more. While these libraries shine brightest on large distributed platforms, they also work rather well on small clusters and often, surprisingly, even on a laptop with only two cores. Our tutorial begins with recommendations on how to get more performance out of your R code before considering parallel implementations. Because R is a high-level language, a function can have a deep hierarchy of operations. For big data, this can easily lead to inefficiency. Profiling is an important tool to understand the performance of an R code for both serial and parallel improvements. The pbdR packages provide a highly scalable capability for the development of novel distributed data analysis algorithms. This level of scalability is unmatched in other analysis software. Interactive speeds (seconds) are achieved for complex analysis algorithms on data 100 GB and more. This is possible because the interfaces add little overhead to the scalable libraries and their extensions. Furthermore, this is often achieved with little or no change to serial R codes. Our overview includes codes of varying complexity, illustrating reading data in parallel, the process of changing a serial code to a distributed parallel code, and how to engage distributed matrix computation from within R.
AB - We present a tutorial overview showing how one can achieve scalable performance with R. We do so by utilizing several package extensions, including those from the pbdR project. These packages consist of high performance, high-level interfaces to and extensions of MPI, PBLAS, ScaLAPACK, I/O libraries, profiling libraries, and more. While these libraries shine brightest on large distributed platforms, they also work rather well on small clusters and often, surprisingly, even on a laptop with only two cores. Our tutorial begins with recommendations on how to get more performance out of your R code before considering parallel implementations. Because R is a high-level language, a function can have a deep hierarchy of operations. For big data, this can easily lead to inefficiency. Profiling is an important tool to understand the performance of an R code for both serial and parallel improvements. The pbdR packages provide a highly scalable capability for the development of novel distributed data analysis algorithms. This level of scalability is unmatched in other analysis software. Interactive speeds (seconds) are achieved for complex analysis algorithms on data 100 GB and more. This is possible because the interfaces add little overhead to the scalable libraries and their extensions. Furthermore, this is often achieved with little or no change to serial R codes. Our overview includes codes of varying complexity, illustrating reading data in parallel, the process of changing a serial code to a distributed parallel code, and how to engage distributed matrix computation from within R.
KW - Distributed computing
KW - Principal components analysis
KW - SPMD
KW - Scalable statistical computing
UR - http://www.scopus.com/inward/record.url?scp=85006826675&partnerID=8YFLogxK
U2 - 10.1016/j.bdr.2016.10.002
DO - 10.1016/j.bdr.2016.10.002
M3 - Review article
AN - SCOPUS:85006826675
SN - 2214-5796
VL - 8
SP - 1
EP - 11
JO - Big Data Research
JF - Big Data Research
ER -