TY - GEN
T1 - NVIDIA’s Cloud Native Supercomputing
AU - Shainer, Gilad
AU - Graham, Richard
AU - Newburn, Chris J.
AU - Hernandez, Oscar
AU - Bloch, Gil
AU - Gibbs, Tom
AU - Wells, Jack C.
N1 - Publisher Copyright:
© 2022, Springer Nature Switzerland AG.
PY - 2022
Y1 - 2022
N2 - NVIDIA is defining a High-Performance Computing system architecture called Cloud Native Supercomputing to provide bare-metal system performance with security isolation and functional offload capabilities. Cloud Native Supercomputing delivers a cloud-based user experience in a way that maintains the performance and scalability that is uniquely delivered with supercomputing facilities. This new set of capabilities is being driven by the need to accommodate new scientific workflows that combine traditional simulation with experimental data from the edge and combine it with AI, data analytics and visualization frameworks in an integrated and even real-time fashion. These new workflows stress the system management, security and non-computational functions of traditional cloud or supercomputing facilities. Specifically, workflows that include data from untrusted (or non-local) sources, user experiences that range from Jupyter notebooks and interactive jobs to Gordon Bell-class capacity batch runs and I/O patterns that are unique to the emerging mix of in silico and live data sources. To achieve these objectives, we introduce a new architectural component called the Data Processing Unit (DPU), which in early embodiments is a system-on-a-chip (SoC) that includes an InfiniBand (IB) and Ethernet network adapter, programmable Arm cores, memory, PCI switches, and custom accelerators. The BlueField-1 and BlueField-2 devices are NVIDIA’s first DPU instances. This paper describes the architecture of cloud native supercomputing systems that use DPUs for isolation and acceleration, along with system services provided by that DPU. These services provide enhanced security through isolation, file-system management capabilities, monitoring, and the offloaded support for communication libraries.
AB - NVIDIA is defining a High-Performance Computing system architecture called Cloud Native Supercomputing to provide bare-metal system performance with security isolation and functional offload capabilities. Cloud Native Supercomputing delivers a cloud-based user experience in a way that maintains the performance and scalability that is uniquely delivered with supercomputing facilities. This new set of capabilities is being driven by the need to accommodate new scientific workflows that combine traditional simulation with experimental data from the edge and combine it with AI, data analytics and visualization frameworks in an integrated and even real-time fashion. These new workflows stress the system management, security and non-computational functions of traditional cloud or supercomputing facilities. Specifically, workflows that include data from untrusted (or non-local) sources, user experiences that range from Jupyter notebooks and interactive jobs to Gordon Bell-class capacity batch runs and I/O patterns that are unique to the emerging mix of in silico and live data sources. To achieve these objectives, we introduce a new architectural component called the Data Processing Unit (DPU), which in early embodiments is a system-on-a-chip (SoC) that includes an InfiniBand (IB) and Ethernet network adapter, programmable Arm cores, memory, PCI switches, and custom accelerators. The BlueField-1 and BlueField-2 devices are NVIDIA’s first DPU instances. This paper describes the architecture of cloud native supercomputing systems that use DPUs for isolation and acceleration, along with system services provided by that DPU. These services provide enhanced security through isolation, file-system management capabilities, monitoring, and the offloaded support for communication libraries.
KW - Artificial intelligence
KW - Cloud computing
KW - Data processing unit
KW - High performance computing
KW - Storage
UR - http://www.scopus.com/inward/record.url?scp=85127038404&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-96498-6_20
DO - 10.1007/978-3-030-96498-6_20
M3 - Conference contribution
AN - SCOPUS:85127038404
SN - 9783030964979
T3 - Communications in Computer and Information Science
SP - 340
EP - 357
BT - Driving Scientific and Engineering Discoveries Through the Integration of Experiment, Big Data, and Modeling and Simulation - 21st Smoky Mountains Computational Sciences and Engineering, SMC 2021, Revised Selected Papers
A2 - Nichols, [given-name]Jeffrey
A2 - Maccabe, [given-name]Arthur ‘Barney’
A2 - Nutaro, James
A2 - Pophale, Swaroop
A2 - Devineni, Pravallika
A2 - Ahearn, Theresa
A2 - Verastegui, Becky
PB - Springer Science and Business Media Deutschland GmbH
T2 - 21st Smoky Mountains Computational Sciences and Engineering Conference, SMC 2021
Y2 - 18 October 2021 through 20 October 2021
ER -