Best Practices and Lessons Learned from Deploying and Operating Large-Scale Data-Centric Parallel File Systems

Sarp Oral, James Simmons, Jason Hill, Dustin Leverman, Feiyi Wang, Matt Ezell, Ross Miller, Douglas Fuller, Raghul Gunasekaran, Youngjae Kim, Saurabh Gupta, Devesh Tiwari, Sudharshan S. Vazhkudai, James H. Rogers, David Dillow, Galen M. Shipman, Arthur S. Bland

Research output: Contribution to journalConference articlepeer-review

43 Scopus citations

Abstract

The Oak Ridge Leadership Computing Facility (OLCF) has deployed multiple large-scale parallel file systems (PFS) to support its operations. During this process, OLCF acquired significant expertise in large-scale storage system design, file system software development, technology evaluation, benchmarking, procurement, deployment, and operational practices. Based on the lessons learned from each new PFS deployment, OLCF improved its operating procedures, and strategies. This paper provides an account of our experience and lessons learned in acquiring, deploying, and operating large-scale parallel file systems. We believe that these lessons will be useful to the wider HPC community.

Original languageEnglish
Article number7013005
Pages (from-to)217-228
Number of pages12
JournalInternational Conference for High Performance Computing, Networking, Storage and Analysis, SC
Volume2015-January
Issue numberJanuary
DOIs
StatePublished - Jan 16 2014
EventInternational Conference for High Performance Computing, Networking, Storage and Analysis, SC 2014 - New Orleans, United States
Duration: Nov 16 2014Nov 21 2014

Fingerprint

Dive into the research topics of 'Best Practices and Lessons Learned from Deploying and Operating Large-Scale Data-Centric Parallel File Systems'. Together they form a unique fingerprint.

Cite this