Abstract
The FAIR principles of open science (Findable, Accessible, Interoperable, and Reusable) have had transformative effects on modern large-scale computational science. In particular, they have encouraged more open access to and use of data, an important consideration as collaboration among teams of researchers accelerates and the use of workflows by those teams to solve problems increases. How best to apply the FAIR principles to workflows themselves, and software more generally, is not yet well understood. We argue that the software engineering concept of technical debt management provides a useful guide for application of those principles to workflows, and in particular that it implies reusability should be considered as 'first among equals'. Moreover, our approach recognizes a continuum of reusability where we can make explicit and selectable the tradeoffs required in workflows for both their users and developers. To this end, we propose a new abstraction approach for reusable workflows, with demonstrations for both synthetic workloads and real-world computational biology workflows. Through application of novel systems and tools that are based on this abstraction, these experimental workflows are refactored to rightsize the granularity of workflow components to efficiently fill the gap between end-user simplicity and general customizability. Our work makes it easier to selectively reason about and automate the connections between trade-offs across user and developer concerns when exposing degrees of freedom for reuse. Additionally, by exposing fine-grained reusability abstractions we enable performance optimizations, as we demonstrate on both institutional-scale and leadership-class HPC resources.
Original language | English |
---|---|
Title of host publication | Proceedings - 2021 IEEE International Conference on Cluster Computing, Cluster 2021 |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
Pages | 444-455 |
Number of pages | 12 |
ISBN (Electronic) | 9781728196664 |
DOIs | |
State | Published - 2021 |
Event | 2021 IEEE International Conference on Cluster Computing, Cluster 2021 - Virtual, Portland, United States Duration: Sep 7 2021 → Sep 10 2021 |
Publication series
Name | Proceedings - IEEE International Conference on Cluster Computing, ICCC |
---|---|
Volume | 2021-September |
ISSN (Print) | 1552-5244 |
Conference
Conference | 2021 IEEE International Conference on Cluster Computing, Cluster 2021 |
---|---|
Country/Territory | United States |
City | Virtual, Portland |
Period | 09/7/21 → 09/10/21 |
Funding
This research used resources of the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC05-00OR22725. Sandia National Laboratories is a multi-mission laboratory managed and operated by National Technology & Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA0003525. SAND2021-9168C. Our work makes it easier to selectively reason about and automate the connections between trade-offs across user and developer concerns when exposing degrees of freedom for reuse. Looking toward future development, we see great potential for more powerful and granular metadata representation, automation of reusable workflow composition, and applications across diverse areas of computational science (including climate, materials research, computational systems biology, and hybrid experimental/simulation platforms). Workflows represent the connections between data, computation, and human decision-making, and making them more reusable and automatable will have benefits across the science ecosystem. ACKNOWLEDGMENT This manuscript has been authored by UT-Battelle, LLC under contract no. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a nonexclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan, last accessed September 16, 2020). Funding was provided by the Plant-Microbe Interfaces (PMI) SFA, the Exascale & Petascale Networks for KBase project and by The Center for Bioenergy Innovation (CBI). These are all supported by the Genomic Sciences Program of Office of Biological and Environmental Research in the DOE Office of Science. This work was also supported in part by the joint U.S. Department of Veterans Affairs, US Department of Energy MVP CHAMPION program, and the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration.
Keywords
- Distributed Information systems
- FAIR
- Middleware
- Reusability
- Workflows