Abstract
Background: Clouds and MapReduce have shown themselves to be a broadly useful approach to scientific computing especially for parallel data intensive applications. However they have limited applicability to some areas such as data mining because MapReduce has poor performance on problems with an iterative structure present in the linear algebra that underlies much data analysis. Such problems can be run efficiently on clusters using MPI leading to a hybrid cloud and cluster environment. This motivates the design and implementation of an open source Iterative MapReduce system Twister.Results: Comparisons of Amazon, Azure, and traditional Linux and Windows environments on common applications have shown encouraging performance and usability comparisons in several important non iterative cases. These are linked to MPI applications for final stages of the data analysis. Further we have released the open source Twister Iterative MapReduce and benchmarked it against basic MapReduce (Hadoop) and MPI in information retrieval and life sciences applications.Conclusions: The hybrid cloud (MapReduce) and cluster (MPI) approach offers an attractive production environment while Twister promises a uniform programming environment for many Life Sciences applications.Methods: We used commercial clouds Amazon and Azure and the NSF resource FutureGrid to perform detailed comparisons and evaluations of different approaches to data intensive computing. Several applications were developed in MPI, MapReduce and Twister in these different environments.
Original language | English |
---|---|
Article number | S3 |
Journal | BMC Bioinformatics |
Volume | 11 |
Issue number | SUPPL. 12 |
DOIs | |
State | Published - Dec 21 2010 |
Externally published | Yes |
Funding
MPI: Message Passing Interface; NSF: National Science Fundation; UC Santa Barbara HPC Research: University of California Santa Barbara High Performance Computing Research; OCI: Office of Cyberinfrastructure; DOE: Department of Energy; EU: European Union; VM: Virtual Machine; HPC: High Performance Computing; DNA: Deoxyribonucleic Acid; BLAST: Basic Local Alignment Search Tool; MDS: Multidimensional Scaling; JVM: Java Virtual Machine We appreciate Microsoft for their technical support. This work was made possible using the computing use grant provided by Amazon Web Services which is titled “Proof of concepts linking FutureGrid users to AWS”. This work is partially funded by Microsoft “CRMC” grant and NIH Grant Number RC2HG005806-02. This document was developed with support from the National Science Foundation (NSF) under Grant No. 0910812 to Indiana University for “FutureGrid: An Experimental, High-Performance Grid Test-bed.” Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessary This article has been published as part of BMC Bioinformatics Volume 11 Supplement 12, 2010: Proceedings of the 11th Annual Bioinformatics Open Source Conference (BOSC) 2010. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/11?issue=S12.
Funders | Funder number |
---|---|
Java Virtual Machine | |
University of California Santa Barbara High Performance Computing Research | |
National Science Foundation | 0910812 |
National Institutes of Health | RC2HG005806-02 |
U.S. Department of Energy | |
Microsoft | |
Indiana University | |
University of California, Santa Barbara | |
Amazon Web Services | |
European Commission |