Using Apache Spark on genome assembly for scalable overlap-graph reduction

Alexander J. Paul, Dylan Lawrence, Myoungkyu Song, Seung Hwan Lim, Chongle Pan, Tae Hyuk Ahn

Research output: Contribution to journalArticlepeer-review

3 Scopus citations

Abstract

BACKGROUND: De novo genome assembly is a technique that builds the genome of a specimen using overlaps of genomic fragments without additional work with reference sequence. Sequence fragments (called reads) are assembled as contigs and scaffolds by the overlaps. The quality of the de novo assembly depends on the length and continuity of the assembly. To enable faster and more accurate assembly of species, existing sequencing techniques have been proposed, for example, high-throughput next-generation sequencing and long-reads-producing third-generation sequencing. However, these techniques require a large amounts of computer memory when very huge-size overlap graphs are resolved. Also, it is challenging for parallel computation. RESULTS: To address the limitations, we propose an innovative algorithmic approach, called Scalable Overlap-graph Reduction Algorithms (SORA). SORA is an algorithm package that performs string graph reduction algorithms by Apache Spark. The SORA's implementations are designed to execute de novo genome assembly on either a single machine or a distributed computing platform. SORA efficiently compacts the number of edges on enormous graphing paths by adapting scalable features of graph processing libraries provided by Apache Spark, GraphX and GraphFrames. CONCLUSIONS: We shared the algorithms and the experimental results at our project website, https://github.com/BioHPC/SORA . We evaluated SORA with the human genome samples. First, it processed a nearly one billion edge graph on a distributed cloud cluster. Second, it processed mid-to-small size graphs on a single workstation within a short time frame. Overall, SORA achieved the linear-scaling simulations for the increased computing instances.

Original languageEnglish
Pages (from-to)48
Number of pages1
JournalHuman genomics
Volume13
DOIs
StatePublished - Oct 22 2019

Funding

TA is supported by NSF-1566292, NSF-1564894, Saint Louis University (SLU) Startup, SLU President’s Research Fund 2018, and Amazon Web Service (AWS) Cloud Credits for Research. DL is supported by T32 HG000045 from the National Human Genome Research Institute. Publication were funded by TA’s SLU Startup fund.

FundersFunder number
NSF-1564894
NSF-1566292
National Science Foundation1566292
National Human Genome Research Institute
Amazon Web ServicesT32 HG000045
Saint Louis University

    Keywords

    • Apache spark
    • Cloud computing
    • Genome assembly
    • Graph reduction
    • Overlap-layout-consensus

    Fingerprint

    Dive into the research topics of 'Using Apache Spark on genome assembly for scalable overlap-graph reduction'. Together they form a unique fingerprint.

    Cite this