TY - GEN
T1 - Enabling graph appliance for genome assembly
AU - Singh, Rina
AU - Graves, Jeffrey A.
AU - Lee, Sangkeun
AU - Sukumar, Sreenivas R.
AU - Shankar, Mallikarjun
N1 - Publisher Copyright:
© 2015 IEEE.
PY - 2015/12/22
Y1 - 2015/12/22
N2 - In recent years, there has been a huge growth in the amount of genomic data available as reads generated from various genome sequencers. The number of reads generated can be huge, ranging from hundreds to billions of nucleotide, each varying in size. Assembling such large amounts of data is one of the challenging computational problems for both biomedical and data scientists. Most of the genome assemblers that have developed use de Bruijn graph techniques. A de Bruijn graph represents a collection of read sequences by billions of vertices and edges, which require large amounts of memory and computational power to store and process. This is the major drawback to de Bruijn graph assembly. Massively parallel, multithreaded, shared memory systems can be leveraged to overcome some of these issues. The objective of our research is to investigate the feasibility and scalability issues of de Bruijn graph assembly on Cray's Urika-GD system; Urika-GD is a high performance graph appliance with a large shared memory and massively multithreaded custom processor designed for executing SPARQL queries over large-scale RDF data sets. However, to the best of our knowledge, there is no research on representing a de Bruijn graph as an RDF graph or finding Eulerian paths in RDF graphs using SPARQL for potential genome discovery. In this paper, we address the issues involved in representing de Bruin graphs as RDF graphs and propose an iterative querying approach for searching cycles to find Eulerian paths in large RDF graphs. We evaluate the performance of our implementation on real world ebola genome datasets and illustrate how genome assembly can be accomplished with Urika-GD using iterative SPARQL queries.
AB - In recent years, there has been a huge growth in the amount of genomic data available as reads generated from various genome sequencers. The number of reads generated can be huge, ranging from hundreds to billions of nucleotide, each varying in size. Assembling such large amounts of data is one of the challenging computational problems for both biomedical and data scientists. Most of the genome assemblers that have developed use de Bruijn graph techniques. A de Bruijn graph represents a collection of read sequences by billions of vertices and edges, which require large amounts of memory and computational power to store and process. This is the major drawback to de Bruijn graph assembly. Massively parallel, multithreaded, shared memory systems can be leveraged to overcome some of these issues. The objective of our research is to investigate the feasibility and scalability issues of de Bruijn graph assembly on Cray's Urika-GD system; Urika-GD is a high performance graph appliance with a large shared memory and massively multithreaded custom processor designed for executing SPARQL queries over large-scale RDF data sets. However, to the best of our knowledge, there is no research on representing a de Bruijn graph as an RDF graph or finding Eulerian paths in RDF graphs using SPARQL for potential genome discovery. In this paper, we address the issues involved in representing de Bruin graphs as RDF graphs and propose an iterative querying approach for searching cycles to find Eulerian paths in large RDF graphs. We evaluate the performance of our implementation on real world ebola genome datasets and illustrate how genome assembly can be accomplished with Urika-GD using iterative SPARQL queries.
UR - http://www.scopus.com/inward/record.url?scp=84963749659&partnerID=8YFLogxK
U2 - 10.1109/BigData.2015.7364056
DO - 10.1109/BigData.2015.7364056
M3 - Conference contribution
AN - SCOPUS:84963749659
T3 - Proceedings - 2015 IEEE International Conference on Big Data, IEEE Big Data 2015
SP - 2583
EP - 2590
BT - Proceedings - 2015 IEEE International Conference on Big Data, IEEE Big Data 2015
A2 - Luo, Feng
A2 - Ogan, Kemafor
A2 - Zaki, Mohammed J.
A2 - Haas, Laura
A2 - Ooi, Beng Chin
A2 - Kumar, Vipin
A2 - Rachuri, Sudarsan
A2 - Pyne, Saumyadipta
A2 - Ho, Howard
A2 - Hu, Xiaohua
A2 - Yu, Shipeng
A2 - Hsiao, Morris Hui-I
A2 - Li, Jian
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 3rd IEEE International Conference on Big Data, IEEE Big Data 2015
Y2 - 29 October 2015 through 1 November 2015
ER -