TY - GEN
T1 - Efficient data access for parallel BLAST
AU - Lin, Heshan
AU - Ma, Xiaosong
AU - Chandramohan, Praveen
AU - Geist, Al
AU - Samatova, Nagiza
PY - 2005
Y1 - 2005
N2 - Searching biological sequence databases is one of the most routine tasks in computational biology. This task is significantly hampered by the exponential growth in sequence database sizes. Recent advances in parallelization of biological sequence search applications have enabled bioinformatics researchers to utilize high-performance computing platforms and, as a result, greatly reduce the execution time of their sequence database searches. However, existing parallel sequence search tools have been focusing mostly on parallelizing the sequence alignment engine. While the computation-intensive alignment tasks become cheaper with larger machines, data-intensive initial preparation and result merging tasks become more expensive. Inefficient handling of input and output data can easily create performance bottlenecks even on supercomputers. It also causes a considerable data management overhead. In this paper, we present a set of techniques for efficient and flexible data handling in parallel sequence search applications. We demonstrate our optimizations through improving mpiBLAST, an open-source parallel BLAST tool rapidly gaining popularity. These optimization techniques aim at enabling flexible database partitioning, reducing I/O by caching small auxiliary files and results, enabling parallel I/O on shared files, and performing scalable result processing protocols. As a result, we reduce mpiBLAST users' operational overhead by removing the requirement of prepartitioning databases. Meanwhile, our experiments show that these techniques can bring by an order of magnitude improvement to both the overall performance and scalability of mpiBLAST.
AB - Searching biological sequence databases is one of the most routine tasks in computational biology. This task is significantly hampered by the exponential growth in sequence database sizes. Recent advances in parallelization of biological sequence search applications have enabled bioinformatics researchers to utilize high-performance computing platforms and, as a result, greatly reduce the execution time of their sequence database searches. However, existing parallel sequence search tools have been focusing mostly on parallelizing the sequence alignment engine. While the computation-intensive alignment tasks become cheaper with larger machines, data-intensive initial preparation and result merging tasks become more expensive. Inefficient handling of input and output data can easily create performance bottlenecks even on supercomputers. It also causes a considerable data management overhead. In this paper, we present a set of techniques for efficient and flexible data handling in parallel sequence search applications. We demonstrate our optimizations through improving mpiBLAST, an open-source parallel BLAST tool rapidly gaining popularity. These optimization techniques aim at enabling flexible database partitioning, reducing I/O by caching small auxiliary files and results, enabling parallel I/O on shared files, and performing scalable result processing protocols. As a result, we reduce mpiBLAST users' operational overhead by removing the requirement of prepartitioning databases. Meanwhile, our experiments show that these techniques can bring by an order of magnitude improvement to both the overall performance and scalability of mpiBLAST.
UR - http://www.scopus.com/inward/record.url?scp=33746293354&partnerID=8YFLogxK
U2 - 10.1109/IPDPS.2005.190
DO - 10.1109/IPDPS.2005.190
M3 - Conference contribution
AN - SCOPUS:33746293354
SN - 0769523129
SN - 0769523129
SN - 9780769523125
T3 - Proceedings - 19th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2005
SP - 72b
BT - Proceedings - 19th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2005
T2 - 19th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2005
Y2 - 4 April 2005 through 8 April 2005
ER -