TY - GEN
T1 - Analytics-driven lossless data compression for rapid in-situ indexing, storing, and querying
AU - Jenkins, John
AU - Arkatkar, Isha
AU - Lakshminarasimhan, Sriram
AU - Shah, Neil
AU - Schendel, Eric R.
AU - Ethier, Stephane
AU - Chang, Choong Seock
AU - Chen, Jacqueline H.
AU - Kolla, Hemanth
AU - Klasky, Scott
AU - Ross, Robert
AU - Samatova, Nagiza F.
PY - 2012
Y1 - 2012
N2 - The analysis of scientific simulations is highly data-intensive and is becoming an increasingly important challenge. Peta-scale data sets require the use of light-weight query-driven analysis methods, as opposed to heavy-weight schemes that optimize for speed at the expense of size. This paper is an attempt in the direction of query processing over losslessly compressed scientific data. We propose a co-designed double-precision compression and indexing methodology for range queries by performing unique-value-based binning on the most significant bytes of double precision data (sign, exponent, and most significant mantissa bits), and inverting the resulting metadata to produce an inverted index over a reduced data representation. Without the inverted index, our method matches or improves compression ratios over both general-purpose and floating-point compression utilities. The inverted index is light-weight, and the overall storage requirement for both reduced column and index is less than 135%, whereas existing DBMS technologies can require 200-400%. As a proof-of-concept, we evaluate univariate range queries that additionally return column values, a critical component of data analytics, against state-of-the-art bitmap indexing technology, showing multi-fold query performance improvements.
AB - The analysis of scientific simulations is highly data-intensive and is becoming an increasingly important challenge. Peta-scale data sets require the use of light-weight query-driven analysis methods, as opposed to heavy-weight schemes that optimize for speed at the expense of size. This paper is an attempt in the direction of query processing over losslessly compressed scientific data. We propose a co-designed double-precision compression and indexing methodology for range queries by performing unique-value-based binning on the most significant bytes of double precision data (sign, exponent, and most significant mantissa bits), and inverting the resulting metadata to produce an inverted index over a reduced data representation. Without the inverted index, our method matches or improves compression ratios over both general-purpose and floating-point compression utilities. The inverted index is light-weight, and the overall storage requirement for both reduced column and index is less than 135%, whereas existing DBMS technologies can require 200-400%. As a proof-of-concept, we evaluate univariate range queries that additionally return column values, a critical component of data analytics, against state-of-the-art bitmap indexing technology, showing multi-fold query performance improvements.
UR - http://www.scopus.com/inward/record.url?scp=84866036756&partnerID=8YFLogxK
U2 - 10.1007/978-3-642-32597-7_2
DO - 10.1007/978-3-642-32597-7_2
M3 - Conference contribution
AN - SCOPUS:84866036756
SN - 9783642325960
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 16
EP - 30
BT - Database and Expert Systems Applications - 23rd International Conference, DEXA 2012, Proceedings
T2 - 23rd International Conference on Database and Expert Systems Applications, DEXA 2012
Y2 - 3 September 2012 through 6 September 2012
ER -