Analytics-driven lossless data compression for rapid in-situ indexing, storing, and querying

John Jenkins, Isha Arkatkar, Sriram Lakshminarasimhan, Neil Shah, Eric R. Schendel, Stephane Ethier, Choong Seock Chang, Jacqueline H. Chen, Hemanth Kolla, Scott Klasky, Robert Ross, Nagiza F. Samatova

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

5 Scopus citations

Abstract

The analysis of scientific simulations is highly data-intensive and is becoming an increasingly important challenge. Peta-scale data sets require the use of light-weight query-driven analysis methods, as opposed to heavy-weight schemes that optimize for speed at the expense of size. This paper is an attempt in the direction of query processing over losslessly compressed scientific data. We propose a co-designed double-precision compression and indexing methodology for range queries by performing unique-value-based binning on the most significant bytes of double precision data (sign, exponent, and most significant mantissa bits), and inverting the resulting metadata to produce an inverted index over a reduced data representation. Without the inverted index, our method matches or improves compression ratios over both general-purpose and floating-point compression utilities. The inverted index is light-weight, and the overall storage requirement for both reduced column and index is less than 135%, whereas existing DBMS technologies can require 200-400%. As a proof-of-concept, we evaluate univariate range queries that additionally return column values, a critical component of data analytics, against state-of-the-art bitmap indexing technology, showing multi-fold query performance improvements.

Original languageEnglish
Title of host publicationDatabase and Expert Systems Applications - 23rd International Conference, DEXA 2012, Proceedings
Pages16-30
Number of pages15
EditionPART 2
DOIs
StatePublished - 2012
Event23rd International Conference on Database and Expert Systems Applications, DEXA 2012 - Vienna, Austria
Duration: Sep 3 2012Sep 6 2012

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
NumberPART 2
Volume7447 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference23rd International Conference on Database and Expert Systems Applications, DEXA 2012
Country/TerritoryAustria
CityVienna
Period09/3/1209/6/12

Funding

FundersFunder number
National Science Foundation1028746, 1240682

    Fingerprint

    Dive into the research topics of 'Analytics-driven lossless data compression for rapid in-situ indexing, storing, and querying'. Together they form a unique fingerprint.

    Cite this