TY - GEN
T1 - Apply Block Index Technique to Scientific Data Analysis and I/O Systems
AU - Wu, Tzuhsien
AU - Chou, Jerry
AU - Podhorszki, Norbert
AU - Gu, Junmin
AU - Tian, Yuan
AU - Klasky, Scott
AU - Wu, Kesheng
N1 - Publisher Copyright:
© 2017 IEEE.
PY - 2017/7/10
Y1 - 2017/7/10
N2 - Scientific discoveries are increasingly relying on analysis of massive amounts of data. The ability to directly access the most relevant data records through query, without shifting through all of them becomes essential. However, scientific datasets are commonly stored on parallel file systems and I/O systems that are optimized for reading/writing large chunks of data, and many scientific datasets have spatial-Temporal data similarity, such that the records with similar values often locate in a close proximity of each other. Therefore, our previous work started to investigate the benefit of using block range index technique for scientific datasets, which only records the value range of all the records in a data block. In this paper, we extend our work in several aspects. First, we implement and integrate our blockindex technique with the ADIOS I/O system. Second, we show our proposed method can be significantly better than the existing minmax and bitmaps indexing methods supported in ADIOS, and can also have comparable performance in the worst case. Third, we propose several techniques that can take advantage of the block index information to greatly reduce data retrieval time from query results. Fourth, we evaluate our approach using several real scientific datasets, and analyze the spatial-Temporal data similarity characteristics in them. Through our study, we believe block index can be an effective indexing technique for scientific datasets with little implementation and operating overhead. It's size is small enough for building the indexes on-The-fly, and yet its query information is sufficient for efficient data access.
AB - Scientific discoveries are increasingly relying on analysis of massive amounts of data. The ability to directly access the most relevant data records through query, without shifting through all of them becomes essential. However, scientific datasets are commonly stored on parallel file systems and I/O systems that are optimized for reading/writing large chunks of data, and many scientific datasets have spatial-Temporal data similarity, such that the records with similar values often locate in a close proximity of each other. Therefore, our previous work started to investigate the benefit of using block range index technique for scientific datasets, which only records the value range of all the records in a data block. In this paper, we extend our work in several aspects. First, we implement and integrate our blockindex technique with the ADIOS I/O system. Second, we show our proposed method can be significantly better than the existing minmax and bitmaps indexing methods supported in ADIOS, and can also have comparable performance in the worst case. Third, we propose several techniques that can take advantage of the block index information to greatly reduce data retrieval time from query results. Fourth, we evaluate our approach using several real scientific datasets, and analyze the spatial-Temporal data similarity characteristics in them. Through our study, we believe block index can be an effective indexing technique for scientific datasets with little implementation and operating overhead. It's size is small enough for building the indexes on-The-fly, and yet its query information is sufficient for efficient data access.
KW - IO systems
KW - Indexing
KW - Query analysis
KW - Scientific data
UR - http://www.scopus.com/inward/record.url?scp=85027465842&partnerID=8YFLogxK
U2 - 10.1109/CCGRID.2017.37
DO - 10.1109/CCGRID.2017.37
M3 - Conference contribution
AN - SCOPUS:85027465842
T3 - Proceedings - 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGRID 2017
SP - 865
EP - 871
BT - Proceedings - 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGRID 2017
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGRID 2017
Y2 - 14 May 2017 through 17 May 2017
ER -