TY - GEN
T1 - Parallel in situ indexing for data-intensive computing
AU - Kim, Jinoh
AU - Abbasi, Hasan
AU - Chacón, Luis
AU - Docan, Ciprian
AU - Klasky, Scott
AU - Liu, Qing
AU - Podhorszki, Norbert
AU - Shoshani, Arie
AU - Wu, Kesheng
PY - 2011
Y1 - 2011
N2 - As computing power increases exponentially, vast amount of data is created by many scientific research activities. However, the bandwidth for storing the data to disks and reading the data from disks has been improving at a much slower pace. These two trends produce an ever-widening data access gap. Our work brings together two distinct technologies to address this data access issue: indexing and in situ processing. From decades of database research literature, we know that indexing is an effective way to address the data access issue, particularly for accessing relatively small fraction of data records. As data sets increase in sizes, more and more analysts need to use selective data access, which makes indexing an even more important for improving data access. The challenge is that most implementations of indexing technology are embedded in large database management systems (DBMS), but most scientific datasets are not managed by any DBMS. In this work, we choose to include indexes with the scientific data instead of requiring the data to be loaded into a DBMS.We use compressed bitmap indexes from the FastBit software which are known to be highly effective for query-intensive workloads common to scientific data analysis. To use the indexes, we need to build them first. The index building procedure needs to access the whole data set and may also require a significant amount of compute time. In this work, we adapt the in situ processing technology to generate the indexes, thus removing the need of reading data from disks and to build indexes in parallel. The in situ data processing system used is ADIOS, a middleware for high-performance I/O. Our experimental results show that the indexes can improve the data access time up to 200 times depending on the fraction of data selected, and using in situ data processing system can effectively reduce the time needed to create the indexes, up to 10 times with our in situ technique when using identical parallel settings.
AB - As computing power increases exponentially, vast amount of data is created by many scientific research activities. However, the bandwidth for storing the data to disks and reading the data from disks has been improving at a much slower pace. These two trends produce an ever-widening data access gap. Our work brings together two distinct technologies to address this data access issue: indexing and in situ processing. From decades of database research literature, we know that indexing is an effective way to address the data access issue, particularly for accessing relatively small fraction of data records. As data sets increase in sizes, more and more analysts need to use selective data access, which makes indexing an even more important for improving data access. The challenge is that most implementations of indexing technology are embedded in large database management systems (DBMS), but most scientific datasets are not managed by any DBMS. In this work, we choose to include indexes with the scientific data instead of requiring the data to be loaded into a DBMS.We use compressed bitmap indexes from the FastBit software which are known to be highly effective for query-intensive workloads common to scientific data analysis. To use the indexes, we need to build them first. The index building procedure needs to access the whole data set and may also require a significant amount of compute time. In this work, we adapt the in situ processing technology to generate the indexes, thus removing the need of reading data from disks and to build indexes in parallel. The in situ data processing system used is ADIOS, a middleware for high-performance I/O. Our experimental results show that the indexes can improve the data access time up to 200 times depending on the fraction of data selected, and using in situ data processing system can effectively reduce the time needed to create the indexes, up to 10 times with our in situ technique when using identical parallel settings.
KW - D.4.2 [Storage Management]: Access methods
KW - H.3.3 [Information Search and Retrieval]
KW - [D.4.2]: Storage Management
UR - http://www.scopus.com/inward/record.url?scp=84055192852&partnerID=8YFLogxK
U2 - 10.1109/LDAV.2011.6092319
DO - 10.1109/LDAV.2011.6092319
M3 - Conference contribution
AN - SCOPUS:84055192852
SN - 9781467301541
T3 - 1st IEEE Symposium on Large-Scale Data Analysis and Visualization 2011, LDAV 2011 - Proceedings
SP - 65
EP - 72
BT - 1st IEEE Symposium on Large-Scale Data Analysis and Visualization 2011, LDAV 2011 - Proceedings
T2 - 1st IEEE Symposium on Large-Scale Data Analysis and Visualization 2011, LDAV 2011
Y2 - 23 October 2011 through 24 October 2011
ER -