TY - GEN
T1 - Dimension reduction and visualization of large high-dimensional data via interpolation
AU - Bae, Seung Hee
AU - Choi, Jong Youl
AU - Qiu, Judy
AU - Fox, Geoffrey C.
PY - 2010
Y1 - 2010
N2 - The recent explosion of publicly available biology gene sequences and chemical compounds offers an unprecedented opportunity for data mining. To make data analysis feasible for such vast volume and high-dimensional scientific data, we apply high performance dimension reduction algorithms. It facilitates the investigation of unknown structures in a three dimensional visualization. Among the known dimension reduction algorithms, we utilize the multidimensional scaling and generative topographic mapping algorithms to configure the given high-dimensional data into the target dimension. However, both algorithms require large physical memory as well as computational resources. Thus, the authors propose an interpolated approach to utilizing the mapping of only a subset of the given data. This approach effectively reduces computational complexity. With minor trade-off of approximation, interpolation method makes it possible to process millions of data points with modest amounts of computation and memory requirement. Since huge amount of data are dealt, we represent how to parallelize proposed interpolation algorithms, as well. For the evaluation of the interpolated MDS by STRESS criteria, it is necessary to compute symmetric all pairwise computation with only subset of required data per process, so we also propose a simple but efficient parallel mechanism for the symmetric all pairwise computation when only a subset of data is available to each process. Our experimental results illustrate that the quality of interpolated mapping results are comparable to the mapping results of original algorithm only. In parallel performance aspect, those interpolation methods are well parallelized with high efficiency. With the proposed interpolation method, we construct a configuration of two-million out-of-sample data into the target dimension, and the number of out-of-sample data can be increased further.
AB - The recent explosion of publicly available biology gene sequences and chemical compounds offers an unprecedented opportunity for data mining. To make data analysis feasible for such vast volume and high-dimensional scientific data, we apply high performance dimension reduction algorithms. It facilitates the investigation of unknown structures in a three dimensional visualization. Among the known dimension reduction algorithms, we utilize the multidimensional scaling and generative topographic mapping algorithms to configure the given high-dimensional data into the target dimension. However, both algorithms require large physical memory as well as computational resources. Thus, the authors propose an interpolated approach to utilizing the mapping of only a subset of the given data. This approach effectively reduces computational complexity. With minor trade-off of approximation, interpolation method makes it possible to process millions of data points with modest amounts of computation and memory requirement. Since huge amount of data are dealt, we represent how to parallelize proposed interpolation algorithms, as well. For the evaluation of the interpolated MDS by STRESS criteria, it is necessary to compute symmetric all pairwise computation with only subset of required data per process, so we also propose a simple but efficient parallel mechanism for the symmetric all pairwise computation when only a subset of data is available to each process. Our experimental results illustrate that the quality of interpolated mapping results are comparable to the mapping results of original algorithm only. In parallel performance aspect, those interpolation methods are well parallelized with high efficiency. With the proposed interpolation method, we construct a configuration of two-million out-of-sample data into the target dimension, and the number of out-of-sample data can be increased further.
KW - GTM
KW - Interpolation
KW - MDS
UR - http://www.scopus.com/inward/record.url?scp=78650015538&partnerID=8YFLogxK
U2 - 10.1145/1851476.1851501
DO - 10.1145/1851476.1851501
M3 - Conference contribution
AN - SCOPUS:78650015538
SN - 9781605589428
T3 - HPDC 2010 - Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
SP - 203
EP - 214
BT - HPDC 2010 - Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
T2 - 19th ACM International Symposium on High Performance Distributed Computing, HPDC 2010
Y2 - 21 June 2010 through 25 June 2010
ER -