TY - GEN
T1 - Perturbed gibbs samplers for generating large-scale privacy-safe synthetic health data
AU - Park, Yubin
AU - Ghosh, Joydeep
AU - Shankar, Mallikarjun
PY - 2013
Y1 - 2013
N2 - This paper introduces a non-parametric data synthesizing algorithm to generate privacy-safe ''realistic but not real'' synthetic health data. Our goal is to provide a systematic mechanism that guarantees an adequate and controllable level of privacy while substantially improving on the utility of public use data, compared to current practices by CMS, OSHPD and other agencies. The proposed algorithm synthesizes artificial records while preserving the statistical characteristics of the original data to the extent possible. The risk from ''database linking attack'' is quantified by either an l-diversified or an differentially perturbed data generation process. Moreover its algorithmic performance is optimized using Locality-Sensitive Hashing and parallel computation techniques to yield a linear-time algorithm that is suitable for Big Data Health applications. We synthesize a public Medicare claim dataset using the proposed algorithm, and demonstrate multiple data mining applications and statistical analyses using the data. The synthetic dataset delivers results that are substantially identical to those obtained from the original dataset, without revealing the actual records.
AB - This paper introduces a non-parametric data synthesizing algorithm to generate privacy-safe ''realistic but not real'' synthetic health data. Our goal is to provide a systematic mechanism that guarantees an adequate and controllable level of privacy while substantially improving on the utility of public use data, compared to current practices by CMS, OSHPD and other agencies. The proposed algorithm synthesizes artificial records while preserving the statistical characteristics of the original data to the extent possible. The risk from ''database linking attack'' is quantified by either an l-diversified or an differentially perturbed data generation process. Moreover its algorithmic performance is optimized using Locality-Sensitive Hashing and parallel computation techniques to yield a linear-time algorithm that is suitable for Big Data Health applications. We synthesize a public Medicare claim dataset using the proposed algorithm, and demonstrate multiple data mining applications and statistical analyses using the data. The synthetic dataset delivers results that are substantially identical to those obtained from the original dataset, without revealing the actual records.
KW - Gibbs Sampler
KW - Healthcare
KW - Non-parametric
KW - Privacy
KW - Synthetic Data
UR - http://www.scopus.com/inward/record.url?scp=84893472037&partnerID=8YFLogxK
U2 - 10.1109/ICHI.2013.76
DO - 10.1109/ICHI.2013.76
M3 - Conference contribution
AN - SCOPUS:84893472037
SN - 9780769550893
T3 - Proceedings - 2013 IEEE International Conference on Healthcare Informatics, ICHI 2013
SP - 493
EP - 498
BT - Proceedings - 2013 IEEE International Conference on Healthcare Informatics, ICHI 2013
T2 - 2013 1st IEEE International Conference on Healthcare Informatics, ICHI 2013
Y2 - 9 September 2013 through 11 September 2013
ER -