A Fast, Provably Accurate Approximation Algorithm for Sparse Principal Component Analysis Reveals Human Genetic Variation Across the World

Agniva Chowdhury, Aritra Bose, Samson Zhou, David P. Woodruff, Petros Drineas

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

4 Scopus citations

Abstract

Principal component analysis (PCA) is a widely used dimensionality reduction technique in machine learning and multivariate statistics. To improve the interpretability of PCA, various approaches to obtain sparse principal direction loadings have been proposed, which are termed Sparse Principal Component Analysis (SPCA). In this paper, we present ThreSPCA, a provably accurate algorithm based on thresholding the Singular Value Decomposition for the SPCA problem, without imposing any restrictive assumptions on the input covariance matrix. Our thresholding algorithm is conceptually simple; much faster than current state-of-the-art; and performs well in practice. When applied to genotype data from the 1000 Genomes Project, ThreSPCA is faster than previous benchmarks, at least as accurate, and leads to a set of interpretable biomarkers, revealing genetic diversity across the world.

Original languageEnglish
Title of host publicationResearch in Computational Molecular Biology - 26th Annual International Conference, RECOMB 2022, Proceedings
EditorsItsik Pe’er
PublisherSpringer Science and Business Media Deutschland GmbH
Pages86-106
Number of pages21
ISBN (Print)9783031047480
DOIs
StatePublished - 2022
Event26th International Conference on Research in Computational Molecular Biology, RECOMB 2022 - San Diego, United States
Duration: May 22 2022May 25 2022

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume13278 LNBI
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference26th International Conference on Research in Computational Molecular Biology, RECOMB 2022
Country/TerritoryUnited States
CitySan Diego
Period05/22/2205/25/22

Funding

Acknowledgements. PD and AC were partially supported by National Science Foundation (NSF) 10001390, NSF III-10001674, NSF III-10001225, and an IBM Faculty Award to PD. AB was supported by IBM. DPW and SZ would like to thank partial support from NSF grant No. CCF-181584, Office of Naval Research (ONR) grant N00014-18-1-2562, National Institute of Health (NIH) grant 5401 HG 10798-2, and a Simons Investigator Award.

Keywords

  • Population stratification
  • Population structure
  • Principal Component Analysis
  • Sparse PCA

Fingerprint

Dive into the research topics of 'A Fast, Provably Accurate Approximation Algorithm for Sparse Principal Component Analysis Reveals Human Genetic Variation Across the World'. Together they form a unique fingerprint.

Cite this