Fast, low-memory detection and localization of large, polymorphic inversions from SNPs

Ronald J. Nowling, Fabian Fallas-Moy, Amir Sadovnik, Scott Emrich, Matthew Aleck, Daniel Leskiewicz, John G. Peters

Research output: Contribution to journalArticlepeer-review

3 Scopus citations

Abstract

Background: Large (>1 Mb), polymorphic inversions have substantial impacts on population structure and maintenance of genotypes. These large inversions can be detected from single nucleotide polymorphism (SNP) data using unsupervised learning techniques like PCA. Construction and analysis of a feature matrix from millions of SNPs requires large amount of memory and limits the sizes of data sets that can be analyzed. Methods: We propose using feature hashing construct a feature matrix from a VCF file of SNPs for reducing memory usage. The matrix is constructed in a streaming fashion such that the entire VCF file is never loaded into memory at one time. Results: When evaluated on Anopheles mosquito and Drosophila fly data sets, our approach reduced memory usage by 97% with minimal reductions in accuracy for inversion detection and localization tasks. Conclusion: With these changes, inversions in larger data sets can be analyzed easily and efficiently on common laptop and desktop computers. Our method is publicly available through our open-source inversion analysis software, Asaph.

Original languageEnglish
Article numbere12831
JournalPeerJ
Volume10
DOIs
StatePublished - Jan 2022
Externally publishedYes

Funding

This work was supported by the National Science Foundation under Grant No. IIS-1947257 to Ronald J Nowling. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Keywords

  • Chromosomal inversions
  • Feature hashing
  • Principal component analysis
  • Single nucleotide polymorphisms

Fingerprint

Dive into the research topics of 'Fast, low-memory detection and localization of large, polymorphic inversions from SNPs'. Together they form a unique fingerprint.

Cite this