Extreme Phenotype Sampling Improves LASSO and Random Forest Marker Selection for Complex Traits

Cai John, Wellington Muchero, Scott Emrich

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Scopus citations

Abstract

Most attempts to fit a supervised machine learning (ML) model in bioinformatics try to predict the full range of trait or response values. While such prediction tasks effectively capture the entire phenotypic range of the samples, they are cost prohibitive and can be statistically underpowered for detection of rare variants. In a study design known as extreme phenotype sampling (EPS), samples are selected from the two extremes of the phenotypic distribution. This approach is costcutting, by reducing genotyping/sequencing costs, as well as capable of increasing statistical power. Although combining EPS with ML algorithms has the potential to enhance association studies by improving their computational efficiency, EPS-ML approaches have seen limited use. In this paper we demonstrate an efficient and effective approach to leverage the EPS study design using LASSO regression and random forests, two commonly used ML algorithms within the broader bioinformatics community. We analyze two distinct data sets: leaf expression values generated from black cottonwood and malaria parasite transcriptome data collected from patients. We demonstrate that focusing only on the phenotypic extremes of these sample sets (by forming binary classes) can select more biologically meaningful features than using the full range. This approach will be useful to investigators when examining complex or novel traits. It is particularly well-suited to RNA-seq data where investigators often want to narrow attention to a small number of candidate transcripts out of a large initial pool. Our approach intentionally leverages existing software with efficient implementations to enable future applications of EPS-ML.

Original languageEnglish
Title of host publicationProceedings - 2020 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2020
EditorsTaesung Park, Young-Rae Cho, Xiaohua Tony Hu, Illhoi Yoo, Hyun Goo Woo, Jianxin Wang, Julio Facelli, Seungyoon Nam, Mingon Kang
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages2771-2778
Number of pages8
ISBN (Electronic)9781728162157
DOIs
StatePublished - Dec 16 2020
Event2020 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2020 - Virtual, Seoul, Korea, Republic of
Duration: Dec 16 2020Dec 19 2020

Publication series

NameProceedings - 2020 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2020

Conference

Conference2020 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2020
Country/TerritoryKorea, Republic of
CityVirtual, Seoul
Period12/16/2012/19/20

Funding

This work was supported in part by the National Institutes for Health (NIH) grant P01 AI127338 and in part by the the Center for Bioenergy Innovation (CBI). CBI is a U.S. DOE Bioenergy Research Center supported by the Office of Biological and Environmental Research in the DOE Office of Science. Oak Ridge National Laboratory is managed by UT-Battelle, LLC., for the U.S. DOE under contract DE-AC05-00OR22725. Part of this work was performed at the Oak Ridge Leadership Computing Facility including resources of the Compute And Data Environment for Science (CADES).

FundersFunder number
National Institutes of HealthP01 AI127338
U.S. Department of Energy
Office of Science
Biological and Environmental Research
Oak Ridge National Laboratory
Center for Bioenergy Innovation
UT-BattelleDE-AC05-00OR22725

    Fingerprint

    Dive into the research topics of 'Extreme Phenotype Sampling Improves LASSO and Random Forest Marker Selection for Complex Traits'. Together they form a unique fingerprint.

    Cite this