Abstract
Most attempts to fit a supervised machine learning (ML) model in bioinformatics try to predict the full range of trait or response values. While such prediction tasks effectively capture the entire phenotypic range of the samples, they are cost prohibitive and can be statistically underpowered for detection of rare variants. In a study design known as extreme phenotype sampling (EPS), samples are selected from the two extremes of the phenotypic distribution. This approach is costcutting, by reducing genotyping/sequencing costs, as well as capable of increasing statistical power. Although combining EPS with ML algorithms has the potential to enhance association studies by improving their computational efficiency, EPS-ML approaches have seen limited use. In this paper we demonstrate an efficient and effective approach to leverage the EPS study design using LASSO regression and random forests, two commonly used ML algorithms within the broader bioinformatics community. We analyze two distinct data sets: leaf expression values generated from black cottonwood and malaria parasite transcriptome data collected from patients. We demonstrate that focusing only on the phenotypic extremes of these sample sets (by forming binary classes) can select more biologically meaningful features than using the full range. This approach will be useful to investigators when examining complex or novel traits. It is particularly well-suited to RNA-seq data where investigators often want to narrow attention to a small number of candidate transcripts out of a large initial pool. Our approach intentionally leverages existing software with efficient implementations to enable future applications of EPS-ML.
| Original language | English |
|---|---|
| Title of host publication | Proceedings - 2020 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2020 |
| Editors | Taesung Park, Young-Rae Cho, Xiaohua Tony Hu, Illhoi Yoo, Hyun Goo Woo, Jianxin Wang, Julio Facelli, Seungyoon Nam, Mingon Kang |
| Publisher | Institute of Electrical and Electronics Engineers Inc. |
| Pages | 2771-2778 |
| Number of pages | 8 |
| ISBN (Electronic) | 9781728162157 |
| DOIs | |
| State | Published - Dec 16 2020 |
| Event | 2020 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2020 - Virtual, Seoul, Korea, Republic of Duration: Dec 16 2020 → Dec 19 2020 |
Publication series
| Name | Proceedings - 2020 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2020 |
|---|
Conference
| Conference | 2020 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2020 |
|---|---|
| Country/Territory | Korea, Republic of |
| City | Virtual, Seoul |
| Period | 12/16/20 → 12/19/20 |
Funding
This work was supported in part by the National Institutes for Health (NIH) grant P01 AI127338 and in part by the the Center for Bioenergy Innovation (CBI). CBI is a U.S. DOE Bioenergy Research Center supported by the Office of Biological and Environmental Research in the DOE Office of Science. Oak Ridge National Laboratory is managed by UT-Battelle, LLC., for the U.S. DOE under contract DE-AC05-00OR22725. Part of this work was performed at the Oak Ridge Leadership Computing Facility including resources of the Compute And Data Environment for Science (CADES).