TY - GEN
T1 - Extreme Phenotype Sampling Improves LASSO and Random Forest Marker Selection for Complex Traits
AU - John, Cai
AU - Muchero, Wellington
AU - Emrich, Scott
N1 - Publisher Copyright:
© 2020 IEEE.
PY - 2020/12/16
Y1 - 2020/12/16
N2 - Most attempts to fit a supervised machine learning (ML) model in bioinformatics try to predict the full range of trait or response values. While such prediction tasks effectively capture the entire phenotypic range of the samples, they are cost prohibitive and can be statistically underpowered for detection of rare variants. In a study design known as extreme phenotype sampling (EPS), samples are selected from the two extremes of the phenotypic distribution. This approach is costcutting, by reducing genotyping/sequencing costs, as well as capable of increasing statistical power. Although combining EPS with ML algorithms has the potential to enhance association studies by improving their computational efficiency, EPS-ML approaches have seen limited use. In this paper we demonstrate an efficient and effective approach to leverage the EPS study design using LASSO regression and random forests, two commonly used ML algorithms within the broader bioinformatics community. We analyze two distinct data sets: leaf expression values generated from black cottonwood and malaria parasite transcriptome data collected from patients. We demonstrate that focusing only on the phenotypic extremes of these sample sets (by forming binary classes) can select more biologically meaningful features than using the full range. This approach will be useful to investigators when examining complex or novel traits. It is particularly well-suited to RNA-seq data where investigators often want to narrow attention to a small number of candidate transcripts out of a large initial pool. Our approach intentionally leverages existing software with efficient implementations to enable future applications of EPS-ML.
AB - Most attempts to fit a supervised machine learning (ML) model in bioinformatics try to predict the full range of trait or response values. While such prediction tasks effectively capture the entire phenotypic range of the samples, they are cost prohibitive and can be statistically underpowered for detection of rare variants. In a study design known as extreme phenotype sampling (EPS), samples are selected from the two extremes of the phenotypic distribution. This approach is costcutting, by reducing genotyping/sequencing costs, as well as capable of increasing statistical power. Although combining EPS with ML algorithms has the potential to enhance association studies by improving their computational efficiency, EPS-ML approaches have seen limited use. In this paper we demonstrate an efficient and effective approach to leverage the EPS study design using LASSO regression and random forests, two commonly used ML algorithms within the broader bioinformatics community. We analyze two distinct data sets: leaf expression values generated from black cottonwood and malaria parasite transcriptome data collected from patients. We demonstrate that focusing only on the phenotypic extremes of these sample sets (by forming binary classes) can select more biologically meaningful features than using the full range. This approach will be useful to investigators when examining complex or novel traits. It is particularly well-suited to RNA-seq data where investigators often want to narrow attention to a small number of candidate transcripts out of a large initial pool. Our approach intentionally leverages existing software with efficient implementations to enable future applications of EPS-ML.
UR - http://www.scopus.com/inward/record.url?scp=85100345606&partnerID=8YFLogxK
U2 - 10.1109/BIBM49941.2020.9313524
DO - 10.1109/BIBM49941.2020.9313524
M3 - Conference contribution
AN - SCOPUS:85100345606
T3 - Proceedings - 2020 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2020
SP - 2771
EP - 2778
BT - Proceedings - 2020 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2020
A2 - Park, Taesung
A2 - Cho, Young-Rae
A2 - Hu, Xiaohua Tony
A2 - Yoo, Illhoi
A2 - Woo, Hyun Goo
A2 - Wang, Jianxin
A2 - Facelli, Julio
A2 - Nam, Seungyoon
A2 - Kang, Mingon
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2020 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2020
Y2 - 16 December 2020 through 19 December 2020
ER -