Mr. Clean: An Ensemble of Data Cleaning Algorithms for Increased Data Retention

Kenneth Smith, Sharlee Climer

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Handling missing data is a critical issue in nearly all analytical fields. Techniques that address this problem are generally categorized as either imputation or deletion algorithms. Imputation techniques replace missing data based on observed values and an assumed relationship, which can lead to biases in the imputed values and analysis results. Deletion programs remove missing data by deleting the corresponding sample or feature. This manuscript focuses on partial deletion, where some missing data is allowed, but samples and features with excessive missing data are removed. By intelligently selecting which rows and columns to delete, more valid data can be retained than with tradition deletion techniques. We developed three new algorithms for partial deletion: a greedy algorithm and two mathematical optimization programs. We compare these methods against the DataRetainer, Auto-miss, list-wise, and feature-wise programs, using several real-world data sets and a range of allowed missingness values. Our Greedy algorithm outperforms or ties existing algorithms in terms of run time and valid elements kept in nearly all scenarios. Our mathematical optimization programs further increase the number of valid elements kept, but require additional computational costs. These programs will allow researchers to retain more of their precious data thereby strengthening downstream analyses.

Original languageEnglish
Title of host publicationProceedings - 2023 2023 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2023
EditorsXingpeng Jiang, Haiying Wang, Reda Alhajj, Xiaohua Hu, Felix Engel, Mufti Mahmud, Nadia Pisanti, Xuefeng Cui, Hong Song
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages3149-3156
Number of pages8
ISBN (Electronic)9798350337488
DOIs
StatePublished - 2023
Externally publishedYes
Event2023 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2023 - Istanbul, Turkey
Duration: Dec 5 2023Dec 8 2023

Publication series

NameProceedings - 2023 2023 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2023

Conference

Conference2023 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2023
Country/TerritoryTurkey
CityIstanbul
Period12/5/2312/8/23

Funding

Alzheimer’s Association Research Grant (AARG-22-925002), University of Missouri - St. Louis Research Grants

FundersFunder number
University of Missouri

    Keywords

    • Data Cleaning
    • data deletion
    • imputation

    Fingerprint

    Dive into the research topics of 'Mr. Clean: An Ensemble of Data Cleaning Algorithms for Increased Data Retention'. Together they form a unique fingerprint.

    Cite this