TY - JOUR
T1 - A Keyword-Enhanced Approach to Handle Class Imbalance in Clinical Text Classification
AU - Blanchard, Andrew E.
AU - Gao, Shang
AU - Yoon, Hong Jun
AU - Christian, J. Blair
AU - Durbin, Eric B.
AU - Wu, Xiao Cheng
AU - Stroup, Antoinette
AU - Doherty, Jennifer
AU - Schwartz, Stephen M.
AU - Wiggins, Charles
AU - Coyle, Linda
AU - Penberthy, Lynne
AU - Tourassi, Georgia D.
N1 - Publisher Copyright:
© 2013 IEEE.
PY - 2022/6/1
Y1 - 2022/6/1
N2 - Recent applications ofdeep learning have shown promising results for classifying unstructured text in the healthcare domain. However, the reliability of models in production settings has been hindered by imbalanced data sets in which a small subset of the classes dominate. In the absence of adequate training data, rare classes necessitate additional model constraints for robust performance. Here, we present a strategy for incorporating short sequences of text (i.e. keywords) into training to boost model accuracy on rare classes. In our approach, we assemble a set of keywords, including short phrases, associated with each class. The keywords are then used as additional data during each batch of model training, resulting in a training loss that has contributions from both raw data and keywords. We evaluate our approach on classification of cancer pathology reports, which shows a substantial increase in model performance for rare classes. Furthermore, we analyze the impact of keywords on model output probabilities for bigrams, providing a straightforward method to identify model difficulties for limited training data.
AB - Recent applications ofdeep learning have shown promising results for classifying unstructured text in the healthcare domain. However, the reliability of models in production settings has been hindered by imbalanced data sets in which a small subset of the classes dominate. In the absence of adequate training data, rare classes necessitate additional model constraints for robust performance. Here, we present a strategy for incorporating short sequences of text (i.e. keywords) into training to boost model accuracy on rare classes. In our approach, we assemble a set of keywords, including short phrases, associated with each class. The keywords are then used as additional data during each batch of model training, resulting in a training loss that has contributions from both raw data and keywords. We evaluate our approach on classification of cancer pathology reports, which shows a substantial increase in model performance for rare classes. Furthermore, we analyze the impact of keywords on model output probabilities for bigrams, providing a straightforward method to identify model difficulties for limited training data.
KW - Machine learning
KW - medical information systems
KW - natural language processing
UR - http://www.scopus.com/inward/record.url?scp=85123347177&partnerID=8YFLogxK
U2 - 10.1109/JBHI.2022.3141976
DO - 10.1109/JBHI.2022.3141976
M3 - Article
C2 - 35020599
AN - SCOPUS:85123347177
SN - 2168-2194
VL - 26
SP - 2796
EP - 2803
JO - IEEE Journal of Biomedical and Health Informatics
JF - IEEE Journal of Biomedical and Health Informatics
IS - 6
ER -