Enhancing Text Classification Models with Generative AI-aided Data Augmentation

Huanhuan Zhao, Haihua Chen, Hong Jun Yoon

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Scopus citations

Abstract

This study investigated the potential of enhancing the performance of text classification by augmenting the training dataset with external knowledge samples generated by a generative AI, specifically ChatGPT. The study conducted experiments on three models - CNN, HiSAN, and BERT - using the Reuters dataset. First, the study evaluated the effectiveness of incorporating ChatGPT-generated samples and then analyzed the impact of various factors such as sample size, sample variability, and sample length on the models' performance by varying the number, variety, and length of the generated samples. The models were assessed using macro and micro-averaged scores, and the results revealed that the macro-averaged scores improved significantly across all three models, with the BERT model showing the greatest improvement (from 49.87% to 65.73% in macro F1 score). The study further found that adding 30 distinct samples produced better results than adding 6 duplicates of 5 samples, and samples with 150 and 256 words had similar performance, while those with 50 words performed slightly worse. These findings suggest that incorporating external knowledge samples generated by a generative AI is an effective approach to enhance text classification models' performance. The study also highlights that the variability of articles generated by ChatGPT positively impacted the models' accuracy, and longer synthesized texts convey more comprehensive information on the subjects, leading to higher classification accuracy scores. Additionally, we conducted a comparison between our results and those obtained from EDA, a widely used data augmentation generator. The findings clearly demonstrate that our method surpasses EDA and offers additional advantages by reducing computational costs and solving zero-shot problem. Our code is available on GitHub.1

Original languageEnglish
Title of host publicationProceedings - 5th IEEE International Conference on Artificial Intelligence Testing, AITest 2023
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages138-145
Number of pages8
ISBN (Electronic)9798350336290
DOIs
StatePublished - 2023
Event5th IEEE International Conference on Artificial Intelligence Testing, AITest 2023 - Athens, Greece
Duration: Jul 17 2023Jul 20 2023

Publication series

NameProceedings - 5th IEEE International Conference on Artificial Intelligence Testing, AITest 2023

Conference

Conference5th IEEE International Conference on Artificial Intelligence Testing, AITest 2023
Country/TerritoryGreece
CityAthens
Period07/17/2307/20/23

Funding

This manuscript has been authored in part by UT-Battelle, LLC, under contract DE-AC05-00OR22725 with the US Department of Energy (DOE). The US government retains and the publisher, by accepting the article for publication, acknowledges that the US government retains a nonexclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for US government purposes. DOE will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan). 1https://github.com/HuanhuanZhao08/AI-data-augmentation

FundersFunder number
U.S. Department of Energy

    Keywords

    • ChatGPT
    • artificial intelligence
    • data augmentation
    • imbalanced data
    • machine learning
    • natural language processing
    • text classification

    Fingerprint

    Dive into the research topics of 'Enhancing Text Classification Models with Generative AI-aided Data Augmentation'. Together they form a unique fingerprint.

    Cite this