Abstract
This study investigated the potential of enhancing the performance of text classification by augmenting the training dataset with external knowledge samples generated by a generative AI, specifically ChatGPT. The study conducted experiments on three models - CNN, HiSAN, and BERT - using the Reuters dataset. First, the study evaluated the effectiveness of incorporating ChatGPT-generated samples and then analyzed the impact of various factors such as sample size, sample variability, and sample length on the models' performance by varying the number, variety, and length of the generated samples. The models were assessed using macro and micro-averaged scores, and the results revealed that the macro-averaged scores improved significantly across all three models, with the BERT model showing the greatest improvement (from 49.87% to 65.73% in macro F1 score). The study further found that adding 30 distinct samples produced better results than adding 6 duplicates of 5 samples, and samples with 150 and 256 words had similar performance, while those with 50 words performed slightly worse. These findings suggest that incorporating external knowledge samples generated by a generative AI is an effective approach to enhance text classification models' performance. The study also highlights that the variability of articles generated by ChatGPT positively impacted the models' accuracy, and longer synthesized texts convey more comprehensive information on the subjects, leading to higher classification accuracy scores. Additionally, we conducted a comparison between our results and those obtained from EDA, a widely used data augmentation generator. The findings clearly demonstrate that our method surpasses EDA and offers additional advantages by reducing computational costs and solving zero-shot problem. Our code is available on GitHub.1
Original language | English |
---|---|
Title of host publication | Proceedings - 5th IEEE International Conference on Artificial Intelligence Testing, AITest 2023 |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
Pages | 138-145 |
Number of pages | 8 |
ISBN (Electronic) | 9798350336290 |
DOIs | |
State | Published - 2023 |
Event | 5th IEEE International Conference on Artificial Intelligence Testing, AITest 2023 - Athens, Greece Duration: Jul 17 2023 → Jul 20 2023 |
Publication series
Name | Proceedings - 5th IEEE International Conference on Artificial Intelligence Testing, AITest 2023 |
---|
Conference
Conference | 5th IEEE International Conference on Artificial Intelligence Testing, AITest 2023 |
---|---|
Country/Territory | Greece |
City | Athens |
Period | 07/17/23 → 07/20/23 |
Funding
This manuscript has been authored in part by UT-Battelle, LLC, under contract DE-AC05-00OR22725 with the US Department of Energy (DOE). The US government retains and the publisher, by accepting the article for publication, acknowledges that the US government retains a nonexclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for US government purposes. DOE will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan). 1https://github.com/HuanhuanZhao08/AI-data-augmentation
Keywords
- ChatGPT
- artificial intelligence
- data augmentation
- imbalanced data
- machine learning
- natural language processing
- text classification