Improving Text Classification with Large Language Model-Based Data Augmentation

Huanhuan Zhao, Haihua Chen, Thomas A. Ruggles, Yunhe Feng, Debjani Singh, Hong Jun Yoon

Research output: Contribution to journalArticlepeer-review

2 Scopus citations

Abstract

Large Language Models (LLMs) such as ChatGPT possess advanced capabilities in understanding and generating text. These capabilities enable ChatGPT to create text based on specific instructions, which can serve as augmented data for text classification tasks. Previous studies have approached data augmentation (DA) by either rewriting the existing dataset with ChatGPT or generating entirely new data from scratch. However, it is unclear which method is better without comparing their effectiveness. This study investigates the application of both methods to two datasets: a general-topic dataset (Reuters news data) and a domain-specific dataset (Mitigation dataset). Our findings indicate that: 1. ChatGPT generated new data consistently enhanced model’s classification results for both datasets. 2. Generating new data generally outperforms rewriting existing data, though crafting the prompts carefully is crucial to extract the most valuable information from ChatGPT, particularly for domain-specific data. 3. The augmentation data size affects the effectiveness of DA; however, we observed a plateau after incorporating 10 samples. 4. Combining the rewritten sample with new generated sample can potentially further improve the model’s performance.

Original languageEnglish
Article number2535
JournalElectronics (Switzerland)
Volume13
Issue number13
DOIs
StatePublished - Jul 2024

Funding

The paper is a substantially extended version of the IEEE AITest 2023 conference paper \u201CEnhancing Text Classification Models with Generative AI-aided Data Augmentation\u201D []. This manuscript has been authored in part by UT-Battelle, LLC, under contract DE-AC05-00OR22725 with the US Department of Energy (DOE). The US government retains and the publisher, by accepting the article for publication, acknowledges that the US government retains a nonexclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for US government purposes. DOE will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan ( http://energy.gov/downloads/doe-public-access-plan (accessed on 30 April 2024)). This research was partially funded by US Department of Energy\u2019s Water Power Technologies Office.

FundersFunder number
U.S. Department of Energy
Water Power Technologies Office

    Keywords

    • ChatGPT
    • artificial intelligence
    • data augmentation
    • imbalanced data
    • large language model
    • machine learning
    • natural language processing
    • text classification

    Fingerprint

    Dive into the research topics of 'Improving Text Classification with Large Language Model-Based Data Augmentation'. Together they form a unique fingerprint.

    Cite this