A study on the evaluation of tokenizer performance in natural language processing

Sanghyun Choo, Wonjoon Kim

Research output: Contribution to journalArticlepeer-review

10 Scopus citations

Abstract

The present study aims to compare and analyze the performance of two tokenizers, Mecab-Ko and SentencePiece, in the context of natural language processing for sentiment analysis. The study adopts a comparative approach, employing five algorithms - Naive Bayes (NB), k-Nearest Neighbor (kNN), Support Vector Machine (SVM), Artificial Neural Networks (ANN), and Long Short-Term Memory Recurrent Neural Networks (LSTM-RNN) - to evaluate the performance of each tokenizer. The performance was assessed based on four widely used metrics in the field, accuracy, precision, recall, and F1-score. The results indicated that SentencePiece performed better than Mecab-Ko. To ensure the validity of the results, paired t-tests were conducted on the evaluation outcomes. The study concludes that SentencePiece demonstrated superior classification performance, especially in the context of ANN and LSTM-RNN, when used to interpret customer sentiment based on Korean online reviews. Furthermore, SentencePiece can assign specific meanings to short words or jargon commonly used in product evaluations but not defined beforehand.

Original languageEnglish
Article number2175112
JournalApplied Artificial Intelligence
Volume37
Issue number1
DOIs
StatePublished - 2023
Externally publishedYes

Funding

This research was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT: Ministry of Science and ICT) (No. 2020R1G1A1003384).

FundersFunder number
Ministry of Science, ICT and Future Planning2020R1G1A1003384
National Research Foundation of Korea

    Fingerprint

    Dive into the research topics of 'A study on the evaluation of tokenizer performance in natural language processing'. Together they form a unique fingerprint.

    Cite this