TY - GEN
T1 - Semantic Stealth
T2 - 16th ACM Workshop on Artificial Intelligence and Security, AISec 2024, co-located with CCS 2024
AU - Roa, Camila
AU - Mahbub, Maria
AU - Srinivasan, Sudarshan
AU - Begoli, Edmon
AU - Sadovnik, Amir
N1 - Publisher Copyright:
© 2024 Copyright held by the owner/author(s).
PY - 2024/11/22
Y1 - 2024/11/22
N2 - Deep learning models have been shown to be vulnerable to adversarial attacks, in which perturbations to their inputs cause the model to produce incorrect predictions. As opposed to adversarial attacks in computer vision, where small changes introduced to pixel values can drastically alter a model’s output while remaining imperceptible to humans, text-based attacks are difficult to conceal due to the discrete nature of tokens. Consequently, unconstrained gradient-based attacks often produce adversarial examples that lack semantic meaning, rendering them detectable through visual inspection or perplexity filters. In contrast to methods that rely on gradient-based optimization in the embedding space, we propose an approach that leverages a Large Language Model’s ability to generate grammatically correct and semantically meaningful text to craft adversarial patches that seamlessly blend in with the original input text. These patches can be used to alter the behavior of a target model, such as a text classifier. Since our approach does not rely on gradient backpropagation, it only requires access to the target model’s confidence scores, making it a grey-box attack. We demonstrate the feasibility of our approach using open-source LLMs, including Intel’s Neural Chat, Llama2, and Mistral-Instruct, to generate adversarial patches capable of altering the predictions of a distilBERT model fine-tuned on the IMDB reviews dataset for sentiment classification.
AB - Deep learning models have been shown to be vulnerable to adversarial attacks, in which perturbations to their inputs cause the model to produce incorrect predictions. As opposed to adversarial attacks in computer vision, where small changes introduced to pixel values can drastically alter a model’s output while remaining imperceptible to humans, text-based attacks are difficult to conceal due to the discrete nature of tokens. Consequently, unconstrained gradient-based attacks often produce adversarial examples that lack semantic meaning, rendering them detectable through visual inspection or perplexity filters. In contrast to methods that rely on gradient-based optimization in the embedding space, we propose an approach that leverages a Large Language Model’s ability to generate grammatically correct and semantically meaningful text to craft adversarial patches that seamlessly blend in with the original input text. These patches can be used to alter the behavior of a target model, such as a text classifier. Since our approach does not rely on gradient backpropagation, it only requires access to the target model’s confidence scores, making it a grey-box attack. We demonstrate the feasibility of our approach using open-source LLMs, including Intel’s Neural Chat, Llama2, and Mistral-Instruct, to generate adversarial patches capable of altering the predictions of a distilBERT model fine-tuned on the IMDB reviews dataset for sentiment classification.
KW - adversarial attack
KW - adversarial patches
KW - large language model
KW - sentiment classification
KW - transformer-based model
UR - https://www.scopus.com/pages/publications/85216572675
U2 - 10.1145/3689932.3694758
DO - 10.1145/3689932.3694758
M3 - Conference contribution
AN - SCOPUS:85216572675
T3 - AISec 2024 - Proceedings of the 2024 Workshop on Artificial Intelligence and Security, Co-Located with: CCS 2024
SP - 42
EP - 52
BT - AISec 2024 - Proceedings of the 2024 Workshop on Artificial Intelligence and Security, Co-Located with
PB - Association for Computing Machinery, Inc
Y2 - 14 October 2024 through 18 October 2024
ER -