TY - GEN
T1 - On the Abuse and Detection of Polyglot Files
AU - Koch, Luke
AU - Oesch, Sean
AU - Sadovnik, Amir
AU - Weber, Brian
AU - Chaulagain, Amul
AU - Dixson, Matthew
AU - Dixon, Jared
AU - Huettel, Mike
AU - Watson, Cory
AU - Hartman, Jacob
AU - Patulski, Richard
N1 - Publisher Copyright:
© 2025 Copyright held by the owner/author(s).
PY - 2025/4/28
Y1 - 2025/4/28
N2 - A polyglot is a file that is valid in two or more formats. Polyglot files pose a problem for file-upload and generative AI web interfaces that rely on format identification to determine how to securely handle incoming files. In this work we found that existing file-format and embedded-file detection tools, even those developed specifically for polyglot files, fail to reliably detect polyglot files used in the wild. To address this issue, we studied the use of polyglot files by malicious actors in the wild, finding 30 polyglot samples and 15 attack chains that leveraged polyglot files. Using knowledge from our survey of polyglot usage in the wild—the first of its kind—we created a novel data set based on adversary techniques. We then trained a machine learning detection solution, PolyConv, using this data set. PolyConv achieves a precision-recall area-under-curve score of 0.999 with an F1 score of 99.20% for polyglot detection and 99.47% for file-format identification, significantly outperforming all other tools tested. We developed a content disarmament and reconstruction tool, ImSan, that successfully sanitized 100% of the tested image-based polyglots, which were the most common type found via the survey. Our work provides concrete tools and suggestions to enable defenders to better defend themselves against polyglot files, as well as directions for future work to create more robust file specifications and methods of disarmament.
AB - A polyglot is a file that is valid in two or more formats. Polyglot files pose a problem for file-upload and generative AI web interfaces that rely on format identification to determine how to securely handle incoming files. In this work we found that existing file-format and embedded-file detection tools, even those developed specifically for polyglot files, fail to reliably detect polyglot files used in the wild. To address this issue, we studied the use of polyglot files by malicious actors in the wild, finding 30 polyglot samples and 15 attack chains that leveraged polyglot files. Using knowledge from our survey of polyglot usage in the wild—the first of its kind—we created a novel data set based on adversary techniques. We then trained a machine learning detection solution, PolyConv, using this data set. PolyConv achieves a precision-recall area-under-curve score of 0.999 with an F1 score of 99.20% for polyglot detection and 99.47% for file-format identification, significantly outperforming all other tools tested. We developed a content disarmament and reconstruction tool, ImSan, that successfully sanitized 100% of the tested image-based polyglots, which were the most common type found via the survey. Our work provides concrete tools and suggestions to enable defenders to better defend themselves against polyglot files, as well as directions for future work to create more robust file specifications and methods of disarmament.
KW - APT Survey
KW - Content Disarmament
KW - File-format Identification
KW - Machine Learning
KW - Malware Detection
KW - Polyglot Files
KW - Reconstruction
UR - http://www.scopus.com/inward/record.url?scp=105005147991&partnerID=8YFLogxK
U2 - 10.1145/3696410.3714814
DO - 10.1145/3696410.3714814
M3 - Conference contribution
AN - SCOPUS:105005147991
T3 - WWW 2025 - Proceedings of the ACM Web Conference
SP - 4810
EP - 4822
BT - WWW 2025 - Proceedings of the ACM Web Conference
PB - Association for Computing Machinery, Inc
T2 - 34th ACM Web Conference, WWW 2025
Y2 - 28 April 2025 through 2 May 2025
ER -