On the Abuse and Detection of Polyglot Files

Luke Koch, Sean Oesch, Amir Sadovnik, Brian Weber, Amul Chaulagain, Matthew Dixson, Jared Dixon, Mike Huettel, Cory Watson, Jacob Hartman, Richard Patulski

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

A polyglot is a file that is valid in two or more formats. Polyglot files pose a problem for file-upload and generative AI web interfaces that rely on format identification to determine how to securely handle incoming files. In this work we found that existing file-format and embedded-file detection tools, even those developed specifically for polyglot files, fail to reliably detect polyglot files used in the wild. To address this issue, we studied the use of polyglot files by malicious actors in the wild, finding 30 polyglot samples and 15 attack chains that leveraged polyglot files. Using knowledge from our survey of polyglot usage in the wild—the first of its kind—we created a novel data set based on adversary techniques. We then trained a machine learning detection solution, PolyConv, using this data set. PolyConv achieves a precision-recall area-under-curve score of 0.999 with an F1 score of 99.20% for polyglot detection and 99.47% for file-format identification, significantly outperforming all other tools tested. We developed a content disarmament and reconstruction tool, ImSan, that successfully sanitized 100% of the tested image-based polyglots, which were the most common type found via the survey. Our work provides concrete tools and suggestions to enable defenders to better defend themselves against polyglot files, as well as directions for future work to create more robust file specifications and methods of disarmament.

Original languageEnglish
Title of host publicationWWW 2025 - Proceedings of the ACM Web Conference
PublisherAssociation for Computing Machinery, Inc
Pages4810-4822
Number of pages13
ISBN (Electronic)9798400712746
DOIs
StatePublished - Apr 28 2025
Event34th ACM Web Conference, WWW 2025 - Sydney, Australia
Duration: Apr 28 2025May 2 2025

Publication series

NameWWW 2025 - Proceedings of the ACM Web Conference

Conference

Conference34th ACM Web Conference, WWW 2025
Country/TerritoryAustralia
CitySydney
Period04/28/2505/2/25

Keywords

  • APT Survey
  • Content Disarmament
  • File-format Identification
  • Machine Learning
  • Malware Detection
  • Polyglot Files
  • Reconstruction

Fingerprint

Dive into the research topics of 'On the Abuse and Detection of Polyglot Files'. Together they form a unique fingerprint.

Cite this