TY - GEN
T1 - Toward the Detection of Polyglot Files
AU - Koch, Luke
AU - Oesch, Sean
AU - Chaulagain, Amul
AU - Adkisson, Mary
AU - Erwin, Samantha
AU - Weber, Brian
N1 - Publisher Copyright:
© 2022 ACM.
PY - 2022/8/8
Y1 - 2022/8/8
N2 - Standardized file types play a key role in the development and use of computer software. However, it is possible to confound standardized file type processing by creating a file that is valid in multiple file types. The resulting polyglot (many languages) file can confuse file type identification, allowing elements of the file to evade analysis. This is especially problematic for malware detection systems that rely on file type identification for feature extraction. Although work has been done to identify file types using more comprehensive methods than file signatures, accurate identification of polyglot files remains an open problem. Since malware detection systems routinely perform file type-specific feature extraction, polyglot files need to be filtered out prior to ingestion by these systems. Otherwise, malicious content could pass through undetected. To address the problem of polyglot detection we assembled a data set using the mitra tool. We then evaluated the performance of the most commonly used file identification tools, including file, polydet, binwalk, and TrID. Our analysis demonstrates that existing file type detection tools fail to provide reliable polyglot detection. We then evaluated the ability of a range of machine and deep learning models to detect polyglot files. The most performant models were MalConv2 and Catboost, which demonstrated the highest recall on our data set with 95.16% and 95.45%, respectively. These models outperformed existing methods and could be incorporated into a malware detector's file processing pipeline to filter out potentially malicious polyglots before file type-dependent feature extraction takes place.
AB - Standardized file types play a key role in the development and use of computer software. However, it is possible to confound standardized file type processing by creating a file that is valid in multiple file types. The resulting polyglot (many languages) file can confuse file type identification, allowing elements of the file to evade analysis. This is especially problematic for malware detection systems that rely on file type identification for feature extraction. Although work has been done to identify file types using more comprehensive methods than file signatures, accurate identification of polyglot files remains an open problem. Since malware detection systems routinely perform file type-specific feature extraction, polyglot files need to be filtered out prior to ingestion by these systems. Otherwise, malicious content could pass through undetected. To address the problem of polyglot detection we assembled a data set using the mitra tool. We then evaluated the performance of the most commonly used file identification tools, including file, polydet, binwalk, and TrID. Our analysis demonstrates that existing file type detection tools fail to provide reliable polyglot detection. We then evaluated the ability of a range of machine and deep learning models to detect polyglot files. The most performant models were MalConv2 and Catboost, which demonstrated the highest recall on our data set with 95.16% and 95.45%, respectively. These models outperformed existing methods and could be incorporated into a malware detector's file processing pipeline to filter out potentially malicious polyglots before file type-dependent feature extraction takes place.
KW - file type identification
KW - machine learning
KW - polyglot
KW - steganographic malware
UR - http://www.scopus.com/inward/record.url?scp=85136801077&partnerID=8YFLogxK
U2 - 10.1145/3546096.3546106
DO - 10.1145/3546096.3546106
M3 - Conference contribution
AN - SCOPUS:85136801077
T3 - ACM International Conference Proceeding Series
SP - 120
EP - 128
BT - Proceedings of CSET 2022 - 15th Workshop on Cyber Security Experimentation and Test
PB - Association for Computing Machinery
T2 - 15th Workshop on Cyber Security Experimentation and Test, CSET 2022
Y2 - 8 August 2022
ER -