Skip to content

Data Science Capstone project on creating a deep learning algorithm that can classify compressed images of known malware binary streams

Notifications You must be signed in to change notification settings

jones5am/Malware_Classification_Using_Deep_Learning

Repository files navigation

Malware_Classification_Using_Deep_Learning

Data Science Capstone project on creating a deep learning algorithm that can classify compressed images of known malware binary streams

As the world continues to become more digitized, there is a growing need within government agencies & large organizations for Digital Forensics. This sub-field related to Cybersecurity is concerned with investigating how cyber crimes were committed, what vulnerabilities need to be remedied and, in some cases, what prosecution steps can be pursued. In the first piece of the job around diagnosing how cyber crimes were committed, there is a consistent backlog of information to investigate, potentially malicious files to classify and a time intensive process to do so. Most digital forensics today involves an investigation conducted by subject matter experts, utilizing proprietary tools and custom making of “features” that can be used to classify malicious files. This process is expensive and time-consuming meaning there is an opening for improvements. Hence this project is focused solely on using deep learning techniques to classify malware files. As opposed to traditional tabular features that would be used to classify a malware file, in this case deep learning is utilized because we are classifying the file off its binary representation. Quite literally classifying the machine language 0’s & 1’s in a sequential order (binary stream) for each malware file. This process is beneficial not only because it can be utilized much quicker, but also because of its generalizability to all file types – what file can’t be made into its binary form? Therefore, this project involves creating 3 different deep learning networks to understand their ability to classify these malware files into 9 different classes. These 9 different classes are derived from a Microsoft data set and are therefore mostly files that affect windows users. The class distribution is not even, so I utilized evaluation metrics such as the F-1 score and modified the training weights to prevent overfitting to the majority classes. This project does not focus on a high degree of visualization or EDA, except where necessary to understand the data files & class imbalance.

Please note:

About

Data Science Capstone project on creating a deep learning algorithm that can classify compressed images of known malware binary streams

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published