Data Science Capstone project on creating a deep learning algorithm that can classify compressed images of known malware binary streams
As the world continues to become more digitized, there is a growing need within government agencies & large organizations for Digital Forensics. This sub-field related to Cybersecurity is concerned with investigating how cyber crimes were committed, what vulnerabilities need to be remedied and, in some cases, what prosecution steps can be pursued. In the first piece of the job around diagnosing how cyber crimes were committed, there is a consistent backlog of information to investigate, potentially malicious files to classify and a time intensive process to do so. Most digital forensics today involves an investigation conducted by subject matter experts, utilizing proprietary tools and custom making of “features” that can be used to classify malicious files. This process is expensive and time-consuming meaning there is an opening for improvements. Hence this project is focused solely on using deep learning techniques to classify malware files. As opposed to traditional tabular features that would be used to classify a malware file, in this case deep learning is utilized because we are classifying the file off its binary representation. Quite literally classifying the machine language 0’s & 1’s in a sequential order (binary stream) for each malware file. This process is beneficial not only because it can be utilized much quicker, but also because of its generalizability to all file types – what file can’t be made into its binary form? Therefore, this project involves creating 3 different deep learning networks to understand their ability to classify these malware files into 9 different classes. These 9 different classes are derived from a Microsoft data set and are therefore mostly files that affect windows users. The class distribution is not even, so I utilized evaluation metrics such as the F-1 score and modified the training weights to prevent overfitting to the majority classes. This project does not focus on a high degree of visualization or EDA, except where necessary to understand the data files & class imbalance.
Please note:
- For the most concise description of this project please read the PowerPoint presentation here: https://github.com/jones5am/Malware_Classification_Using_Deep_Learning/raw/master/Malware%20Classification%20Using%20Deep%20Learning%20-%20v1.2.pptx
- For the most detailed description of this project, please read the full report here: https://github.com/jones5am/Malware_Classification_Using_Deep_Learning/raw/master/Malware%20Classification%20Using%20Deep%20Learning%20v1.1.docx
- This project inolves about 400GB of knowm malware files given as text files in hex format. These are then further manipulated into a 1D image so that we can apply Deep Learning classificaiton methods. But because of the file size and content - you cannot run this project with just what I've posted in my respository
- This project is heavliy based on the research paper which can be found here: https://arxiv.org/ftp/arxiv/papers/1807/1807.08265.pdf
- If you would like to replicate this project please download the data from it's original source here: https://www.kaggle.com/c/malware-classification