For decades attackers have used malware, such as spyware, ransomware, and trojans, to corrupt computers and initiate attacks causing billions in damages, thereby making malware detection methods vital for research. A significant challenge is that the advent of obfuscation (concealment) techniques has made this task increasingly difficult and energy-intensive for computers. My work builds on recent research prototyping a light-weight malware detection approach, involving using features from snapshots of machines’ volatile memory to detect malware behavioral signatures with machine learning models. I utilize the new CIC-MalMem-2022 dataset of benign and malignant memory dumps to examine not only the accuracy of several trained models at malware detection, but also build on prior research to study models’ versatility at adjusting to several valuable factors for industry application. These factors include adjusting to resource-saving feature reduction training, response to natural or artificial data drift, application capabilities on low resource machines (IoT), and capacity for recognizing ‘zero-day’ attacks.
The cybersecurity_ml_analysis.ipynb notebook contains data science research that provides a diverse analysis of algorithms for application in obfuscated malware detection, through examing several core factors.
These factors are:
- training on semi-synthetic dataset for greater utility accross machines
- application capabilities on low resource machines (e.i. IoT)
- adjusting to resource-saving feature reduction training (for low memory and processing footprint)
- capacity for recognizing ‘zero-day’ attacks (with incomplete malware dataset)
- response to natural or artificial data drift (responsivess of model to change)
-
Obfuscated-MalMem2022.csv is the raw memory dump data during a variety of benign operations and malware attacks. Source: https://www.unb.ca/cic/datasets/malmem-2022.html
-
augmented_data_set.csv is a larger augmented dataset that contains the previous memory data along with synthetic data samples generated using a GAN. Implemented in notebook.