The general objective of this project is to investigate the various deep learning techniques available to solve selected cyber security related problems.
The objective of this particular project is to make use of deep learning techniques to classify malware quickly without resorting to manual analysis of source codes and behaviour of the malware.
The dataset used for this project was retrieved from the Kaggle website, more specifically from their Microsoft Malware Classification Challenge (BIG 2015) website
-
Simply run the following command in your git repository to get the latest version of the codebase:
git pull origin master
-
The conda environment is where your Python notebooks and codes are usually executed. It contains all the python packages that has been installed onto the computer.
-
To activate it, run the following:
conda activate
-
To deactivate it, run the following:
conda deactivate
-
If your conda environment has been activated, you should be able to see your terminal prompt change, like so:
(<conda environment name>) user@domain $
-
For example, the default and main conda environment is named 'base':
(base) student@localhost $
-
Note that you will have to activate your conda environment before proceeding to run the following commands, if you have installed a newer version of the Anaconda Python distribution.
Firstly, clone this github repo to your home directory (or any other directory you want to) of the Linux virtual machine you are using (a virtual machine running the Ubuntu Linux distro was used for this project). We will call this directory INSTALL_PATH:
cd INSTALL_PATH # Where INSTALL_PATH is the path in which you will be cloning this git repo
git clone https://github.com/HarishB99/application_of_ai_in_cyber_security.git
cd application_of_ai_in_cyber_security/
Install the p7zip-full package on Ubuntu Linux (This was the linux distribution that was used for this project)
sudo apt install -y p7zip-full
The p7zip-full package is installed to make use of the utilities in this package to extract 7-zip archive files.
Download train.7z
from the challenge website (Note that you will have to create an account with Kaggle to download the file.)
Please note the call to keras.preprocessing.image.load_image()
function in the two functions in CNN/utils/utils.py
, namely load_image_as_np()
and resize_from_file()
. For newer versions of the keras.preprocessing.image
library, you will have to use color_mode='grayscale'
to load grayscale images. For older versions, you need to use grayscale=True
. This reminder is also documented in the comments for the two calls to the said library made in CNN/utils/utils.py
.
Before you proceed to execute the Python notebooks in the CNN/
folder, perform the following:
Create a directory at the root of this git repo named dataset_bytes
cd INSTALL_PATH/application_of_ai_in_cyber_security
mkdir dataset_bytes
Extract the .bytes
files in train.7z into INSTALL_PATH/application_of_ai_in_cyber_security/dataset_bytes/
(The best way I can think of is to extract all the files into a separate folder and move all the .bytes
files into the directory mentioned)
Now, INSTALL_PATH/application_of_ai_in_cyber_security/dataset_bytes/
should only contain .bytes
files
You are now ready to execute the Python notebooks in the CNN/
directory
Please note that you will have to execute the codes in CNN/byte_to_image_converter.ipynb
to convert the .bytes
files into .png
files before executing CNN/main.ipynb
Before you proceed to execute the Python notebooks in the LSTM/
folder, perform the following:
Create a directory at the root of this git repo named dataset_asm_files_subset
cd INSTALL_PATH/application_of_ai_in_cyber_security
mkdir dataset_asm_files_subset
Extract a subset of the .asm
files in train.7z into INSTALL_PATH/application_of_ai_in_cyber_security/dataset_asm_files_subset/
(The best way I can think of is to extract all the files into a separate folder and move the first few .asm
files that you wish to be processed by the model into the directory mentioned)
Now, INSTALL_PATH/application_of_ai_in_cyber_security/dataset_asm_files_subset/
should only contain .asm
files
You are now ready to execute the Python notebooks in the LSTM/
directory
You can simply execute LSTM/main.ipynb
to run the program
The purpose of this testing program is to classify samples malware bytecodes (.bytes
files with their PE headers removed) into their respective malware families.
Before you proceed to execute the Python notebooks in the CNN/testing/
folder, perform the following:
Execute CNN/testing/byte_to_image_converter.ipynb
or CNN/testing/byte_to_image_converter.py
to convert the .bytes
files into .png
files before executing CNN/main.ipynb
. For this, you will have to create a directory somewhere in your filesystem to store these bytecodes (.bytes
files with their PE headers removed) that you want the program to classify. We will refer to this directory as BYTECODES_DIR
.
cd INSTALL_PATH/application_of_ai_in_cyber_security/CNN/testing/
python byte_to_image_converter.py -d BYTECODES_DIR
If you are executing the Python notebook instead (CNN/testing/byte_to_image_converter.ipynb
), you will have to edit the variable arguments_list
in cell 9 to match the above.
You are now ready to execute the Python notebooks in the CNN/testing/
directory
You can simply execute CNN/testing/malware_sorter.ipynb
or CNN/testing/malware_sorter.py
to run the program. It is simpler to run the latter.
python malware_sorter.py -d BYTECODES_DIR
If you are executing the Python notebook instead (CNN/testing/malware_sorter.ipynb
), you will have to edit the variable arguments_list
in cell 10 to match the above.