This project implements an Intrusion Detection System (IDS) using machine learning techniques. The goal of the system is to classify different types of network attacks using network traffic data. The model is trained using a Random Forest classifier to detect various network attacks based on labeled data and its associated features. Additionally, we are using Streamlit to create an interactive web dashboard that allows users to visualize the data and model predictions.
This project uses several Python libraries for data processing, machine learning, model evaluation, and dashboard visualization:
- Purpose: Used for data manipulation and analysis. It provides data structures like
DataFrame
for handling and analyzing structured data efficiently. - Installation:
pip install pandas
- Usage: Loading and preprocessing the dataset, handling missing values, and performing feature extraction.
- Purpose: Provides support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.
- Installation:
pip install numpy
- Usage: Used for numerical operations like creating arrays, performing calculations, and handling feature data.
- Purpose: A machine learning library that includes simple and efficient tools for data mining and data analysis. It includes algorithms for classification, regression, clustering, and more.
- Installation:
pip install scikit-learn
- Usage:
- LabelEncoder: For encoding categorical variables into numeric values.
- RandomForestClassifier: Used as the model for training and predicting network attack types.
- train_test_split: For splitting the dataset into training and testing sets.
- accuracy_score and classification_report: For model evaluation.
- Purpose: Used for serializing Python objects (saving and loading models and transformers).
- Installation:
pip install joblib
- Usage: Saving and loading the trained machine learning model and label encoder to disk.
- Purpose: This library is used for handling the Parquet file format, enabling fast data processing.
- Installation:
pip install pyarrow
- Usage: Reading Parquet files for efficient data loading.
- Purpose: Used for creating interactive web applications and dashboards for machine learning projects.
- Installation:
pip install streamlit
- Usage: Building a user-friendly interface for the IDS, allowing users to input data and visualize results. Streamlit will display model predictions and graphs dynamically.
- Purpose: A plotting library used for creating static, animated, and interactive visualizations in Python.
- Installation:
pip install matplotlib
- Usage: Used for creating various plots such as bar charts, line plots, and histograms to visualize data and model performance.
- Purpose: A Python data visualization library based on Matplotlib, which provides a high-level interface for drawing attractive statistical graphics.
- Installation:
pip install seaborn
- Usage: Used for creating more advanced visualizations like correlation heatmaps and distributions.
- Purpose: A graphing library that makes interactive, publication-quality graphs online.
- Installation:
pip install plotly
- Usage: Used for creating interactive plots, especially for visualizing model predictions and performance metrics.
-
Data Loading and Preprocessing:
- The dataset is loaded from a Parquet file using the
pyarrow
engine. - Categorical features like
Label
andClassLabel
are encoded into numeric values usingLabelEncoder
fromscikit-learn
. - Missing values are checked and handled appropriately.
- The dataset is loaded from a Parquet file using the
-
Feature Selection:
- Features are extracted by dropping the target columns (
Label
andClassLabel
) from the dataset. - The
Label
column is selected as the target variable for the classification task.
- Features are extracted by dropping the target columns (
-
Model Training:
- The dataset is split into training and testing sets using
train_test_split
. - A Random Forest classifier (
RandomForestClassifier
) is used to train the model on the training data. The model is trained with parallel processing enabled by settingn_jobs=-1
for faster training.
- The dataset is split into training and testing sets using
-
Model Evaluation:
- The model’s performance is evaluated using accuracy and classification metrics provided by
accuracy_score
andclassification_report
fromscikit-learn
.
- The model’s performance is evaluated using accuracy and classification metrics provided by
-
Model Saving:
- The trained model and label encoder are saved using
joblib
for later use in the application (for example, in a Streamlit dashboard).
- The trained model and label encoder are saved using
-
Dashboard Creation:
- Streamlit is used to create an interactive web dashboard where users can upload their own data, visualize the attack types, and view real-time predictions of network traffic based on the trained model.
- Various plots and graphs are displayed using
Matplotlib
,Seaborn
, andPlotly
for a better understanding of the data and model performance.
To install the required libraries, run the following command:
pip install pandas numpy scikit-learn joblib pyarrow streamlit matplotlib seaborn plotly