Skip to content

Commit

Permalink
doc update
Browse files Browse the repository at this point in the history
  • Loading branch information
nikml committed Jan 22, 2024
1 parent b545ac1 commit 57c67a4
Show file tree
Hide file tree
Showing 3 changed files with 22 additions and 24 deletions.
30 changes: 14 additions & 16 deletions docs/how_it_works/multivariate_drift.rst
Original file line number Diff line number Diff line change
Expand Up @@ -154,28 +154,26 @@ tutorial.
Classifier for Drift Detection
------------------------------

Classifier for drift detection is an implementation of domain classifiers, as it in called
Classifier for drift detection is an implementation of domain classifiers, as it is called
in `relevant literature`_. NannyML uses a LightGBM classifier to distinguish between
the reference data and the examined chunk data. Similar to data reconstruction with PCA
this method is also able to capture complex changes in our data. The algorithm implementing
Classifier for Drift Detection follows the steps described below.

It's important to note that the process described below is repeated for each :term:`Data Chunk`.
The first step is data preparation. We assign label 0 to reference data and label 1 to chunk data.
Please note that the process described below is repeated for each :term:`Data Chunk`.
First, we prepare the data by assigning label 0 to reference data and label 1 to chunk data.
We use the model inputs as features and concatenate the reference and chunk data.
Duplicate rows are removed once, keeping the one coming from the chunk data. That is so
when we are estimating on reference data we get meaningful results. Subsequently
categorical data are encoded, as integers, since that works well with LightGBM.

To assess the domain classifier's discrimination performance we are using
it's cross valdated AUROC performance. We do so with the following steps.
An optional preparation step is to do hyperparameter tuning. We are performing
hyperparameter optimization once, on the combined data and store
the resulting optimal hyperparameters. Hyperparameters can also be provided
by the user. If nothing is specified LightGBM defaults are used.
The next step uses sklearn's `StratifiedKFold` to split the data. For each fold split we
train a `LGBMClassifier` and save it's predicted score in the validation fold.
We then use the predictions across all folds to calculate the resulting AUROC score.
Duplicate rows are removed once, keeping the one coming from the chunk data.
This ensures that when we estimate on reference data, we get meaningful results.
Finally, categorical data are encoded as integers, since this works well with LightGBM.

To evaluate the domain classifier's discrimination performance, we use its cross-validated AUROC score.
We follow these steps to do so: First, we optionally perform hyperparameter tuning.
We perform hyperparameter optimization once on the combined data and store the resulting optimal hyperparameters.
Users can also provide hyperparameters. If nothing is specified, LightGBM defaults are used.
Next, we use sklearn's `StratifiedKFold` to split the data. For each fold split,
we train an `LGBMClassifier` and save its predicted score in the validation fold.
Finally, we use the predictions across all folds to calculate the resulting AUROC score

The higher the AUROC score the easier it is to distinguish the datasets, hence the
more different they are.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ Classifier for Drift Detection
==============================

The second multivariate drift detection method of NannyML is Classifier for Drift Detection.
This method trains a classification model, named discriminator, to differentiate between data from the reference
This method trains a classification model to differentiate between data from the reference
dataset and the chunk dataset. Cross Validation is used for training.
The discriminator's performance, measured by AUROC, on the cross valdated folds is
the multivariate drift measure. When there is no data drift the datasets
Expand Down Expand Up @@ -34,16 +34,15 @@ The method returns a single number, measuring the discrimination capability of t
Any increase in the discrimination value above 0.5 reflects a change in the structure of the model inputs.

NannyML calculates the discrimination value for the monitored model's inputs, and raises an alert if the
values get outside the pre-defined range of `[0.45, 0.65]`. This range can be adjusted by specifying
a threshold strategy appropriate for the user's data.
values get outside the pre-defined range of ``[0.45, 0.65]``. If needed this range can be adjusted by specifying
a threshold strategy more appropriate for the user's data.

In order to monitor a model, NannyML needs to learn about it from a reference dataset.
Then it can monitor the data subject to actual analysis, provided as the analysis dataset.
You can read more about this in our section on :ref:`data periods<data-drift-periods>`.

Let's start by loading some synthetic data provided by the NannyML package and setting it up as our reference
and analysis dataframes. This synthetic data is for a binary classification model, but multi-class
classification or regression can be handled in the same way.
Let's start by loading some synthetic data provided by the NannyML package set it up as our reference and analysis dataframes.
This synthetic data is for a binary classification model, but multi-class classification can be handled in the same way.

.. nbimport::
:path: ./example_notebooks/Tutorial - Drift - Multivariate - Classifier for Drift.ipynb
Expand All @@ -53,7 +52,7 @@ classification or regression can be handled in the same way.
:path: ./example_notebooks/Tutorial - Drift - Multivariate - Classifier for Drift.ipynb
:cell: 2

The :class:`~nannyml.drift.multivariate.classifier_for_drift_dection.calculator.DriftDetectionClassifierCalculator`
The :class:`~nannyml.drift.multivariate.classifier_for_drift_detection.calculator.DriftDetectionClassifierCalculator`
module implements this functionality. We need to instantiate it with appropriate parameters:

- **feature_column_names:** A list with the column names of the features we want to run drift detection on.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,8 @@ values get outside a range defined by the variance in the reference :ref:`data p
In order to monitor a model, NannyML needs to learn about it from a reference dataset. Then it can monitor the data subject to actual analysis, provided as the analysis dataset.
You can read more about this in our section on :ref:`data periods<data-drift-periods>`.

Let's start by loading some synthetic data provided by the NannyML package and setting it up as our reference and analysis dataframes. This synthetic data is for a binary classification model, but multi-class classification can be handled in the same way.
Let's start by loading some synthetic data provided by the NannyML package set it up as our reference and analysis dataframes.
This synthetic data is for a binary classification model, but multi-class classification can be handled in the same way.

.. nbimport::
:path: ./example_notebooks/Tutorial - Drift - Multivariate.ipynb
Expand Down

0 comments on commit 57c67a4

Please sign in to comment.