-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathReportAnomDetect.qmd
1079 lines (735 loc) · 47.5 KB
/
ReportAnomDetect.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Time series anomaly detection - Canadian weather data"
author: "Ahmet Zamanis"
format:
html:
toc: true
embed-resources: true
editor: visual
jupyter: python3
execute:
warning: false
---
## Introduction
This report will apply & compare several anomaly detection algorithms on a multivariate time series dataset of weather measurements. We'll use numerous algorithms from the [PyOD](https://github.com/yzhao062/pyod) anomaly detection package, along with time series data formats & anomaly scoring wrappers from the [Darts](https://github.com/unit8co/darts) package. We'll also implement a simple autoencoder using PyTorch Lightning. We'll use several visualizations including 3D Plotly scatterplots to illustrate the results and unique outputs of each algorithm.
The data consists of daily mean temperature and total precipitation measurements across 13 Canadian locations, some ranging from 1940 to 2020, others starting from 1960. The data was downloaded from [OpenML](https://openml.org/search?type=data&status=active&id=43843&sort=runs), shared by user Elif Ceren Gök.
The Python scripts for this analysis are available on the [GitHub repository](https://github.com/AhmetZamanis/WeatherAnomalyDetectionClassification). They may not be fully up to date with the report.
```{python Imports}
#| code-fold: true
#| code-summary: "Show imports"
# Data handling
import pandas as pd
import numpy as np
import arff # Installed as liac-arff
import warnings
# Plotting
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
# Preprocessing
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from darts.dataprocessing.transformers.scaler import Scaler
from darts.dataprocessing.transformers.missing_values_filler import MissingValuesFiller
# Darts
from darts.timeseries import TimeSeries
from darts.ad.scorers.kmeans_scorer import KMeansScorer
from darts.ad.scorers.pyod_scorer import PyODScorer
from darts.ad.detectors.quantile_detector import QuantileDetector
# PyOD anomaly scorers
from pyod.models.gmm import GMM
from pyod.models.ecod import ECOD
from pyod.models.pca import PCA
from pyod.models.iforest import IForest
# Torch and Lightning
import torch
import lightning as L
from X_LightningClassesAnom import TrainDataset, TestDataset, AutoEncoder
# Hyperparameter tuning
import optuna
# Dimensionality reduction
from sklearn.manifold import TSNE
# Helper functions
from X_HelperFunctionsAnom import score, detect, plot_series, plot_dist, plot_anom3d, plot_detection, plot_tsne, validate_nn
```
```{python Settings}
#| code-fold: true
#| code-summary: "Show settings"
# Set print options
np.set_printoptions(suppress=True, precision=4)
pd.options.display.float_format = '{:.4f}'.format
pd.set_option('display.max_columns', None)
# Set plotting options
plt.rcParams['figure.dpi'] = 150
plt.rcParams["figure.autolayout"] = True
sns.set_style("darkgrid")
px_width = 800
px_height = 800
# Set Torch settings
torch.set_default_dtype(torch.float32)
torch.set_float32_matmul_precision('high')
L.seed_everything(1923, workers = True)
warnings.filterwarnings("ignore", ".*does not have many workers.*")
```
## Data preparation
The data is in .arff format, and we'll use the [liac-arff](https://github.com/renatopp/liac-arff) package to read it. We have daily mean temperature and total precipitation values for each of the 13 locations.
```{python LoadData}
# Load raw data from .arff
raw_data = arff.load(open("./InputData/canadian_climate.arff", "r"))
# Convert to Pandas dataframe & view
df = pd.DataFrame(
raw_data["data"], columns = [x[0] for x in raw_data["attributes"]])
print(df.iloc[:,0:4])
```
To score anomalies, we need to compare the weather variables with the time variables: The same weather measurements can be perfectly normal or anomalous depending on the season. We are looking for **local anomalies,** not global.
- To this end, we'll extract month, week of year and day of year features from our date column.
- We'll use cyclical encoding to transform these categorical variables into numeric, while reflecting their cyclical nature to the algorithms we'll use. This creates a pair of sine and cosine values for each time variable, resulting in a total of 8 variables per location.
```{python TimeFeatures}
# Convert LOCAL_DATE to datetime
df["LOCAL_DATE"] = pd.to_datetime(df["LOCAL_DATE"])
# Add cyclic terms for month, week of year and day of year
df["month_sin"] = np.sin(2 * np.pi * df["LOCAL_DATE"].dt.month / 12)
df["month_cos"] = np.cos(2 * np.pi * df["LOCAL_DATE"].dt.month / 12)
df["week_sin"] = np.sin(2 * np.pi * df["LOCAL_DATE"].dt.isocalendar().week / 53)
df["week_cos"] = np.cos(2 * np.pi * df["LOCAL_DATE"].dt.isocalendar().week / 53)
df["day_sin"] = np.sin(2 * np.pi * df["LOCAL_DATE"].dt.dayofyear / 366)
df["day_cos"] = np.cos(2 * np.pi * df["LOCAL_DATE"].dt.dayofyear / 366)
```
We'll focus our attention only on Ottawa, one of the locations that has data available from 1940. There are a few missing weather measurements across the time range, and we'll interpolate them.
```{python TSOttawa}
# Retrieve the weather data for Ottawa (1940 start) as Darts TimeSeries
ts_ottawa = TimeSeries.from_dataframe(
df,
time_col = "LOCAL_DATE",
value_cols = ["MEAN_TEMPERATURE_OTTAWA", "TOTAL_PRECIPITATION_OTTAWA"],
fill_missing_dates = True
)
# Interpolate missing values
na_filler = MissingValuesFiller()
ts_ottawa = na_filler.transform(ts_ottawa)
```
We don't have any missing values for our time variables, as we created them from the dates.
```{python TSCovars}
# Retrieve date covariates as Darts TS
ts_covars = TimeSeries.from_dataframe(
df,
time_col = "LOCAL_DATE",
value_cols = ['month_sin', 'month_cos', 'week_sin', 'week_cos', 'day_sin',
'day_cos'],
fill_missing_dates = True
)
```
We'll split our data roughly in two: The anomaly scorers will be trained on data before 1980, and "tested" on data after 1980, though we can't calculate any objective performance metrics, as we don't have true anomaly labels to compare our models' predictions with.
```{python Preprocess}
# Concatenate time covariates to Ottawa weather series
ts_ottawa = ts_ottawa.concatenate(ts_covars, axis = 1)
# Split train-test data
test_start = pd.Timestamp("1980-01-01")
train_end = pd.Timestamp("1979-12-31")
ts_train = ts_ottawa.drop_after(test_start)
ts_test = ts_ottawa.drop_before(train_end)
```
## Anomaly detection with PyOD & Darts
Most of the algorithms we'll use are sensitive to features with differing value scales, so we'll scale our features from -1 to 1. This is the value range for the cyclical encoded features, so they will be unchanged.
We'll also create a Darts anomaly detector, with a high quantile threshold. It will take in the anomaly scores produced by the scorers, and flag the top Qth quantile as anomalies. It's also possible to use a raw score threshold detector.
```{python Utilities}
# Create Darts wrapper for MinMaxScaler
feature_range = (-1, 1) # The value range of cyclical encoded features
scaler = Scaler(MinMaxScaler(feature_range = feature_range))
# Create quantile detector
q = 0.999 # The detector will flag scores above this quantile as anomalies
detector = QuantileDetector(high_quantile = q)
```
To shorten the code, we'll use several functions to concisely train our models, anomaly score the data & create plots. These are available [here](https://github.com/AhmetZamanis/WeatherAnomalyDetectionClassification/blob/main/X_HelperFunctionsAnom.py).
### K-means clustering
Our first anomaly scorer is based on the K-means clustering algorithm. K number of clusters are fit to the data, and anomaly scores for each observation are calculated based on the distance from the nearest cluster.
- As an intuitive choice for the number of clusters, we'll go with 12, one for each month of the year. Another decent choice is likely 4, one for each season.
The `window` parameter controls the number of observations that will be anomaly scored together. Since we're interested in daily weather anomalies, we'll use the default value of 1. For many time series anomaly detection tasks, larger windows will make more sense.
- For example, when anomaly scoring data from frequent sensor readings, a single time step with unusual readings may not be very meaningful, but a sequence of unusual readings may indicate an issue.
- In such a case, with `window = N`, Darts reformats the data so each row is a sequence of N time steps. Anomaly scoring is performed on each sequence instead of single observations.
We fit the anomaly scorer on the training period, and generate anomaly scores both for the training & testing periods.
```{python KMeansScore}
# Create K-means scorer
scorer_kmeans = KMeansScorer(
window = 1, # Score each time step by itself
k = 12, # Number of K-means clusters
random_state = 1923)
# Perform anomaly scoring
scores_train_kmeans, scores_test_kmeans, scores_kmeans = score(
ts_train, ts_test, scorer_kmeans, scaler)
```
We pass the anomaly scores to our detector, and get binary anomaly labels.
```{python KMeansDetect}
# Perform anomaly detection
anoms_train_kmeans, anoms_test_kmeans, anoms_kmeans = detect(
scores_train_kmeans, scores_test_kmeans, detector)
```
Since we don't have true anomaly labels, there are no performance metrics we can calculate. Instead, we'll plot the anomaly scores along with the weather measurements.
```{python KMeansPlotSeries}
# Plot anomaly scores & original series
plot_series(
"K-means scorer", ts_train, ts_test, scores_train_kmeans, scores_test_kmeans)
```
From the plots above, it appears the K-means anomaly scores are highly related to the total precipitation values. A particular post-2000 day with very high precipitation stands out.
- The overall anomaly scores and weather measurements don't seem too different before and after 1980, though a few post-1980 days particularly stand out.
Next, we'll plot the same variables, but this time we'll color by anomaly labels. The black line will be the "normal" observations below the Qth quantile of anomaly scores, and the blue line will be those above the Qth quantile, the flagged anomalies.
```{python KMeansPlotDetect}
# Detections plot
plot_detection("K-means scores", q, ts_ottawa, scores_kmeans, anoms_kmeans)
```
We see the anomaly scores for the flagged observations are clearly and considerably higher than non-flagged ones, which is good. Ideally we want as much separation between the anomaly scores as possible to avoid false positives or false negatives. In other words, we want the lowest possible scores for normal observations, and the highest possible scores for actual anomalies.
- The anomalies all have the highest observed precipitation values. Their temperature values are not unusual compared to the remaining observations, but they are generally on the higher side.
- In summary, the K-means scorer identifies warm days with unusually high precipitation as anomalies.
Next, we'll look at the anomaly score density plots for the training and testing periods, comparing their distributions.
```{python KMeansPlotDist}
# Plot distributions of anomaly scores
plot_dist("K-means scorer", scores_train_kmeans, scores_test_kmeans)
```
The scores for both periods are distributed similarly, though the post-1980 period does have a bit more of the higher-scored observations, and a bit less of the median-scored observations.
Next, we'll plot all observations in a 3D scatterplot, with our weather variables and the month as our three dimensions. We'll color the observations by their anomaly labels. The interactive plot can be moved around by clicking and holding, and zoomed with the mouse wheel.
```{python KMeansPlot3D}
# 3D anomalies plot
plot_anom3d(
"K-means scorer", ts_ottawa, anoms_kmeans, px_width, px_height)
```
Here we see the majority of anomalies are summer days with high precipitation and usual temperatures. Some anomalies including the most extreme observation also belong to the fall months
We can also retrieve the fitted cluster labels from the K-means algorithm, and use them to color the observations in the same 3D scatterplot.
```{python KMeansClustering}
#| code-fold: true
#| code-summary: "Show code for 3D clustering plot"
# Clustering plot
train_labels = scorer_kmeans.model.labels_.astype(str)
fig = px.scatter_3d(
x = ts_train['MEAN_TEMPERATURE_OTTAWA'].univariate_values(),
y = ts_train['TOTAL_PRECIPITATION_OTTAWA'].univariate_values(),
z = ts_train.time_index.month,
color = train_labels,
title = "K-Means clustering plot, train set",
labels = {
"x": "Mean temperature",
"y": "Total precipitation",
"z": "Month",
"color": "Clusters"},
width = px_width,
height = px_height
)
fig.show()
```
We see the K-means algorithm very clearly clusters the observations according to their months. The anomaly scores are then calculated as the distance from the closest cluster, so the most anomalous observations are the high-precipitation ones in summer.
- Compared to the summer months, we don't see such particularly distant observations in winter. That does not mean they couldn't be considered anomalous or unusual for their respective months.
### Gaussian mixture models (GMM)
The GMM approach assumes the data distribution consists of the combination of an N number of Gaussian distributions. These N distributions (mixture components) are learned from the fitted data. Anomaly scores are computed for observations based on their likelihoods, where a lower likelihood means a higher anomaly score.
- In practice, this approach is fairly similar to K-means clustering. Just like the choice of K clusters in K-means, we'll choose the number of Gaussian mixture components as a parameter.
- We'll go with `n_components = 4`, one for each season, just for variety against the previous approach of 12 cluster centroids.
- GMM is trained with expectation maximization, so the solution is not deterministic, though convergence to a local solution is guaranteed. We'll perform 10 initializations, and the best results will be kept.
Darts offers the `PyODScorer` time series wrapper for any algorithm in `PyOD`.
```{python CreateGMM}
# Create PyOD GMM model
gmm = GMM(
n_components = 4, # N. of Gaussian mixture components
n_init = 10, # N. of initializations for expectation maximization
contamination = (1 - q), # Proportion of expected anomalies in the dataset
random_state = 1923)
# Create Darts GMM scorer
scorer_gmm = PyODScorer(model = gmm, window = 1)
```
We'll repeat pretty much the same anomaly detection & plotting code for each algorithm, illustrate the differences and comment on the outputs.
```{python GMMCode}
#| code-fold: true
#| code-summary: "Show code for GMM anomaly detection"
# Perform anomaly scoring
scores_train_gmm, scores_test_gmm, scores_gmm = score(
ts_train, ts_test, scorer_gmm, scaler)
# Perform anomaly detection
anoms_train_gmm, anoms_test_gmm, anoms_gmm = detect(
scores_train_gmm, scores_test_gmm, detector)
# Plot scores & original series
plot_series("GMM scorer", ts_train, ts_test, scores_train_gmm, scores_test_gmm)
# Detections plot
plot_detection("GMM scores", q, ts_ottawa, scores_gmm, anoms_gmm)
# Plot distributions of anomaly scores
plot_dist("GMM scorer", scores_train_gmm, scores_test_gmm)
# 3D anomalies plot
plot_anom3d("GMM scorer", ts_ottawa, anoms_gmm, px_width, px_height)
# 3D cluster plot
labels = scorer_gmm.model.detector_.predict(
scaler.transform(ts_ottawa).values()).astype(str)
fig = px.scatter_3d(
x = ts_ottawa['MEAN_TEMPERATURE_OTTAWA'].univariate_values(),
y = ts_ottawa['TOTAL_PRECIPITATION_OTTAWA'].univariate_values(),
z = ts_ottawa.time_index.month,
color = labels,
title = "GMM clustering plot",
labels = {
"x": "Mean temperature",
"y": "Total precipitation",
"z": "Month",
"color": "Clusters"},
width = px_width,
height = px_height
)
fig.show()
```
The GMM algorithm's outputs are similar to K-means in some respects.
- The differences between scores for anomalous and non-anomalous observations appear to be even larger in magnitude, which is good. The most unusual observation after year 2000 stands out even more compared to other observations.
- We could say the GMM anomalies are a bit colder & less rainy overall.
- The score distributions of train & test periods are even more similar, and are very right-skewed. While it may be easier to identify individual anomalies, the overall comparison of score distributions across periods may be less meaningful.
- Unlike K-means, we now have a few anomalies in all months of the year, including winter months. This is likely because the scoring is now based on 4 mixture components (seasons) instead of 12 clusters (months).
- For example, it's possible the anomalies in February are flagged because they are compared with the same mixture component as observations from January. In contrast, they were likely just compared with the "February cluster" in K-means, and appeared relatively non-anomalous.
- If we were interested in seasonal anomalies, this approach could be preferable, and for monthly anomalies K-means could be preferable (or, either approach could be fit with either number of components / clusters).
- We do not have a strict definition of an anomaly, or true anomaly labels to compare our results with, so either method could be preferred based on the use case.
- The fitted mixture components correspond to each season, as can be seen in the 3D clustering plot. Only the most extreme observation, which is in September, is attributed to the summer component.
### ECOD
ECOD is a fairly simple and efficient anomaly detection method which relies on empirical cumulative distribution functions. Let's make a few definitions to understand how.
- For a value $x$, drawn from a random variable $X$, the cumulative distribution function $f(x)$ is the cumulative sum of the probability density functions for values $<=x$. This also corresponds to the probability that $X <= x$.
- The empirical CDF is a step function derived from an empirical sample measurement of $X$.
ECOD estimates a non-parametric eCDF from each dimension (feature) in the training data.
- Tail probabilities are also estimated for each dimension. For each observation, the tail probabilities are used to calculate a dimensional anomaly score. These are then aggregated.
- The underlying assumption is that anomalous values will lie at the tails of the eCDFs.
- ECOD is simple & efficient, with no parameters to choose. However, I believe the nature of the algorithm should not be able to capture the interactions in our multivariate time series.
- For example, when evaluated alone, the temperature and precipitation values for a day could be normal compared to the entire distribution for each feature. But their combination, along with the observation date, could make them anomalies.
```{python CreateECOD}
# Create ECOD scorer
ecod = ECOD(contamination = (1 - q))
scorer_ecod = PyODScorer(model = ecod, window = 1)
```
```{python ECODCode}
#| code-fold: true
#| code-summary: "Show code for ECOD anomaly detection"
# Perform anomaly scoring
scores_train_ecod, scores_test_ecod, scores_ecod = score(
ts_train, ts_test, scorer_ecod)
# Perform anomaly detection
anoms_train_ecod, anoms_test_ecod, anoms_ecod = detect(
scores_train_ecod, scores_test_ecod, detector)
# Plot scores & original series
plot_series("ECOD scorer", ts_train, ts_test, scores_train_ecod, scores_test_ecod)
# Detections plot
plot_detection("ECOD scores", q, ts_ottawa, scores_ecod, anoms_ecod)
# Plot distributions of anomaly scores
plot_dist("ECOD scorer", scores_train_ecod, scores_test_ecod)
# 3D anomalies plot
plot_anom3d("ECOD scorer", ts_ottawa, anoms_ecod, px_width, px_height)
```
Indeed, we see the anomaly scores for the flagged anomalies are fairly close to the non-anomalous observations. The particularly extreme post-2000 observation is not even assigned a very high score.
- The flagged anomalies have temperature and precipitation values all over the value range.
- The anomaly score distributions are considerably less right-skewed compared to previous algorithms, as ECOD does not attribute particularly high scores to any observation.
- Looking at the 3D anomalies plot, ECOD still flags some of the very rainy summer & fall observations, but misses quite a few. There are also numerous flagged observations that seem fairly normal, especially in December.
- Of course, another factor is the quantile we use for the detection threshold. With an even higher quantile, we could end up with only a few meaningful anomalies in the summer and fall months.
We can explain each training point's anomaly scoring with a dimensional outlier score plot. We'll view the plot for the highest precipitation day in the training data, which was assigned the highest training anomaly score by the previous algorithms.
```{python ECODExplainer}
# Special to ECOD scorer: Explain a single training point's outlier scoring
idx_max_precip = np.argmax(
ts_train['TOTAL_PRECIPITATION_OTTAWA'].univariate_values())
scorer_ecod.model.explain_outlier(
ind = idx_max_precip, # Index of point to explain
columns = [0, 1, 2, 3], # Dimensions to explain
feature_names = ["MeanTemp", "TotalPrecip", "MonthSin", "MonthCos"])
```
We see the precipitation value for this observation is assigned a higher than 0.999th quantile anomaly score, while the temperature anomaly score is fairly low.
- As stated before, I believe the ECOD algorithm is not very suitable to this task, as I don't think it takes the interactions between weather and time features into account. The scores for each feature are simply aggregated.
- The dimensional anomaly scores for the time features are meaningless and potentially misleading when aggregated, as it's not possible to observe "anomalous" values for these. They are simply an encoding of the observation date, and are meant to interact with the weather measurements rather than be evaluated alone.
### Principal components analysis (PCA)
PCA is mainly used as a dimensionality reduction method: Given an N number of features, it outputs N principal components that aim to capture as much of the data variance as possible, with the fewest number of uncorrelated components.
- PCA constructs a hyperplane of eigenvectors from the input features. Anomaly scores can be computed based on the distances of each observation from this hyperplane.
- Instead of scaling our features between -1 and 1, we'll use the `standardization = True` argument to center the data around a zero mean and scale it to unit variance.
```{python CreatePCA}
# Create PCA scorer
pca = PCA(contamination = (1 - q), standardization = True, random_state = 1923)
scorer_pca = PyODScorer(model = pca, window = 1)
```
```{python PCACode}
#| code-fold: true
#| code-summary: "Show code for PCA anomaly detection"
# Perform anomaly scoring
scores_train_pca, scores_test_pca, scores_pca = score(
ts_train, ts_test, scorer_pca)
# Perform anomaly detection
anoms_train_pca, anoms_test_pca, anoms_pca = detect(
scores_train_pca, scores_test_pca, detector)
# Plot scores & original series
plot_series("PCA scorer", ts_train, ts_test, scores_train_pca, scores_test_pca)
# Detections plot
plot_detection("PCA scores", q, ts_ottawa, scores_pca, anoms_pca)
# Plot distributions of anomaly scores
plot_dist("PCA scorer", scores_train_pca, scores_test_pca)
# 3D anomalies plot
plot_anom3d("PCA scorer", ts_ottawa, anoms_pca, px_width, px_height)
```
Similar to the K-means approach, the PCA anomaly scorer flags warm days that are unusually rainy as anomalies.
- The differences between scores for anomalous and non-anomalous observations are large.
- Looking at the 3D anomalies plot, the PCA scorer restricts anomaly flags mostly to summer days, and some fall days.
- Similar to the GMM scorer, the anomaly score distributions are very right-skewed.
#### Principal components plots & T-SNE
PCA is mainly used as a dimensionality reduction & feature selection method, and we can extract quite a bit more intuition from it. We'll start by printing the variances explained by each principal component.
```{python PCAVariances}
# Variances explained by each component
pc_variances = scorer_pca.model.explained_variance_ratio_ * 100
pc_variances = [round(x, 2) for x in pc_variances]
pc_variances
```
Only the first 3 PCs explain roughly 98% of the variance in our data. This is not surprising as 6 of our 8 input features are different encodings of time, highly correlated with one another. We can think of our problem as three dimensional: Temperature, precipitation and time / season.
Using a heatplot, we can examine the PC loadings: How much each input feature contributes to each PC.
```{python PCAHeatplot}
#| code-fold: true
#| code-summary: "Show code to generate PC loadings heatplot"
# Heatplot of PC loadings, X = PCs, Y = Features's contribution to the PCs
pc_loadings = pd.DataFrame(
scorer_pca.model.components_.T,
columns = pc_variances,
index = ts_ottawa.components)
_ = sns.heatmap(pc_loadings, cmap = "PiYG")
_ = plt.title("PC loadings")
_ = plt.xlabel("% variances explained by PCs")
_ = plt.ylabel("Features")
plt.show()
plt.close("all")
```
We see PC1 is mostly contributed to by temperature & time values. PC2 mostly captures information from the time values, while PC3 and PC4 are dominated by precipitation and temperature values respectively.
We can plot the PCs in 3D scatterplots, three at a time, and color the observations by anomaly label.
```{python PCA3DPlots}
#| code-fold: true
#| code-summary: "Show code to generate 3D PC plots"
# Transform data into PC values
pc_train = scorer_pca.model.detector_.transform(
scorer_pca.model.scaler_.transform(ts_train.values())
)
pc_test = scorer_pca.model.detector_.transform(
scorer_pca.model.scaler_.transform(ts_test.values())
)
pcs = np.concatenate((pc_train, pc_test), axis = 0)
# Principal components plot, first 3 PCs
fig = px.scatter_3d(
x = pcs[:, 0],
y = pcs[:, 1],
z = pcs[:, 2],
color = anoms_pca.univariate_values().astype(str),
title = "PCA components plot, first 3 PCs",
labels = {
"x": "PC1",
"y": "PC2",
"z": "PC3",
"color": "Anomaly labels"},
width = px_width,
height = px_height
)
fig.show()
# Principal components plot, PC3-4-5
fig = px.scatter_3d(
x = pcs[:, 4],
y = pcs[:, 3],
z = pcs[:, 2],
color = anoms_pca.univariate_values().astype(str),
title = "PCA components plot, middle 3 PCs",
labels = {
"x": "PC5",
"y": "PC4",
"z": "PC3",
"color": "Anomaly labels"},
width = px_width,
height = px_height
)
fig.show()
# Principal components plot, last 3 PCs
fig = px.scatter_3d(
x = pcs[:, -1],
y = pcs[:, -2],
z = pcs[:, -3],
color = anoms_pca.univariate_values().astype(str),
title = "PCA components plot, last 3 PCs",
labels = {
"x": "PC6",
"y": "PC7",
"z": "PC8",
"color": "Anomaly labels"},
width = px_width,
height = px_height
)
fig.show()
```
Looking at the first plot for PCs 1-3, we can clearly see the temporal structure of the data represented in just 3 dimensions.
- Each of the pillar-like, distinct clusters of observations in the plot likely represent one month of the year.
- The circular shape is due to the cyclical nature of the time variables, such as months or days in a year. We could say the temperature values, which contribute mainly to PC1, also have a cyclical nature.
- We can see anomalies exclusively belong to certain clusters, i.e. certain months.
- The most discriminative dimension when it comes to separating anomalies from normal observations is PC3, which almost exclusively reflects the precipitation dimension.
Similarly in the second plot for PCs 3-5, we see PC3 is the most discriminative dimension. We also see the anomalies are constrained to a certain range in the PC4 dimension, which is made up by the temperature & time values.
The third plot, for PCs 6-8, seems to show these PCs are not very useful (at least by themselves) in separating anomalies from normal observations. Since these PCs seem to have no contribution from the weather features, the shapes are likely not meaningful.
One more thing we can do is apply T-SNE to the PCs, and reduce them further to only 3 dimensions.
- T-SNE stands for t-distributed stochastic neighbor embedding. It is a non-linear dimensionality reduction method that usually outputs 2 or 3 dimensions for data visualization purposes.
- The algorithm finds the similarities of each pair of observations, over all dimensions, and converts them to joint probabilities in a Gaussian distribution. Similar pairs have a higher probability, and vice versa.
- Then the same is done for values in the embedded (2D or 3D) space, but assuming a heavy-tailed Student's t-distribution. The algorithm then tries to minimize the difference between the two joint probabilities.
- T-SNE is computationally expensive, non-deterministic, and has some parameters that may greatly affect the solution. For high-dimensional data, it's suggested to first apply PCA (or another dimensionality reduction method) before applying T-SNE.
There are several potential pitfalls to the T-SNE algorithm, which are explained well [here](https://distill.pub/2016/misread-tsne/).
- In short, it's a good idea to perform & plot T-SNE with several different values for the perplexity parameter, as it can greatly affect the outcome, and result in meaningless shapes if the perplexity is not suitable for the data.
- Perplexity can be thought of as a "number of nearest neighbors" parameter. Generally a value between 5 and 50 is suggested, with larger and denser datasets requiring higher values.
- We'll also run many iterations of the optimization algorithm to ensure we have a good solution. See the [helper functions script](https://github.com/AhmetZamanis/WeatherAnomalyDetectionClassification/blob/main/X_HelperFunctionsAnom.py) for details on the T-SNE parameters & code.
```{python PCATSNE}
#| code-fold: true
#| code-summary: "Show code to generate 3D T-SNE plots"
# Standardize & normalize PCs
std_scaler = StandardScaler()
pcs_scaled = std_scaler.fit_transform(pcs)
# Apply T-SNE to PCs
plot_tsne(pcs_scaled, anoms_pca, px_width, px_height, perplexities = [10, 30, 50])
```
The T-SNE reductions result in interesting figures. It seems a perplexity of 30 or higher is more suitable to our data, as lower values tend to result in less-separated blobs.
- It looks like there are many smaller clusters of observations, some of them circular, others line-shaped. These likely represent time sequences of particular lengths.The flagged anomalies are at the very center of the figure (try zooming in).
- Keep in mind that the sizes and distances of clusters may be misleading. The T-SNE algorithm tends to expand dense clusters and contract smaller ones, while a single perplexity value may not be suitable to capture the distances between all clusters.
### Isolation forest
The IForest algorithm is based on the premise that isolating an anomalous data point is easier than isolating normal points.
- For each datapoint, a random feature is selected, and a random split value is selected between the minimum and maximum for the feature. This is repeated until the datapoint is isolated from all other datapoints with different values. This process can be represented as a decision tree.
- The shorter the "isolation tree path", the more anomalous the datapoint is. The final anomaly scores are averaged from a forest of such trees. We'll build 500 trees instead of the default 100.
- By default, the IForest algorithm will sample 256 datapoints to build each tree in the forest (or use all datapoints if there are less than 256).
- We won't sample features per tree, as we want the time variables to always interact with the measurement variables.
```{python CreateIForest}
# Create IForest scorer
iforest = IForest(
n_estimators= 500, # N. of trees in forest
contamination = (1 - q),
random_state = 1923)
scorer_iforest = PyODScorer(model = iforest, window = 1)
```
```{python IForestCode}
#| code-fold: true
#| code-summary: "Show code for IForest anomaly detection"
# Perform anomaly scoring
scores_train_iforest, scores_test_iforest, scores_iforest = score(
ts_train, ts_test, scorer_iforest)
# Perform anomaly detection
anoms_train_iforest, anoms_test_iforest, anoms_iforest = detect(
scores_train_iforest, scores_test_iforest, detector)
# Plot scores & original series
plot_series(
"IForest scorer", ts_train, ts_test, scores_train_iforest, scores_test_iforest)
# Detections plot
plot_detection("IForest scores", q, ts_ottawa, scores_iforest, anoms_iforest)
# Plot distributions of anomaly scores
plot_dist("IForest scorer", scores_train_iforest, scores_test_iforest)
# 3D anomalies plot
plot_anom3d(
"IForest scorer", ts_ottawa, anoms_iforest, px_width, px_height)
```
Compared to normal observations, the IForest anomaly scorer does not assign much higher scores to anomalous points. The anomalous points are still on the warmer and more rainy side, but less so compared to K-means and GMM anomalies.
- Interestingly, the post-2000 day with the highest precipitation is not flagged as an anomaly.
- Somewhat similar to ECOD, IForest seems to flag anomalies all over the year, some of them fairly questionable. It also flags some points as anomalies while missing other close points that appear even more anomalous.
- This suggests the randomness in the algorithm is not very suitable for finding interactions between variables, though it's better than ECOD.
- The score distributions are less right-skewed, again similar to ECOD. The post-1980 score distribution is a bit higher.
Similar to other tree-based models, we are able to extract feature importance scores from the trained IForest model.
- Keep in mind these scores evaluate each feature's importance by itself. In our case, the particular combinations of feature values are more important.
- The IForest [documentation](https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.iforest) also states that the importance values for high-cardinality features can be misleading, such as for our day of year features.
```{python IForestFeatImp}
#| code-fold: true
#| code-summary: "Show code to plot IForest feature importances"
# Feature importances (can be misleading for high cardinality features, e.g. day
# and week features)
feat_imp = pd.DataFrame({
"Feature importance": scorer_iforest.model.feature_importances_,
"Feature": ts_ottawa.components
}).sort_values("Feature importance", ascending = False)
_ = sns.barplot(data = feat_imp, x = "Feature importance", y = "Feature")
plt.show()
plt.close("all")
```
Looking at these feature importance scores, it's possible the IForest algorithm made its decisions mostly by looking at the temperature & day of year values. This would explain why some high-precipitation anomalies are missed
## Autoencoder with PyTorch Lightning
The last anomaly scorer we'll use is an autoencoder. Put simply, an autoencoder is a neural network that takes in an N number of inputs, compresses them into fewer dimensions than N, and uses the compressed representation to reconstruct the original N inputs.
- The network starts with an **encoder** structure: Layers which compress the inputs into fewer dimensions. This compressed representation is called the **latent space**.
- The layers may get progressively smaller as the inputs progress through the network, enforcing a bottleneck to arrive at the latent space.
- The latent space representation is fed into the **decoder** structure: Layers which expand the latent space back into a reconstruction of the original inputs.
- Usually, the encoder and decoder are symmetrical in terms of architecture.
The autoencoder architectures are used in tasks such as dimensionality reduction, denoising data or anomaly detection. In our case, the anomaly scores will simply be the **reconstruction error** for each datapoint: The higher the reconstruction error, the more anomalous a datapoint is.
- The assumption is that normal observations can easily be reconstructed from the latent space, while anomalous observations will be harder to reconstruct.
PyOD does implement several autoencoders, mainly in Keras, and also in PyTorch. However, I wanted to implement one from scratch in PyTorch Lightning, especially because the PyOD one does not offer the option to retrieve the latent space representations. We will use the latent space for visualization, just like we did with principal components.
We'll build a fairly simple autoencoder, with two layers each in the encoder and decoder. See the [Lightning classes script](https://github.com/AhmetZamanis/WeatherAnomalyDetectionClassification/blob/main/X_LightningClassesAnom.py) for more details on the architecture and code.
- The tradeoff between overfitting and underfitting is also relevant in autoencoders: We want to accurately reconstruct inputs, but we don't want to memorize them without learning a useful latent space representation.
- We'll handle this tradeoff by simply forcing a bottleneck in the network. We previously saw almost all of the variance in our data can be compressed into 3 principal components, so we'll reduce our 8 inputs into 3 latent space dimensions.
- This also allows us to plot the entire latent space without having to apply T-SNE.
- In my previous experiments, a larger latent space did not result in much lower reconstruction errors on the testing dataset.
- There are other, more sophisticated architectures that handle this problem differently, mainly by including regularization terms in the loss function. [Here](https://www.jeremyjordan.me/autoencoders/) is a good article that illustrates some of them.
### Hyperparameter tuning
I previously tuned the learning rate, learning rate decay and dropout hyperparameters, using the Optuna framework. We'll load the saved best tune from this run. See the code & comments below for details.
```{python TuneAutoencoder}
#| code-fold: true
#| code-summary: "Show code to tune Autoencoder hyperparameters"
#| eval: false
# Split train - val data
val_start = pd.Timestamp("1970-01-01")
train_end = pd.Timestamp("1969-12-31")
ts_tr = ts_train.drop_after(val_start)
ts_val = ts_train.drop_before(train_end)
# Perform preprocessing for train - validation split
x_tr = std_scaler.fit_transform(ts_tr.values())
x_val = std_scaler.transform(ts_val.values())
# Load data into TrainDataset
train_data = TrainDataset(x_tr)
val_data = TrainDataset(x_val)
# Create data loaders
train_loader = torch.utils.data.DataLoader(
train_data, batch_size = 128, num_workers = 0, shuffle = True)
val_loader = torch.utils.data.DataLoader(
val_data, batch_size = len(ts_val), num_workers = 0, shuffle = False)
# Define Optuna objective
def objective_nn(trial, train_loader, val_loader):
# Define parameter ranges to tune over & suggest param set for trial
learning_rate = trial.suggest_float("learning_rate", 5e-4, 5e-2)
lr_decay = trial.suggest_float("lr_decay", 0.9, 1)
dropout = trial.suggest_float("dropout", 1e-4, 0.2)
# Create hyperparameters dict
hyperparams_dict = {
"input_size": ts_train.values().shape[1],
"learning_rate": learning_rate,
"lr_decay": lr_decay,
"dropout": dropout
}
# Validate hyperparameter set
score, epoch = validate_nn(hyperparams_dict, train_loader, val_loader, trial)
# Report best n. of epochs
trial.set_user_attr("n_epochs", epoch)
return score
# Create study
study_nn = optuna.create_study(
sampler = optuna.samplers.TPESampler(seed = 1923),
pruner = optuna.pruners.HyperbandPruner(),
study_name = "tune_nn",
direction = "minimize"
)
# Instantiate objective
obj = lambda trial: objective_nn(trial, train_loader, val_loader)
# Optimize study
study_nn.optimize(obj, n_trials = 500, show_progress_bar = True)
# Retrieve and export trials
trials_nn = study_nn.trials_dataframe().sort_values("value", ascending = True)
trials_nn.to_csv("./OutputData/trials_nnX.csv", index = False)
```
### Anomaly detection
We'll train the model on the pre-1980 data, and use it to reconstruct both pre-1980 and post-1980 observations.
```{python TrainAutoencoder}
#| code-fold: true
#| code-summary: "Show code to train & predict with Autoencoder model"
# Import best trial
best_trial_nn = pd.read_csv("./OutputData/trials_nn3.csv").iloc[0,]
# Retrieve best hyperparameters
hyperparams_dict = {
"input_size": ts_train.values().shape[1],
"learning_rate": best_trial_nn["params_learning_rate"],
"lr_decay": best_trial_nn["params_lr_decay"],
"dropout": best_trial_nn["params_dropout"]
}
# Perform preprocessing
x_train = std_scaler.fit_transform(ts_train.values())
x_test = std_scaler.transform(ts_test.values())
# Load data into TrainDataset & TestDataset
train_data = TrainDataset(x_train)
test_data = TestDataset(x_test)
train_score_data = TestDataset(x_train)
# Create data loaders
train_loader = torch.utils.data.DataLoader(
train_data, batch_size = 128, num_workers = 0, shuffle = True)
test_loader = torch.utils.data.DataLoader(
test_data, batch_size = len(ts_test), num_workers = 0, shuffle = False)
train_score_loader = torch.utils.data.DataLoader(
train_score_data, batch_size = len(ts_train), num_workers = 0, shuffle = False)
# Create trainer
trainer = L.Trainer(
max_epochs = int(best_trial_nn["user_attrs_n_epochs"]),
accelerator = "gpu", devices = "auto", precision = "16-mixed",
enable_model_summary = True,
logger = True,
enable_progress_bar = True,
enable_checkpointing = True
)
# Create & train model
model = AutoEncoder(hyperparams_dict = hyperparams_dict)
trainer.fit(model, train_loader)
# Perform reconstructions of training & testing data
preds_train = trainer.predict(
model, train_score_loader)[0].cpu().numpy().astype(np.float64)
preds_test = trainer.predict(
model, test_loader)[0].cpu().numpy().astype(np.float64)
```
We'll compute the reconstruction error for each training & testing datapoint as the mean absolute error over all input dimensions. This metric will serve as our anomaly score.
```{python ScoreAutoencoder}
#| code-fold: true
#| code-summary: "Show code to get reconstruction errors"
# Perform anomaly scoring: Get mean reconstruction error for each datapoint
# Train set
scores_train = np.abs(x_train - preds_train) # Absolute errors of each dimension
scores_train = np.mean(scores_train, axis = 1) # Mean absolute error over all dimensions
scores_train = pd.DataFrame(scores_train, index = ts_train.time_index) # Dataframe with corresponding dates
scores_train = TimeSeries.from_dataframe(scores_train) # Darts TS
scores_train = scores_train.with_columns_renamed("0", "Scores")
# Test set
scores_test = np.abs(x_test - preds_test)
scores_test = np.mean(scores_test, axis = 1)
scores_test = pd.DataFrame(scores_test, index = ts_test.time_index)
scores_test = TimeSeries.from_dataframe(scores_test)
scores_test = scores_test.with_columns_renamed("0", "Scores")
scores = scores_train.append(scores_test)
```
Finally, we'll perform anomaly detection & plot our results as usual.
```{python CodeAutoencoder}
#| code-fold: true
#| code-summary: "Show code for Autoencoder anomaly detection"
# Perform anomaly detection
anoms_train, anoms_test, anoms = detect(scores_train, scores_test, detector)
# Plot scores & original series
plot_series("Autoencoder scorer", ts_train, ts_test, scores_train, scores_test)
# Detections plot
plot_detection("Autoencoder scores", q, ts_ottawa, scores, anoms)
# Plot distributions of anomaly scores
plot_dist("Autoencoder scorer", scores_train, scores_test)
# 3D anomalies plot
plot_anom3d(
"Autoencoder scorer", ts_ottawa, anoms, px_width, px_height)
```