-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathReportClassification.qmd
686 lines (452 loc) · 31.8 KB
/
ReportClassification.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
---
title: "Time series classification - Canadian weather data"
author: "Ahmet Zamanis"
format:
gfm:
toc: true
editor: visual
jupyter: python3
execute:
warning: false
---
## Introduction
In time series classification, the goal is to predict the class label attributed to an entire time series rather than a single observation. A common example is classifying ECG readings as normal and abnormal, or predicting whether a machine is faulty from a sequence of sensor readings.
In this report, we'll take multivariate sequences of weather measurements from several locations, and try to predict which location these sequences were observed in. We'll test and compare several time series classifiers from the [sktime](https://github.com/sktime/sktime) package. We'll also try the method of transforming our time series into images (with the [pyts](https://github.com/johannfaouzi/pyts) package) and classifying the images with a convolutional neural network built in PyTorch Lightning.
The data consists of daily mean temperature and total precipitation measurements across 13 Canadian locations, some ranging from 1940 to 2020, others starting from 1960. The data was downloaded from [OpenML](https://openml.org/search?type=data&status=active&id=43843&sort=runs), shared by user Elif Ceren Gök.
The Python scripts for this analysis are available on the [GitHub repository](https://github.com/AhmetZamanis/WeatherAnomalyDetectionClassification). They may not be fully up to date with the report.
```{python Imports}
#| code-fold: true
#| code-summary: "Show imports"
# Data handling
import pandas as pd
import numpy as np
import arff # Installed as liac-arff
import warnings
# Plotting
import matplotlib.pyplot as plt
import seaborn as sns
# PyTorch Lightning & Optuna
import torch
import lightning as L
import optuna
# Time series classifiers
from sktime.classification.distance_based import KNeighborsTimeSeriesClassifier
from sktime.classification.kernel_based import RocketClassifier, Arsenal
from sktime.classification.dictionary_based import MUSE
# Transformations
from pyts.image import RecurrencePlot
from sklearn.preprocessing import OneHotEncoder
# Performance metrics
from sklearn.metrics import accuracy_score, log_loss, confusion_matrix
# Custom Lightning classes
from X_LightningClassesClassif import TrainDataset, TestDataset, CNN, OptunaPruning
# Helper functions
from X_HelperFunctionsClassif import plot_confusion, scale_dims, get_images, plot_images, validate_cnn, test_model
```
```{python Settings}
#| code-fold: true
#| code-summary: "Show settings"
# Set print options
np.set_printoptions(suppress=True, precision=4)
pd.options.display.float_format = '{:.4f}'.format
pd.set_option('display.max_columns', None)
# Set plotting options
plt.rcParams['figure.dpi'] = 200
plt.rcParams['savefig.dpi'] = 200
plt.rcParams["figure.autolayout"] = True
sns.set_style("darkgrid")
# Set Torch settings
torch.set_default_dtype(torch.float32)
torch.set_float32_matmul_precision('high')
L.seed_everything(1923, workers = True)
warnings.filterwarnings("ignore", ".*does not have many workers.*")
```
## Data preparation
The data is in .arff format, and we'll use the [liac-arff](https://github.com/renatopp/liac-arff) package to read it. We have daily mean temperature and total precipitation values for each of the 13 locations.
```{python LoadData}
# Load raw data
raw_data = arff.load(open("./InputData/canadian_climate.arff", "r"))
# Convert to pandas dataframe & view
df = pd.DataFrame(
raw_data["data"], columns = [x[0] for x in raw_data["attributes"]])
print(df.iloc[:,0:4])
```
```{python CheckMissing}
# Check missing values
pd.isnull(df).sum()
```
We'll focus our attention on three of the pre-1960 locations: Ottawa, Toronto and Vancouver. First we'll long convert the data, and retrieve the rows only for these three cities.
```{python LongConvert}
# Wide to long conversion
df = pd.wide_to_long(
df, stubnames = ["MEAN_TEMPERATURE", "TOTAL_PRECIPITATION"],
i = "LOCAL_DATE", j = "LOCATION", sep = "_", suffix = r"\w+")
df = df.reset_index()
# Select observations only for Ottawa, Toronto, Vancouver
df = df[df["LOCATION"].isin(["OTTAWA", "TORONTO", "VANCOUVER"])]
# Convert LOCAL_DATE to datetime, set index for NA interpolation
df["LOCAL_DATE"] = pd.to_datetime(df["LOCAL_DATE"])
df = df.set_index("LOCAL_DATE")
# View long data
print(df)
```
We have a few missing weather measurements across the time period. We'll interpolate them with a simple method.
```{python InterpolateMissing}
# Interpolate missing values in Ottawa, Toronto, Vancouver
df = df.groupby("LOCATION", group_keys = False).apply(
lambda g: g.interpolate(method = "time"))
df = df.reset_index()
```
The daily mean temperature and total precipitation values will be our weather variables. To predict the city of a sequence of weather measurements, we'll need to compare these variables with the dates they were observed in.
- To this end, we'll extract month, week of year and day of year features from our date column.
- We'll use cyclical encoding to transform these categorical variables into numeric, while reflecting their cyclical nature to the algorithms we'll use. This creates a pair of sine and cosine values for each time variable, resulting in a total of 8 variables per location.
```{python CyclicalEncode}
# Add cyclic terms for month, week of year and day of year
df["month_sin"] = np.sin(2 * np.pi * df["LOCAL_DATE"].dt.month / 12)
df["month_cos"] = np.cos(2 * np.pi * df["LOCAL_DATE"].dt.month / 12)
df["week_sin"] = np.sin(2 * np.pi * df["LOCAL_DATE"].dt.isocalendar().week / 53)
df["week_cos"] = np.cos(2 * np.pi * df["LOCAL_DATE"].dt.isocalendar().week / 53)
df["day_sin"] = np.sin(2 * np.pi * df["LOCAL_DATE"].dt.dayofyear / 366)
df["day_cos"] = np.cos(2 * np.pi * df["LOCAL_DATE"].dt.dayofyear / 366)
```
We'll split the data for each city into 28-day sequences. We'll store the data in a 3D numpy array, which is accepted as input by both sktime and pyts.
```{python SequenceConvert}
#| code-fold: true
#| code-summary: "Show code to split data into 28-day sequences"
# Assign n as one less than the target sequence length
n = 27
# Add counter for days
df["DAYCOUNT"] = df.groupby("LOCATION").LOCATION.cumcount().add(1)
# Add rowgroups: A unique number for each n+1 day sequence
df["ROWGROUP"] = (df["DAYCOUNT"] // (n + 1)).astype(str)
df = df.drop("DAYCOUNT", axis = 1)
# Eliminate rowgroups which are not of length N + 1 per city
n_rowgroups = len(df["ROWGROUP"].unique())
rowgroup_lengths = [
int(len(df[df["ROWGROUP"] == str(x)]) / 3) for x in range(0, n_rowgroups)]
rowgroups_to_keep = np.where(np.array(rowgroup_lengths) == n + 1)[0].tolist()
rowgroups_to_keep = [str(x) for x in rowgroups_to_keep]
df = df.loc[df["ROWGROUP"].isin(rowgroups_to_keep)]
# Retrieve targets for each sequence
y = df.groupby(["LOCATION", "ROWGROUP"]).head(1)["LOCATION"]
y = y.reset_index().drop("index", axis = 1).values.flatten()
# Get class labels
classes = np.unique(y)
# Retrieve features as 3Darray of shape (n_sequences, n_dimensions, seq_length)
# Get 2D arrays of (n_dimensions, seq_length) for each sequence
x = df.groupby(["LOCATION", "ROWGROUP"], as_index = False).apply(
lambda g: np.array(
[g["MEAN_TEMPERATURE"].values,
g["TOTAL_PRECIPITATION"].values,
g["month_sin"].values,
g["month_cos"].values,
g["week_sin"].values,
g["week_cos"].values,
g["day_sin"].values,
g["day_cos"].values
]
)
)
# Get a single 3Darray
x = np.array([x[i] for i in range(0, len(x))], dtype = np.float64)
# Print data shape
print("Shape of data (n_sequences, n_dimensions, seq_length): " + str(x.shape))
```
We have 3126 multivariate sequences of 8 variables, each sequence 28 days long. Each of the three cities have an equal number of sequences. Many time series classifiers don't accept sequences of unequal lengths, so we dropped the remainder days which could not make up a 28-day sequence.
Next, we'll split the training and testing data while minding the time order of the sequences. The training data will consist of the first 80% of sequences for each city, and the testing data will consist of the last 20%. We'll do this split manually, as shown in the cell below.
```{python TrainTestSplit}
#| code-fold: true
#| code-summary: "Show code to split training & testing data for each city"
l = len(y) # Length of entire data, for all 3 cities
len_test = int(l / 3 * 0.2) # Length of test data for one city
len_train = int(l / 3 - len_test) # Length of train data for one city
j = int(l / 3) # Length of entire data for one city
# Get indices for training set, for each city
idx_train = list(range(0, len_train)) + list(range(j, len_train + j)) + list(range(j * 2, len_train + (j * 2)))
# Get indices for testing set as the difference from the training set
idx_test = list(range(0, l))
idx_test = list(set(idx_test).difference(idx_train))
# Perform train-test split
y_train, y_test = y[idx_train], y[idx_test]
x_train, x_test = x[idx_train], x[idx_test]
# Print data shapes
print(
"Shape of training data (n_sequences, n_dimensions, seq_length): " +
str(x_train.shape))
print(
"Shape of testing data (n_sequences, n_dimensions, seq_length): " +
str(x_test.shape))
```
### Recurrence plot transformations
The pyts package offers several [methods](https://pyts.readthedocs.io/en/stable/modules/image.html) to convert time series into images, which can then be classified with image classification models. We'll convert our sequences into recurrence plots: Black-and-white images that represent the pairwise distances between each time step in the sequence (called trajectories).
- For each sequence, we'll end up with a 28 x 28 matrix of distances for each of the 8 variables.
- Then the problem practically turns into image classification, with 8 images (channels) of size 28 x 28 for each sequence.
After the recurrence plot transformation, we'll also scale each channel between 0 to 1. We'll do the scaling manually as the usual scaler functions do not accept our data dimensions. See the [helper functions script](https://github.com/AhmetZamanis/WeatherAnomalyDetectionClassification/blob/main/X_HelperFunctionsClassif.py) for more details.
```{python GetImages}
# Create RecurrencePlot transformer
trafo_image = RecurrencePlot()
# Transform the features
x_train_img, x_test_img = get_images(x_train, x_test, trafo_image)
# Print data shape
print(
"Shape of one image-transformed sequence (n_dims, seq_length, seq_length): " + str(x_train_img[0].shape))
```
We can plot our transformed data as black and white images. We'll plot two consecutive sequences of weather measurements for each city.
```{python PlotImages1}
# Plot the recurrence plot for two consecutive sequences per city, for the weather
# dimensions
plot_images(x_train_img, 0, "MeanTemp", 0, 1, len_train)
plot_images(x_train_img, 1, "TotalPrecip", 0, 1, len_train)
```
Each plot is the pairwise similarity matrix of each trajectory in that sequence, for that variable. The resulting images should identify the city when compared with other images for the same city.
We can also plot the time dimensions, and confirm they are the same for each city.
- These dimensions won't identify the cities by themselves, they are only fed into the models for interaction with the weather variables.
- A particular weather measurement alone may not identify the city, but the same measurement at specific times could.
```{python PlotImages2}
# Plot the recurrence plot for two consecutive sequences per city, for the month dimensions
plot_images(x_train_img, 2, "MonthSin", 0, 1, len_train)
plot_images(x_train_img, 3, "MonthCos", 0, 1, len_train)
```
## Time series classification
We'll try several time series classification methods, focusing on those that natively handle multivariate time series.
- The idea behind most of these multivariate methods is to perform certain transformations on the time series, or to extract certain features from them. These can then be used as inputs in regular classification algorithms.
- The image transformation we applied is one such method. Other methods take a text classification approach, transforming time series into word representations and deriving features from these.
- Some methods even rely on extracting simple summary statistics from the time series.
### K-nearest neighbors with DTW distance
The first approach we'll use is the k-nearest neighbor classifier adapted to time series data. The main difference is the distance metric used: **Dynamic time warping** distance.
- DTW is a method that calculates the optimal match between the observations of two time series, even if they are of unequal length. This is done by minimizing the differences between each pair subject to certain conditions.
- DTW is often used in tasks like speech recognition, where the analyzed sequences may be of different lengths.
- DTW outputs a similarity metric for two time series, which will be our distance metric in the kNN classifier.
The number of nearest neighbors used to classify each observation is the key parameter in kNN. We'll just use 3, but the result may considerably change with different values.
```{python CreateKNN}
# Create KNN classifier
model_knn = KNeighborsTimeSeriesClassifier(
n_neighbors = 3, # N. of nearest neighbors to compare each observation
n_jobs = -1)
```
We'll train our models, test their performances and retrieve their predictions on the test set. We'll aggregate and compare the performance metrics at the end.
```{python TestKNN}
# Test KNN classifier
preds_knn, probs_knn, acc_knn, loss_knn = test_model(
model_knn, x_train, x_test, y_train, y_test, scale = True)
# View predicted probabilities for each city
print("KNN classifier predicted probabilities for each city:")
print(pd.DataFrame(probs_knn, columns = ["Ottawa", "Toronto", "Vancouver"]))
```
Since our model makes predictions with the 3 nearest neighbors, the probability predictions are fairly simplistic.
### ROCKET & Arsenal
ROCKET is a fast and efficient transformation used to derive features for time series classification. It relies on convolutional kernels, not unlike the convolutional neural network approach we'll use later. [See here](https://d2l.ai/chapter_convolutional-neural-networks/conv-layer.html) for a good explanation of convolution, which is best made visually.
- Instead of learning the weights for the convolutional kernels during training (as we will do with a CNN), ROCKET simply generates a large number of kernels randomly. In the sktime implementation, the kernels are randomly assigned a length of 7, 9 or 11.
- The convolutions are applied to each observation (which are univariate or multivariate time series, and not single values), and two features are extracted from the resulting feature maps: The maximum value and the proportion of positive values. These can then be used as features in classifier algorithms.
- With a univariate time series of length N, the convolution is between an input vector of length N, and a kernel vector of length L.
- With a multivariate time series, each kernel is assigned a random selection of D features as the input channels. A set of weights are randomly generated for each selected input channel. The convolution is between an input matrix of size N x D, and a kernel matrix of L x D.
- In effect, we convert our 3D data of shape `(n_sequences, n_features, seq_length)` into 2D data of shape `(n_sequences, n_kernels * 2)`.
- sktime's `RocketClassifier` combines a ROCKET transformation with sklearn's `RidgeClassifierCV`: L2-regularized logistic regression with built-in crossvalidation. We'll stick to this approach, but other classifiers can also be used.
We'll also try `Arsenal`, which is simply an ensemble of ROCKET + RidgeClassifierCV classifiers, by default 25 of them. The ensemble weighs the predictions of each classifier according to their built-in crossvalidation accuracies.
Besides the default ROCKET transformation described above, there are two more variations of the ROCKET transform, MiniRocket and MultiRocket.
- MiniRocket is a highly deterministic version of ROCKET: The kernels are fixed at length 9 and kernel weights are restricted to two values. The only feature derived from the feature maps is the proportion of positive values.
- MultiRocket applies MiniRocket to both the original and first-order differenced series. It also derives three more features from the feature maps: Mean of positive values, mean of indices of positive values, and longest stretch of positive values.
- Both MiniRocket and MultiRocket performed considerably better than the default ROCKET on this problem. We'll continue with MultiRocket in this demonstration.
- `RocketClassifier` automatically chooses the univariate or multivariate version of either ROCKET variant based on the input data shape.
- MultiRocket is not to be confused with a multivariate version of the default ROCKET. Both variants have univariate and multivariate versions.
```{python CreateRocket}
# Create RocketClassifier
model_rocket = RocketClassifier(
rocket_transform = "multirocket", # Use MultiRocket variant
n_jobs = -1, random_state = 1923)
# Create Arsenal classifier (probabilistic ROCKET ensemble, memory intensive)
model_arsenal = Arsenal(
rocket_transform = "multirocket", # Use MultiRocket variant
random_state = 1923)
```
```{python TestRocket}
# Test RocketClassifier
preds_rocket, probs_rocket, acc_rocket, loss_rocket = test_model(
model_rocket, x_train, x_test, y_train, y_test, scale = False)
# View predicted probabilities for each city (ROCKET is non-probabilistic)
print("MultiRocket classifier predicted probabilities for each city:")
print(pd.DataFrame(probs_rocket, columns = ["Ottawa", "Toronto", "Vancouver"]))
```
```{python TestArsenal}
# Test Arsenal classifier
preds_arsenal, probs_arsenal, acc_arsenal, loss_arsenal = test_model(
model_arsenal, x_train, x_test, y_train, y_test, scale = False)
# View predicted probabilities for each city
print("Arsenal classifier predicted probabilities for each city:")
print(pd.DataFrame(probs_arsenal, columns = ["Ottawa", "Toronto", "Vancouver"]))
```
A single `RocketClassifier` does not yield probabilistic predictions, but an `Arsenal` ensemble does, even though they are derived from the average of 25 binary predictions.
### MUSE
The next method we'll use is a bag-of-words approach.
- In general, these approaches assign "letters" to values in a time series, binning similar values into the same categories.
- Subsequences can be extracted from time series, and transformed into a sequence of letters, i.e. "words".
- Predictions can be made based on the appearance frequencies of words, borrowing from text classification concepts such as TF-IDF, or by using the derived words as features in regular classifiers.
The **WEASEL** method extracts words from a time series with multiple sliding windows of varying sizes, and selects the most predictive ones, with the chi-squared test by default. **MUSE** extends this approach to multivariate time series.
The sktime implementation of `MUSE` combines the WEASEL + MUSE transformation with a logistic regression classifier, trained on the features chosen by a one-way ANOVA test. We'll use this method with pretty much the default settings.
- We won't use the first-order differences of the features, as they are likely not meaningful for our time features. MiniRocket does not use first-order differences while MultiRocket does, and both methods perform very similarly on this problem.
```{python CreateMUSE}
# Create MUSE classifier
model_muse = MUSE(
use_first_order_differences = False, # Not meaningful for time features
support_probabilities = True, # Train LogisticRegression which outputs probs.
n_jobs = -1, random_state = 1923)
```
```{python TestMUSE}
# Test classifier
preds_muse, probs_muse, acc_muse, loss_muse = test_model(
model_muse, x_train, x_test, y_train, y_test, scale = True)
# View predicted probabilities for each city
print("MUSE classifier predicted probabilities for each city:")
print(pd.DataFrame(probs_muse, columns = ["Ottawa", "Toronto", "Vancouver"]))
```
The probability predictions of the MUSE classifier are likely more sophisticated than the previous methods.
### Convolutional neural network with recurrence plots
Finally, we'll perform image classification on the data we transformed into recurrence plots, using a typical convolutional neural network. For details on the code & architecture, see the [custom Lightning classes script](https://github.com/AhmetZamanis/WeatherAnomalyDetectionClassification/blob/main/X_LightningClassesClassif.py). Also [see here](https://d2l.ai/chapter_convolutional-neural-networks/index.html) for an excellent source on CNNs and deep learning in general. Below, we'll talk briefly about the intuition behind a CNN and our implementation.
Convolutional layers apply convolution kernels to their inputs, similar to ROCKET. Unlike the kernels in ROCKET though, the kernel weights in a convolutional layer are optimized as parameters during model training, just like weights and bias terms in a dense layer.
The main idea in CNNs is to take in a high-dimensional input, and to reduce its dimensionality with multiple convolutional layers, while also increasing the number of channels. This basically trades image resolution in favor of more features, allowing us to capture non-linear relationships & interaction effects from the data in a computationally efficient manner.
- The convolution operations "summarize" the values of neighboring points in a matrix, which may be a matrix of pixels of an image. In our case, it is a matrix of pairwise distances for time steps in a sequence.
- The performance of CNNs depend on several assumptions about the nature of the problem and the data (see the source above for detailed explanations). The way I understand it, we're assuming we can solve our problem by "locally summarizing" neighboring values, and looking for relationships & interactions between these "local summaries".
Our network inputs (one sequence, ignoring training batch size) are of shape `(8, 28, 28)`: Images of size 28 x 28, with 8 channels (features). Our network starts with three successive blocks of convolutional + maximum pooling layers. Each block will roughly reduce the input size into half, while roughly doubling the channel size.
- Pooling simply applies a window to its input (the output of convolution), and pools the values in the window according to a function, usually the maximum. This can help the network generalize better.
- The fourth convolutional + pooling block will reduce the channel size to 3, the number of target class labels in our classification problem.
- Finally, we'll apply global average pooling & flattening to arrive at a vector of length 3 as our final output: The predicted logits of belonging to each target class. To convert these into probability predictions, we can simply apply a softmax activation function.
- Older CNN architectures feed the outputs of convolutional layers into dense layers to arrive at the final predictions. More recent architectures instead leverage global average pooling, which reduces model complexity and often performs better.
- [See here](https://d2l.ai/chapter_convolutional-modern/index.html) for a detailed breakdown of some modern CNN architectures, which were referenced while designing the network for this problem.
- The CNN used in this report mostly resembles the [Network in Network](https://arxiv.org/abs/1312.4400) (NiN) architecture, which avoids using dense layers at the end. Instead, there are two 1x1 convolutions after each "main" convolution layer. These induce local non-linearities while avoiding large & computationally expensive dense layers in the network.
#### Data prep
Below are the additional data preparation steps needed to train our CNN model with Lightning. One advantage of CNNs over most other NNs is relatively fewer number of hyperparameters to optimize. In this case, I only tuned the learning rate & learning rate decay with Optuna. Here we'll load the best tune previously saved, [see here](https://github.com/AhmetZamanis/WeatherAnomalyDetectionClassification/blob/main/ScriptsClassification/2.4_CNN.py) for the code used for hyperparameter tuning.
```{python CNNDataPrep}
#| code-fold: true
#| code-summary: "Show code to prepare data for CNN training"
# Convert multiclass targets into binary matrices of shape (n_seq, n_classes)
encoder_onehot = OneHotEncoder(sparse_output = False)
y_train_img = encoder_onehot.fit_transform(y_train.reshape(-1, 1))
y_test_img = encoder_onehot.fit_transform(y_test.reshape(-1, 1))
# Load data into TrainDataset
train_data = TrainDataset(x_train_img, y_train_img)
test_data = TestDataset(x_test_img)
# Create dataloaders
train_loader = torch.utils.data.DataLoader(
train_data, batch_size = 128, num_workers = 0, shuffle = True)
test_loader = torch.utils.data.DataLoader(
test_data, batch_size = len(test_data), num_workers = 0, shuffle = False)
# Import best trial
best_trial_cnn = pd.read_csv("./OutputData/trials_cnn1.csv").iloc[0,]
# Retrieve best hyperparameters
hyperparams_dict = {
"input_channels": x_train.shape[1],
"learning_rate": best_trial_cnn["params_learning_rate"],
"lr_decay": best_trial_cnn["params_lr_decay"]
}
```
#### Model testing
We'll train our CNN, predict the testing data and retrieve both the probability & class predictions in the same format as the previous models.
```{python TrainCNN}
# Create trainer
trainer = L.Trainer(
max_epochs = int(best_trial_cnn["user_attrs_n_epochs"]),
accelerator = "gpu", devices = "auto", precision = "16-mixed",
enable_model_summary = True,
logger = True,
log_every_n_steps = 20,
enable_progress_bar = True,
enable_checkpointing = True
)
# Create & train model
model = CNN(hyperparams_dict)
trainer.fit(model, train_loader)
```
```{python PredCNN}
# Predict testing data
probs_cnn = trainer.predict(model, test_loader)
probs_cnn = probs_cnn[0].cpu().numpy().astype(np.float32)
# Convert to class predictions
preds_cnn = classes[np.argmax(probs_cnn, axis = 1)]
# Calculate performance metrics
acc_cnn = accuracy_score(classes[np.argmax(y_test_img, axis = 1)], preds_cnn)
loss_cnn = log_loss(y_test_img, probs_cnn)
# View predicted probabilities for each city
print("CNN classifier predicted probabilities for each city:")
print(pd.DataFrame(probs_cnn, columns = ["Ottawa", "Toronto", "Vancouver"]))
```
Again, one benefit of the CNN model over some previous models is meaningful probability predictions.
## Performance comparison
Now we can compare the performances of our classifiers. Since our data is perfectly balanced between the three target classes, we can use accuracy as a simple and intuitive metric for class predictions overall. We'll also assess the quality of probability predictions with log loss.
### Metrics table
We'll compare the performance metrics of our classifiers with random chance predictions as a baseline. With 3 perfectly balanced target classes, random chance would yield an accuracy of roughly 33%. We'll also compute log loss for random chance, assuming each class is assigned a probability of 0.33 for all sequences.
```{python MetricsTable}
#| code-fold: true
#| code-summary: "Show code to create performance metrics table"
# Generate random chance performance metrics
p = 0.33
probs_random = np.repeat(np.repeat(p, 3), len(y_test))
probs_random = np.reshape(probs_random, (len(y_test), 3))
loss_random = log_loss(y_test, probs_random)
# Gather performance metrics as dictionary
dict_metrics = {
"Accuracy": [round(x, 4) for x in [
p, acc_knn, acc_rocket, acc_arsenal, acc_muse, acc_cnn]],
"Log loss": [round(x, 4) for x in [
loss_random, loss_knn, loss_rocket, loss_arsenal, loss_muse, loss_cnn]]
}
# Print as table
print(pd.DataFrame(dict_metrics, index = [
"Random choice", "kNN with DTW", "MultiRocket", "Arsenal", "MUSE", "CNN"]))
```
We see the Arsenal method performs considerably better than the others in accuracy, and even a single MultiRocket + Ridge model comes second.
- The CNN model is second from last in accuracy, though it comes very close to kNN and MultiRocket.
- MUSE takes last place in accuracy, though it's not far from the other methods, and all methods are a huge improvement over random chance.
When it comes to the quality of probability predictions, we have a different story, as CNN has the best log loss value, followed by Arsenal.
- MUSE performs slightly worse in log loss compared to the random chance method.
- kNN and MultiRocket do not output meaningful probability predictions.
### Confusion matrix plots
We have three target classes, and neither performance metric will inform us about the isolated performance for one class. We'll plot the confusion matrices for the class predictions of each model to visually assess this. We won't plot one for MultiRocket, as Arsenal is an ensemble of MultiRocket + Ridge models.
```{python ConfusionMatrix}
#| code-fold: true
#| code-summary: "Show code to create confusion matrix plots"
# Gather class predictions as dictionary
dict_preds = {
"kNN with DTW": preds_knn,
# "ROCKET + ridge": preds_rocket,
"Arsenal + ridge": preds_arsenal,
"MUSE + logistic": preds_muse,
"CNN with recurrence plots": preds_cnn
}
# Gather axes as dictionary
dict_axes = {
"kNN with DTW": (0, 0),
"Arsenal + ridge": (0, 1),
"MUSE + logistic": (1, 0),
"CNN with recurrence plots": (1, 1)
}
# Create figure
fig, ax = plt.subplots(2,2, sharey = True, sharex = True)
_ = fig.suptitle("Confusion matrices of time series classifiers")
# Generate plots
for key in dict_preds.keys():
# Get confusion matrix
matrix = confusion_matrix(y_test, dict_preds[key], labels = classes)
# Create plot
idx = dict_axes[key]
_ = sns.heatmap(
matrix, xticklabels = classes, yticklabels = classes, cmap = "Reds",
annot = True, fmt = "g", square = True, cbar = False, linecolor = "black",
linewidths = 0.5, ax = ax[idx])
_ = ax[idx].set_xlabel("Predicted classes")
_ = ax[idx].set_ylabel("True classes")
_ = ax[idx].set_title(key)
plt.show()
plt.close("all")
```
A simple way to assess the confusion matrix plots is to look at the diagonals, which represent cases of true classes matching predicted classes. This way, we can see Arsenal does the best job in class predictions, overall and for each class individually.
Looking at horizontal rows will give us an idea of isolated performances for each class.
- We see all models do a great job of identifying sequences from Vancouver, but generally struggle telling apart Ottawa and Toronto. This is expected as Vancouver is on the west coast, with a considerably different climate, while Ottawa and Toronto are close to one another on the east coast.
- kNN often confuses Ottawa & Toronto for Vancouver, in addition to one another. Other models mostly confuse Ottawa & Toronto for one another, and not for Vancouver.
- Arsenal has the best performance in identifying all cities.
## Conclusion
We compared several different approaches to multivariate time series classification, and the results are subjective.
- While the Arsenal ensemble of MultiRocket + Ridge classifiers clearly performed better in class predictions, the CNN model trained on image-transformed data performed better in probability predictions.
- The latter can be more important in many cases, especially when it comes to predicting inherently unstable phenomena like the weather. We may prefer a more realistic probability estimate over a simplified class prediction, so we can assess the uncertainty ourselves.
- Still, the Arsenal ensemble also achieved a comparable log loss score, which is very impressive considering the simple nature of its probability predictions. Also keep in mind that we can combine a ROCKET transformation with any classifier in theory, instead of just logistic regression.
- Personally, I believe in finding the simplest tool that solves a problem. In future time series classification problems, ROCKET variations will likely be my first choices.
Any comments or suggestions are welcome.