-
Notifications
You must be signed in to change notification settings - Fork 3
Evaluation
Once a data set has been labeled [see [Labeling][Labeling), one or more models may be evaluated for accuracy using the evaluate tool. Evaluation uses K-fold cross validation in which the data set is randomly divided into K folds or partitions, K-1 folds are used to train the model and the held out partition is used as a test set to determine accuracy. K combinations of K-1 and held out data are iteratively trained and tested, with results from each held out fold averaged to produce an averaged set of accuracy numbers. Accuracy of the model is identified by a confusion matrix and three key metrics, precision, recall and F1 score. The evaluate tool provides each of these.
We consider a set of data as follows:
% sound-info -sounds metadata.csv
Loading aisp properties from file:/c:/dev/aisp/aisp.properties
Loading sounds from [metadata.csv]
Total: 103 samples, 1 label(s), 00:01:38.188 hr:min:sec
Label: class, 103 samples, 3 value(s), 00:01:38.188 hr:min:sec
Value: ambient, 61 samples, 00:01:11.960 hr:min:sec
Value: click, 23 samples, 00:00:5.076 hr:min:sec
Value: voice, 19 samples, 00:00:21.152 hr:min:sec
For our example, we will use the lpknn model. Typically, models require a constant/fixed length of sound to be trained on and to classify. The length of this clip generally depends on the target sounds. A general rule of thumb is that the label applied should represent 75% of the clip. So for very short events, say of length 500-1000 milliseconds, you should probably not use a clip length of larger than 1 second. The sounds contain a very short event, click, which requires a short clip length of 250 msec. Using such a clip length, we can now see the inventory of sounds.
% sound-info -sounds metadata.csv -clipLen 250
Loading aisp properties from file:/c:/dev/aisp/aisp.properties
Sounds will be clipped every 250 msec into 250 msec clips (padding=NoPad)
Total: 342 samples, 1 label(s), 00:01:25.500 hr:min:sec
Label: class, 342 samples, 3 value(s), 00:01:25.500 hr:min:sec
Value: ambient, 261 samples, 00:01:5.250 hr:min:sec
Value: click, 5 samples, 00:00:1.250 hr:min:sec
Value: voice, 76 samples, 00:00:19.000 hr:min:sec
Note that we only have 5 clicks which is relatively small compared to the other label values. This can partially be addressed by padding clips that are too short with duplicate data, as follows:
% sound-info -sounds metadata.csv -clipLen 250 -pad duplicate
Loading aisp properties from file:/c:/dev/aisp/aisp.properties
Loading sounds from [metadata.csv]
Sounds will be clipped every 250 msec into 250 msec clips (padding=DuplicatePad)
Total: 395 samples, 1 label(s), 00:01:38.750 hr:min:sec
Label: class, 395 samples, 3 value(s), 00:01:38.750 hr:min:sec
Value: ambient, 288 samples, 00:01:12.000 hr:min:sec
Value: click, 24 samples, 00:00:6.000 hr:min:sec
Value: voice, 83 samples, 00:00:20.750 hr:min:sec
Padding is not done when the remaindered clip is less than half the requested length. This has improved the numbers, but ideally, there would be an even number of samples across each label value. When this is not available, balancing of the data can be done for both evaluation and training. For evaluation, the -kfoldBalance option is used (see -balance-with option for the train tool). So to evaluate our model,
% evaluate -model lpknn -sounds metadata.csv -clipLen 250 -pad duplicate -label class -cm \
-folds 3 -kfoldBalance max
Loading aisp properties from file:/c:/dev/aisp/aisp.properties
Loading sounds from [metadata.csv]
Sounds will be clipped every 250 msec into 250 msec clips (padding=DuplicatePad)
Warning: Nashorn engine is planned to be removed from a future JDK release
Training and evaluating classifier (LpDistanceMergeKNNClassifier)
Sounds : Total: 395 samples, 1 label(s), 00:01:38.750 hr:min:sec
Label: class, 395 samples, 3 value(s), 00:01:38.750 hr:min:sec
Value: ambient, 288 samples, 00:01:12.000 hr:min:sec
Value: click, 24 samples, 00:00:6.000 hr:min:sec
Value: voice, 83 samples, 00:00:20.750 hr:min:sec
Evaluating 3 of 3 folds with balanced training data at 288 samples per label value.
Evaluation completed in 2436 msec. 812 msec/fold (computed in parallel)
Evaluated label name: class
COUNT MATRIX:
Predicted ->[ ambient ][ click ][ voice ]
ambient ->[ * 285 * ][ 2 ][ 1 ]
click ->[ 8 ][ * 16 * ][ 0 ]
voice ->[ 10 ][ 3 ][ * 70 * ]
PERCENT MATRIX:
Predicted ->[ ambient ][ click ][ voice ]
ambient ->[ * 72.15 * ][ 0.51 ][ 0.25 ]
click ->[ 2.03 ][ * 4.05 * ][ 0.00 ]
voice ->[ 2.53 ][ 0.76 ][ * 17.72 * ]
Label | Count | F1 | Precision | Recall
ambient | 288 | 96.447 | 94.059 | 98.958
click | 24 | 71.111 | 76.190 | 66.667
voice | 83 | 90.909 | 98.592 | 84.337
Micro-averaged | 395 | 93.922 | 93.922 | 93.922
Macro-averaged | 395 | 85.432 | 88.684 | 83.324
Precision: 93.922 +/- 0.6399% (micro), 88.684 +/- 3.4790% (macro)
Recall : 93.922 +/- 0.6399% (micro), 83.324 +/- 6.9818% (macro)
F1 : 93.922 +/- 0.6399% (micro), 85.432 +/- 5.4500% (macro)
The above shows both the confusion matrices and the various precision, recall and F1 metrics. Note that the confusion matrix does NOT show balanced data as only the K-1 training folds are balanced on each iteration, not the held out test set. At this point, you can choose to use the model as is or tune it using your JavaScript model definition (see Models for how to define your own model).