Evaluation

Evaluating models

Once a data set has been labeled [see [Labeling][Labeling), one or more models may be evaluated for accuracy using the evaluate tool. Evaluation uses K-fold cross validation in which the data set is randomly divided into K folds or partitions, K-1 folds are used to train the model and the held out partition is used as a test set to determine accuracy. K combinations of K-1 and held out data are iteratively trained and tested, with results from each held out fold averaged to produce an averaged set of accuracy numbers. Accuracy of the model is identified by a confusion matrix and three key metrics, precision, recall and F1 score. The evaluate tool provides each of these.

Example

We consider a set of data as follows:

% sound-info -sounds metadata.csv
Loading aisp properties from file:/c:/dev/aisp/aisp.properties


Loading sounds from [metadata.csv]
Total: 103 samples, 1 label(s), 00:01:38.188 hr:min:sec
Label: class, 103 samples, 3 value(s), 00:01:38.188 hr:min:sec
  Value: ambient, 61 samples, 00:01:11.960 hr:min:sec
  Value: click, 23 samples, 00:00:5.076 hr:min:sec
  Value: voice, 19 samples, 00:00:21.152 hr:min:sec

For our example, we will use the lpknn model. Typically, models require a constant/fixed length of sound to be trained on and to classify. The length of this clip generally depends on the target sounds. A general rule of thumb is that the label applied should represent 75% of the clip. So for very short events, say of length 500-1000 milliseconds, you should probably not use a clip length of larger than 1 second. The sounds contain a very short event, click, which requires a short clip length of 250 msec. Using such a clip length, we can now see the inventory of sounds.

sound-info 

To evaluate the default lpknn model,
```bash
% sound-info -sounds metadata.csv -clipLen 250 
Loading aisp properties from file:/c:/dev/aisp/aisp.properties


Sounds will be clipped every 250 msec into 250 msec clips (padding=NoPad)
Total: 342 samples, 1 label(s), 00:01:25.500 hr:min:sec
Label: class, 342 samples, 3 value(s), 00:01:25.500 hr:min:sec
  Value: ambient, 261 samples, 00:01:5.250 hr:min:sec
  Value: click, 5 samples, 00:00:1.250 hr:min:sec
  Value: voice, 76 samples, 00:00:19.000 hr:min:sec

Note that we only have 5 clicks relative which is relatively small compared to the other label values. This can partially be addressed by padding clips that are too short with duplicate data, as follows:

% sound-info -sounds metadata.csv -clipLen 250 -pad duplicate
Loading aisp properties from file:/c:/dev/aisp/aisp.properties


Loading sounds from [metadata.csv]
Sounds will be clipped every 250 msec into 250 msec clips (padding=DuplicatePad)
Total: 395 samples, 1 label(s), 00:01:38.750 hr:min:sec
Label: class, 395 samples, 3 value(s), 00:01:38.750 hr:min:sec
  Value: ambient, 288 samples, 00:01:12.000 hr:min:sec
  Value: click, 24 samples, 00:00:6.000 hr:min:sec
  Value: voice, 83 samples, 00:00:20.750 hr:min:sec

Padding is not done when the remaindered clip is less than half the requested length.
This has improved the numbers, but ideally, there would be an even number of samples across each label value. When this is not available, balancing of the data can be done for both evaluation and training.
For evaluation, the -kfoldBalance option is used (see -balance-with option for the train tool).
So to evaluate our model,

% evaluate -model lpknn -sounds metadata.csv -clipLen 250 -pad duplicate -label class -cm -folds 3
Loading aisp properties from file:/c:/dev/aisp/aisp.properties


Loading sounds from [metadata.csv]
Sounds will be clipped every 250 msec into 250 msec clips (padding=DuplicatePad)
Warning: Nashorn engine is planned to be removed from a future JDK release
Training and evaluating classifier (LpDistanceMergeKNNClassifier)
Sounds : Total: 395 samples, 1 label(s), 00:01:38.750 hr:min:sec
Label: class, 395 samples, 3 value(s), 00:01:38.750 hr:min:sec
  Value: ambient, 288 samples, 00:01:12.000 hr:min:sec
  Value: click, 24 samples, 00:00:6.000 hr:min:sec
  Value: voice, 83 samples, 00:00:20.750 hr:min:sec
Evaluating 3 of 3 folds .
Evaluation completed in 1091 msec. 363 msec/fold (computed in parallel)
Evaluated label name: class
COUNT MATRIX:
Predicted  ->[   ambient   ][    click    ][    voice    ]
ambient    ->[ *    285 *  ][        2    ][        1    ]
click      ->[        8    ][ *     16 *  ][        0    ]
voice      ->[       10    ][        3    ][ *     70 *  ]

PERCENT MATRIX:
Predicted  ->[   ambient   ][    click    ][    voice    ]
ambient    ->[  * 72.15 *  ][     0.51    ][     0.25    ]
click      ->[     2.03    ][  *  4.05 *  ][     0.00    ]
voice      ->[     2.53    ][     0.76    ][  * 17.72 *  ]

Label           |  Count |        F1 | Precision |    Recall
ambient         |    288 |    96.447 |    94.059 |    98.958
click           |     24 |    71.111 |    76.190 |    66.667
voice           |     83 |    90.909 |    98.592 |    84.337
Micro-averaged  |    395 |    93.922 |    93.922 |    93.922
Macro-averaged  |    395 |    85.432 |    88.684 |    83.324

Precision:  93.922 +/- 0.6399% (micro),  88.684 +/- 3.4790% (macro)
Recall   :  93.922 +/- 0.6399% (micro),  83.324 +/- 6.9818% (macro)
F1       :  93.922 +/- 0.6399% (micro),  85.432 +/- 5.4500% (macro)

The above shows both the confusion matrices and the various precision, recall and F1 metrics. At this point, you can choose to use the model as is or tune it using your JavaScript model definition (see Models for how to define your own model).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation

Evaluating models

Example

Clone this wiki locally