-
Notifications
You must be signed in to change notification settings - Fork 48
/
Copy pathslides.html
777 lines (745 loc) · 41.6 KB
/
slides.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="generator" content="pandoc">
<meta name="author" content="Jon Nordby [email protected]">
<meta name="dcterms.date" content="2021-03-25">
<title>Sound Event Detection with Machine Learning</title>
<meta name="apple-mobile-web-app-capable" content="yes">
<meta name="apple-mobile-web-app-status-bar-style" content="black-translucent">
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no, minimal-ui">
<link rel="stylesheet" href="https://unpkg.com/reveal.js@^4//dist/reset.css">
<link rel="stylesheet" href="https://unpkg.com/reveal.js@^4//dist/reveal.css">
<style>
code{white-space: pre-wrap;}
span.smallcaps{font-variant: small-caps;}
span.underline{text-decoration: underline;}
div.column{display: inline-block; vertical-align: top; width: 50%;}
div.hanging-indent{margin-left: 1.5em; text-indent: -1.5em;}
ul.task-list{list-style: none;}
pre > code.sourceCode { white-space: pre; position: relative; }
pre > code.sourceCode > span { display: inline-block; line-height: 1.25; }
pre > code.sourceCode > span:empty { height: 1.2em; }
.sourceCode { overflow: visible; }
code.sourceCode > span { color: inherit; text-decoration: inherit; }
div.sourceCode { margin: 1em 0; }
pre.sourceCode { margin: 0; }
@media screen {
div.sourceCode { overflow: auto; }
}
@media print {
pre > code.sourceCode { white-space: pre-wrap; }
pre > code.sourceCode > span { text-indent: -5em; padding-left: 5em; }
}
pre.numberSource code
{ counter-reset: source-line 0; }
pre.numberSource code > span
{ position: relative; left: -4em; counter-increment: source-line; }
pre.numberSource code > span > a:first-child::before
{ content: counter(source-line);
position: relative; left: -1em; text-align: right; vertical-align: baseline;
border: none; display: inline-block;
-webkit-touch-callout: none; -webkit-user-select: none;
-khtml-user-select: none; -moz-user-select: none;
-ms-user-select: none; user-select: none;
padding: 0 4px; width: 4em;
color: #aaaaaa;
}
pre.numberSource { margin-left: 3em; border-left: 1px solid #aaaaaa; padding-left: 4px; }
div.sourceCode
{ }
@media screen {
pre > code.sourceCode > span > a:first-child::before { text-decoration: underline; }
}
code span.al { color: #ff0000; font-weight: bold; } /* Alert */
code span.an { color: #60a0b0; font-weight: bold; font-style: italic; } /* Annotation */
code span.at { color: #7d9029; } /* Attribute */
code span.bn { color: #40a070; } /* BaseN */
code span.bu { } /* BuiltIn */
code span.cf { color: #007020; font-weight: bold; } /* ControlFlow */
code span.ch { color: #4070a0; } /* Char */
code span.cn { color: #880000; } /* Constant */
code span.co { color: #60a0b0; font-style: italic; } /* Comment */
code span.cv { color: #60a0b0; font-weight: bold; font-style: italic; } /* CommentVar */
code span.do { color: #ba2121; font-style: italic; } /* Documentation */
code span.dt { color: #902000; } /* DataType */
code span.dv { color: #40a070; } /* DecVal */
code span.er { color: #ff0000; font-weight: bold; } /* Error */
code span.ex { } /* Extension */
code span.fl { color: #40a070; } /* Float */
code span.fu { color: #06287e; } /* Function */
code span.im { } /* Import */
code span.in { color: #60a0b0; font-weight: bold; font-style: italic; } /* Information */
code span.kw { color: #007020; font-weight: bold; } /* Keyword */
code span.op { color: #666666; } /* Operator */
code span.ot { color: #007020; } /* Other */
code span.pp { color: #bc7a00; } /* Preprocessor */
code span.sc { color: #4070a0; } /* SpecialChar */
code span.ss { color: #bb6688; } /* SpecialString */
code span.st { color: #4070a0; } /* String */
code span.va { color: #19177c; } /* Variable */
code span.vs { color: #4070a0; } /* VerbatimString */
code span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warning */
</style>
<link rel="stylesheet" href="https://unpkg.com/reveal.js@^4//dist/theme/white.css" id="theme">
<link rel="stylesheet" href="style.css"/>
</head>
<body>
<div class="reveal">
<div class="slides">
<section class="slide level2">
<section class="titleslide level1" data-background-image="./img/soundsensing-withlogo.jpg" style="background: rgba(255, 255, 255, 0.3); padding-top: 1.7em;">
<h1 style>
Sound Event Detection with Machine Learning
</h1>
<p>
Jon Nordby</br> Head of Data Science & Machine Learning</br> Soundsensing AS</br> [email protected]</br> </br> EuroPython 2021</br>
</p>
</section>
<aside class="notes">
<p>Hi and good morning everyone</p>
<p>Jon Nordby Head of Machine Learning and Data Science at Soundsensing</p>
<p>Today we will be talking about Sound Event Detection using Machine Learning</p>
</aside>
</section>
<section>
<section id="introduction" class="title-slide slide level1">
<h1>Introduction</h1>
<aside class="notes">
</aside>
</section>
<section id="about-soundsensing" class="slide level2">
<h2>About Soundsensing</h2>
<p><img data-src="./img/soundsensing-solution.svg.png" style="width:100.0%" /></p>
<aside class="notes">
<p>Soundsensing is a company that focuses on audio and machine learning.</p>
<p>We provide easy-to-use IoT sensors that can continiously measure sound, and use Machine Learning to extract interesting information.</p>
<p>The information presented in our online dashboard, and is also available in an API for integrating with other systems.</p>
<p>Our products are used for Noise Monitoring and Condition Monitoring of equipment.</p>
</aside>
</section>
<section id="sound-event-detection" class="slide level2">
<h2>Sound Event Detection</h2>
<p><img data-src="./img/audio-classification-tagging-detection.png" style="width:100.0%" /></p>
<blockquote>
<p>Given input audio </br> return the timestamps (start, end) </br> for each event class</p>
</blockquote>
<aside class="notes">
<p>One of many common tasks in Audio Machine Learning</p>
<p>Other examples of tasks are Audio Classification, and Audio Tagging</p>
<p>In Classification there is only a single class label as output. No timing information In Tagging one allows multiple classes. But also no timing information. Event Detection gives a series of time-stamps as output</p>
<p>Also known as: Acoustic Event Detection or Audio Event Detection (AED)</p>
<p>Audio Classification with Machine Learning (Jon Nordby, EuroPython 2019) https://www.youtube.com/watch?v=uCGROOUO_wY</p>
</aside>
</section>
<section id="events-and-non-events" class="slide level2">
<h2>Events and non-events</h2>
<p>Events are sounds with a clearly-defined duration or onset.</p>
<table>
<thead>
<tr class="header">
<th>Event (time limited)</th>
<th>Class (continious)</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Car passing</td>
<td>Car traffic</td>
</tr>
<tr class="even">
<td>Honk</td>
<td>Car traffic</td>
</tr>
<tr class="odd">
<td>Word</td>
<td>Speech</td>
</tr>
<tr class="even">
<td>Gunshot</td>
<td>Shooting</td>
</tr>
</tbody>
</table>
<aside class="notes">
<p>What are events?</p>
<p>Start-end. Onset/offset Or at least a clear start</p>
<p>Isolated claps (event) versus clapping (ongoing, class)</p>
<p>If events are overlapping a lot, might not make sense as events anymore</p>
<p>For events one can count the number of occurrences Classification might instead count number of seconds instead</p>
</aside>
</section></section>
<section>
<section id="application" class="title-slide slide level1">
<h1>Application</h1>
<p>Fermentation tracking when making alcoholic beverages. Beer, Cider, Wine, etc.</p>
<aside class="notes">
<p>Tried to pick a bit fun task as an example</p>
</aside>
</section>
<section id="alcohol-is-produced-via-fermentation" class="slide level2">
<h2>Alcohol is produced via fermentation</h2>
<p><img data-src="./img/beer-brewing-crop.jpg" style="width:30.0%" /></p>
<aside class="notes">
<p>When brewing alcoholic beverages such as beer, cider or wine one puts together a compoud with yeast, source of sugars, water (the wort) into a vessel</p>
<p>The vessel is put in a location with an appropriate temperature, and after some time the fermentation process will start.</p>
<p>During fermentation the yeast will eat the sugar, which will produce alcohol, and as a byproduct also CO2 gas</p>
<p>There are many things that can go wrong. - can fail to start - way to intense: foaming, blowout - abrupt stop</p>
<p>So as a brewer, one has to monitor the process.</p>
<p>At the top of the vessel you see an airlock. This is a device that will let the CO2 gas out, while not allowing oxygen, bugs or other contaminants in.</p>
</aside>
</section>
<section id="airlock-activity" class="slide level2">
<h2>Airlock activity</h2>
<video data-autoplay src="videos/!-j7md-wkL1U0.mp4" controls style="height: 800px">
</video>
<!--
<iframe src="https://www.youtube.com/watch?v=j7md-wkL1U0" controls width="1500" height="1000"/></iframe>
-->
<aside class="notes">
<p>In this video clip the fermentation process has started, with medium activity</p>
<p>CO2 is being pushed through the airlock, and escapes out at the top</p>
<p>As you can hear this makes a characteristic sound, a “plop” for each bubble of gas that escapes</p>
<p>This example has a very nice and clear sound. It is not always so nice.</p>
<p>This is something we can track using Machine Learning. We can have a microphone that picks up the sound, pass it through some software and use a machine learning model to detect each individual “plop”</p>
<p>Example of an event. Clear time-defined sound that we want to count.</p>
<p>If you count the plop or bubbling activity then you can estimate how much fermentation is going on. Can also be used to estimate alcohol content, though it is not very precise for that. It can tell at least whether fermentation has started or not, and roughly how the brew is progressing.</p>
</aside>
</section>
<section id="fermentation-tracking" class="slide level2">
<h2>Fermentation tracking</h2>
<p>Fermentation activity can be tracked as Bubbles Per Minute (BPM).</p>
<p><img data-src="./img/fermentation-rate-over-time.png" style="width:80.0%" /></p>
<aside class="notes">
<p>Typical curves look like this. Starts out with nothing, then ramps up. And as the yeast eats up the sugars, fermentation will gradually go down.</p>
<p>Many variations in the curves possible depending on your brew, some examples shown here.</p>
<p>Affected by temperature, external and in the brew. And of the changes over time in sugar and yeast concentrations.</p>
</aside>
</section>
<section id="our-goal" class="slide level2">
<h2>Our goal</h2>
<p>Make a system that can track fermentation activity, </br>outputting Bubbles per Minute (BPM), </br>by capturing airlock sound using a microphone, </br>using Machine Learning to count each “plop”</p>
<aside class="notes">
<p>Of course there are existing devices dedicated to this task. Such as a Plaato Airlock. But for fun and learning we will do this using sound. This is an Sound Event Detection problem</p>
</aside>
</section></section>
<section>
<section id="machine-learning-needs-data" class="title-slide slide level1">
<h1>Machine Learning needs Data!</h1>
<aside class="notes">
<p>When one says “Machine Learning” many people think think mainly about ML algorithms and code But just as important, or in many cases more important, is the <strong>data</strong></p>
<p>Without appropriate data for your task, you will not get a good ML model, or ML powered system!</p>
</aside>
</section>
<section id="supervised-machine-learning" class="slide level2">
<h2>Supervised Machine Learning</h2>
<p><img data-src="./img/aed-supervised-learning.png" style="width:80.0%" /></p>
<aside class="notes">
<p>The technique we are going to use is Supervised learning, which is the most common for learning a classifier or detector like this.</p>
<p>Supervised learning is based on labeled examples, of Input (Audio) AND Expected output (bubble yes/no)</p>
<p>In sound event detection there are multiple ways of labeling your data. We will work here with strongly labeled data (shown at the top), where the start and end of each event instance is marked. Very detailed. Takes a considerable amount of time to make, but easy for a system to learn from.</p>
<p>So the labeled data will go into the training system, and output a Sound Event Detector. This detector can be ran on new audio, and will output detected events, in our case the plops.</p>
<p>TODO: mark only the relevant case in image</p>
</aside>
<!--
As an alternative you can have weakly labeled data.
This is when you have a longer audio clip, maybe 10 seconds or more,
and you have noted which kinds of events are present in the clip,
but not marked where they are, or how many events the are.
This is less time-consuming to make.
But means that there is much less information available for the machine learning algorithm,
so this is more challenging task.
One can also use unlabled data for learning.
But even then you will usually need some labels, to evaluate performance.
-->
</section>
<section id="data-requirements-quantity" class="slide level2">
<h2>Data requirements: Quantity</h2>
<p>Need <em>enough</em> data.</p>
<table>
<thead>
<tr class="header">
<th>Instances per class</th>
<th>Suitability</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>100</td>
<td>Minimal</td>
</tr>
<tr class="even">
<td>1000</td>
<td>Good</td>
</tr>
<tr class="odd">
<td>10000+</td>
<td>Very good</td>
</tr>
</tbody>
</table>
<aside class="notes">
<p>What are the requirements for the data?</p>
<p>One requirement is that we have enough data. This varies a lot, depending on complexity of the problem. But here are some rough guidelines.</p>
<p>100 events. Couple of minutes. When you split out a test set from this, might only have 30 instances there. Can be used as a start. But will be hard to work with, because you will have a lot of variation in statistics.</p>
<p>1000 events. Approx 1 hour. Can have a couple of hundred events in the test sets. Reasonable for low to medium complexity tasks.</p>
<p>10000 events. Tens of hours. Best case. Then one has robust statistics.</p>
</aside>
</section>
<section id="data-requirements-quality" class="slide level2">
<h2>Data requirements: Quality</h2>
<p>Need <em>realistic</em> data. Capturing natural variation in</p>
<ul>
<li>the event sound</li>
<li>recording devices used</li>
<li>recording environment</li>
</ul>
<aside class="notes">
<p>But the other important thing is to have <em>realistic</em> data</p>
<p>Variations in event sound. Different airlock designs and vessels cause different sound Different brews, and phases of fermentation process cause differences The recording devices also have variation, changes the captured sound</p>
<p>Also have different environments Need to separate the events of interest, from the background noise There might be other people and activities in the room, or sounds coming in from other rooms or outside. Such variation need to be represented in our dataset, so that we know that our model will handle them well - and not confuse other sounds for “plops”</p>
</aside>
</section>
<section id="check-the-data" class="slide level2">
<h2>Check the data</h2>
<video data-autoplay src="videos/eda-audacity.mkv" controls style="height: 800px">
</video>
<aside class="notes">
<p>Data collected via Youtube</p>
<p>!! Only show 2-3 examples</p>
<p>2 group Much lower in the frequency</p>
<p>3 group Even more noise. Machine in the background Starting to be hard to hear</p>
<p>4 again different sound car in the background</p>
<p>5 two plops at the same time events that overlap can be very challening especially if very similar, can be practically impossible</p>
<p>6 first a plop then a sound that in spectrogram looks quite similar but actually is something different can be very easily confused</p>
</aside>
</section>
<section id="understand-the-data" class="slide level2">
<h2>Understand the data</h2>
<p>Note down characteristics of the sound</p>
<ul>
<li>Event length</li>
<li>Distance between events</li>
<li>Variation in the event sound</li>
<li>Changes over time</li>
<li>Differences between recordings</li>
<li>Background noises</li>
<li>Other events that could be easily confused</li>
</ul>
<aside class="notes">
<p>Always inspect and explore the data!</p>
<p>Listen to audio, look at spectrogram.</p>
<p>Length. Around 200 milliseconds Distance. Varies based on activity Variations. Another type of airlock design, 3-part. Makes much less sound Time changes. Very high Rec differences.</p>
<p>TODO, make into a table for this case</p>
</aside>
</section>
<section id="labeling-data-manually-using-audacity" class="slide level2">
<h2>Labeling data manually using Audacity</h2>
<p><img data-src="./img/audacity.png" style="width:80.0%" /></p>
<div class="sourceCode" id="cb1"><pre class="sourceCode python"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> pandas</span>
<span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb1-3"><a href="#cb1-3" aria-hidden="true" tabindex="-1"></a>labels <span class="op">=</span> pandas.read_csv(path, sep<span class="op">=</span><span class="st">'</span><span class="ch">\t</span><span class="st">'</span>, header<span class="op">=</span><span class="va">None</span>,</span>
<span id="cb1-4"><a href="#cb1-4" aria-hidden="true" tabindex="-1"></a> names<span class="op">=</span>[<span class="st">'start'</span>, <span class="st">'end'</span>, <span class="st">'annotation'</span>],</span>
<span id="cb1-5"><a href="#cb1-5" aria-hidden="true" tabindex="-1"></a> dtype<span class="op">=</span><span class="bu">dict</span>(start<span class="op">=</span><span class="bu">float</span>,end<span class="op">=</span><span class="bu">float</span>,annotation<span class="op">=</span><span class="bu">str</span>))</span></code></pre></div>
<aside class="notes">
<p>Audacity open source audio editor Supports “label tracks”</p>
<p>Select an area in time Hit Ctrl B to add a label T for true. Event of interst N for no. Other events Can also mark other sounds, events/activities that are ongoing Can be useful for error analysis</p>
<p>Can be exported as a text file Can be read easily with Pandas, as shown in this example code</p>
</aside>
<!--
"How to Label Audio for Deep Learning in 4 Simple Steps"
Miguel Pinto, TowardsDataScience.com
https://towardsdatascience.com/how-to-label-audio-for-deep-learning-in-4-simple-steps-6a2c33b343e6
Shows how to use Audacity to label.
Including switching to spectrograms,
annotating a frequency range,
exporting the labels to files,
and importing the label files in Python.
-->
</section></section>
<section>
<section id="machine-learning-system" class="title-slide slide level1">
<h1>Machine Learning system</h1>
<aside class="notes">
<p>Now that we have data, labeled and checked we can go over to the model part</p>
</aside>
</section>
<section id="audio-ml-pipeline-overview" class="slide level2">
<h2>Audio ML pipeline overview</h2>
<p><img data-src="./img/detection-pipeline.svg.png" style="width:60.0%" /></p>
<aside class="notes">
<p>Split the audio into fixed-length windows.</p>
<p>Compute some features. For example a spectrogram.</p>
<p>Each spectrogram window will go into a classifier.</p>
<p>Outputs a probability between 0.0 and 1.0.</p>
<p>Event tracker converts the probability into a discrete list of event starts/stops.</p>
<p>Count these over time to estimate the Bubbles per Minute.</p>
</aside>
<!--
Single audio stream. Monophonic.
Single event class. Binary classification
Uniform probability of event occuring.
Not considering sequences, or states, in the detector
Ie in speech recognition certain sequences of phonemes are more probable
Requires that each event is clearly audible and understandable - without context
Low-to-no overlap between events.
-->
</section>
<section id="spectrogram" class="slide level2">
<h2>Spectrogram</h2>
<p><img data-src="./img/frog_spectrogram.png" style="width:50.0%" /></p>
<div class="sourceCode" id="cb2"><pre class="sourceCode python"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> librosa</span>
<span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a>audio, sr <span class="op">=</span> librosa.load(path)</span>
<span id="cb2-4"><a href="#cb2-4" aria-hidden="true" tabindex="-1"></a>spec <span class="op">=</span> librosa.feature.melspectrogram(y<span class="op">=</span>audio, sr<span class="op">=</span>sr)</span>
<span id="cb2-5"><a href="#cb2-5" aria-hidden="true" tabindex="-1"></a>spec_db <span class="op">=</span> librosa.power_to_db(spec, ref<span class="op">=</span>np.<span class="bu">max</span>)</span>
<span id="cb2-6"><a href="#cb2-6" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb2-7"><a href="#cb2-7" aria-hidden="true" tabindex="-1"></a>lr.display.specshow(ps_db, x_axis<span class="op">=</span><span class="st">'time'</span>, y_axis<span class="op">=</span><span class="st">'mel'</span>)</span></code></pre></div>
<aside class="notes">
<p>Also in Pytorch Audio, Tensorflow et.c.</p>
</aside>
</section>
<section id="cnn-classifier-model" class="slide level2">
<h2>CNN classifier model</h2>
<p><img data-src="./img/cnn.jpg" style="width:60.0%" /></p>
<div class="sourceCode" id="cb3"><pre class="sourceCode python"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> tensorflow <span class="im">import</span> keras</span>
<span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> keras.layers <span class="im">import</span> Convolution2D, MaxPooling2D</span>
<span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb3-4"><a href="#cb3-4" aria-hidden="true" tabindex="-1"></a>model <span class="op">=</span> keras.Sequential([</span>
<span id="cb3-5"><a href="#cb3-5" aria-hidden="true" tabindex="-1"></a> Convolution2D(filters, kernel,</span>
<span id="cb3-6"><a href="#cb3-6" aria-hidden="true" tabindex="-1"></a> input_shape<span class="op">=</span>(bands, frames, channels)),</span>
<span id="cb3-7"><a href="#cb3-7" aria-hidden="true" tabindex="-1"></a> MaxPooling2D(pool_size<span class="op">=</span>pool),</span>
<span id="cb3-8"><a href="#cb3-8" aria-hidden="true" tabindex="-1"></a>....</span>
<span id="cb3-9"><a href="#cb3-9" aria-hidden="true" tabindex="-1"></a>])</span></code></pre></div>
<aside class="notes">
<p>If you are unfamiliar with deep learning, can also try a simple Logistic Regression on MFCC, with scikit-learn. Might do OK for many tasks!</p>
<p>Once the pipeline is setup, with A large amount of different kind of models can work well</p>
</aside>
<!--
-->
<!--
Trick: Normalization. Window-based. Median or max.
Trick: Include delta features
-->
</section>
<section id="evaluation" class="slide level2">
<h2>Evaluation</h2>
<p><img data-src="./img/evaluation-curves.png" style="width:80.0%" /></p>
<aside class="notes">
<p>Multiple levels</p>
<p>Window-wise - False Positive Rate / False Negative Rate - Precision / recall</p>
<p>Might be overly strict. Due to overlap, can afford to miss a couple of windows</p>
<p>Should be able to miss a couple of events without loosing track of the BPM</p>
</aside>
<!--
- Event-wise evaluation
using sed-eval
TODO: include evaluation of BPM. Per
- Blops per Minute
Errors within +- 10%?
LATER: include slide on dataset splitting.
Grouped split in scikit-learn
:::
<!--
Results
Detection performance
LATER: results on windows
precision/recall or TPR/FPR curve
LATER: Results on BPM
-->
</section>
<section id="event-tracker" class="slide level2">
<h2>Event Tracker</h2>
<p>Converting to discrete list of events</p>
<ul>
<li>Threshold the probability from classifier</li>
<li>Keep track of whether we are currently in an event or not</li>
</ul>
<div class="sourceCode" id="cb4"><pre class="sourceCode python"><code class="sourceCode python"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a> <span class="cf">if</span> <span class="kw">not</span> inside_event <span class="kw">and</span> probability <span class="op">>=</span> on_threshold:</span>
<span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a> inside_event <span class="op">=</span> <span class="va">True</span></span>
<span id="cb4-3"><a href="#cb4-3" aria-hidden="true" tabindex="-1"></a> <span class="bu">print</span>(<span class="st">'EVENT on'</span>, t, probability)</span>
<span id="cb4-4"><a href="#cb4-4" aria-hidden="true" tabindex="-1"></a> <span class="cf">if</span> inside_event <span class="kw">and</span> probability <span class="op"><=</span> off_threshold:</span>
<span id="cb4-5"><a href="#cb4-5" aria-hidden="true" tabindex="-1"></a> inside_event <span class="op">=</span> <span class="va">False</span></span>
<span id="cb4-6"><a href="#cb4-6" aria-hidden="true" tabindex="-1"></a> <span class="bu">print</span>(<span class="st">'EVENT off'</span>, t, probability)</span></code></pre></div>
<aside class="notes">
<p>Using separate on/off threshold avoids noise/oscillation due to minor changes around the threshold value. Called hysteresis</p>
</aside>
</section>
<section id="statistics-estimator" class="slide level2">
<h2>Statistics Estimator</h2>
<p>To compute the Bubbles Per Minute</p>
<p><img data-src="./img/histogram.png" style="width:30.0%" /></p>
<ul>
<li>Using the typical time-between-events</li>
<li>Assumes regularity</li>
<li>Median more robust against outliers</li>
</ul>
<aside class="notes">
<p>Could just count events over 1 minute and report as-is. However our model will make some mistakes. Missed events, additional events.</p>
<p>Since we have a very periodic and slowly changing process, can instead use the distance between events.</p>
<p>Can have outliers. If missing event, or false triggering. Take the median value and report as the BPM.</p>
</aside>
<!---
TODO: update with real picture
Median filtering.
Reject time-difference values outside of IQR.
Maybe give a range.
Confidence Interval of the mean
Student-T extimation
-->
</section>
<section id="tracking-over-time-using-brewfather" class="slide level2">
<h2>Tracking over time using Brewfather</h2>
<p><img data-src="./img/brewfather-fermenting-crop.png" style="width:50.0%" /></p>
<div class="sourceCode" id="cb5"><pre class="sourceCode python"><code class="sourceCode python"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a><span class="co"># API documentation: https://docs.brewfather.app/integrations/custom-stream</span></span>
<span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> requests</span>
<span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb5-4"><a href="#cb5-4" aria-hidden="true" tabindex="-1"></a>url <span class="op">=</span> <span class="st">'http://log.brewfather.net/stream?id=9MmXXXXXXXXX'</span></span>
<span id="cb5-5"><a href="#cb5-5" aria-hidden="true" tabindex="-1"></a>data <span class="op">=</span> <span class="bu">dict</span>(name<span class="op">=</span><span class="st">'brewaed-0001'</span>, bpm<span class="op">=</span>CALCULATED<span class="op">-</span>BPM)</span>
<span id="cb5-6"><a href="#cb5-6" aria-hidden="true" tabindex="-1"></a>r <span class="op">=</span> requests.post(url, json<span class="op">=</span>data)</span></code></pre></div>
<aside class="notes">
<p>LATER. Edit picture to make less tall</p>
</aside>
</section></section>
<section>
<section id="outro" class="title-slide slide level1">
<h1>Outro</h1>
</section>
<section id="more-resources" class="slide level2">
<h2>More resources</h2>
<p></br> Github project: <a href="https://github.com/jonnor/brewing-audio-event-detection">jonnor/brewing-audio-event-detection</a></p>
<p></br> General Audio ML: <a href="https://github.com/jonnor/machinehearing">jonnor/machinehearing</a></p>
<ul>
<li><a href="https://arxiv.org/abs/2107.05463">Sound Event Detection: A tutorial</a>. Virtanen et al.</li>
<li><a href="https://www.youtube.com/watch?v=uCGROOUO_">Audio Classification with Machine Learning</a> (EuroPython 2019)</li>
<li><a href="https://www.youtube.com/watch?v=ks5kq1R0aws">Environmental Noise Classification on Microcontrollers</a> (TinyML 2021)</li>
</ul>
<p></br> Slack: <a href="https://valeriovelardo.com/the-sound-of-ai-community/">Sound of AI community</a></p>
</section>
<section id="what-do-you-want-make" class="slide level2">
<h2>What do you want make?</h2>
<p>Now that you know the basics of Audio Event Detection with Machine Learning in Python.</p>
<ul>
<li>Popcorn popping</li>
<li>Bird call</li>
<li>Cough</li>
<li>Umm/aaa speech patterns</li>
<li>Drum hits</li>
<li>Car passing</li>
</ul>
<aside class="notes">
<p>Not-events. Alarm goes off. Likely to persist (for a while)</p>
</aside>
</section>
<section id="continious-monitoring-using-audio-ml" class="slide level2">
<h2>Continious Monitoring using Audio ML</h2>
<p>Want to deploy Continious Monitoring with Audio?</br> Consider using the Soundsensing sensors and data-platform.</p>
<p><img data-src="./img/soundsensing-solution.svg.png" style="width:80.0%" /></p>
<p></br> <em>Get in Touch! [email protected]</em></p>
<aside class="notes">
<ul>
<li>Built-in cellular connectivity.</li>
<li>Rugged design for industrial and outdoor usecases.</li>
<li>Can run Machine Learning both on-device or in-cloud</li>
<li>Supports Sound Event Detection, Audio Classification, Acoustic Anomaly Detection</li>
</ul>
</aside>
</section>
<section id="join-soundsensing" class="slide level2">
<h2>Join Soundsensing</h2>
<p></br>Want to work on Audio Machine Learning in Python?</br> We have many opportunities.</p>
<ul>
<li>Full-time positions</li>
<li>Part-time / freelance work</li>
<li>Engineering thesis</li>
<li>Internships</li>
<li>Research or industry partnerships</li>
</ul>
<p></br> <em>Get in Touch! [email protected]</em></p>
<aside class="notes">
</aside>
</section>
<section id="section" class="slide level2" data-background="./img/soundsensing-withlogo.jpg" style="background: rgba(255, 255, 255, 0.3);">
<h2 data-background="./img/soundsensing-withlogo.jpg" style="background: rgba(255, 255, 255, 0.3);"></h2>
<h1>
Questions ?
</h1>
<p></br> <em> Sound Event Detection with Machine Learning</br> EuroPython 2021 </em></p>
</br>
<p>
Jon Nordby </br>[email protected] </br>Head of Data Science & Machine Learning
</p>
</section></section>
<section>
<section id="bonus" class="title-slide slide level1">
<h1>Bonus</h1>
<p>Bonus slides after this point</p>
<!-- TODO: Maybe include some of this in main talk -->
</section>
<section id="semi-automatic-labelling" class="slide level2">
<h2>Semi-automatic labelling</h2>
<p>Using a Gaussian Mixture, Hidden Markov Model (GMM-HMM)</p>
<p><img data-src="./img/labeling-gmm-hmm.png" style="width:80.0%" /></p>
<div class="sourceCode" id="cb6"><pre class="sourceCode python"><code class="sourceCode python"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> hmmlearn.hmm, librosa, sklearn.preprocessing</span>
<span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb6-3"><a href="#cb6-3" aria-hidden="true" tabindex="-1"></a>features <span class="op">=</span> librosa.feature.mfcc(audio, n_mfcc<span class="op">=</span><span class="dv">13</span>, ...)</span>
<span id="cb6-4"><a href="#cb6-4" aria-hidden="true" tabindex="-1"></a>model <span class="op">=</span> hmmlearn.hmm.GMMHMM(n_components<span class="op">=</span><span class="dv">2</span>, ...)</span>
<span id="cb6-5"><a href="#cb6-5" aria-hidden="true" tabindex="-1"></a>X <span class="op">=</span> sklearn.preprocessing.StandardScaler().fit_transform(data)</span>
<span id="cb6-6"><a href="#cb6-6" aria-hidden="true" tabindex="-1"></a>model.fit(X)</span>
<span id="cb6-7"><a href="#cb6-7" aria-hidden="true" tabindex="-1"></a>probabilities <span class="op">=</span> model.score_samples(X)[<span class="dv">1</span>][:,<span class="dv">1</span>]</span></code></pre></div>
<aside class="notes">
<p>Unsupervised learning. Does not need any labels. Compute statistics, try to cluster into 2 groups. Event and background Can work quite well when the events are quite clear.</p>
<p>Workflow. First running it, generating label files Then reviewing and editing the labels in Audacity</p>
<p>from hmmlearn https://github.com/hmmlearn/hmmlearn Using Mel-Frequency-Cepstral-Coefficiants as features Lossy compression on top of a mel-spectrogram</p>
</aside>
</section>
<section id="synthesize-data" class="slide level2">
<h2>Synthesize data</h2>
<p>How to get more data</br>without gathering “in the wild”?</p>
<ul>
<li>Mix in diffent kinds of background noise.</li>
<li>Vary Signal to Noise ratio etc</li>
<li>Useful to estimate performance on tricky, not-yet-seen data</li>
<li>Can be used to compensate for small amount of training data</li>
<li><em>scaper</em> Python library: <a href="https://github.com/justinsalamon/scaper">github.com/justinsalamon/scaper</a></li>
</ul>
<aside class="notes">
<p>Challenge in Acoustic Event Detection in uncontrolled environment.</p>
<p>Handling the largs amounts of different background noises that could occur.</p>
</aside>
</section>
<section id="streaming-inference" class="slide level2">
<h2>Streaming inference</h2>
<p>Key: Chopping up incoming stream into (overlapping) audio windows</p>
<div class="sourceCode" id="cb7"><pre class="sourceCode python"><code class="sourceCode python"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> sounddevice, queue</span>
<span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a><span class="co"># Setup audio stream from microphone</span></span>
<span id="cb7-4"><a href="#cb7-4" aria-hidden="true" tabindex="-1"></a>audio_queue <span class="op">=</span> queue.Queue()</span>
<span id="cb7-5"><a href="#cb7-5" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb7-6"><a href="#cb7-6" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> audio_callback(indata, frames, time, status):</span>
<span id="cb7-7"><a href="#cb7-7" aria-hidden="true" tabindex="-1"></a> audio_queue.put(indata.copy())</span>
<span id="cb7-8"><a href="#cb7-8" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb7-9"><a href="#cb7-9" aria-hidden="true" tabindex="-1"></a>stream <span class="op">=</span> sounddevice.InputStream(callback<span class="op">=</span>audio_callback, ...)</span>
<span id="cb7-10"><a href="#cb7-10" aria-hidden="true" tabindex="-1"></a>...</span>
<span id="cb7-11"><a href="#cb7-11" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb7-12"><a href="#cb7-12" aria-hidden="true" tabindex="-1"></a><span class="co"># In classification loop</span></span>
<span id="cb7-13"><a href="#cb7-13" aria-hidden="true" tabindex="-1"></a> data <span class="op">=</span> audio_queue.get()</span>
<span id="cb7-14"><a href="#cb7-14" aria-hidden="true" tabindex="-1"></a> <span class="co"># shift old audio over, add new data</span></span>
<span id="cb7-15"><a href="#cb7-15" aria-hidden="true" tabindex="-1"></a> audio_buffer <span class="op">=</span> numpy.roll(audio_buffer, <span class="bu">len</span>(data), axis<span class="op">=</span><span class="dv">0</span>)</span>
<span id="cb7-16"><a href="#cb7-16" aria-hidden="true" tabindex="-1"></a> audio_buffer[<span class="bu">len</span>(audio_buffer)<span class="op">-</span><span class="bu">len</span>(data):<span class="bu">len</span>(audio_buffer)] <span class="op">=</span> data</span>
<span id="cb7-17"><a href="#cb7-17" aria-hidden="true" tabindex="-1"></a> new_samples <span class="op">+=</span> <span class="bu">len</span>(data)</span>
<span id="cb7-18"><a href="#cb7-18" aria-hidden="true" tabindex="-1"></a> <span class="co"># check if we have received enough new data to do new prediction</span></span>
<span id="cb7-19"><a href="#cb7-19" aria-hidden="true" tabindex="-1"></a> <span class="cf">if</span> new_samples <span class="op">>=</span> hop_length:</span>
<span id="cb7-20"><a href="#cb7-20" aria-hidden="true" tabindex="-1"></a> p <span class="op">=</span> model.predict(audio_buffer)</span>
<span id="cb7-21"><a href="#cb7-21" aria-hidden="true" tabindex="-1"></a> <span class="cf">if</span> p <span class="op"><</span> threshold:</span>
<span id="cb7-22"><a href="#cb7-22" aria-hidden="true" tabindex="-1"></a> <span class="bu">print</span>(<span class="ss">f'EVENT DETECTED time=</span><span class="sc">{</span>datetime<span class="sc">.</span>datetime<span class="sc">.</span>now()<span class="sc">}</span><span class="ss">'</span>)</span></code></pre></div>
<aside class="notes">
<p>Brewer does not really care about each and every blop BPM changes slowly and (normally) quite evenly, and does not have to be reported often Brewfather limits updates to once per 15 minutes</p>
<p>Detection time. Delay between sound event happening and detection being performed and reported How quickly someone needs to see/use result Some applications may short detection time</p>
<p>But real-time streaming detection can be useful to verify detection when setting up. And makes for nicer demo :)</p>
<p>LATER: video demo? can just be console output, while input audio is playing</p>
</aside>
</section>
<section id="event-detection-with-weakly-labeled-data" class="slide level2">
<h2>Event Detection with Weakly Labeled data</h2>
<p>Can one learn Sound Event Detection </br>without annotating the times for each event? </br> </br>Yes!</p>
<ul>
<li>Referred to as <em>weekly labeled</em> Sound Event Detection</li>
<li>Can be tackled with <em>Multiple Instance Learning</em></li>
<li>Inputs: Audio clips consisting of 0-N events</li>
<li>Labels: True if any events in clip, else false</li>
<li>Multiple analysis windows per 1 label</li>
<li>Using temporal pooling in Neural Network</li>
</ul>
<aside class="notes">
<p>TODO, maybe expand on this, show example code</p>
<p>Active area of research. DCASE Speech recognition systems. Can give phone level output with sentence-level annotations</p>
<p>Multiple Instance Learning Principle model architecture with neural networks Each (overlapped) analysis window in a clip goes through same neural network. Outputs are pooled across time to make prediction of event present-or-not. Common pooling operation: max, or softmax More advanced. Attention pooling, or Autopool (softmax generalization)</p>
</aside>
</section>
<section id="data-collection-via-youtube" class="slide level2">
<h2>Data collection via Youtube</h2>
<p>Criteria for inclusion:</p>
<ul>
<li>Preferably couple of minutes long, minimum 15 seconds</li>
<li>No talking to the camera</li>
<li>Mostly stationary camera</li>
<li>No audio editing/effects</li>
<li>One or more airlocks bubbling</li>
<li>Bubbling can be heard by ear</li>
</ul>
<p>Approx 1000 videos reviewed, 100 usable</p>
<aside class="notes">
<p>Making note of</p>
<ul>
<li>Bubbling rate</li>
<li>Clarity of bubble sound</li>
<li>Other noise around</li>
</ul>
<p>Maybe 1000 videos reviewed. End up with around 100 potentialy useful Many hours of work</p>
<p>Up to 100 recording devices and 100 environments. Maybe 2000 events Some recordings very long, several hours. Maybe 5000 events</p>
<p>Using youtube-dl to download youtube-dl –extract-audio $URL</p>
<p>https://youtube-dl.org/ https://github.com/ytdl-org/youtube-dl/</p>
</aside>
</section>
<section id="characteristics-of-audio-events" class="slide level2">
<h2>Characteristics of Audio Events</h2>
<ul>
<li>Duration</li>
<li>Tonal/atonal</li>
<li>Temporal patterns</li>
<li>Percussive</li>
<li>Frequency content</li>
<li>Temporal envelope</li>
<li>Foreground vs background</li>
<li>Signal to Noise Ratio</li>
</ul>
<aside class="notes">
<p>Some events are short Gunshot Bark</p>
<p>Some are bit longer Cat mjau</p>
<p>Some events are percussive / atonal. Cough, etc</p>
<p>Some have temporal patterns Some are more tonal Alarms</p>
<p>Transitions. Into state. Out of state.</p>
</aside>
</section>
<section id="analysis-windows" class="slide level2">
<h2>Analysis windows</h2>
<p><img data-src="./img/overlapped-windows.png" style="width:50.0%" /></p>
<p>Window length bit longer than the event length.</p>
<p>Overlapping gives classifier multiple chances at seeing each event.</p>
<p>Reducing overlap increases resolution! Overlap for AES: 10%</p>
<aside class="notes">
</aside>
</section></section>
</div>
</div>
<script src="https://unpkg.com/reveal.js@^4//dist/reveal.js"></script>
<!-- reveal.js plugins -->
<script src="https://unpkg.com/reveal.js@^4//plugin/notes/notes.js"></script>
<script src="https://unpkg.com/reveal.js@^4//plugin/search/search.js"></script>
<script src="https://unpkg.com/reveal.js@^4//plugin/zoom/zoom.js"></script>
<script>
// Full list of configuration options available at:
// https://revealjs.com/config/
Reveal.initialize({
// Add the current slide number to the URL hash so that reloading the
// page/copying the URL will return you to the same slide
hash: true,
// The "normal" size of the presentation, aspect ratio will be preserved
// when the presentation is scaled to fit different resolutions. Can be
// specified using percentage units.
width: 1920,
height: 1080,
// Factor of the display size that should remain empty around the content
margin: 0,
// reveal.js plugins
plugins: [
RevealNotes,
RevealSearch,
RevealZoom
]
});
</script>
</body>
</html>