author: Jon Nordby [email protected] date: March 25, 2021 css: style.css width: 1920 height: 1080 margin: 0 pagetitle: 'Sound Event Detection with Machine Learning'
Jon Nordby
Head of Data Science & Machine Learning
Soundsensing AS
[email protected]
EuroPython 2021
::: notes
Hi and good morning everyone
Jon Nordby Head of Machine Learning and Data Science at Soundsensing
Today we will be talking about Sound Event Detection using Machine Learning
:::
::: notes
:::
::: notes
Soundsensing is a company that focuses on audio and machine learning.
We provide easy-to-use IoT sensors that can continiously measure sound, and use Machine Learning to extract interesting information.
The information presented in our online dashboard, and is also available in an API for integrating with other systems.
Our products are used for Noise Monitoring and Condition Monitoring of equipment.
:::
Given input audio
return the timestamps (start, end)
for each event class
::: notes
One of many common tasks in Audio Machine Learning
Other examples of tasks are Audio Classification, and Audio Tagging
In Classification there is only a single class label as output. No timing information In Tagging one allows multiple classes. But also no timing information. Event Detection gives a series of time-stamps as output
Also known as: Acoustic Event Detection or Audio Event Detection (AED)
Audio Classification with Machine Learning (Jon Nordby, EuroPython 2019) https://www.youtube.com/watch?v=uCGROOUO_wY
:::
Events are sounds with a clearly-defined duration or onset.
Event (time limited) | Class (continious) |
---|---|
Car passing | Car traffic |
Honk | Car traffic |
Word | Speech |
Gunshot | Shooting |
::: notes
What are events?
Start-end. Onset/offset Or at least a clear start
Isolated claps (event) versus clapping (ongoing, class)
If events are overlapping a lot, might not make sense as events anymore
For events one can count the number of occurrences Classification might instead count number of seconds instead
:::
Fermentation tracking when making alcoholic beverages. Beer, Cider, Wine, etc.
::: notes
Tried to pick a bit fun task as an example
:::
::: notes
When brewing alcoholic beverages such as beer, cider or wine one puts together a compoud with yeast, source of sugars, water (the wort) into a vessel
The vessel is put in a location with an appropriate temperature, and after some time the fermentation process will start.
During fermentation the yeast will eat the sugar, which will produce alcohol, and as a byproduct also CO2 gas
There are many things that can go wrong.
- can fail to start
- way to intense: foaming, blowout
- abrupt stop
So as a brewer, one has to monitor the process.
At the top of the vessel you see an airlock. This is a device that will let the CO2 gas out, while not allowing oxygen, bugs or other contaminants in.
:::
::: notes
In this video clip the fermentation process has started, with medium activity
CO2 is being pushed through the airlock, and escapes out at the top
As you can hear this makes a characteristic sound, a "plop" for each bubble of gas that escapes
This example has a very nice and clear sound. It is not always so nice.
This is something we can track using Machine Learning. We can have a microphone that picks up the sound, pass it through some software and use a machine learning model to detect each individual "plop"
Example of an event. Clear time-defined sound that we want to count.
If you count the plop or bubbling activity then you can estimate how much fermentation is going on. Can also be used to estimate alcohol content, though it is not very precise for that. It can tell at least whether fermentation has started or not, and roughly how the brew is progressing.
:::
Fermentation activity can be tracked as Bubbles Per Minute (BPM).
::: notes
Typical curves look like this. Starts out with nothing, then ramps up. And as the yeast eats up the sugars, fermentation will gradually go down.
Many variations in the curves possible depending on your brew, some examples shown here.
Affected by temperature, external and in the brew. And of the changes over time in sugar and yeast concentrations.
:::
Make a system that can track fermentation activity,
outputting Bubbles per Minute (BPM),
by capturing airlock sound using a microphone,
using Machine Learning to count each "plop"
::: notes
Of course there are existing devices dedicated to this task. Such as a Plaato Airlock. But for fun and learning we will do this using sound. This is an Sound Event Detection problem
:::
::: notes
When one says "Machine Learning" many people think think mainly about ML algorithms and code But just as important, or in many cases more important, is the data
Without appropriate data for your task, you will not get a good ML model, or ML powered system!
:::
::: notes
The technique we are going to use is Supervised learning, which is the most common for learning a classifier or detector like this.
Supervised learning is based on labeled examples, of Input (Audio) AND Expected output (bubble yes/no)
In sound event detection there are multiple ways of labeling your data. We will work here with strongly labeled data (shown at the top), where the start and end of each event instance is marked. Very detailed. Takes a considerable amount of time to make, but easy for a system to learn from.
So the labeled data will go into the training system, and output a Sound Event Detector. This detector can be ran on new audio, and will output detected events, in our case the plops.
TODO: mark only the relevant case in image
:::
Need enough data.
Instances per class | Suitability |
---|---|
100 | Minimal |
1000 | Good |
10000+ | Very good |
::: notes
What are the requirements for the data?
One requirement is that we have enough data. This varies a lot, depending on complexity of the problem. But here are some rough guidelines.
100 events. Couple of minutes. When you split out a test set from this, might only have 30 instances there. Can be used as a start. But will be hard to work with, because you will have a lot of variation in statistics.
1000 events. Approx 1 hour. Can have a couple of hundred events in the test sets. Reasonable for low to medium complexity tasks.
10000 events. Tens of hours. Best case. Then one has robust statistics.
:::
Need realistic data. Capturing natural variation in
- the event sound
- recording devices used
- recording environment
::: notes
But the other important thing is to have realistic data
Variations in event sound. Different airlock designs and vessels cause different sound Different brews, and phases of fermentation process cause differences The recording devices also have variation, changes the captured sound
Also have different environments Need to separate the events of interest, from the background noise There might be other people and activities in the room, or sounds coming in from other rooms or outside. Such variation need to be represented in our dataset, so that we know that our model will handle them well
- and not confuse other sounds for "plops"
:::
::: notes
Data collected via Youtube
!! Only show 2-3 examples
2 group Much lower in the frequency
3 group Even more noise. Machine in the background Starting to be hard to hear
4 again different sound car in the background
5 two plops at the same time events that overlap can be very challening especially if very similar, can be practically impossible
6 first a plop then a sound that in spectrogram looks quite similar but actually is something different can be very easily confused
:::
Note down characteristics of the sound
- Event length
- Distance between events
- Variation in the event sound
- Changes over time
- Differences between recordings
- Background noises
- Other events that could be easily confused
::: notes
Always inspect and explore the data!
Listen to audio, look at spectrogram.
Length. Around 200 milliseconds Distance. Varies based on activity Variations. Another type of airlock design, 3-part. Makes much less sound Time changes. Very high Rec differences.
TODO, make into a table for this case
:::
import pandas
labels = pandas.read_csv(path, sep='\t', header=None,
names=['start', 'end', 'annotation'],
dtype=dict(start=float,end=float,annotation=str))
::: notes Audacity open source audio editor Supports "label tracks"
Select an area in time Hit Ctrl B to add a label T for true. Event of interst N for no. Other events Can also mark other sounds, events/activities that are ongoing Can be useful for error analysis
Can be exported as a text file Can be read easily with Pandas, as shown in this example code
:::
::: notes
Now that we have data, labeled and checked we can go over to the model part
:::
::: notes
Split the audio into fixed-length windows.
Compute some features. For example a spectrogram.
Each spectrogram window will go into a classifier.
Outputs a probability between 0.0 and 1.0.
Event tracker converts the probability into a discrete list of event starts/stops.
Count these over time to estimate the Bubbles per Minute.
:::
import librosa
audio, sr = librosa.load(path)
spec = librosa.feature.melspectrogram(y=audio, sr=sr)
spec_db = librosa.power_to_db(spec, ref=np.max)
lr.display.specshow(ps_db, x_axis='time', y_axis='mel')
::: notes
Also in Pytorch Audio, Tensorflow et.c.
:::
from tensorflow import keras
from keras.layers import Convolution2D, MaxPooling2D
model = keras.Sequential([
Convolution2D(filters, kernel,
input_shape=(bands, frames, channels)),
MaxPooling2D(pool_size=pool),
....
])
::: notes
If you are unfamiliar with deep learning, can also try a simple Logistic Regression on MFCC, with scikit-learn. Might do OK for many tasks!
Once the pipeline is setup, with A large amount of different kind of models can work well
:::
::: notes
Multiple levels
Window-wise
- False Positive Rate / False Negative Rate
- Precision / recall
Might be overly strict. Due to overlap, can afford to miss a couple of windows
Should be able to miss a couple of events without loosing track of the BPM
:::
Converting to discrete list of events
- Threshold the probability from classifier
- Keep track of whether we are currently in an event or not
if not inside_event and probability >= on_threshold:
inside_event = True
print('EVENT on', t, probability)
if inside_event and probability <= off_threshold:
inside_event = False
print('EVENT off', t, probability)
::: notes
Using separate on/off threshold avoids noise/oscillation due to minor changes around the threshold value. Called hysteresis
:::
To compute the Bubbles Per Minute
- Using the typical time-between-events
- Assumes regularity
- Median more robust against outliers
::: notes
Could just count events over 1 minute and report as-is. However our model will make some mistakes. Missed events, additional events.
Since we have a very periodic and slowly changing process, can instead use the distance between events.
Can have outliers. If missing event, or false triggering. Take the median value and report as the BPM.
:::
# API documentation: https://docs.brewfather.app/integrations/custom-stream
import requests
url = 'http://log.brewfather.net/stream?id=9MmXXXXXXXXX'
data = dict(name='brewaed-0001', bpm=CALCULATED-BPM)
r = requests.post(url, json=data)
::: notes
LATER. Edit picture to make less tall
:::
Github project: [jonnor/brewing-audio-event-detection](https://github.com/jonnor/brewing-audio-event-detection)
General Audio ML: [jonnor/machinehearing](https://github.com/jonnor/machinehearing)
- Sound Event Detection: A tutorial. Virtanen et al.
- Audio Classification with Machine Learning (EuroPython 2019)
- Environmental Noise Classification on Microcontrollers (TinyML 2021)
Slack: [Sound of AI community](https://valeriovelardo.com/the-sound-of-ai-community/)
Now that you know the basics of Audio Event Detection with Machine Learning in Python.
- Popcorn popping
- Bird call
- Cough
- Umm/aaa speech patterns
- Drum hits
- Car passing
::: notes
Not-events. Alarm goes off. Likely to persist (for a while)
:::
Want to deploy Continious Monitoring with Audio?
Consider using the Soundsensing sensors and data-platform.
Get in Touch! [email protected]
::: notes
- Built-in cellular connectivity.
- Rugged design for industrial and outdoor usecases.
- Can run Machine Learning both on-device or in-cloud
- Supports Sound Event Detection, Audio Classification, Acoustic Anomaly Detection
:::
Want to work on Audio Machine Learning in Python?
We have many opportunities.
- Full-time positions
- Part-time / freelance work
- Engineering thesis
- Internships
- Research or industry partnerships
Get in Touch! [email protected]
::: notes
:::
Sound Event Detection with Machine Learning
EuroPython 2021
Jon Nordby
[email protected]
Head of Data Science & Machine Learning
Bonus slides after this point
Using a Gaussian Mixture, Hidden Markov Model (GMM-HMM)
import hmmlearn.hmm, librosa, sklearn.preprocessing
features = librosa.feature.mfcc(audio, n_mfcc=13, ...)
model = hmmlearn.hmm.GMMHMM(n_components=2, ...)
X = sklearn.preprocessing.StandardScaler().fit_transform(data)
model.fit(X)
probabilities = model.score_samples(X)[1][:,1]
::: notes
Unsupervised learning. Does not need any labels. Compute statistics, try to cluster into 2 groups. Event and background Can work quite well when the events are quite clear.
Workflow. First running it, generating label files Then reviewing and editing the labels in Audacity
from hmmlearn https://github.com/hmmlearn/hmmlearn Using Mel-Frequency-Cepstral-Coefficiants as features Lossy compression on top of a mel-spectrogram
:::
How to get more data
without gathering "in the wild"?
- Mix in diffent kinds of background noise.
- Vary Signal to Noise ratio etc
- Useful to estimate performance on tricky, not-yet-seen data
- Can be used to compensate for small amount of training data
- scaper Python library: github.com/justinsalamon/scaper
::: notes
Challenge in Acoustic Event Detection in uncontrolled environment.
Handling the largs amounts of different background noises that could occur.
:::
Key: Chopping up incoming stream into (overlapping) audio windows
import sounddevice, queue
# Setup audio stream from microphone
audio_queue = queue.Queue()
def audio_callback(indata, frames, time, status):
audio_queue.put(indata.copy())
stream = sounddevice.InputStream(callback=audio_callback, ...)
...
# In classification loop
data = audio_queue.get()
# shift old audio over, add new data
audio_buffer = numpy.roll(audio_buffer, len(data), axis=0)
audio_buffer[len(audio_buffer)-len(data):len(audio_buffer)] = data
new_samples += len(data)
# check if we have received enough new data to do new prediction
if new_samples >= hop_length:
p = model.predict(audio_buffer)
if p < threshold:
print(f'EVENT DETECTED time={datetime.datetime.now()}')
::: notes
Brewer does not really care about each and every blop BPM changes slowly and (normally) quite evenly, and does not have to be reported often Brewfather limits updates to once per 15 minutes
Detection time. Delay between sound event happening and detection being performed and reported How quickly someone needs to see/use result Some applications may short detection time
But real-time streaming detection can be useful to verify detection when setting up. And makes for nicer demo :)
LATER: video demo? can just be console output, while input audio is playing
:::
Can one learn Sound Event Detection
without annotating the times for each event?
Yes!
- Referred to as weekly labeled Sound Event Detection
- Can be tackled with Multiple Instance Learning
- Inputs: Audio clips consisting of 0-N events
- Labels: True if any events in clip, else false
- Multiple analysis windows per 1 label
- Using temporal pooling in Neural Network
::: notes
TODO, maybe expand on this, show example code
Active area of research. DCASE Speech recognition systems. Can give phone level output with sentence-level annotations
Multiple Instance Learning Principle model architecture with neural networks Each (overlapped) analysis window in a clip goes through same neural network. Outputs are pooled across time to make prediction of event present-or-not. Common pooling operation: max, or softmax More advanced. Attention pooling, or Autopool (softmax generalization)
:::
Criteria for inclusion:
- Preferably couple of minutes long, minimum 15 seconds
- No talking to the camera
- Mostly stationary camera
- No audio editing/effects
- One or more airlocks bubbling
- Bubbling can be heard by ear
Approx 1000 videos reviewed, 100 usable
::: notes
Making note of
- Bubbling rate
- Clarity of bubble sound
- Other noise around
Maybe 1000 videos reviewed. End up with around 100 potentialy useful Many hours of work
Up to 100 recording devices and 100 environments. Maybe 2000 events Some recordings very long, several hours. Maybe 5000 events
Using youtube-dl to download youtube-dl --extract-audio $URL
https://youtube-dl.org/ https://github.com/ytdl-org/youtube-dl/
:::
- Duration
- Tonal/atonal
- Temporal patterns
- Percussive
- Frequency content
- Temporal envelope
- Foreground vs background
- Signal to Noise Ratio
::: notes
Some events are short Gunshot Bark
Some are bit longer Cat mjau
Some events are percussive / atonal. Cough, etc
Some have temporal patterns Some are more tonal Alarms
Transitions. Into state. Out of state.
:::
Window length bit longer than the event length.
Overlapping gives classifier multiple chances at seeing each event.
Reducing overlap increases resolution! Overlap for AES: 10%
::: notes
:::