This is an ambitious project, that I decided to embark on as a learning opportunity before my main "passivesonar" effort. In this document, I will outline the goals and approaches to the task at hand, along with a degree of mathematics behind them.
Bats are fascinating creatures. As much as they are different in all possible way or form from cetaceans like orcas, dolphins and whales, these two major groups of species share one fundamental skill that is the very core of their survival. And that skill is echolocation.
The evolutionary mechanisms of how bats and cetaceans developed this sophisticated ability are beyond the scope of this project, although one factor is obvious - both hunt in darkness. Bats usually come out at twilight and go on with their feeding for the rest of the evening, while large whales like the sperm whale dive down to over a thousand meters looking for the ultimate treat - the giant squid.
As a scuba diver, one thing I know for a fact is that light does not penetrate water too deep. Most times it gets pretty dark when you hit 50 meters, and past 300 meters it is universally known to be a no-light zone. The hunting grounds of sperm whales are way beyond that limit, so pitch darkness is why and where they need echolocation.
Again - why bats and not whales? For a few reasons, number one being accessibility and opportunity to gather and analyze my own data. See, capturing whale songs is one hell of an endevour, dependent on so many things it is only possible for large-scale operations with specialized vessels. With bats, all I need is a ultrasonic microphone and a cave. Luckily, caves in South East Asia are not uncommon, and ultrasonic mics can be purchased for a few hundred dollars a piece.
The goal of this project is to have an open-source platform that will offer several key features:
- Real-time monitoring of bat vocalizations
- Real-time classification of bat vocalizations
- Post-recording analysis
- Species identification
The code of the backend will be in Python. However, it is going to require a number of dependencies
Librosa
: for analyzing the recordingsNumpy
: for mathematical operationsPandas
: for working with datasetsPyTorch
: for the ML modelsAudacity
: to prepare the learning datasetDocker
: for what docker is good for
Ideally, this should be able to run on anything as small as the Raspberry Pi 4 with 4gb of RAM.
Now that the objectives and motivations are established, I am moving on to the science behind this. I partly do it for my own benefit to have a reference, because preparing for this has already required hundreds of hours of reading books, researching, writing code, binge-watching Coursera, you name it. It's been fun, though, I've absolutely loved it.
The wavelength
The amplitude
aka sound pressure
The frequency
Human hearing frequency range is modest, ranging from 20hz to 20kHz. Although most people will hear sounds up to 16-17kHz max.
The sampling rate
Nyquist
. Think of sampling a soundwave as slicing pizza. With a high
The Nyquist frequency
In practice, we should always allow for more space, so a realistic sampling rate should be:
For instance, if we are recording bat calls that have
The Bit Depth
Where
The Sine Wave
is a fundamental idea of digital audio. Any real-life audio recording is a product of many sine waves combined together. The formula for a basic sine wave is as follows:
Where amplitude
, frequency
, 3.14
, time vector
and phase
.
The Frame
is simply a collection of samples. Frames allow processing samples in chunks, with common frame size ranging from 256 to 8192 samples.
Quantization
is another fundamental concept closely related to Bit Depth
. Quantization is a process of mapping analogue values to a limited range of discrete values. To visualise, imagine a sine wave traveling through space. If we want to accurately sample that wave and represent it digitally, we need to put that wave against a background, and draw horizontal lines from the bottom to the top. The more such lines or levels we have, the more accurately we can reconstruct a signal. The number of such levels is directly determined by bit depth Q
. If we have Q
=8, that is
The Doppler Shift
describes an effect where a continous, constant-rate signal may appear faster or slower depending on the speed of movement of the listener and the source of sound. Essentially, as we get closer to the source of the sound, signals arrive faster, although their frequency does not change. This change of frequency can be formulated as follows:
Where
| There is a variety of ways to represent a digital signal, namely oscillograms
, power spectra
and sonograms
aka spectrograms
.
Oscillograms
show time against amplitude, power spectra
show frequency again time, and sonograms/spectrograms
display time against frequency with amplitude being represented by colour intensity.
FFT
- Fast-Fourier Transform. This is one of the more technical aspects of digital audio processing that is relatively math-heavy, but not impossible to understand. First, why do we need it?
FFT is an algorithm that allows us to move a signal from time domain to frequency domain. If you think about it, any audio signal can be illustrated on a cartesian plane, where the x-axis is time, and the y-axis is amplitude. That is signal in a time domain. Although useful, it does not tell us what frequencies are contained in the signal, which is critical information in we are analyzing bat vocalizations.
That is where FFT
comes into play. Fourier-transform essentially decomposes the original signal into its frequencies, which allows us to plot them on a cartesian plane, where x-axis has the frequencies, and y-axis shows their relative amplitude. This provides a clear picture of how much of a specific frequency is contained in the signal.
That is the basic idea. Keep reading to learn about the math behind the Fourier Transform or skip to the next chapter.
Much of what I've learned about these animals is from a fantastic book by Jon Russ British Bat Calls
. In this paragraph I will be quoting him extensively. Although the title specifically refers to bats found in Britain, the fundamentals of bat communication are the same across the world.
This project has originated during my time in South-East Asia - specifically, central Vietnam, in city of Da Nang along the coast of the South China Sea. The species that live here are not present in Britain and vice-versa, however the basic rules still apply.
Bats echolocate by producing and projecting ultrasonic sounds from their mouths or noses and then detecting the echoes that return from any solid object within range. Bats produce these pulses in rapid succession in order to receive a regularly updated picture of their environment. Thus a single call provides the bat with a single snapshot of its environment whereas a series of calls provides a series of snapshots, in much the same way as a strobe light provides us with a series of staggered images.
A bat’s echolocation system is highly sophisticated. By emitting short high-frequency pulses of sound from their mouths or noses, bats are able to use the information contained within the echoes returned from a solid object to construct a ‘sound picture’ of their environment. Not only are they able to identify the size, position and speed of objects within three-dimensional space, they are also able to differentiate forms and surface textures. However, as there is no single signal form that is optimal for all purposes, bats have evolved a large number of signal types.
Frequency-modulated broadband signals (FM)
: a broadband signal refers to a signal that spans a wide range of frequencies. Example: An FM echolocation call from a bat sweeping from 20 kHz to 100 kHz is broadband. Useful in cluttered environments like forests, where bats need precise spatial resolution to avoid obstacles.
qCF signals
: combines the benefits of narrow-band (CF) and broad-band (FM). Generally bat species foraging primarily in a cluttered environment usually put more emphasis on the FM components of their calls while those that forage primarily in an open environment tend to put more emphasis on the qCF
components of their calls. CF calls provide long-range detection.
Social calls
produced by bats are often more structurally complex than echolocation calls
used for orientation. Social calls are used to communicate with other bats, and for many
species they consist of a wide variety of trills and harmonics, comparable in many respects to
bird song.
Some are used to defend patches of insects against other
bats or to sustain territorial boundaries. Others function in attracting a mate or, in the case of
distress calls
, to initiate a mobbing response. Perhaps the most astounding are the isolation calls
emitted by young bats, which allow their mothers to identify them.
Bats that use FM or FM/qCF calls determine the distance of prey by the time it takes for the
echo to return, while the direction is determined from analysis of the time difference between
the arrival of the echo at the right and the left ears. The echolocation system of horseshoe bats
is not based on this time-differential orientation, because the echo would overlap with the
transmitted sound as a result of the long duration of the calls. Horseshoe bat
calls contain
constant frequency components of very long duration and they have a filter in their ears tuned
precisely to that particular frequency.
Bats also produce sound for communication and this may function in attracting a mate (advertisement calls
)2.4.8.1, defending a feeding area (patch defence calls
), calling for help (distress calls
) 2.4.8.2, or mother–infant communication.
CF
or qCF
components contain the highest concentration of energy in the call, known as the
peak frequency or frequency of maximum energy (FmaxE
).
The speed at which these calls are emitted is known as the pulse repetition rate and there are differences between species. For example, Natterer’s bat has an extremely fast pulse repetition rate. In comparison, the repetition rate of the echolocation calls of noctule is very slow. Some species may have a very regular ‘rhythm’ to this repetition rate (e.g. Daubenton’s bat) whereas others are erratic (e.g. soprano pipistrelle).
| This here sounds promising for our design: I could use this approach for the GUI, which can benefit from real-time monitoring, while analysis will be done in a separate process
Frequency division (FD) is normally the cheapest of the ‘broadband’ systems that simultaneously monitor the full range of frequencies contained within all bat calls. It uses a ‘zero-crossing’ circuit that produces a square wave output with the same frequency as the fundamental of the incoming signal. The number of square waves is counted and for every 10 waves a single square wave of the same total duration is outputted. This reduces frequencies by a factor of 10 (sometimes a different factor is used) and brings them within the audible range but has no effect on time, so calls are heard in real time.
| And this is what can be useful for analysis
Along with full spectrum sampling (see below), time expansion (TE)
gives the most accurate reproduction of bat calls. TE stores the ultrasound signal digitally and replays it at a slower
speed (usually 10 times, but sometimes slower
) so it can be recorded to a standard audio
device. The signal retains all the characteristics of the original signal, so we hear
the entire call as it should sound except that it is 10 times lower in frequency and 10 times
slower.
| Another interesting note. We could address that by storing the sounds in a folder in the GUI, but keep real-time monitoring going. Or, preview using FD approach. Either way, the original signal should be preserved.
The disadvantage of TE is that during the period when the detector plays back the time-expanded sound, it is not capturing any new sounds.
- In general, a bat call will be longer than 2.5 ms and shorter than 70 ms.
- Measuring parameters from the series of three to five selected calls is good practice and ensures any variation between calls is taken into account; if there are alternating call types, parameters from both are measured.
- Peak frequency (also referred to as the frequency containing maximum energy (FmaxE)) is often the key parameter used to identify species, in conjunction with call shape.
- Start or maximum frequency can be very difficult to measure, depending on the level of background noise and the quality of the recordings
- The time parameters duration and inter-pulse interval are rarely diagnostic but often measured to help confirm likely species identification.
- Occasionally, calls may be recorded which appear to have repeated or ‘double’ calls. In most cases this appears on a sonogram as distorted sound. These echoes are due to the emitted sound from the bat being picked up directly by the detector but also bouncing off an object such as a wall and the resulting echo also being picked up by the detector.
The author presents certain call parameters for a number of species 5.1. We are only interested in a few, as those are the species that inhabit Vietnam
-
Inter-pulse interval (ms)90.2 (24.9–186.6)
-
Call duration (ms)50.5 (16.3–73.8)
-
Frequency of maximum energy (peak) (kHz)81.3 (77.8–83.8)
-
Start frequency (kHz)70.2 (62.2–78.5)
-
End frequency (kHz)67.3 (58.1–80.9)
-
Forages in pasture, parkland, meadows and woodland, especially near water.
-
Social calls of the greater horseshoe bat are generally not produced in flight.
-
Inter-pulse interval (ms)70.4 (14.1–113.7)
-
Call duration (ms)43.6 (11.9–61.4)
-
Frequency of maximum energy (peak) (kHz)111.1 (107.3–114.0)
-
Start frequency (kHz)99.0 (92.3–107.8)
-
End frequency (kHz) 96.6 (83.4–110.3)
-
Mainly deciduous woodland and wetlands, but also pasture, woodland edge and hedgerows. Also over water and in farmyards.
-
Social calls of the lesser horseshoe bat are generally not produced in flight.
- Inter-pulse interval (ms)75.5 (27.5–186.0)
- Call duration (ms)3.2 (1.4–5.8)
- Frequency of maximum energy (peak) (kHz)47.0 (41.8–56.5)
- Start frequency (kHz)81.1 (50.3–109.7)
- End frequency (kHz)29.4 (22.4–38.6)
- Generally forages higher above the water.
- Social calls, though rarely heard, sound like a loud extra call slipped into the echolocation call sequence.
- Inter-pulse interval (ms)102.5 (59.9–211.0)
- Call duration (ms)5.9 (3.2–8.6)
- Frequency of maximum energy (peak) (kHz)46.6 (43.3–49.9)
- Start frequency (kHz)68.8 (50.8–95.2)
- End frequency (kHz)45.9 (41.2–50.6)
- Inter-pulse interval (ms)113.0 (66.7–251.5)
- Call duration (ms)4.2 (3.1–6.4)
- Frequency of maximum energy (peak) (kHz)47.5 (39.2–68.5)
- Start frequency (kHz)88.3 (69.9–101.8)
- End frequency (kHz)32.4 (25.6–43.3)
- Woodland, parks, meadows, flowing water, and gardens.
- Inter-pulse interval (ms)216.9 (120.3–413.1)372.2 (120.2–807.2)
- Call duration (ms)14.7 (8.8–23.4)22.1 (13.2–29.9)
- Frequency of maximum energy (peak) (kHz)24.5 (22.4–33.6)19.3 (17.5–23.6)
- Start frequency (kHz)37.9 (23.8–52.2)23.2 (18.2–30.4)
- End frequency (kHz)23.7 (21.4–32.2)18.3 (17.1–23.0)
- Found in a wide range of open habitats. Common over deciduous woodland, parkland, pasture, marshland and rivers. Not very common in larger cities.
- Timing: Bats are most active after sunset.
- Location: Look for entrances to caves or areas where bats emerge.
- Keep away from human noise or overlapping natural sounds (e.g., running water).