Interpretable Bioacoustic Classifiers

Modelling temporal shift-invariance in self-supervised generative models improves accuracy and interpretability of species detection in weakly labelled soundscape recordings

Kieran A. Gibb

Alice Eldridge | Aisha Lawal Shuaibu | Ivor J. A. Simpson

[Paper] | [Code]

Abstract

Realising the potential for acoustic monitoring to deliver biodiversity insight at scale requires new approaches to the automated analysis of PAM recordings that are trustworthy as well as cost-effective. Discriminative models trained on annotated species data are gaining popularity but are labour intensive, notoriously opaque and biased. Self-supervised generative models such as Variational Autoencoders (VAE) offer great potential for learning compact yet expressive representations of data and can provide a strong prior for use in downstream discriminative tasks such as species detection while being intrinsically interpretable.

We propose and evaluate a novel modification to the VAE learning algorithm that models intra-frame shift-invariance. We demonstrate that this modification (SIVAE) provides representations that are more interpretable, consistent and yield good performance on few-shot learning task with very weakly labels. Species-specific Linear classifiers were trained to predict the presence or absence of over 150 avian and anuran species using SIVAE representations of 60s long audio samples. A simple attention mechanism guides the selection of relevant frames to identify the most relevant timestep(s), while L1 regularisation sparsifies the linear model weights to perform species-specific feature selection.

Whilst demonstrated in terrestrial recordings, the approach is transferable to marine, freshwater, and soil habitats. These innovations set the path for trustworthy, data and time-efficient tools to support solid ecological inference from large-scale passive acoustic monitoring surveys.

Disentangling Intra-frame Shift

The absolute temporal position of events that lie within that frame (e.g. the onset of a particular bird vocalisation) are arbitrary when frames are segmented automatically (as is the norm). However, during training the classic VAE algorithm finds the most parsimonious compressed latent representation that minimises the difference between input and generated spectrograms. The results is spurious information (such as distance from frame start to signal onset) carries as much weight as more ecologically meaningful information (such as the distance between peaks or nuances of spectral morphology which represent acoustic traits of a given species).

intra_frame_shift
Figure 1: (B) A bird call interpolated into a generic soundscape background and shifted in time \(\delta\) by linear interpolation in time (A). Taking the 1st derivative of the frame representation \(\mathbf{z}\) with respect to the shift across the latent space highlights its inconsistency for a classic VAE (C). The shift-invariant soundscape VAE is robust to such a perturbation (D).

We explicitly disentangle absolute position within a frame by training a shift prediction network which predicts the intra-frame shift \(\delta \in [-1, 1]\) of the frame contents, with respect to a learned canonical timeline. Consistency is enforced by a 2-way cross-decoding across paired instances, where each instance is a translation of an input frame. A periodic boundary condition is applied and we regularise by learning to specify a zero-centered prior distribution.

$$\mathcal{L}_{\text{align}} = -\mathcal{N}(\hat{\boldsymbol{\delta}}; \mathbf{0}, \sigma_{\mathcal{T}}) = \sum_{t=0}^T \frac{1}{2}\bigg(\log(\sigma^2_{\mathcal{T}}) + \frac{\hat{\delta}_t^2}{\sigma^2_{\mathcal{T}}} \bigg)$$

Model Architecture

SIVAE is pre-trained to embed and reconstruct log mel spectrograms. Downstream linear classifiers are fit to samples from the learned posterior to predict weak presence / absence labels with a learned per-species gated attention mechanism to identify the most relevant frames \(t \in T\) in each sample.

Figure 2: (Top) Encoding architecture overview. Spectrograms \(\mathbf{x}_{0, \ldots, T}\) are encoded into intra-frame shift-invariant latent representations, \(\mathbf{z}_{0, \ldots, T}\), with a corresponding shift \(\hat{\boldsymbol{\delta}}_{0, \ldots, T}\) for each frame. These representations are learned in a fully self-supervised manner; subsequently, classifiers can be trained using \(\mathbf{z}\) as a feature. (Bottom) Decoding architecture overview. Samples \(\bar{\mathbf{z}}\) are drawn from the average Gaussian distribution using the parametrisation trick. The decoder maps the sample \(\bar{\mathbf{z}}_t\) to approximate each input spectrogram, both a full contiguous clip and independently translated frames. During decoding each frame representation is duplicated for each target and is shifted in time by applying a translation \(\mathcal{T}(..., \hat{\boldsymbol{\delta}})\).

Generative Species Representations

Using the generative model, we can inspect the basis of a detection by appling classifier weights as an affine transformation in the latent space, generating novel samples from regions predictive of each species. Specifically, we define a species presence transformation \(\mathcal{S}_j(\mathbf{z}_k)\):

$$ d_{j,k} = \frac{W_j^T\mathbf{z}_k + b}{||W_j||}, $$

$$ \bar{W}_j = \frac{W_j}{||W_j||}, $$

$$ \mathcal{S}_j(\mathbf{z}_k) = \mathbf{z}_k + (d_{j,k} + \delta)\bar{W}_j $$

where \(\mathbf{z}_k\) is a sample from an appropriate starting region of the latent space, \(d_{j,k}\) is the signed distance to the hyperplane for species \(j\), \(\bar{W}\) is the direction normal to the hyperplane and \(\delta\) is a tunable distance parameter beyond the decision boundary controlling the amplitude of the signal in the generated spectrogram. We set \(\mathbf{z}_k\) by taking the average for each habitat resulting in a habitat-specific silent background. The final result is decoded \(\mathcal{D}(\mathcal{S}_j(\mathbf{z}_k))\) providing a spectogram illustrating the predictive factors for each species.

(1) Select a species:


(2) Move across the hyperplane:

(3) Compare generated spectrogram with a typical example:

Figure 3: (Left) a real example call typical of each species. (Right) a decoded generative species representation highlights what sounds were used for detecting a species.

Observe that for many species, the spectro-temporal morphology of the call, its fundamental frequency and harmonics are used for delinating presence. However in nearly all cases, correlated soundscape components such as other species calls are used to aid or even dominate the prediction.

Species Detection

Quantitative experiments show our approach was both effective in terms of discriminative performance, achieving the highest area under the ROC (0.92) and Top-1 accuracy (0.67), outperforming BirdNET V2.4 (0.89, 0.52) while achieving good average precision (AP) despite using very weak presence annotations for each minute of audio.
Figure 4: Area under the ROC curve (auROC) and average precision (AP) score distributions across out-of-the-box BirdNet V2.4 and both VAE variants. Full distributions show across species the auROC is approximately equivalent with BirdNET and across the VAE and SIVAE for the full dataset, while our models typically perform slightly better on examples with high occlusions. Both VAE variant representations provide less precision than BirdNET in the SO dataset. Introducing shift-invariance results in a small improvement in the UK and a drastic improvement in RFCX bird dataset, while we see a slight drop in precision in Ecuadorian soundscapes.
Table 1: Area under the ROC (auROC), mean average precision (mAP) and top-1 accuracy scores for species detection models trained on latent space representations for each model variant alongside BirdNET and Perch. Scores are averaged across the entire species community over 3 different training runs for each dataset. Comparisons are made with both out-of-the-box BirdNET V2.4 and the best results from Ghani et al. (2023) where BirdNet and Perch were fine-tuned on a few-shot learning task. * indicates ``strong labels" where bounding boxes were used to delineate the call in the training data. Size denotes the dimensionality of the embedding for the corresponding temporal resolution in seconds."