The architecture details of the encoder and decoder networks are described below and shown in figures. Figure below shows visualization legend, describing each type of layer used in the model and additional annotations for determining their shapes. Note that we use the same notations as in the paper.
- convolution - convolution layer with
kernels without an activation function, outputting
feature maps,
- convolution with ReLU - convolution layer with ReLU activation function and
kernels, outputting
feature maps,
- convolution with sigmoid - convolution layer with sigmoid activation function and
kernels outputting
feature maps,
- transposed convolution with ReLU - transposed convolution layer with ReLU activation function and
kernels, outputting
feature maps,
- maxpool - MaxPool2D layer,
- feature map - feature map tensor of shape
,
- flatten and stack - tensor created by flattening and stacking vectors assigned to each cell in each feature map, of shape
.
Algorithm below presents SSDIR's encoder flow: for each cell in feature pyramid's grids it creates where, present, what and depth latent variables. The inference for each cell in the feature pyramid, as well as generating latent representations, is conducted parallelly.
INPUT: normalized image OUTPUT: ( |
---|
RETURN: |
The backbone used in SSDIR is a standard VGG11 with batch normalization,
whose classification head was replaced with a feature pyramid, as shown
in figure below. The input images are normalized with mean
and standard deviation
and resized to
. As a result, the backbone outputs
feature
maps of resolutions
,
,
,
and
, which denote the sizes of grids in each
level of the feature pyramid. These features are passed to latent
vectors' encoders. During training, weights of the backbone are frozen.
The architectures of where and present encoders are presented in
figures below. Both encoders are based on SSD
prediction heads and utilize one convolutional layer for each feature
map. Since the SSD model used in SSDIR assigns two predictions to each
cell, the output representation consists of vectors. As in the
backbone's case, the weights of encoders transferred from an SSD model
are frozen during training SSDIR.
where encoder | present encoder |
---|---|
![]() |
![]() |
The what encoder, shown in figure below, is slightly extended as compared to other
encoders. Each feature map can be processed by multiple convolutional
layers, each with the same number of kernels, equal to the size of
latent vector. The output is a vector of means,
used to sample the
latent vector.
The architecture of the depth encoder is similar to the present
encoder (see figure below). As in the what encoder, here the output
is used as mean for sampling latent vector.
what encoder | depth encoder |
---|---|
![]() |
![]() |
These modules generate one latent vector for each cell; in order to match the size of SSD's output, each latent vector is duplicated.
Algorithm below shows the flow of the SSDIR decoder network.
First, all latent vectors are filtered according to
, producing
present-only latent vectors.
For batched decoding and transforming reconstructions, all filtered
latent representations in a batch are stacked and forwarded at once
through what decoder and the spatial transformer. The output image,
created by merging transformed reconstructions, is normalized to
increase the intensities for visual fidelity.
INPUT: objects' latent representations ( OUTPUT: reconstructed images |
---|
Normalize( |
RETURN: |
The what decoder consists of a sequence of convolutional layers. The
first one, containing
kernels, prepares a larger
feature map for transposed convolution. Then, a series of transposed
convolutions with strides of size
, each with
-sized
filters upscale the feature map to achieve
resolution.
Finally, the last convolutional layer with the sigmoid activation
function outputs 3 channels, creating
objects' reconstructions.
The filtered latent vectors are used to
transform decoded reconstructions to the inferred location on the image.
We use affine transformation to create
images,
which are merged according to softmaxed
.
Standard deviations, used to sample latent representations of each object, are treated as model hyperparameters. They are given for each experiment in Section Training.
To increase the stability of training, the latent vectors of non-present
objects (those, whose probability is lower than
) can be
reset, in order to prevent their values from exploding, as noticed
during training when transferring a pre-trained backbone, where, and
present encoders from SSD. In such a case, all non-present objects'
means were set to
, all standard deviations were set to
, and
all bounding box parameters were set to
.
We noticed that training only what and depth encoders was not
sufficient to learn high-quality representations for more complex
datasets. In such a case, it is possible to clone the convolutional
backbone for learning and
and train it jointly, while preserving the originally learned weights
for inferring
and
.
The weights of modules, which are not transferred from a trained SSD
model, are initialized using Xavier, with biases set to .
We prepared the Multi-scale scattered MNIST dataset to test multi-object representation learning using images with highly varying object sizes. It can be treated as a benchmark dataset, and its construction procedure contributes to this submission. The dataset was generated according to algorithm below, with a given set of parameters:
- output image size,
- set of grid sizes used for placing digits,
- minimum size of a digit in the image,
- maximum size of a digit in the image,
- the range in which the position of a digit may vary, given as the percentage of the cell size,
- threshold for indicating if a cell in a grid is already filled, given as the percentage of the cell's area,
- number of images to generate.
In table below we gathered the parameter values used for generating main dataset, prepared for training the SSD and SSDIR models. An additional validation dataset the size of 10% of the training dataset was used for evaluating SSDIR, SPAIR, and SPACE with regard to per-object reconstructions and the downstream task. We also present all researched values of parameters, combinations of which were used to generate the ablation study's datasets.
parameter | main | ablation |
---|---|---|
We used the dataset generated originally by the authors. It contains
object of pre-defined sizes (large and small),
colors,
materials and
shapes. The locations of the objects in the scene were
processed to generate bounding boxes for training the SSD model. We
trained SSDIR, SPAIR, and SPACE using the entire training dataset, while
the validation dataset was used for evaluating each model's
reconstructions quality.
This dataset was used for evaluating the performance of the models in images with a complex background when trying to focus on a particular type of object (here: faces). The dataset contains bounding box coordinates and hence could be used for training the SSD model directly. We applied an additional preprocessing stage, dropping small bounding boxes (smaller than 4% of the image) and removing images without any bounding box. Then, the SSDIR, SPAIR, and SPACE models were trained with the training dataset, and the validation dataset served as the reconstructions quality benchmark.
In this section, we summarize the hyperparameters of SSDIR used for training models for each dataset. The batch size was tuned to fit the GPU's memory size, whereas the other hyperparameters' optimization was conducted using Bayesian model-based optimization. In table below we present the hyperparameters used for each task and dataset (the denominations of the hyperparameters match those used in the paper).
Below, we provide additional reconstructions for the following datasets: scattered MNIST, CLEVR and WIDER FACE. Once again, the number of reconstructions shown for each image is limited, due to the total number of objects reconstructed by each model; if the number of objects reconstructed by a model was smaller than the number of columns, we show only the ones returned by the model.