Supplementary material

The architecture details of the encoder and decoder networks are described below and shown in figures. Figure below shows visualization legend, describing each type of layer used in the model and additional annotations for determining their shapes. Note that we use the same notations as in the paper.

convolution - convolution layer with $16$ $3\times 3$ kernels without an activation function, outputting $32\times 32$ feature maps,
convolution with ReLU - convolution layer with ReLU activation function and $16$ $3\times 3$ kernels, outputting $32\times 32$ feature maps,
convolution with sigmoid - convolution layer with sigmoid activation function and $16$ $3\times 3$ kernels outputting $32\times 32$ feature maps,
transposed convolution with ReLU - transposed convolution layer with ReLU activation function and $16$ $3\times 3$ kernels, outputting $32\times 32$ feature maps,
maxpool - MaxPool2D layer,
feature map - feature map tensor of shape $16\times32\times32$ ,
flatten and stack - tensor created by flattening and stacking vectors assigned to each cell in each feature map, of shape $32\times2$ .

Encoder network

Algorithm below presents SSDIR's encoder flow: for each cell in feature pyramid's grids it creates where, present, what and depth latent variables. The inference for each cell in the feature pyramid, as well as generating latent representations, is conducted parallelly.

INPUT: normalized image $x$ objects' latent representations OUTPUT: ( $\mathbf{z}_\mathit{where}$ , $\mathbf{z}_\mathit{present}$ , $\mathbf{z}_\mathit{what}$ , $\mathbf{z}_\mathit{depth}$ )
$\mathit{features} \leftarrow$ Backbone( $x$ );
$[cx_i, cy_i, w_i, h_i] \leftarrow$ WhereEncoder( $\mathit{features}$ );
$\beta_i \leftarrow$ PresentEncoder( $\mathit{features}$ );
$\mathbf{\mu}_\mathit{what}^i \leftarrow$ WhatEncoder( $\mathit{features}$ );
$\mu_\mathit{depth}^i \leftarrow$ DepthEncoder( $\mathit{features}$ );
$\mathbf{z}_\mathit{where} \leftarrow [\mathbf{cx}, \mathbf{cy}, \mathbf{w}, \mathbf{h}]$ ;
$\mathbf{z}_\mathit{present} \leftarrow \mathit{Bernoulli}(\mathbf{\beta})$ ;
$\mathbf{z}_\mathit{what} \leftarrow \mathcal{N}(\mathbf{\mu}_\mathit{what}, \mathbf{\sigma}_\mathit{what})$ ;
$\mathbf{z}_\mathit{depth} \leftarrow \mathcal{N}(\mathbf{\mu}_\mathit{depth}, \mathbf{\sigma}_\mathit{depth})$ ;
RETURN: $\mathbf{z}_{where}$ , $\mathbf{z}_{present}$ , $\mathbf{z}_{what}$ , $\mathbf{z}_{depth}$

Convolutional backbone

The backbone used in SSDIR is a standard VGG11 with batch normalization, whose classification head was replaced with a feature pyramid, as shown in figure below. The input images are normalized with mean $\mathbf{\mu}=\left\{0.485, 0.456, 0.406\right\}$ and standard deviation $\mathbf{\sigma}=\left\{0.229, 0.224, 0.225\right\}$ and resized to $3\times300\times300$ . As a result, the backbone outputs $5$ feature maps of resolutions $(18\times18)$ , $(9\times9)$ , $(5\times5)$ , $(3\times3)$ and $(1\times1)$ , which denote the sizes of grids in each level of the feature pyramid. These features are passed to latent vectors' encoders. During training, weights of the backbone are frozen.

SSDIR where and present encoders

The architectures of where and present encoders are presented in figures below. Both encoders are based on SSD prediction heads and utilize one convolutional layer for each feature map. Since the SSD model used in SSDIR assigns two predictions to each cell, the output representation consists of $880$ vectors. As in the backbone's case, the weights of encoders transferred from an SSD model are frozen during training SSDIR.

where encoder	present encoder

SSDIR depth and what encoder

The what encoder, shown in figure below, is slightly extended as compared to other encoders. Each feature map can be processed by multiple convolutional layers, each with the same number of kernels, equal to the size of $\mathbf{z}_\mathit{what}$ latent vector. The output is a vector of means, used to sample the $\mathbf{z}_\mathit{what}$ latent vector.

The architecture of the depth encoder is similar to the present encoder (see figure below). As in the what encoder, here the output is used as mean for sampling $\mathbf{z}_\mathit{depth}$ latent vector.

what encoder	depth encoder

These modules generate one latent vector for each cell; in order to match the size of SSD's output, each latent vector is duplicated.

Decoder network

Algorithm below shows the flow of the SSDIR decoder network. First, all latent vectors are filtered according to $\mathbf{z}_\mathit{present}$ , producing $M$ present-only latent vectors. For batched decoding and transforming reconstructions, all filtered latent representations in a batch are stacked and forwarded at once through what decoder and the spatial transformer. The output image, created by merging transformed reconstructions, is normalized to increase the intensities for visual fidelity.

INPUT: objects' latent representations ( $\mathbf{z}_{where}$ , $\mathbf{z}_{present}$ , $\mathbf{z}_{what}$ , $\mathbf{z}_{depth}$ ) OUTPUT: reconstructed images $y$
$\mathbf{z}_{what}^M \leftarrow$ Filter( $\mathbf{z}_{what}, \mathbf{z}_{present}$ );
$\mathbf{z}_{where}^M \leftarrow$ Filter( $\mathbf{z}_{where}, \mathbf{z}_{present}$ );
$\mathbf{z}_{depth}^M \leftarrow$ Filter( $\mathbf{z}_{depth}, \mathbf{z}_{present}$ );
$\mathit{recs} \leftarrow$ WhatDecoder( $\mathbf{z}_{what}^M$ );
$\mathit{scaled\_recs} \leftarrow$ STN( $\mathit{recs}, \mathbf{z}_{where}^M$ );
$\mathit{weights} \leftarrow$ SoftMax( $\mathbf{z}_{depth}^M$ );
$y \leftarrow$ WeightedMerge( $\mathit{scaled\_recs}, \mathit{weights}$ );
Normalize( $y$ );
RETURN: $y$

SSDIR what decoder

The what decoder consists of a sequence of convolutional layers. The first one, containing $1024$ $1\times1$ kernels, prepares a larger feature map for transposed convolution. Then, a series of transposed convolutions with strides of size $2$ , each with $(2\times2)$ -sized filters upscale the feature map to achieve $64\times64$ resolution. Finally, the last convolutional layer with the sigmoid activation function outputs 3 channels, creating $M$ objects' reconstructions.

Spatial Transformer and merging

The filtered $\mathbf{z}_\mathit{where}^M$ latent vectors are used to transform decoded reconstructions to the inferred location on the image. We use affine transformation to create $M$ $3\times300\times300$ images, which are merged according to softmaxed $\mathbf{z}_\mathit{depth}^M$ .

Other details

Standard deviations, used to sample latent representations of each object, are treated as model hyperparameters. They are given for each experiment in Section Training.

To increase the stability of training, the latent vectors of non-present objects (those, whose $\beta$ probability is lower than $0.001$ ) can be reset, in order to prevent their values from exploding, as noticed during training when transferring a pre-trained backbone, where, and present encoders from SSD. In such a case, all non-present objects' means were set to $0.0$ , all standard deviations were set to $1.0$ , and all bounding box parameters were set to $\left[0.0\ 0.0\ 0.0\ 0.0\right]$ .

We noticed that training only what and depth encoders was not sufficient to learn high-quality representations for more complex datasets. In such a case, it is possible to clone the convolutional backbone for learning $\mathbf{z}_\mathit{what}$ and $\mathbf{z}_\mathit{depth}$ and train it jointly, while preserving the originally learned weights for inferring $\mathbf{z}_\mathit{where}$ and $\mathbf{z}_\mathit{present}$ .

The weights of modules, which are not transferred from a trained SSD model, are initialized using Xavier, with biases set to $0$ .

Datasets

Multi-scale scattered MNIST

We prepared the Multi-scale scattered MNIST dataset to test multi-object representation learning using images with highly varying object sizes. It can be treated as a benchmark dataset, and its construction procedure contributes to this submission. The dataset was generated according to algorithm below, with a given set of parameters:

$[s_x, s_y]$ - output image size,
$grids$ - set of grid sizes used for placing digits,
${ds}_{min}$ - minimum size of a digit in the image,
${ds}_{max}$ - maximum size of a digit in the image,
$\Delta_{position}$ - the range in which the position of a digit may vary, given as the percentage of the cell size,
$\theta_{filled}$ - threshold for indicating if a cell in a grid is already filled, given as the percentage of the cell's area,
$n_{images}$ - number of images to generate.

INPUTS: $\mathit{MNIST}$ dataset, $[s_x, s_y]$ , $\mathit{grids}$ , $\mathit{ds}_\mathit{min}$ , $\mathit{ds}_\mathit{max}$ , $\Delta_\mathit{position}$ , $\theta_\mathit{filled}$ , $n_\mathit{images}$ OUTPUTS: a generated dataset
2 $\mathbf{X}$ $\mathbf{X} \leftarrow [\ ]$ $x \leftarrow$ CreateEmptyImage( $s_x, s_y$ );
$\mathit{bboxes} \leftarrow [\ ]$ ;
$\mathit{labels} \leftarrow [\ ]$ ;
$\mathit{grid\_size} \leftarrow$ DrawGridSize( $\mathit{grids}$ );
$n_\mathit{cells} = \mathit{grid\_size}_x \cdot \mathit{grid\_size}_y$ ;
$n_\mathit{digits} \leftarrow$ randint(min= $n_\mathit{cells} / 2$ , max= $n_\mathit{cells}$ );
$\mathit{filled} \leftarrow [\ ]$ ;
$\mathit{cell} \leftarrow$ DrawGridCell( $\mathit{grid\_size}$ , $\mathit{filled}$ );
$\mathit{ds} \leftarrow$ RandomDigitSize( $\mathit{cell}_\mathit{size}$ , $\mathit{ds}_\mathit{min}$ , $\mathit{ds}_\mathit{max}$ );
$\mathit{digit\_img}, \mathit{label} \leftarrow$ GetDigit( $\mathit{MNIST}$ );
$\mathit{digit\_img} \leftarrow$ Resize( $\mathit{digit\_img}$ , $ds$ );
$\mathit{bbox} \leftarrow$ CalculateBboxCoords( $ds$ , $cell$ , $\Delta_\mathit{position}$ );
$x \leftarrow$ AddDigitToImage( $\mathit{digit\_img}$ , $\mathit{bbox}$ , $x$ );
$\mathit{bboxes} \leftarrow$ append( $\mathit{bbox}$ );
$\mathit{labels} \leftarrow$ append( $\mathit{label}$ );
$\mathit{filled} \leftarrow$ MarkFilled( $\mathit{bbox}$ , $\theta_\mathit{filled}$ );
$\mathbf{X} \leftarrow$ append( $x$ , $\mathit{bboxes}$ , $\mathit{labels}$ );
RETURN: $\mathbf{X}$

In table below we gathered the parameter values used for generating main dataset, prepared for training the SSD and SSDIR models. An additional validation dataset the size of 10% of the training dataset was used for evaluating SSDIR, SPAIR, and SPACE with regard to per-object reconstructions and the downstream task. We also present all researched values of parameters, combinations of which were used to generate the ablation study's datasets.

parameter	main	ablation
$n_\mathit{images}$	$50000$	$10000$
$\mathit{grids}$	$\left\{(2, 2), (3, 3), (4, 4), (5, 5), (6, 6)\right\}$	$\{\left\{(2, 2), (3, 3), (4, 4), (5, 5)\right\}$ , $\left\{(2, 2), (3, 3), (4, 4), (5, 5), (6, 6)\right\}$ , $\left\{(3, 3), (4, 4), (5, 5)\right\}$ , $\left\{(3, 3), (4, 4), (5, 5), (6, 6)\right\}$ , $\left\{(4, 4), (5, 5), (6, 6)\right\}\}$
$\mathit{ds}_\mathit{min}$	$96$	$\left\{96, 128, 160\right\}$
$\mathit{ds}_\mathit{max}$	$384$	$\left\{256, 320, 384\right\}$
${[s_x, s_y]}$	$512\times512$	$512\times512$
$\Delta_\mathit{position}$	$0.7$	$0.7$
$\theta_\mathit{filled}$	$0.4$	$0.4$

CLEVR

We used the dataset generated originally by the authors. It contains object of $2$ pre-defined sizes (large and small), $8$ colors, $2$ materials and $3$ shapes. The locations of the objects in the scene were processed to generate bounding boxes for training the SSD model. We trained SSDIR, SPAIR, and SPACE using the entire training dataset, while the validation dataset was used for evaluating each model's reconstructions quality.

WIDER FACE

This dataset was used for evaluating the performance of the models in images with a complex background when trying to focus on a particular type of object (here: faces). The dataset contains bounding box coordinates and hence could be used for training the SSD model directly. We applied an additional preprocessing stage, dropping small bounding boxes (smaller than 4% of the image) and removing images without any bounding box. Then, the SSDIR, SPAIR, and SPACE models were trained with the training dataset, and the validation dataset served as the reconstructions quality benchmark.

Training regime and hyperparameters

In this section, we summarize the hyperparameters of SSDIR used for training models for each dataset. The batch size was tuned to fit the GPU's memory size, whereas the other hyperparameters' optimization was conducted using Bayesian model-based optimization. In table below we present the hyperparameters used for each task and dataset (the denominations of the hyperparameters match those used in the paper).

symbol	description	MNIST	CLEVR	WIDER
	batch size	$32$	$32$	$32$
	learning rate	$0.0005$	$0.0005$	$0.0005$
	clone backbone	false	true	true
$\alpha_\mathit{obj}$	image reconstruction error coefficient	$10.0$	$1.0$	$1.0$
$\alpha_\mathit{rec}$	per-object reconstruction error coefficient	$10.0$	$14.0$	$10.0$
$\alpha_\mathit{what}$	$\mathbf{z}_\mathit{what}$ KL loss coefficient	$1.0$	$0.54$	$0.1$
$\alpha_\mathit{depth}$	$\mathbf{z}_\mathit{depth}$ KL loss coefficient	$1.0$	$0.44$	$0.4$
$\sigma_\mathit{what}$	$\mathbf{z}_\mathit{what}$ standard deviation	$0.1$	$0.27$	$0.3$
$\sigma_\mathit{depth}$	$\mathbf{z}_\mathit{depth}$ standard deviation	$0.1$	$0.14$	$0.1$
$n_\mathit{hidden}$	number of hidden layers in what encoder	$3$	$3$	$4$
$\|\mathbf{z}_\mathit{what}\|$	$\mathbf{z}_\mathit{what}$ latent vector size	$64$	$256$	$512$
	$\mathbf{z}_\mathit{what}$ prior	$\mathcal{N}\left(\mathbf{0}, \mathbf{I}\right)$	$\mathcal{N}\left(\mathbf{0}, \mathbf{I}\right)$	$\mathcal{N}\left(\mathbf{0}, \mathbf{I}\right)$
	$\mathbf{z}_\mathit{depth}$ prior	$\mathcal{N}\left(0, 1\right)$	$\mathcal{N}\left(0, 1\right)$	$\mathcal{N}\left(0, 1\right)$

Additional reconstructions

Below, we provide additional reconstructions for the following datasets: scattered MNIST, CLEVR and WIDER FACE. Once again, the number of reconstructions shown for each image is limited, due to the total number of objects reconstructed by each model; if the number of objects reconstructed by a model was smaller than the number of columns, we show only the ones returned by the model.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SUPPLEMENTARY.md

SUPPLEMENTARY.md

Supplementary material

Encoder network

Convolutional backbone

SSDIR where and present encoders

SSDIR depth and what encoder

Decoder network

SSDIR what decoder

Spatial Transformer and merging

Other details

Datasets

Multi-scale scattered MNIST

CLEVR

WIDER FACE

Training regime and hyperparameters

Additional reconstructions

Multi-scale MNIST

CLEVR

WIDER FACE

Files

SUPPLEMENTARY.md

Latest commit

History

SUPPLEMENTARY.md

File metadata and controls

Supplementary material

Encoder network

Convolutional backbone

SSDIR where and present encoders

SSDIR depth and what encoder

Decoder network

SSDIR what decoder

Spatial Transformer and merging

Other details

Datasets

Multi-scale scattered MNIST

CLEVR

WIDER FACE

Training regime and hyperparameters

Additional reconstructions

Multi-scale MNIST

CLEVR

WIDER FACE