Deep Learning with a Small Training Batch (or Lack Thereof)

Overview of self-supervised methods.

While the demand for neural networks is growing, most state-of-the-art approaches to adapt them to business needs often lag, hindered by insufficient or absent markup. Supervised learning is hardly feasible in this situation, and standard unsupervised methods won’t work for most of your tasks. This is where self-supervised plans come to the rescue. Depending on the task, they require next to no markup or none at all.

Photo by Kelly Sikkema from Unsplash

Some 100,000 new content units are uploaded to our iFunny app daily. Such manual markup would be costly, but we need information on the objects in the content to personalize the feed and select the proper push notifications.

This article covers two self-supervised approaches to tackling the image classification issue. Google and Facebook offer these methods. These methods are excellent when you only have a small training batch, and they will significantly simplify manual markup or even allow you to ditch it.

SimCLR: a framework for unsupervised learning by Google Research

Original article: A Simple Framework for Contrastive Learning of Visual Representations

The method is based on the contrastive loss function,

where z is the output of the network (linear layer), and sim is the similarity function between the two vectors. Ipso facto, the cosine distance is used

The loss function is calculated for a positive pair of examples (i, j) representing the same object, but with different properties. The curly 1 in the denominator can take two values (0 and 1) and serves to indicate that k is not equal to i. Thus, the sum at the bottom consists only of different vectors (the distance to itself is not added, always being equal to 1). τ is an additional adjusting parameter, called temperature.

Two augmented representations are created for each object, and the loss function, as described above, is calculated for them. Therefore, information on the pairwise distance to other objects is contained only in the sum of the denominators in the loss function.

A large batch size (from 256 to 8,192) was used for training; augmentation doubled the number of objects per training step. The authors claim that regular SGD optimizers may be unstable in this situation, so they opted for LARS. They also used global batching normalization with the same mean and variance for all the devices used in training.

I would also like to note that the authors use the metric to compare vectors coming not from the encoder but from a fully connected linear network that follows it (two fully connected linear layers). They claim this works better than directly comparing embedding by a convolutional network.

The gif below explains the learning process. The same network is fed first a batch of images and then their augmented copies. The resulting vectors are compared, and the network places the embeddings from one source as close as possible in the feature space while spreading them between the vectors of different source images.

SimCLR learning process

The authors tested numerous augmentations, as shown below. A maximum of two transformations combined was used.

The result of the augmentation application

Out of all augmentation combinations, the ones that brought the best result were studied closely.

The brighter the rectangle, the greater the contribution from the combination of augmentations on the horizontal + vertical axes to the trained model’s accuracy.


The table below contains the results of comparing the accuracy of the resulting model with that of other methods, depending on the percentage of the training batch with markup, 1 or 10% of the total data.

Comparison of the resulting model with other methods

SimCLRv2: an updated version with a teacher network

Original article: Big Self-Supervised Models are Strong Semi-Supervised Learners

In the fall of 2020, the authors published an article describing the updated approach version. Here are the three main points that set it apart from the initial one:

  • More profound, wider networks are used as encoders.

  • The output contains more internal layers of a fully connected neural network.

  • A memory mechanism for buffering negative examples is added.

Unlike the first version, the first MLP layer was used to fine-tune the network for specific classes. Previously, this part of the network was discarded, and embeddings received from the encoder were used instead.

Also, the new version contained three training stages (as compared to two in the initial one):

  1. Training an extensive network without a teacher. The network was trained without any specific task, with the main features highlighted.

  2. Fine-tuning with a teacher on a small batch of marked-up data for a specific task.

  3. Training a second, more compact network (student) using the markup provided by the first extensive network.

SimCLRv2 with a teacher network, which is trained on unmarked data using the initial SimCLR pattern, and a more compact student network, trained on the markup provided by the first model.


The table below contains the results of comparing the accuracy of the resulting model with that of other methods and the initial version of the algorithm, depending on the percentage of the training batch with markup, 1 or 10% of the total data.

A study of SimCLR properties

Original article: Intriguing Properties of Contrastive Losses

The Brain Team at Google Research published an article covering the general properties of this approach to training. In it, they generalized the loss function used when comparing objects. The authors also conducted a study with images featuring a considerable number of objects. They created an artificial dataset for this purpose; see examples below. This dataset was used in unsupervised learning using the SimCLR methodology. After unsupervised learning, the network was trained on pre-marked-up examples, where each image featured only one object (a digit).

MultiDigit dataset example

The final model quality was assessed using images with a single object. The results of comparing the resulting solution with the classical supervised learning approach are shown below.

Comparison of SimCLR and Supervised approaches depending on the number of digits in the image

Another interesting study covered in the article involves SimCLR’s ability to teach the network to highlight local features. K-means clustering was used for this purpose, based on the features highlighted by different layers of the trained network. The results are shown below.

Highlighting of local features by the network. Colors are used to highlight areas of the image that are close to each other in the feature space

The last experiment, involving feature suppression, consists of three parts.

For the first network, an artificial dataset (DigitOnImageNet) was created, wherein numbers from the MNIST dataset were overlaid on each picture from the ImageNet dataset, as shown below. The question here was whether the network would record the simpler features inherent in the digits from MNIST, or the complex ImageNet patterns. Each unique image from ImageNet was matched with a unique image from MNIST. The number of digits used was a hyperparameter; its effect on the model quality was studied separately.

DigitOnImageNet. The dataset uses images from ImageNet with superimposed white digits from the MNIST dataset

The second dataset contained 112x112 px images with two digits of different sizes. The first digit was always 20x20 px, and the second one ranged from 20x20 to 80x80. This was done to test the hypothesis that the geometric size of the feature affects learning.

MultiDigits. A set of 2 random digits from the MNIST dataset on a single image, same scale (top) and different scale (bottom)

The third data set was created by adding an extra channel to the normal images, containing a random digit from a predetermined range. To simplify the training, the digit was evenly distributed throughout the image and presented in binary. Unlike the other channels of the original image, this layer was not augmented in any way during SimCLR training.

RandBit. A dataset of images with an added channel featuring a random digit, bearing no informative value.


As per the results of the first experiment, as the number of digits in the ImageNet image increased, the classification of the digits improved, while the classification of the other images worsened. Additionally, several values of the temperature parameter used in the loss function were tested, which influenced the decreasing rate. A comparison was also made with the supervised learning method; it was found to be almost unaffected by the digits added to the images.

DigitOnImageNet. Graphs of quality dependence on the number of unique digit images and the temperature parameter

The second experiment tested the classification quality of a digit of a certain size by a model trained using the dataset described above. The results showed that the effect became noticeable starting with a size of 50x50 px; the classification quality of the smaller digit fell.

MultiDigits. The classification quality of the digit depends on its size in the image, as well as the temperature parameter

The results of the experiment using the third dataset also showed a drop in quality as the range of random digit selection increased. All of the results described can be seen in the figures below.

RandBit. Image classification quality depends on variance when selecting a random digit in Channel 4

Code and more

Barlow Twins by Facebook Research

Original article: Barlow Twins: Self-Supervised Learning via Redundancy Reduction

The method is named after the neurobiologist Horace Barlow, or more specifically his work, Redundancy Reduction Principle.

Like SimCLR and other self-supervised learning methods, Barlow Twins processes the embeddings of the original and augmented image. The authors claim the main standout feature of their approach to be an innovative loss function.

First off, let’s explain the notations used below. 𝓣 stands for the augmentations applied to images. ZA and ZB are the embeddings obtained at the network output from two different types of 𝓣 augmentations. The indices A and B indicate which augmentation produced the final vector.

C is the cross-correlation matrix computed between the outputs of the two identical networks along the batch dimension. Ipso facto, the code uses the same network to calculate all the embeddings. The matrix itself is calculated as follows:

b is the index of the element within the batch, and i, j are indexes of the vector dimension of the networks’ outputs. The final matrix is square, with the size of the dimensionality of the network’s output, and with values comprised between -1 (i.e. perfect anti-correlation) and 1 (i.e. perfect correlation).

Intuitively, the first term of the loss function, by trying to equate the diagonal elements of the cross-correlation matrix to 1, makes the embedding invariant to the distortions applied (showing the network that it is the same object). The second term, by trying to equate the other elements of the matrix to 0, decorrelates the different objects (images).

The entire pipeline, shown in the figure below, is almost identical to the approach described in SimCLR.

The ResNet-50 network is used as a learning encoder that translates pictures into vectors. It lacks the classification head; instead, there is a fully connected three-layer network with an internal dimension of 8,192. This superstructure is used only during training; the final classification involves only the core (convolutional) part of the network, and a new head with the required number of classes of a finite dimensionality is added. Next, the head is fine-tuned to classify embeddings based on data with the existing markup.

The LARS optimizer was used in training. The training took 1,000 epochs with a batch size of 2,048. There was a learning rate of 0.2 for the weights and 0.0048 for the biases and batch normalization parameters. During the first 10 epochs, lr remained unchanged, after which it was reduced by the method described here, by a factor of 1,000. The authors also searched for the optimal λ parameter (it is not mentioned how), settling on λ=5✖10–3. The entire training took 124 hours, using 32 NVIDIA V100 units.

This method is claimed to be superior to many of the presented alternatives, including the initial SimCLR version.


Fine-tuning using the full ImageNet training batch

Mixed learning with a small portion of pre-marked-up images

The authors also separately studied the effect of changing certain parameters during training on the final result.

This approach has a local maximum at a batch size of 1,024, which separates it from the other approaches compared at this range of values.

Batch size change: when studying this graph, consider both the sample size and the computing power used

Increasing the feature space of the final vector is expected to positively influence the final prediction accuracy. In the alternative approaches, this dependence has the same character, albeit less pronounced.

Changing the length of the final embedding

Reducing the number of augmentations causes a decreased accuracy in all approaches considered but to different degrees.

Removing some augmentations

These experiments are less complete than those given in Intriguing Properties of Contrastive Losses by the Google Research team. Also, despite the allegedly improved accuracy in certain parameters, this implementation loses to its counterparts.

Code and more


Both approaches described are noteworthy. The exact final results are more likely to depend on the specific application rather than the accuracy demonstrated by test datasets. No less important is the framework used to write the methods:

  • If you primarily use PyTorch, use the Barlow Twins code. It will make implementing the result and modifying the training code easier for you.

  • If you want to use Tensorflow, or have a TPU available, go for SimCLR, seeing as it has already proven itself on Kaggle.

We at Funcorp prefer PyTorch when solving problems, so we opted for Barlow Twins. We used a training batch of 8.5 million non-marked-up images. The result was a feature space wherein certain subsets of images with common patterns can be outlined using fairly simple clustering methods. These vectors are already being used in selecting personalized push notifications.