# Image Self-Supervised Training With PyTorch Lightning

(You can also view this post in Google Colab)

Self-Supervision is the current hotness of deep learning. Yes, deep networks and transfer learning are now old hat — you need to include self-supervised somewhere if you want to get those big VC dollars. Like transfer learning, though at its core it’s a very simple idea: there is so much data in the world — how can we use it without the big expense of getting humans to label it all? And the answer really does feel like cheating. Self-supervision is essentially “get the computer to automatically add labels to all your data, train a network on that, and then use transfer learning on the task you actually want to solve.” That’s it. The only interesting bits are how to decide what labels you add to what is called the “pretext task”, but the technique is surprisingly effective, especially in image and text-based problems where the Internet provides an almost endless supply of data.

Let’s have a look at the two main approaches to image self-supervised learning that are popular right now — rebuilding the original input from a distorted input, and automatically adding labels to data and training using those synthetic labels

### Reconstructing & Augmenting The Input

If you remember our look at the super-resolution architectures, they’re taking a small image and producing a larger, enhanced image. A self-supervised dataset for this problem can be fairly easily obtained by simply looking at the problem in the opposite way: harvest images from the Internet, and create smaller versions of them. You now have training images and the ground truth labels (the original images). If you’re building a model that colourizes images, then you grab colour images…and turn them into black and white ones!

You can extend this to a more general principle, where you take an image, apply a series of transforms to that image, and then train a neural network to go from the manipulated image to the original. You’ll end up with some sort of U-Net-like architecture, but after you’ve trained the network, you can throw away the ‘decoder’ half and use the ‘encoder’ part for your actual task by adding a Linear layer or two on top of the features you obtain at the bottom of the ‘U’.

You’ll want an augmentation technique that forces the architecture to learn things like how to structure elements of images, how to in-paint missing parts of an image, correct orientations, and so on. Here’s a couple to get you started

#### CutOut / Random Erasing

Perhaps the easiest to apply is simply removing part of an image and getting the model to restore it. This approach is often known as CutOut, and was shown to improve model performance with classification tasks in its introductory paper “Improved Regularization of Convolutional Neural Networks with Cutout”.

And it’s rather easy to apply, because it’s now included as a torchvision transform by default! You can just use:

torchvision.transforms.RandomErasing(p, scale, ratio value, inplace)


This can be slotted into a transform pipeline as we’ve seen many times throughout the book. The parameters you can set are:

• p — the probability of the transform taking place
• scale — range of proportion of erased area against input image.
• ratio — range of aspect ratio of erased area.
• value — the value that will be used in the erased box. Default is 0. If given a single integer, that integer will be used. A tuple of length 3 will make the transform use values within for replacing R, G, and B channels. If passed the string "random", each pixel in the box will be replaced with a random value.
• inplace — boolean to make this transform inplace. Default set to False.

In general, you’ll probably want to use the random strategy for erasing details from an image.

#### Crappify

Crappify is a fun idea from the fast.ai project which literally ‘crappifies’ your images. The concept is simple: pack a transform function with resizing, adding text, and JPEG artefacting, or anything else you decide to add to ruin the image, and then train the network to restore things back to the original.

### Automatically Labelling Data

The full image-based based self-supervision works very well, but you could say that it’s a little wasteful in a classification task; you end up training a full U-Net and throwing half of it away. Is there a way we can we be lazier and still do self-supervision?

Yes! And it’s what we’re going to spend the rest of this section implementing. Consider this image:

Okay, so it’s another picture of Helvetica the cat, but we would need a human annotator to give us the cat label. But we can take this image, transform it, and give it a meaningful label at the same time.

image_90

We have given this new image the label of image_90 to indicate that it has been indicated by 90º. But no human was needed in this (trivial) labelling. We can build a completely synthetic classification task, where we can build a training dataset and corresponding labels entirely programatically. We don’t need to build a U-Net block because all we’re training is a simple classification task; there’s no image reconstruction. But in order to learn how to classify correctly, the model will have to learn how to recognize what way up a cat normally is, and this pre-trained model can then be used on our actual classification task.

We’re going to build this approach to self-supervision, but we’re going to do it with a slightly higher-level framework than PyTorch. Enter PyTorch Lightning!

## PyTorch Lightning, or A Little Help From The Internet

PyTorch Lightning is a wrapper around PyTorch that handles a lot of the standard PyTorch boilerplate that you end up writing for every project (e.g. training, test, and validation loops, determining whether a model should be in eval or not, setting up data, and so on). Like fast.ai, it has an extensible callback system that allows you to hook custom code during almost any part of the training cycle, so you end up with most of the power of PyTorch, but without having to rewrite train() every time you start a new project. Instead, you end up just doing this to train a custom model:

from pytorch_lightning import Trainer

model = LightningModel()
trainer = Trainer(gpus=1, num_nodes=1)
trainer.fit(model)


Some people prefer working with pure PyTorch all the time, but I definitely see a lot of value in Lightning, as it does remove a lot of the error-prone tedium of writing training code while still remaining flexible enough for research purposes. I personally write most of my deep learning code either with Lightning or fast.ai (the new fast.ai2 library even has a tiered layer of APIs that allows you to delve deeper when you need to but still use their powerful higher-level abstractions for most of your work) rather than in raw PyTorch.

Don’t worry, though, because as we’ll see, building a model with PyTorch Lightning isn’t that much different than what we’ve been doing throughout the rest of the book. We just don’t need to worry about the training quite so much!

## Light Leaves, ResNet Sees

In order to demonstrate self-supervised training, we’re going to use a smaller version of ImageNet called Imagenette. This dataset contains images from 10 classes of the larger set, and was constructed by Jeremy Howard as a way of being able to quickly test new ideas on a representative sample of ImageNet rather than having to spend a considerable amount of time training on the whole thing. We’ll be using the full-sized version for our model, which means a 300Mb download. Let’s declare our imports and download Imagenette.

!pip install pytorch-lightning
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
import pytorch_lightning as pl
from PIL import Image
from pathlib import Path
from torchvision import transforms
import torchvision.transforms.functional as TF
import random

!wget https://s3.amazonaws.com/fast-ai-imageclas/imagenette2-320.tgz
!tar xvzf imagenette2-320.tgz


### A Self-Supervised Dataset, As A Treat

You, sobbing: “You can’t just point at a picture and call it a label!”

Me, an intellectual, pointing at a cat rotated ninety degrees: “Label.”

Even though we’re using PyTorch Lightning, we’ll construct our datasets in the usual way with the Dataset class. When an image is requested from the dataset, we will either simply return a tensor version of the image with the label 0, or randomly apply a rotational transform through 90, 180, or 270 degrees, or flipping the image’s axis either horizontally or vertically. Each of these potential transforms has a separate label, giving us six potential labels for any image. Note that we’re not doing any normalization in this pipeline to keep things relatively simple, but feel free to add the standard ImageNet normalization if you desire.

class RotationalTransform:
def __init__(self, angle):
self.angle = angle

def __call__(self, x):
return TF.rotate(x, self.angle)

class VerticalFlip:
def __init__(self):
pass
def __call__(self, x):
return TF.vflip(x)

class HorizontalFlip:
def __init__(self):
pass
def __call__(self, x):
return TF.hflip(x)


We’ll then wrap those transforms up inside a Dataset class, which will apply a chosen transformation when __getitem__ is called, as well as returning the correct label for that transform.

class SelfSupervisedDataset(object):
def __init__(self, image_path=Path("imagenette2-320/train")):
self.imgs = list(image_path.glob('**/*.JPEG'))
self.class_transforms = [RotationalTransform(0), RotationalTransform(90),
RotationalTransform(180), RotationalTransform(270),
HorizontalFlip(),VerticalFlip()]
self.to_tensor = transforms.Compose([transforms.ToTensor()])
self.classes = len(self.class_transforms)

def __getitem__(self, idx):
img = Image.open(self.imgs[idx])
label = random.choice(range(0, self.classes))
img = img.convert("RGB")
# Resize first, then apply our selected transform and finally convert to tensor
transformed_image = self.to_tensor(self.class_transforms[label](transforms.Resize((224,224))(img)))
return transformed_image, label

def __len__(self):
return len(self.imgs)


### ResNet-34 Go Brrr

With our dataset completed, we’re now ready to write the LightningModule that will be the model we train on this data. Writing a model in PyTorch Lightning is not too much different from the standard PyTorch approach we’ve seen throughout the book, but there are some additions that make the class more self-contained and allow PyTorch Lightning to do things like handle training for us. Here’s a skeleton LightningModule:

class SkeletonModel(pl.LightningModule):
def __init__(self):
pass
def forward(self, x):
pass
pass
def training_step(self, batch, batch_idx):
pass
def configure_optimizers(self):
pass
def prepare_data(self):
pass


As you can see, we have our familiar __init__ and forward methods, which work in exactly the same way as before. But we now also have methods for various parts of the training cycle, including setting up dataloaders and performing training and validation steps. We also have a prepare_data method which can do any preprocessing needed for datasets, as well as configure_optimizer for setting up our model’s optimizing function.

PyTorch Lightning includes hooks for lots of other parts of the training process (e.g. handling validation steps and DataLoaders, running code at the start or end of training epochs, and lots more besides), but these are the minimal parts we’ll need to implement.

Now that we know the structure, let’s throw together a model based on ResNet-34 with a small custom head. Note that we’re not using a pretrained ResNet model here; we’re going to be training from scratch. We’ll also add another method, validation_epoch_end, which will update statistics for loss and accuracy in our validation set at the end of every epoch.

class SelfSupervisedModel(pl.LightningModule):
def __init__(self, hparams=None, num_classes=6, batch_size=64):
super(SelfSupervisedModel, self).__init__()
self.resnet = torchvision.models.resnet34(pretrained=False)
self.resnet.fc = nn.Sequential(nn.Linear(512, 256), nn.ReLU(), nn.Linear(256, num_classes))
self.batch_size = batch_size
self.loss_fn = nn.CrossEntropyLoss()
if "lr" not in hparams:
hparams["lr"] = 0.001
self.hparams = hparams

def forward(self, x):
return self.resnet(x)

def training_step(self, batch, batch_idx):
inputs, targets = batch
predictions = self(inputs)
loss = self.loss_fn(predictions, targets)
return {'loss': loss}

def configure_optimizers(self):

def prepare_data(self):
self.training_dataset = SelfSupervisedDataset()
self.val_dataset = SelfSupervisedDataset(Path("imagenette2-320/val"))

def validation_step(self, batch, batch_idx):
inputs, targets = batch
predictions = self(inputs)
val_loss = self.loss_fn(predictions, targets)
_, preds = torch.max(predictions, 1)
acc = torch.sum(preds == targets.data) / (targets.shape[0] * 1.0)
return {'val_loss': val_loss, 'val_acc': acc}

def validation_epoch_end(self, outputs):
avg_loss = torch.stack([x['val_loss'] for x in outputs]).mean()
avg_acc = torch.stack([x['val_acc'].float() for x in outputs]).mean()
logs = {'val_loss': avg_loss, 'val_acc': avg_acc}
return {'progress_bar': logs}


Having defined the model, we can start training by using PyTorch Lightning’s Trainer class. We’ll pass in max_epochs to only train for 5 epochs with the learing rate of 0.001 (though the framework comes with lr_finder method to find an appropriate learning rate that uses the same approach that we have been using in the book so far and what’ll you’ll find in fast.ai). We’ll also need to tell the trainer how many GPUs we have available; if more than one is present and available, then the class will use as many as directed for multi-GPU training.

model = SelfSupervisedModel({'lr': 0.001})
trainer = pl.Trainer(max_epochs=5, gpus=1)
trainer.fit(model)
trainer.save_checkpoint("selfsupervised.pth")


We’ve now trained for 5 epochs on our pretraining task. What we need to now is to train on the actual task we’re trying to solve - not to classify for rotations or flipping, but to determine the ImageNet class an image belongs to. We can do this training simply by swapping out the current dataloaders for ones that returns the images and the labels for the provided Imagenette dataset. We do this using the old faithful ImageFolder:

tfms = transforms.Compose([
transforms.Resize((224,224)),
transforms.ToTensor()
])

imagenette_training_data = torchvision.datasets.ImageFolder(root="imagenette2-320/train/", transform=tfms)

imagenette_val_data = torchvision.datasets.ImageFolder(root="imagenette2-320/val/", transform=tfms)


We’ll then load in our saved checkpoint, replacing the original training data with the new DataLoader, and we’ll replace the head of the classifier so it now is predicting the 10 ImageNet labels instead of our self-supervised labels. The model will be trained for a further 5 epochs on the supervised training data.

model = model.load_from_checkpoint("selfsupervised.pth")
model.resnet.fc[2] = nn.Linear(256,12)


Training will be performed using the Trainer class again, but this time we’ll pass in these new training and validation dataloaders, which will override the ones we defined in the actual class (and prepare_dataset will not be called by PyTorch Lightning during this training phase).

trainer = pl.Trainer(max_epochs=5, gpus=1)


The model’s accuracy its final (10th) epoch of training ended up around 54%. Which isn’t too bad considering that we have only trained 5 epochs on the data itself (and did no augmentation on that pipeline). But was it worth it? Well, let’s check! If we recreate our model from scratch and just pass in the non-supervised dataloaders for training and validation, training for 10 epochs, we can have a comparison between that result and our self-supervised model.

standard_model = SelfSupervisedModel({'lr': 0.001})
trainer = pl.Trainer(max_epochs=10, gpus=1)


On my training run, it ended up with a best accuracy over 10 epochs of 33%. We can see that pre-training with our self-supervised dataset offers a greater performance despite being trained on the final task for only 5 epochs.

## One Step (or more) Beyond

This has been dipping a toe into the waters of self-supervised learning. If you want to go deeper, you could experiment further with the framework in this chapter. Can you improve performance by adding other transformations to the initial pipeline, perhaps? Or augmentation in the training on the task fine-tuning stage? Or training with larger ResNet architectures?

In addition, I urge you to look contrastive learning, which is a technique where the model is trained by being shown augmented and non-augmented images and another image of a completely different class. This turns out to be another powerful way of extracting as much as you can from your existing data and, as part of Google’s SimCLR system, is currently the state-of-the-art when it comes to training models on ImageNet.

I don’t normally hype work things on this blog, but seeing as how I had a part in making it happen, I’d be remiss if I didn’t link to Lucidworks’ new Smart Answers, a deep learning-powered question and answer system. We put a lot of work into making it really easy to get started either with or without training on your existing data and we’ll be continuing to improve it for quite some time to come.

Having said all that, I’m not sure if there’s much else to talk about this week. I should have the last part of Chapter 9.5 finished next weekend, so if nothing else, you’ll have a Bank Holiday Weekend of just under 3,000 words on the burning subject of Image Classification via Self-Supervision. Get in, lads!1

Also, you should be concerned, because I’m toying with the idea of writing a long and rambling post about Community, a show which is not the best American sitcom of the past 20 years (because that’s Brooklyn-Nine-Nine), but is probably my favourite. But don’t worry! There will be vintage clips of British shows in that post as well! I know that’s what you’re really here for.

In the meantime, stay safe, and remain indoors.

1. Of course, the quantization post a couple of weeks ago was one of my most shared tweets / pages ever, so maybe you really are hungry for that PyTorch action. I’m hoping to get you some reinforcement learning and, if I can pull it off, a silly but useful crossover of my knowledge of Transformers…and Transformers. [return]

# Give me the hope to run out of steam

And we’re back to non-deep learning content again1. I’m currently watching my cat attempt to hunt birds from the basement window, losing her footing and falling onto the pillow that has been down here since Thanksgiving. And then getting back up again. It’s entertaining. Welcome to lockdown…week…infinity?

This has been a very different week from most quarantine weeks, in that the house has been mostly full! People recovering from surgery, visitors, cinnamon rolls, a disappointing Detroit-style pizza2, the return of the deer family, a depressing Catalyst-related Spark bug, the completion of the TF UK #82 line up of The Wreckers3, a shed-load of Brook-Rose novels, and my first attempt at Sichuan cooking. Which, really, makes staying home sound a lot more interesting than it actually is. And I didn’t mention the nights of not sleeping, or the hours of staring at log files. Or that my turnips rotted.

Still, aside from the weight gain, the general stress, the mounting apprehension about November, and a host of other things, everything is fine! Totally!

🥺🥺🥺

Come back next week for more uplifting content!

1. By the way, I’m sorry that the LaTeX for the equations didn’t render properly in the last post. I did attempt to upgrade my Hugo install and include KaTeX rendering…but ended up completely destroying my rendering pipeline due to the Hyde-X theme not having been updated for a few years now. I…will do a blog redesign at some point with a more modern theme, but for now, I’ve downgraded twenty versions of Hugo to get the pipeline working again. Yay, software! [return]
2. Probably expected when you use NY pizza dough for it, mind you… [return]
3. WRECK’N’RULE! Anyway, better than the other news coming from Hasbro and Takara this week… [return]

# Big Models Hate This One Weird Trick! (Quantization, T5, & PyTorch 1.4)

Here’s the notebook in Google Colab

As we know, models can be big lumbering beasts, comprised of millions of parameters (both weights and activations) that require lots of matrix multiplications to take an input and arrive at an answer. And for most of our work so far, that’s been fine! We have mighty GPUs that can handle these burdens with ease.

But what if we didn’t? We often package a model up for production inference usage so that it only runs on the CPU. And what if we wanted to run our model on a smaller embedded platform? Suddenly, both the size of the model and all those floating-point operations become a little more problematic. Thankfully, there’s a trick we can perform that makes our model smaller and faster, normally with the trade off with some accuracy. Even better, PyTorch allows us to perform this one weird trick with just one line of code, with some other approaches for squeezing even more performance. Let’s have a quick look at quantization.

## Quantization

Every parameter in our model is a 32-bit floating point number, taking up 4 bytes of memory. That’s not a lot, but it can soon add up. Let’s have a look at Google’s recent T5 transformer-based model, which has a t5-small variant that’s available in the transformers library.

import torch
from transformers import pipeline, T5ForConditionalGeneration

def count_parameters(model):
return sum(p.numel() for p in model.parameters())

base_model = T5ForConditionalGeneration.from_pretrained("t5-small")

param_count = count_parameters(base_model)

memory = (param_count * 4) / (1024 *1024)
memory


230.8154296875


Even with the smallest pre-trained T5 weights, our model is roughly 60m parameters and weighs in at a whopping 230Mb!

However, what if we decided that we didn’t need the full precision of our floating-point parameters? If our parameters could be restricted to within a certain range of values, then we could use a smaller type of number representation to store the parameters. This quantization is the key to speeding up our inference time and reducing the memory footprint of our models. What we tend to aim for is to quantize down from a 32-bit floating point to an 8-bit integer. The basic idea is:

$$x{int8} = (\frac{x{float32}}{x{scale}} + x{offset})$$

Which is essentially just fitting the potential values of the parameters of a network to a line of $y = mx + c$, although due to the reduced resolution of the 8-bit integer, there’s only so many values a parameter now may take instead of the huge amount that a float32 value could be. PyTorch does its quantizing in a slightly more complicated affair that ensures that zero is always zero, but the basic idea is the same - we have a range of values that our parameters can take, and then find an appropriate pair $x{scale}$ and $x{offset}$ to provide 256 graduations to represent that range - or 255 if you think about PyTorch always keeping zero around.

At the moment (PyTorch 1.5), quantized layers are best supported with CNN and Linear layers. Thankfully, if we have a look at the model structure of T5, we can see a happy coincidence:

base_model

T5Model(
(shared): Embedding(32128, 512)
(encoder): T5Stack(
(embed_tokens): Embedding(32128, 512)
(block): ModuleList(
(0): T5Block(
(layer): ModuleList(
(0): T5LayerSelfAttention(
(SelfAttention): T5Attention(
(q): Linear(in_features=512, out_features=512, bias=False)
(k): Linear(in_features=512, out_features=512, bias=False)
(v): Linear(in_features=512, out_features=512, bias=False)
(o): Linear(in_features=512, out_features=512, bias=False)
(relative_attention_bias): Embedding(32, 8)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(1): T5LayerFF(
(DenseReluDense): T5DenseReluDense(
(wi): Linear(in_features=512, out_features=2048, bias=False)
(wo): Linear(in_features=2048, out_features=512, bias=False)
(dropout): Dropout(p=0.1, inplace=False)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
(1): T5Block(
(layer): ModuleList(
(0): T5LayerSelfAttention(
(SelfAttention): T5Attention(
(q): Linear(in_features=512, out_features=512, bias=False)
(k): Linear(in_features=512, out_features=512, bias=False)
(v): Linear(in_features=512, out_features=512, bias=False)
(o): Linear(in_features=512, out_features=512, bias=False)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(1): T5LayerFF(
(DenseReluDense): T5DenseReluDense(
(wi): Linear(in_features=512, out_features=2048, bias=False)
(wo): Linear(in_features=2048, out_features=512, bias=False)
(dropout): Dropout(p=0.1, inplace=False)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
(2): T5Block(
(layer): ModuleList(
(0): T5LayerSelfAttention(
(SelfAttention): T5Attention(
(q): Linear(in_features=512, out_features=512, bias=False)
(k): Linear(in_features=512, out_features=512, bias=False)
(v): Linear(in_features=512, out_features=512, bias=False)
(o): Linear(in_features=512, out_features=512, bias=False)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(1): T5LayerFF(
(DenseReluDense): T5DenseReluDense(
(wi): Linear(in_features=512, out_features=2048, bias=False)
(wo): Linear(in_features=2048, out_features=512, bias=False)
(dropout): Dropout(p=0.1, inplace=False)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
(3): T5Block(
(layer): ModuleList(
(0): T5LayerSelfAttention(
(SelfAttention): T5Attention(
(q): Linear(in_features=512, out_features=512, bias=False)
(k): Linear(in_features=512, out_features=512, bias=False)
(v): Linear(in_features=512, out_features=512, bias=False)
(o): Linear(in_features=512, out_features=512, bias=False)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(1): T5LayerFF(
(DenseReluDense): T5DenseReluDense(
(wi): Linear(in_features=512, out_features=2048, bias=False)
(wo): Linear(in_features=2048, out_features=512, bias=False)
(dropout): Dropout(p=0.1, inplace=False)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
(4): T5Block(
(layer): ModuleList(
(0): T5LayerSelfAttention(
(SelfAttention): T5Attention(
(q): Linear(in_features=512, out_features=512, bias=False)
(k): Linear(in_features=512, out_features=512, bias=False)
(v): Linear(in_features=512, out_features=512, bias=False)
(o): Linear(in_features=512, out_features=512, bias=False)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(1): T5LayerFF(
(DenseReluDense): T5DenseReluDense(
(wi): Linear(in_features=512, out_features=2048, bias=False)
(wo): Linear(in_features=2048, out_features=512, bias=False)
(dropout): Dropout(p=0.1, inplace=False)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
(5): T5Block(
(layer): ModuleList(
(0): T5LayerSelfAttention(
(SelfAttention): T5Attention(
(q): Linear(in_features=512, out_features=512, bias=False)
(k): Linear(in_features=512, out_features=512, bias=False)
(v): Linear(in_features=512, out_features=512, bias=False)
(o): Linear(in_features=512, out_features=512, bias=False)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(1): T5LayerFF(
(DenseReluDense): T5DenseReluDense(
(wi): Linear(in_features=512, out_features=2048, bias=False)
(wo): Linear(in_features=2048, out_features=512, bias=False)
(dropout): Dropout(p=0.1, inplace=False)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
)
(final_layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(decoder): T5Stack(
(embed_tokens): Embedding(32128, 512)
(block): ModuleList(
(0): T5Block(
(layer): ModuleList(
(0): T5LayerSelfAttention(
(SelfAttention): T5Attention(
(q): Linear(in_features=512, out_features=512, bias=False)
(k): Linear(in_features=512, out_features=512, bias=False)
(v): Linear(in_features=512, out_features=512, bias=False)
(o): Linear(in_features=512, out_features=512, bias=False)
(relative_attention_bias): Embedding(32, 8)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(1): T5LayerCrossAttention(
(EncDecAttention): T5Attention(
(q): Linear(in_features=512, out_features=512, bias=False)
(k): Linear(in_features=512, out_features=512, bias=False)
(v): Linear(in_features=512, out_features=512, bias=False)
(o): Linear(in_features=512, out_features=512, bias=False)
(relative_attention_bias): Embedding(32, 8)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(2): T5LayerFF(
(DenseReluDense): T5DenseReluDense(
(wi): Linear(in_features=512, out_features=2048, bias=False)
(wo): Linear(in_features=2048, out_features=512, bias=False)
(dropout): Dropout(p=0.1, inplace=False)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
(1): T5Block(
(layer): ModuleList(
(0): T5LayerSelfAttention(
(SelfAttention): T5Attention(
(q): Linear(in_features=512, out_features=512, bias=False)
(k): Linear(in_features=512, out_features=512, bias=False)
(v): Linear(in_features=512, out_features=512, bias=False)
(o): Linear(in_features=512, out_features=512, bias=False)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(1): T5LayerCrossAttention(
(EncDecAttention): T5Attention(
(q): Linear(in_features=512, out_features=512, bias=False)
(k): Linear(in_features=512, out_features=512, bias=False)
(v): Linear(in_features=512, out_features=512, bias=False)
(o): Linear(in_features=512, out_features=512, bias=False)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(2): T5LayerFF(
(DenseReluDense): T5DenseReluDense(
(wi): Linear(in_features=512, out_features=2048, bias=False)
(wo): Linear(in_features=2048, out_features=512, bias=False)
(dropout): Dropout(p=0.1, inplace=False)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
(2): T5Block(
(layer): ModuleList(
(0): T5LayerSelfAttention(
(SelfAttention): T5Attention(
(q): Linear(in_features=512, out_features=512, bias=False)
(k): Linear(in_features=512, out_features=512, bias=False)
(v): Linear(in_features=512, out_features=512, bias=False)
(o): Linear(in_features=512, out_features=512, bias=False)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(1): T5LayerCrossAttention(
(EncDecAttention): T5Attention(
(q): Linear(in_features=512, out_features=512, bias=False)
(k): Linear(in_features=512, out_features=512, bias=False)
(v): Linear(in_features=512, out_features=512, bias=False)
(o): Linear(in_features=512, out_features=512, bias=False)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(2): T5LayerFF(
(DenseReluDense): T5DenseReluDense(
(wi): Linear(in_features=512, out_features=2048, bias=False)
(wo): Linear(in_features=2048, out_features=512, bias=False)
(dropout): Dropout(p=0.1, inplace=False)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
(3): T5Block(
(layer): ModuleList(
(0): T5LayerSelfAttention(
(SelfAttention): T5Attention(
(q): Linear(in_features=512, out_features=512, bias=False)
(k): Linear(in_features=512, out_features=512, bias=False)
(v): Linear(in_features=512, out_features=512, bias=False)
(o): Linear(in_features=512, out_features=512, bias=False)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(1): T5LayerCrossAttention(
(EncDecAttention): T5Attention(
(q): Linear(in_features=512, out_features=512, bias=False)
(k): Linear(in_features=512, out_features=512, bias=False)
(v): Linear(in_features=512, out_features=512, bias=False)
(o): Linear(in_features=512, out_features=512, bias=False)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(2): T5LayerFF(
(DenseReluDense): T5DenseReluDense(
(wi): Linear(in_features=512, out_features=2048, bias=False)
(wo): Linear(in_features=2048, out_features=512, bias=False)
(dropout): Dropout(p=0.1, inplace=False)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
(4): T5Block(
(layer): ModuleList(
(0): T5LayerSelfAttention(
(SelfAttention): T5Attention(
(q): Linear(in_features=512, out_features=512, bias=False)
(k): Linear(in_features=512, out_features=512, bias=False)
(v): Linear(in_features=512, out_features=512, bias=False)
(o): Linear(in_features=512, out_features=512, bias=False)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(1): T5LayerCrossAttention(
(EncDecAttention): T5Attention(
(q): Linear(in_features=512, out_features=512, bias=False)
(k): Linear(in_features=512, out_features=512, bias=False)
(v): Linear(in_features=512, out_features=512, bias=False)
(o): Linear(in_features=512, out_features=512, bias=False)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(2): T5LayerFF(
(DenseReluDense): T5DenseReluDense(
(wi): Linear(in_features=512, out_features=2048, bias=False)
(wo): Linear(in_features=2048, out_features=512, bias=False)
(dropout): Dropout(p=0.1, inplace=False)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
(5): T5Block(
(layer): ModuleList(
(0): T5LayerSelfAttention(
(SelfAttention): T5Attention(
(q): Linear(in_features=512, out_features=512, bias=False)
(k): Linear(in_features=512, out_features=512, bias=False)
(v): Linear(in_features=512, out_features=512, bias=False)
(o): Linear(in_features=512, out_features=512, bias=False)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(1): T5LayerCrossAttention(
(EncDecAttention): T5Attention(
(q): Linear(in_features=512, out_features=512, bias=False)
(k): Linear(in_features=512, out_features=512, bias=False)
(v): Linear(in_features=512, out_features=512, bias=False)
(o): Linear(in_features=512, out_features=512, bias=False)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(2): T5LayerFF(
(DenseReluDense): T5DenseReluDense(
(wi): Linear(in_features=512, out_features=2048, bias=False)
(wo): Linear(in_features=2048, out_features=512, bias=False)
(dropout): Dropout(p=0.1, inplace=False)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
)
(final_layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
)


Yes, that’s right, look at all those Linear layers! We should be able to get some benefit out of quantizing this model.

## One Weird Trick — Dynamic Quantization

import torch.quantization

quantized_model = torch.quantization.quantize_dynamic(base_model, {torch.nn.Linear}, dtype=torch.qint8)


No, really, that’s it. Chapter done. Bye!

Oh, okay, if you really insist. But honestly, there’s not much more to it. Okay, firstly, a caveat in that quantize_dynamic will only quantize the weights, not the activations in our parameters. But all we need to do is pass in the model we wish to quantize and a dict of layers that we wish to replace with our quantized versions, in this case Linear. The function returns a new model, though you could run with the optional parameter inplace=True to mutate the original model rather than make a copy.

Let’s save the model and take a look at the quantized size:

!mkdir t5
quantized_model.save_pretrained("t5")
!du -m t5

mkdir: cannot create directory ‘t5’: File exists
121 t5


Almost a 50% reduction in size! We can’t get down to 4 times smaller due to not being able to store the activations as 8-bit integers, but we’ve done pretty well for one line of code. Let’s do a very simple microbenchmark using both models in the transformers library summarization pipeline.

base_summarizer = pipeline("summarization", model=base_model, tokenizer="t5-small")
quantized_summarizer = pipeline("summarization", model=quantized_model, tokenizer="t5-small")

%timeit base_summarizer("From the very beginning, Regan was seen as having series potential. After the television film scored highly in the ratings, work began on the development of the series proper. Ian Kennedy Martin's idea was for the series to be mainly studio-based, with more dialogue and less action, but producer Ted Childs disagreed, and in consequence Ian Kennedy Martin parted company with the project. Childs produced it on 16mm film, a format that allowed for a much smaller film unit than videotape at that time. This made it possible to shoot almost entirely on location which helped give the series a startling degree of realism and to use film editing techniques which enabled him to give the show a heavy bias toward action sequences. The television play and the subsequent series were commissioned by Thames Television and produced by its film division Euston Films. It was originally broadcast on ITV between 2 January 1975 and 28 December 1978 at 21:00–22:00 on weekdays (usually Mondays), with repeated screenings at the same time until the early 1980s. The writers were given strict guidelines to follow: \"Each show will have an overall screen time (minus titles) of 48 minutes 40 seconds. Each film will open with a teaser of up to 3 minutes, which will be followed by the opening titles. The story will be played across three acts, each being no more than 19 minutes and no less than 8 minutes in length. Regan will appear in every episode, Carter in approximately 10 out of 13 episodes. In addition to these main characters, scripts should be based around three major speaking parts, with up to ten minor speaking parts")

1 loop, best of 3: 29.4 s per loop

%timeit quantized_summarizer("From the very beginning, Regan was seen as having series potential. After the television film scored highly in the ratings, work began on the development of the series proper. Ian Kennedy Martin's idea was for the series to be mainly studio-based, with more dialogue and less action, but producer Ted Childs disagreed, and in consequence Ian Kennedy Martin parted company with the project. Childs produced it on 16mm film, a format that allowed for a much smaller film unit than videotape at that time. This made it possible to shoot almost entirely on location which helped give the series a startling degree of realism and to use film editing techniques which enabled him to give the show a heavy bias toward action sequences. The television play and the subsequent series were commissioned by Thames Television and produced by its film division Euston Films. It was originally broadcast on ITV between 2 January 1975 and 28 December 1978 at 21:00–22:00 on weekdays (usually Mondays), with repeated screenings at the same time until the early 1980s. The writers were given strict guidelines to follow: \"Each show will have an overall screen time (minus titles) of 48 minutes 40 seconds. Each film will open with a teaser of up to 3 minutes, which will be followed by the opening titles. The story will be played across three acts, each being no more than 19 minutes and no less than 8 minutes in length. Regan will appear in every episode, Carter in approximately 10 out of 13 episodes. In addition to these main characters, scripts should be based around three major speaking parts, with up to ten minor speaking parts")

1 loop, best of 3: 16.6 s per loop
`

In addition to almost being half the size, the quantized model is almost twice as fast! So…why don’t we do this all the time? Are there no downsides? Well…it depends. We are losing information in our inference in a quantized model as our values cannot map to all the possible floating-point values that we find in the original model. So the chain of multiplications will be less accurate in our quantized model than in the original. You’ll need to check the new model against a reference dataset to determine the accuracy loss and whether that loss is an acceptable trade-off compared to the reduced storage demands and faster execution.

## Other Quantizing Options Are Available

In addition to dynamic quantizing, PyTorch also offers static quantizing, where a trained model is modified to include observer modules and a selection of data is fed into the model. During the inference on this data, the observers can generate a quantized distribution that fits best to the observed data and the activations that result. This can can produce even further space and time savings, especially with vision models like ResNet.

However, for the best-in-class of accuracy in your smaller model, you’ll want to investigate quantization-aware training (QAT). In this approach, the model fakes quantizing during the training loop of both the forward and backward passes; while all the computations take place with standard floats, everything is rounded down to integer values, so you end up with a quantized model after training is finished, but one with a higher accuracy than you can acheive with the dynamic or static approaches.

## Is It Worth It?

You might be wondering if you’re just better off training a smaller model rather than going to all this effort to compress larger models. In the recent paper, Train Large, Then Compress, there’s a good deal of evidence presented that transformer-based models really do benefit from this approach. Because larger models converge faster than smaller ones, you will likely get more accurate results by training a large model and compressing than if you spent the same compute time on a smaller model. So go forth and compress!

(and we’ll see you back here in the future for pruning models)

https://pytorch.org/docs/stable/quantization.html

https://arxiv.org/abs/2002.11794

# Falling Standards

A summary of how things are going: I have worked from home now for over four years. And yet it was the closing week of April 2020 where I ordered enough tracksuit bottoms to last a week, because I cannot be bothered to dress properly any more. Standards have fallen, people. Also, how come March felt like one thousand years instead of a month and yet April has flown by so quickly?

Slowly trying to thin down my web presence this week by backing up a hosted server I rent that…hasn’t been doing anything of note for at least two years now. I really want to convert this blog to just being served from an S3 bucket, but given that the current set up has been running on Dreamhost for…18 years, well, let’s just say that there’s some inertia there. Also, this blog will be old enough to drink come August (though not in its jurisdiction). That’s absolutely terrifying. It also means that it’s been the same amount of time since I went to UNC. Ah, halcyon days.

Meanwhile, from what is apparently the Cincinnati 2020 reboot of Keeping Up Appearances, let me tell you how I met the ex-mayor of Cincinnati. It’s Friday evening, and I’m celebrating the end of the week with a drink. It’s a slightly larger than usual drink because it was the end of the bottle. There I am, downstairs in the basement watching an episode of The Sweeney whilst also reading the instructions for a GPU-accelerated dense vector search library. Yeah, I know you’re jealous; it’s a wild Friday night and no mistake. Anyway, having finished the bottle of Wild Turkey, of course the doorbell rings. Running up the stairs, slightly drunk, pulling off a sock as I go1, I open the door and, yes, well there’s the former mayor, coincidentally my neighbour with a package in hand. Now, despite it also being the two-year anniversary of me moving to the area…I have never actually met him before. So he’s introducing himself, and I go to shake his hand.

“Oh no, I don’t think we do that any more”

I snap back, making it very clear I’ve been drinking, apologising and basically expecting a Miranda-like situation where I lose my trousers is coming any moment. On the bright side, I have now met both sets of neighbours! But I can never speak to one set ever again. I guess I’m thankful it’s not the set that I’m helping to take down a tree in the summer?

1. because, obviously, I only had one sock on at that point and that would have just been weird to answer the door with one sock. [return]

# Shaw Taylor Says Hi

Well, as the horror continues to pile up, a small bit of personal good news: apparently my book is being translated into German. I’m working with the editors to make a few changes in the upcoming edition, hopefully including the already-in-English Chapter 9.5 plus some sections on quantizing models and image self-supervision, providing I can get them written by the end of May. The English language versions of those will end up on the GitHub repo as promised. No word on a second edition (which would probably see massive revisions of the test and audio chapters, given that PyTorch changed everything a month after the book went to the printers), but I’ll let you know if it happens!

Otherwise, a week that mainly consisted of the New Normal of Remaining Indoors, but with a small visit to Jungle Jim’s on Saturday (with a mask, but nothing as fabulous as this outfit). I needed more tea. Of course, when you go, you can’t just buy one thing, as you will end up with at least ten things that catch you eye even when you go with a list and a plan in mind. Reader, I bought all the Star Bars. And the Tunnock’s Caramel Wafers. Look, I’m reduced to watching episodes of Police 5 whilst drinking a cup of tea. Eating Scottish sweets that haven’t changed since 1952 is the least of my worries right now.

I continue to plan for a future in the hope that there will be a future. Not that I can talk about things much at the moment, but one plan will have a website, hopefully around the end of June. It’ll be fun! At least fun in a certain type of way. Not quite as much fun as Sophie Ellis-Bextor’s Kitchen Discos, but something for a certain type of person to play about with for the rest of the year. No clues, but it will take obsessions from my early life and mix them with current obsessions. Keep ‘em peeled!

I’m so sorry

# What Even Is Time?

I think this week qualifies as my first meltdown? That’s what days of increasingly less sleep gets you, I guess. Also: you cannot solve Kubernetes problems by thinking about them from 1am until 6am.1

Anyway, I have had a long weekend of doing…almost nothing. It’s been quite nice, really. No baking or confectionary work. No long periods of staring at code. I even sat outside for a bit.

(Okay, I cheated by moving the baking to mid-week by virtue of having a quiet, but quite nice birthday on Wednesday. Not quite the birthday celebration we had planned, but that is pretty much 2020 so far…)

Next week: further adventures in Seldon Core. And maybe a masked-up visit to Jungle Jim’s…

(I don’t have a lot to say about being 41. Aside from being so old, obviously)

1. Though, oddly, you can solve them in the morning whilst in the shower. [return]

# Sheriff Fatman

Really, what I want to know is why I spent a large part of Friday night listening to Carter The Unstoppable Sex Machine?

It’s a sound of The End of History, big sans-serif fonts on top of VT from a performance on The 8:15 From Manchester, Amiga music intros, and big honking slabs of the cheapest drum beat you could possibly imagine. And there’s been no revival. The only reference point I can think of these days would be The Indelicates, and they’re not exactly a household name. Maybe I just love the puns and the vague memory of happier times before The Event.

Also, who booked them for the Smash Hits Poll Winners’ Party?

(I have to say that I’m on Schofield’s side of that argument; his response to them smashing up the stage is peak-Smash Hits. It’s not quite like firing a machine gun at the Brits, after all…)

In Event news, I went outside. Twice. Both to supermarkets, though one trip was for a prescription rather than food items. The trip to Kroger was weird in that it’s the first time in a month I’ve been near that number of people . Already it seems odd to have so many people in one place. Probably around 60-70% people wearing masks and…somewhat bad attempts at social distancing. You have to be more on your game noticing when people suddenly stop in the aisles. But I got in and out (I can’t say unscathed, because none of us really know, of course). There are supplies, and there will be a roast dinner (and a game of Root which actually ended up not being quite as scary for a first play-through as we expected).

As I type this, I realize that right now, I should have been suffering terrible jet-lag and trying to roam the streets of Tokyo with Tammy in a vain attempt to stay up to a reasonable bedtime. There would have been ramen, brutalism, bourbon hunting, and my face curling into a pained grimace as Tammy ate all the seafood. But the important thing is that we’re still all here and mostly okay!

# Remaining Indoors

Well, I was going to point back into the archives and highlight some posts from the Pac-Man Wedding of 2010…only it turns out that whilst I was good about posting updates leading up to the event highlighting voting for the reception songs, I wrote absolutely nothing about the day itself or the subsequent trip to New York. So, er…it’s been ten years since Sandwich Day at Fiore’s! I am celebrating by…remaining indoors as usual. It’s what we do in 2020!

I even double-downed on Remaining Indoors this weekend as there was no board game visit. I sadly didn’t really sleep on Friday and didn’t feel up to it. Instead, I made my own Easter egg.

View this post on Instagram

A post shared by Ian Pointer (@carsondial) on

Milk chocolate orange through and through, twenty-two centimetres tall. That’ll do, I think.

I am putting off going outside again until Tuesday. I can even probably survive another couple of weeks without going to a supermarket, but we’re going to have a roast dinner for Easter Sunday next weekend, so I will have to venture out for potatoes and meat. I’m not particularly looking forward to that. But I have masks, gloves, and enough hand sanitizer to sterilize a battalion. So I should be okay.

# I Only Remember The Bits With Liz

Well, I finally caved to reality this weekend and cancelled the flight to Japan. Re-scheduling may end up being somewhat difficult, even putting aside the pandemic, so it’s not looking likely that we will get to Japan this year after all. Bit of a disappointment, but hopefully there will be other times to come!

View this post on Instagram

A post shared by Ian Pointer (@carsondial) on

Meanwhile, I have joined the baking hordes. But I refuse to waste my precious flour reserves by throwing out a sizeable amount of it to begin a sourdough starter. Instead, soda bread! I was planning on doing a full Ulster Fry…but isolation makes you somewhat less inclined to do an actual production for lunch or dinner. Just Enough Is Fine.

Having said all that, I’m already making plans for something fun at Easter now that I’m going to…well, probably still confined to the house if we’re being honest with ourselves. More details as the plans come together! But you can guess that it’s going to involve chocolate somehow. Not much of a surprise there.

We also re-watched Billy Liar this week. I last watched it back in 2005, which seems to have meant that I completely forgot all the scenes in the film that don’t have Julie Christie in and show Billy to be entirely hateful. Or that Rodney Bewes was in it, basically playing Bob from The Likely Lads. It was definitely less charming than I remember, and I think spending the past 15 years with The Long Blondes’ Giddy Stratospheres made me think there was a touch more to the Liz scenes than are actually present in the film. Still, we’ll move onto Kes next week in my continuing season of “it’s grim up North”. Or maybe Panda’s Fen for something a little more weird.

Right, another week of staying home awaits! rushes and sits down