Jun 14, 2020 · 2 minute
read
manchester
so so old
It’s been twenty years. At least, around that. Not sure exactly. I didn’t keep a diary, and as I keep pointing out to great disappointment to others, there are very few photos of that time, but I think either this weekend or last weekend was my final weekend of living in Manchester. At the end of the first and second years, I rushed home as soon as I possibly could, sometimes literally the day of my final exam. In the last year, I hung around for as long as I could, knowing that this was a big ending. Some of my friends that I made during that time I haven’t seen since the day I left.
St. Anselm Hall is now co-ed (though apparently it was the last hold-out of university halls in the entire country, which says…something), but still apparently has formal dinner. The thing about it is that we laughed at the Oxford-lite grasping at the time (the ‘JCR’ was an anteroom, the ‘SCR’ simply a normal room with delusions upon its station), and I’m sure everybody else who stayed there throughout the decades did too. But I did stay there for three years, in the exact same room, so it had its merits, even if when I tell people about it now they expect me to talk about a Sorting Hat.
It often feels like yesterday. But more often these days, it seems like a different world and a completely different person. At least I eventually managed to sort out half-decent looking glasses for me. It might have been better if I had got that down before I left home mind you…
(come back in August for the second part of this tale, where I point out that the entire course of my life post-2000 was decided by Dawson’s Creek of all things. Plus! Muppets! In Space!)
Jun 7, 2020 · 1 minute
read
the darkest timeline
Even the Judges of Mega-City One don’t obscure their badges.
Things I didn’t expect at the start of the year: assembling a backpack with equipment to cut zipties, deal with tear gas, cuts, and bruises, all topped off with a pair of burner phones to avoid Stingray harvesting. And then, less than 24 hours later, taking it all back out again out of an abundance of COVID-19 caution.
How will I describe this week in the years to come? “That was the week I got curtains!” Maybe not. But it has been two and a half years since I bought the house, so it is probably beyond time of having some. Only for two windows, though. You don’t want to rush these things.
And I’m tired again. Maybe a longer entry next week? Probably wouldn’t get your hopes up, but I’ll see what I can do.
Jun 1, 2020 · 1 minute
read
maybe June will be better?
Well, it’s been a rather terrible week on the macro and the micro level, hasn’t it? Perhaps June will bring better tidings. Or…given that it’s 2020, things will get worse.
It got worse. Much worse.
May 25, 2020 · 13 minute
read
deeeeeep learning
quality content
being lazy and still doing better
(You can also view this post in Google Colab)
Self-Supervision is the current hotness of deep learning. Yes, deep networks and transfer learning are now old hat — you need to include self-supervised somewhere if you want to get those big VC dollars. Like transfer learning, though at its core it’s a very simple idea: there is so much data in the world — how can we use it without the big expense of getting humans to label it all? And the answer really does feel like cheating. Self-supervision is essentially “get the computer to automatically add labels to all your data, train a network on that, and then use transfer learning on the task you actually want to solve.” That’s it. The only interesting bits are how to decide what labels you add to what is called the “pretext task”, but the technique is surprisingly effective, especially in image and text-based problems where the Internet provides an almost endless supply of data.
Let’s have a look at the two main approaches to image self-supervised learning that are popular right now — rebuilding the original input from a distorted input, and automatically adding labels to data and training using those synthetic labels
If you remember our look at the super-resolution architectures, they’re taking a small image and producing a larger, enhanced image. A self-supervised dataset for this problem can be fairly easily obtained by simply looking at the problem in the opposite way: harvest images from the Internet, and create smaller versions of them. You now have training images and the ground truth labels (the original images). If you’re building a model that colourizes images, then you grab colour images…and turn them into black and white ones!
You can extend this to a more general principle, where you take an image, apply a series of transforms to that image, and then train a neural network to go from the manipulated image to the original. You’ll end up with some sort of U-Net-like architecture, but after you’ve trained the network, you can throw away the ‘decoder`’ half and use the ‘encoder’ part for your actual task by adding a Linear layer or two on top of the features you obtain at the bottom of the ‘U’.
You’ll want an augmentation technique that forces the architecture to learn things like how to structure elements of images, how to in-paint missing parts of an image, correct orientations, and so on. Here’s a couple to get you started
CutOut / Random Erasing
Perhaps the easiest to apply is simply removing part of an image and getting the model to restore it. This approach is often known as CutOut
, and was shown to improve model performance with classification tasks in its introductory paper “Improved Regularization of Convolutional Neural Networks with Cutout”.
And it’s rather easy to apply, because it’s now included as a torchvision
transform by default! You can just use:
torchvision.transforms.RandomErasing(p, scale, ratio value, inplace)
This can be slotted into a transform pipeline as we’ve seen many times throughout the book. The parameters you can set are:
p
— the probability of the transform taking place
scale
— range of proportion of erased area against input image.
ratio
— range of aspect ratio of erased area.
value
— the value that will be used in the erased box. Default is 0. If given a single integer, that integer will be used. A tuple of length 3 will make the transform use values within for replacing R, G, and B channels. If passed the string "random"
, each pixel in the box will be replaced with a random value.
inplace
— boolean to make this transform inplace. Default set to False.
In general, you’ll probably want to use the random
strategy for erasing details from an image.
Crappify
Crappify is a fun idea from the fast.ai project which literally ‘crappifies’ your images. The concept is simple: pack a transform function with resizing, adding text, and JPEG artefacting, or anything else you decide to add to ruin the image, and then train the network to restore things back to the original.
Automatically Labelling Data
The full image-based based self-supervision works very well, but you could say that it’s a little wasteful in a classification task; you end up training a full U-Net and throwing half of it away. Is there a way we can we be lazier and still do self-supervision?
Yes! And it’s what we’re going to spend the rest of this section implementing. Consider this image:
Okay, so it’s another picture of Helvetica the cat, but we would need a human annotator to give us the cat
label. But we can take this image, transform it, and give it a meaningful label at the same time.
image_90
We have given this new image the label of image_90
to indicate that it has been indicated by 90º. But no human was needed in this (trivial) labelling. We can build a completely synthetic classification task, where we can build a training dataset and corresponding labels entirely programatically. We don’t need to build a U-Net block because all we’re training is a simple classification task; there’s no image reconstruction. But in order to learn how to classify correctly, the model will have to learn how to recognize what way up a cat normally is, and this pre-trained model can then be used on our actual classification task.
We’re going to build this approach to self-supervision, but we’re going to do it with a slightly higher-level framework than PyTorch. Enter PyTorch Lightning!
PyTorch Lightning, or A Little Help From The Internet
PyTorch Lightning is a wrapper around PyTorch that handles a lot of the standard PyTorch boilerplate that you end up writing for every project (e.g. training, test, and validation loops, determining whether a model should be in eval
or not, setting up data, and so on). Like fast.ai, it has an extensible callback system that allows you to hook custom code during almost any part of the training cycle, so you end up with most of the power of PyTorch, but without having to rewrite train()
every time you start a new project. Instead, you end up just doing this to train a custom model:
from pytorch_lightning import Trainer
model = LightningModel()
trainer = Trainer(gpus=1, num_nodes=1)
trainer.fit(model)
Some people prefer working with pure PyTorch all the time, but I definitely see a lot of value in Lightning, as it does remove a lot of the error-prone tedium of writing training code while still remaining flexible enough for research purposes. I personally write most of my deep learning code either with Lightning or fast.ai (the new fast.ai2 library even has a tiered layer of APIs that allows you to delve deeper when you need to but still use their powerful higher-level abstractions for most of your work) rather than in raw PyTorch.
Don’t worry, though, because as we’ll see, building a model with PyTorch Lightning isn’t that much different than what we’ve been doing throughout the rest of the book. We just don’t need to worry about the training quite so much!
Light Leaves, ResNet Sees
In order to demonstrate self-supervised training, we’re going to use a smaller version of ImageNet called Imagenette. This dataset contains images from 10 classes of the larger set, and was constructed by Jeremy Howard as a way of being able to quickly test new ideas on a representative sample of ImageNet rather than having to spend a considerable amount of time training on the whole thing. We’ll be using the full-sized version for our model, which means a 300Mb download. Let’s declare our imports and download Imagenette.
!pip install pytorch-lightning
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
import pytorch_lightning as pl
from PIL import Image
from pathlib import Path
from torchvision import transforms
import torchvision.transforms.functional as TF
import random
!wget https://s3.amazonaws.com/fast-ai-imageclas/imagenette2-320.tgz
!tar xvzf imagenette2-320.tgz
A Self-Supervised Dataset, As A Treat
You, sobbing: “You can’t just point at a picture and call it a label!"
Me, an intellectual, pointing at a cat rotated ninety degrees: “Label."
Even though we’re using PyTorch Lightning, we’ll construct our datasets in the usual way with the Dataset
class. When an image is requested from the dataset, we will either simply return a tensor version of the image with the label 0
, or randomly apply a rotational transform through 90, 180, or 270 degrees, or flipping the image’s axis either horizontally or vertically. Each of these potential transforms has a separate label, giving us six potential labels for any image. Note that we’re not doing any normalization in this pipeline to keep things relatively simple, but feel free to add the standard ImageNet normalization if you desire.
class RotationalTransform:
def __init__(self, angle):
self.angle = angle
def __call__(self, x):
return TF.rotate(x, self.angle)
class VerticalFlip:
def __init__(self):
pass
def __call__(self, x):
return TF.vflip(x)
class HorizontalFlip:
def __init__(self):
pass
def __call__(self, x):
return TF.hflip(x)
We’ll then wrap those transforms up inside a Dataset
class, which will apply a chosen transformation when __getitem__
is called, as well as returning the correct label for that transform.
class SelfSupervisedDataset(object):
def __init__(self, image_path=Path("imagenette2-320/train")):
self.imgs = list(image_path.glob('**/*.JPEG'))
self.class_transforms = [RotationalTransform(0), RotationalTransform(90),
RotationalTransform(180), RotationalTransform(270),
HorizontalFlip(),VerticalFlip()]
self.to_tensor = transforms.Compose([transforms.ToTensor()])
self.classes = len(self.class_transforms)
def __getitem__(self, idx):
img = Image.open(self.imgs[idx])
label = random.choice(range(0, self.classes))
img = img.convert("RGB")
# Resize first, then apply our selected transform and finally convert to tensor
transformed_image = self.to_tensor(self.class_transforms[label](transforms.Resize((224,224))(img)))
return transformed_image, label
def __len__(self):
return len(self.imgs)
ResNet-34 Go Brrr
With our dataset completed, we’re now ready to write the LightningModule
that will be the model we train on this data. Writing a model in PyTorch Lightning is not too much different from the standard PyTorch approach we’ve seen throughout the book, but there are some additions that make the class more self-contained and allow PyTorch Lightning to do things like handle training for us. Here’s a skeleton LightningModule
:
class SkeletonModel(pl.LightningModule):
def __init__(self):
pass
def forward(self, x):
pass
def train_dataloader(self):
pass
def training_step(self, batch, batch_idx):
pass
def configure_optimizers(self):
pass
def prepare_data(self):
pass
As you can see, we have our familiar __init__
and forward
methods, which work in exactly the same way as before. But we now also have methods for various parts of the training cycle, including setting up dataloaders and performing training and validation steps. We also have a prepare_data
method which can do any preprocessing needed for datasets, as well as configure_optimizer
for setting up our model’s optimizing function.
PyTorch Lightning includes hooks for lots of other parts of the training process (e.g. handling validation steps and DataLoaders, running code at the start or end of training epochs, and lots more besides), but these are the minimal parts we’ll need to implement.
Now that we know the structure, let’s throw together a model based on ResNet-34 with a small custom head. Note that we’re not using a pretrained ResNet model here; we’re going to be training from scratch. We’ll also add another method, validation_epoch_end
, which will update statistics for loss and accuracy in our validation set at the end of every epoch.
class SelfSupervisedModel(pl.LightningModule):
def __init__(self, hparams=None, num_classes=6, batch_size=64):
super(SelfSupervisedModel, self).__init__()
self.resnet = torchvision.models.resnet34(pretrained=False)
self.resnet.fc = nn.Sequential(nn.Linear(512, 256), nn.ReLU(), nn.Linear(256, num_classes))
self.batch_size = batch_size
self.loss_fn = nn.CrossEntropyLoss()
if "lr" not in hparams:
hparams["lr"] = 0.001
self.hparams = hparams
def forward(self, x):
return self.resnet(x)
def training_step(self, batch, batch_idx):
inputs, targets = batch
predictions = self(inputs)
loss = self.loss_fn(predictions, targets)
return {'loss': loss}
def configure_optimizers(self):
return torch.optim.Adam(self.parameters(), lr=self.hparams["lr"])
def prepare_data(self):
self.training_dataset = SelfSupervisedDataset()
self.val_dataset = SelfSupervisedDataset(Path("imagenette2-320/val"))
def train_dataloader(self):
return torch.utils.data.DataLoader(self.training_dataset, batch_size=self.batch_size, num_workers=4, shuffle=True)
def val_dataloader(self):
return torch.utils.data.DataLoader(self.val_dataset, batch_size=self.batch_size, num_workers=4)
def validation_step(self, batch, batch_idx):
inputs, targets = batch
predictions = self(inputs)
val_loss = self.loss_fn(predictions, targets)
_, preds = torch.max(predictions, 1)
acc = torch.sum(preds == targets.data) / (targets.shape[0] * 1.0)
return {'val_loss': val_loss, 'val_acc': acc}
def validation_epoch_end(self, outputs):
avg_loss = torch.stack([x['val_loss'] for x in outputs]).mean()
avg_acc = torch.stack([x['val_acc'].float() for x in outputs]).mean()
logs = {'val_loss': avg_loss, 'val_acc': avg_acc}
return {'progress_bar': logs}
Having defined the model, we can start training by using PyTorch Lightning’s Trainer
class. We’ll pass in max_epochs
to only train for 5 epochs with the learing rate of 0.001
(though the framework comes with lr_finder
method to find an appropriate learning rate that uses the same approach that we have been using in the book so far and what’ll you’ll find in fast.ai). We’ll also need to tell the trainer how many GPUs we have available; if more than one is present and available, then the class will use as many as directed for multi-GPU training.
model = SelfSupervisedModel({'lr': 0.001})
trainer = pl.Trainer(max_epochs=5, gpus=1)
trainer.fit(model)
trainer.save_checkpoint("selfsupervised.pth")
We’ve now trained for 5 epochs on our pretraining task. What we need to now is to train on the actual task we’re trying to solve - not to classify for rotations or flipping, but to determine the ImageNet class an image belongs to. We can do this training simply by swapping out the current dataloaders for ones that returns the images and the labels for the provided Imagenette dataset. We do this using the old faithful ImageFolder
:
tfms = transforms.Compose([
transforms.Resize((224,224)),
transforms.ToTensor()
])
imagenette_training_data = torchvision.datasets.ImageFolder(root="imagenette2-320/train/", transform=tfms)
imagenette_training_data_loader = torch.utils.data.DataLoader(imagenette_training_data, batch_size=64, num_workers=4, shuffle=True)
imagenette_val_data = torchvision.datasets.ImageFolder(root="imagenette2-320/val/", transform=tfms)
imagenette_val_data_loader = torch.utils.data.DataLoader(imagenette_val_data, batch_size=64, num_workers=4)
We’ll then load in our saved checkpoint, replacing the original training data with the new DataLoader, and we’ll replace the head of the classifier so it now is predicting the 10 ImageNet labels instead of our self-supervised labels. The model will be trained for a further 5 epochs on the supervised training data.
model = model.load_from_checkpoint("selfsupervised.pth")
model.resnet.fc[2] = nn.Linear(256,12)
Training will be performed using the Trainer
class again, but this time we’ll pass in these new training and validation dataloaders, which will override the ones we defined in the actual class (and prepare_dataset
will not be called by PyTorch Lightning during this training phase).
trainer = pl.Trainer(max_epochs=5, gpus=1)
trainer.fit(model, train_dataloader=imagenette_training_data_loader, val_dataloaders=imagenette_val_data_loader)
The model’s accuracy its final (10th) epoch of training ended up around 54%. Which isn’t too bad considering that we have only trained 5 epochs on the data itself (and did no augmentation on that pipeline). But was it worth it? Well, let’s check! If we recreate our model from scratch and just pass in the non-supervised dataloaders for training and validation, training for 10 epochs, we can have a comparison between that result and our self-supervised model.
standard_model = SelfSupervisedModel({'lr': 0.001})
trainer = pl.Trainer(max_epochs=10, gpus=1)
trainer.fit(standard_model, train_dataloader=imagenette_training_data_loader, val_dataloaders=imagenette_val_data_loader)
On my training run, it ended up with a best accuracy over 10 epochs of 33%. We can see that pre-training with our self-supervised dataset offers a greater performance despite being trained on the final task for only 5 epochs.
One Step (or more) Beyond
This has been dipping a toe into the waters of self-supervised learning. If you want to go deeper, you could experiment further with the framework in this chapter. Can you improve performance by adding other transformations to the initial pipeline, perhaps? Or augmentation in the training on the task fine-tuning stage? Or training with larger ResNet architectures?
In addition, I urge you to look contrastive learning, which is a technique where the model is trained by being shown augmented and non-augmented images and another image of a completely different class. This turns out to be another powerful way of extracting as much as you can from your existing data and, as part of Google’s SimCLR system, is currently the state-of-the-art when it comes to training models on ImageNet.
Further Reading
May 17, 2020 · 2 minute
read
deeeeeep learning
quality content
I don’t normally hype work things on this blog, but seeing as how I had a part in making it happen, I’d be remiss if I didn’t link to Lucidworks’ new Smart Answers, a deep learning-powered question and answer system. We put a lot of work into making it really easy to get started either with or without training on your existing data and we’ll be continuing to improve it for quite some time to come.
Having said all that, I’m not sure if there’s much else to talk about this week. I should have the last part of Chapter 9.5 finished next weekend, so if nothing else, you’ll have a Bank Holiday Weekend of just under 3,000 words on the burning subject of Image Classification via Self-Supervision. Get in, lads
Also, you should be concerned, because I’m toying with the idea of writing a long and rambling post about Community, a show which is not the best American sitcom of the past 20 years (because that’s Brooklyn-Nine-Nine), but is probably my favourite. But don’t worry! There will be vintage clips of British shows in that post as well! I know that’s what you’re really here for.
In the meantime, stay safe, and remain indoors.
May 10, 2020 · 2 minute
read
rewatching community again
transformers
actually, this time it is about those ones
And we’re back to non-deep learning content again. I’m currently watching my cat attempt to hunt birds from the basement window, losing her footing and falling onto the pillow that has been down here since Thanksgiving. And then getting back up again. It’s entertaining. Welcome to lockdown…week…infinity?
This has been a very different week from most quarantine weeks, in that the house has been mostly full! People recovering from surgery, visitors, cinnamon rolls, a disappointing Detroit-style pizza, the return of the deer family, a depressing Catalyst-related Spark bug, the completion of the TF UK #82 line up of The Wreckers, a shed-load of Brook-Rose novels, and my first attempt at Sichuan cooking. Which, really, makes staying home sound a lot more interesting than it actually is. And I didn’t mention the nights of not sleeping, or the hours of staring at log files. Or that my turnips rotted.
Still, aside from the weight gain, the general stress, the mounting apprehension about November, and a host of other things, everything is fine! Totally!
🥺🥺🥺
Come back next week for more uplifting content!
May 3, 2020 · 12 minute
read
pytorch
quantization
transformers
no, not those ones
Here’s the notebook in Google Colab
As we know, models can be big lumbering beasts, comprised of millions of parameters (both weights and activations) that require lots of matrix multiplications to take an input and arrive at an answer. And for most of our work so far, that’s been fine! We have mighty GPUs that can handle these burdens with ease.
But what if we didn’t? We often package a model up for production inference usage so that it only runs on the CPU. And what if we wanted to run our model on a smaller embedded platform? Suddenly, both the size of the model and all those floating-point operations become a little more problematic. Thankfully, there’s a trick we can perform that makes our model smaller and faster, normally with the trade off with some accuracy. Even better, PyTorch allows us to perform this one weird trick with just one line of code, with some other approaches for squeezing even more performance. Let’s have a quick look at quantization.
Quantization
Every parameter in our model is a 32-bit floating point number, taking up 4 bytes of memory. That’s not a lot, but it can soon add up. Let’s have a look at Google’s recent T5 transformer-based model, which has a t5-small
variant that’s available in the transformers
library.
import torch
from transformers import pipeline, T5ForConditionalGeneration
def count_parameters(model):
return sum(p.numel() for p in model.parameters())
base_model = T5ForConditionalGeneration.from_pretrained("t5-small")
param_count = count_parameters(base_model)
memory = (param_count * 4) / (1024 *1024)
memory
230.8154296875
Even with the smallest pre-trained T5 weights, our model is roughly 60m parameters and weighs in at a whopping 230Mb!
However, what if we decided that we didn’t need the full precision of our floating-point parameters? If our parameters could be restricted to within a certain range of values, then we could use a smaller type of number representation to store the parameters. This quantization is the key to speeding up our inference time and reducing the memory footprint of our models. What we tend to aim for is to quantize down from a 32-bit floating point to an 8-bit integer. The basic idea is:
$$x_{int8} = (\frac{x_{float32}}{x_{scale}} + x_{offset})$$
Which is essentially just fitting the potential values of the parameters of a network to a line of $y = mx + c$, although due to the reduced resolution of the 8-bit integer, there’s only so many values a parameter now may take instead of the huge amount that a float32
value could be. PyTorch does its quantizing in a slightly more complicated affair that ensures that zero is always zero, but the basic idea is the same - we have a range of values that our parameters can take, and then find an appropriate pair $x_{scale}$ and $x_{offset}$ to provide 256 graduations to represent that range - or 255 if you think about PyTorch always keeping zero around.
At the moment (PyTorch 1.5), quantized layers are best supported with CNN
and Linear
layers. Thankfully, if we have a look at the model structure of T5, we can see a happy coincidence:
T5Model(
(shared): Embedding(32128, 512)
(encoder): T5Stack(
(embed_tokens): Embedding(32128, 512)
(block): ModuleList(
(0): T5Block(
(layer): ModuleList(
(0): T5LayerSelfAttention(
(SelfAttention): T5Attention(
(q): Linear(in_features=512, out_features=512, bias=False)
(k): Linear(in_features=512, out_features=512, bias=False)
(v): Linear(in_features=512, out_features=512, bias=False)
(o): Linear(in_features=512, out_features=512, bias=False)
(relative_attention_bias): Embedding(32, 8)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(1): T5LayerFF(
(DenseReluDense): T5DenseReluDense(
(wi): Linear(in_features=512, out_features=2048, bias=False)
(wo): Linear(in_features=2048, out_features=512, bias=False)
(dropout): Dropout(p=0.1, inplace=False)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
(1): T5Block(
(layer): ModuleList(
(0): T5LayerSelfAttention(
(SelfAttention): T5Attention(
(q): Linear(in_features=512, out_features=512, bias=False)
(k): Linear(in_features=512, out_features=512, bias=False)
(v): Linear(in_features=512, out_features=512, bias=False)
(o): Linear(in_features=512, out_features=512, bias=False)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(1): T5LayerFF(
(DenseReluDense): T5DenseReluDense(
(wi): Linear(in_features=512, out_features=2048, bias=False)
(wo): Linear(in_features=2048, out_features=512, bias=False)
(dropout): Dropout(p=0.1, inplace=False)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
(2): T5Block(
(layer): ModuleList(
(0): T5LayerSelfAttention(
(SelfAttention): T5Attention(
(q): Linear(in_features=512, out_features=512, bias=False)
(k): Linear(in_features=512, out_features=512, bias=False)
(v): Linear(in_features=512, out_features=512, bias=False)
(o): Linear(in_features=512, out_features=512, bias=False)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(1): T5LayerFF(
(DenseReluDense): T5DenseReluDense(
(wi): Linear(in_features=512, out_features=2048, bias=False)
(wo): Linear(in_features=2048, out_features=512, bias=False)
(dropout): Dropout(p=0.1, inplace=False)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
(3): T5Block(
(layer): ModuleList(
(0): T5LayerSelfAttention(
(SelfAttention): T5Attention(
(q): Linear(in_features=512, out_features=512, bias=False)
(k): Linear(in_features=512, out_features=512, bias=False)
(v): Linear(in_features=512, out_features=512, bias=False)
(o): Linear(in_features=512, out_features=512, bias=False)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(1): T5LayerFF(
(DenseReluDense): T5DenseReluDense(
(wi): Linear(in_features=512, out_features=2048, bias=False)
(wo): Linear(in_features=2048, out_features=512, bias=False)
(dropout): Dropout(p=0.1, inplace=False)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
(4): T5Block(
(layer): ModuleList(
(0): T5LayerSelfAttention(
(SelfAttention): T5Attention(
(q): Linear(in_features=512, out_features=512, bias=False)
(k): Linear(in_features=512, out_features=512, bias=False)
(v): Linear(in_features=512, out_features=512, bias=False)
(o): Linear(in_features=512, out_features=512, bias=False)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(1): T5LayerFF(
(DenseReluDense): T5DenseReluDense(
(wi): Linear(in_features=512, out_features=2048, bias=False)
(wo): Linear(in_features=2048, out_features=512, bias=False)
(dropout): Dropout(p=0.1, inplace=False)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
(5): T5Block(
(layer): ModuleList(
(0): T5LayerSelfAttention(
(SelfAttention): T5Attention(
(q): Linear(in_features=512, out_features=512, bias=False)
(k): Linear(in_features=512, out_features=512, bias=False)
(v): Linear(in_features=512, out_features=512, bias=False)
(o): Linear(in_features=512, out_features=512, bias=False)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(1): T5LayerFF(
(DenseReluDense): T5DenseReluDense(
(wi): Linear(in_features=512, out_features=2048, bias=False)
(wo): Linear(in_features=2048, out_features=512, bias=False)
(dropout): Dropout(p=0.1, inplace=False)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
)
(final_layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(decoder): T5Stack(
(embed_tokens): Embedding(32128, 512)
(block): ModuleList(
(0): T5Block(
(layer): ModuleList(
(0): T5LayerSelfAttention(
(SelfAttention): T5Attention(
(q): Linear(in_features=512, out_features=512, bias=False)
(k): Linear(in_features=512, out_features=512, bias=False)
(v): Linear(in_features=512, out_features=512, bias=False)
(o): Linear(in_features=512, out_features=512, bias=False)
(relative_attention_bias): Embedding(32, 8)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(1): T5LayerCrossAttention(
(EncDecAttention): T5Attention(
(q): Linear(in_features=512, out_features=512, bias=False)
(k): Linear(in_features=512, out_features=512, bias=False)
(v): Linear(in_features=512, out_features=512, bias=False)
(o): Linear(in_features=512, out_features=512, bias=False)
(relative_attention_bias): Embedding(32, 8)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(2): T5LayerFF(
(DenseReluDense): T5DenseReluDense(
(wi): Linear(in_features=512, out_features=2048, bias=False)
(wo): Linear(in_features=2048, out_features=512, bias=False)
(dropout): Dropout(p=0.1, inplace=False)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
(1): T5Block(
(layer): ModuleList(
(0): T5LayerSelfAttention(
(SelfAttention): T5Attention(
(q): Linear(in_features=512, out_features=512, bias=False)
(k): Linear(in_features=512, out_features=512, bias=False)
(v): Linear(in_features=512, out_features=512, bias=False)
(o): Linear(in_features=512, out_features=512, bias=False)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(1): T5LayerCrossAttention(
(EncDecAttention): T5Attention(
(q): Linear(in_features=512, out_features=512, bias=False)
(k): Linear(in_features=512, out_features=512, bias=False)
(v): Linear(in_features=512, out_features=512, bias=False)
(o): Linear(in_features=512, out_features=512, bias=False)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(2): T5LayerFF(
(DenseReluDense): T5DenseReluDense(
(wi): Linear(in_features=512, out_features=2048, bias=False)
(wo): Linear(in_features=2048, out_features=512, bias=False)
(dropout): Dropout(p=0.1, inplace=False)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
(2): T5Block(
(layer): ModuleList(
(0): T5LayerSelfAttention(
(SelfAttention): T5Attention(
(q): Linear(in_features=512, out_features=512, bias=False)
(k): Linear(in_features=512, out_features=512, bias=False)
(v): Linear(in_features=512, out_features=512, bias=False)
(o): Linear(in_features=512, out_features=512, bias=False)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(1): T5LayerCrossAttention(
(EncDecAttention): T5Attention(
(q): Linear(in_features=512, out_features=512, bias=False)
(k): Linear(in_features=512, out_features=512, bias=False)
(v): Linear(in_features=512, out_features=512, bias=False)
(o): Linear(in_features=512, out_features=512, bias=False)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(2): T5LayerFF(
(DenseReluDense): T5DenseReluDense(
(wi): Linear(in_features=512, out_features=2048, bias=False)
(wo): Linear(in_features=2048, out_features=512, bias=False)
(dropout): Dropout(p=0.1, inplace=False)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
(3): T5Block(
(layer): ModuleList(
(0): T5LayerSelfAttention(
(SelfAttention): T5Attention(
(q): Linear(in_features=512, out_features=512, bias=False)
(k): Linear(in_features=512, out_features=512, bias=False)
(v): Linear(in_features=512, out_features=512, bias=False)
(o): Linear(in_features=512, out_features=512, bias=False)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(1): T5LayerCrossAttention(
(EncDecAttention): T5Attention(
(q): Linear(in_features=512, out_features=512, bias=False)
(k): Linear(in_features=512, out_features=512, bias=False)
(v): Linear(in_features=512, out_features=512, bias=False)
(o): Linear(in_features=512, out_features=512, bias=False)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(2): T5LayerFF(
(DenseReluDense): T5DenseReluDense(
(wi): Linear(in_features=512, out_features=2048, bias=False)
(wo): Linear(in_features=2048, out_features=512, bias=False)
(dropout): Dropout(p=0.1, inplace=False)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
(4): T5Block(
(layer): ModuleList(
(0): T5LayerSelfAttention(
(SelfAttention): T5Attention(
(q): Linear(in_features=512, out_features=512, bias=False)
(k): Linear(in_features=512, out_features=512, bias=False)
(v): Linear(in_features=512, out_features=512, bias=False)
(o): Linear(in_features=512, out_features=512, bias=False)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(1): T5LayerCrossAttention(
(EncDecAttention): T5Attention(
(q): Linear(in_features=512, out_features=512, bias=False)
(k): Linear(in_features=512, out_features=512, bias=False)
(v): Linear(in_features=512, out_features=512, bias=False)
(o): Linear(in_features=512, out_features=512, bias=False)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(2): T5LayerFF(
(DenseReluDense): T5DenseReluDense(
(wi): Linear(in_features=512, out_features=2048, bias=False)
(wo): Linear(in_features=2048, out_features=512, bias=False)
(dropout): Dropout(p=0.1, inplace=False)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
(5): T5Block(
(layer): ModuleList(
(0): T5LayerSelfAttention(
(SelfAttention): T5Attention(
(q): Linear(in_features=512, out_features=512, bias=False)
(k): Linear(in_features=512, out_features=512, bias=False)
(v): Linear(in_features=512, out_features=512, bias=False)
(o): Linear(in_features=512, out_features=512, bias=False)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(1): T5LayerCrossAttention(
(EncDecAttention): T5Attention(
(q): Linear(in_features=512, out_features=512, bias=False)
(k): Linear(in_features=512, out_features=512, bias=False)
(v): Linear(in_features=512, out_features=512, bias=False)
(o): Linear(in_features=512, out_features=512, bias=False)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(2): T5LayerFF(
(DenseReluDense): T5DenseReluDense(
(wi): Linear(in_features=512, out_features=2048, bias=False)
(wo): Linear(in_features=2048, out_features=512, bias=False)
(dropout): Dropout(p=0.1, inplace=False)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
)
(final_layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
)
Yes, that’s right, look at all those Linear
layers! We should be able to get some benefit out of quantizing this model.
One Weird Trick — Dynamic Quantization
import torch.quantization
quantized_model = torch.quantization.quantize_dynamic(base_model, {torch.nn.Linear}, dtype=torch.qint8)
No, really, that’s it. Chapter done. Bye!
Oh, okay, if you really insist. But honestly, there’s not much more to it. Okay, firstly, a caveat in that quantize_dynamic
will only quantize the weights, not the activations in our parameters. But all we need to do is pass in the model
we wish to quantize and a dict of layers that we wish to replace with our quantized versions, in this case Linear
. The function returns a new model, though you could run with the optional parameter inplace=True
to mutate the original model rather than make a copy.
Let’s save the model and take a look at the quantized size:
!mkdir t5
quantized_model.save_pretrained("t5")
!du -m t5
mkdir: cannot create directory ‘t5’: File exists
121 t5
Almost a 50% reduction in size! We can’t get down to 4 times smaller due to not being able to store the activations as 8-bit integers, but we’ve done pretty well for one line of code. Let’s do a very simple microbenchmark using both models in the transformers
library summarization pipeline.
base_summarizer = pipeline("summarization", model=base_model, tokenizer="t5-small")
quantized_summarizer = pipeline("summarization", model=quantized_model, tokenizer="t5-small")
%timeit base_summarizer("From the very beginning, Regan was seen as having series potential. After the television film scored highly in the ratings, work began on the development of the series proper. Ian Kennedy Martin's idea was for the series to be mainly studio-based, with more dialogue and less action, but producer Ted Childs disagreed, and in consequence Ian Kennedy Martin parted company with the project. Childs produced it on 16mm film, a format that allowed for a much smaller film unit than videotape at that time. This made it possible to shoot almost entirely on location which helped give the series a startling degree of realism and to use film editing techniques which enabled him to give the show a heavy bias toward action sequences. The television play and the subsequent series were commissioned by Thames Television and produced by its film division Euston Films. It was originally broadcast on ITV between 2 January 1975 and 28 December 1978 at 21:00–22:00 on weekdays (usually Mondays), with repeated screenings at the same time until the early 1980s. The writers were given strict guidelines to follow: \"Each show will have an overall screen time (minus titles) of 48 minutes 40 seconds. Each film will open with a teaser of up to 3 minutes, which will be followed by the opening titles. The story will be played across three acts, each being no more than 19 minutes and no less than 8 minutes in length. Regan will appear in every episode, Carter in approximately 10 out of 13 episodes. In addition to these main characters, scripts should be based around three major speaking parts, with up to ten minor speaking parts")
1 loop, best of 3: 29.4 s per loop
%timeit quantized_summarizer("From the very beginning, Regan was seen as having series potential. After the television film scored highly in the ratings, work began on the development of the series proper. Ian Kennedy Martin's idea was for the series to be mainly studio-based, with more dialogue and less action, but producer Ted Childs disagreed, and in consequence Ian Kennedy Martin parted company with the project. Childs produced it on 16mm film, a format that allowed for a much smaller film unit than videotape at that time. This made it possible to shoot almost entirely on location which helped give the series a startling degree of realism and to use film editing techniques which enabled him to give the show a heavy bias toward action sequences. The television play and the subsequent series were commissioned by Thames Television and produced by its film division Euston Films. It was originally broadcast on ITV between 2 January 1975 and 28 December 1978 at 21:00–22:00 on weekdays (usually Mondays), with repeated screenings at the same time until the early 1980s. The writers were given strict guidelines to follow: \"Each show will have an overall screen time (minus titles) of 48 minutes 40 seconds. Each film will open with a teaser of up to 3 minutes, which will be followed by the opening titles. The story will be played across three acts, each being no more than 19 minutes and no less than 8 minutes in length. Regan will appear in every episode, Carter in approximately 10 out of 13 episodes. In addition to these main characters, scripts should be based around three major speaking parts, with up to ten minor speaking parts")
1 loop, best of 3: 16.6 s per loop
In addition to almost being half the size, the quantized model is almost twice as fast! So…why don’t we do this all the time? Are there no downsides? Well…it depends. We are losing information in our inference in a quantized model as our values cannot map to all the possible floating-point values that we find in the original model. So the chain of multiplications will be less accurate in our quantized model than in the original. You’ll need to check the new model against a reference dataset to determine the accuracy loss and whether that loss is an acceptable trade-off compared to the reduced storage demands and faster execution.
Other Quantizing Options Are Available
In addition to dynamic quantizing, PyTorch also offers static quantizing, where a trained model is modified to include observer modules and a selection of data is fed into the model. During the inference on this data, the observers can generate a quantized distribution that fits best to the observed data and the activations that result. This can can produce even further space and time savings, especially with vision models like ResNet.
However, for the best-in-class of accuracy in your smaller model, you’ll want to investigate quantization-aware training (QAT). In this approach, the model fakes quantizing during the training loop of both the forward and backward passes; while all the computations take place with standard floats, everything is rounded down to integer values, so you end up with a quantized model after training is finished, but one with a higher accuracy than you can acheive with the dynamic or static approaches.
Is It Worth It?
You might be wondering if you’re just better off training a smaller model rather than going to all this effort to compress larger models. In the recent paper, Train Large, Then Compress, there’s a good deal of evidence presented that transformer-based models really do benefit from this approach. Because larger models converge faster than smaller ones, you will likely get more accurate results by training a large model and compressing than if you spent the same compute time on a smaller model. So go forth and compress!
(and we’ll see you back here in the future for pruning models)
Further Reading
https://pytorch.org/docs/stable/quantization.html
https://arxiv.org/abs/2002.11794
May 3, 2020 · 3 minute
read
meeting the mayor
where did april go?
A summary of how things are going: I have worked from home now for over four years. And yet it was the closing week of April 2020 where I ordered enough tracksuit bottoms to last a week, because I cannot be bothered to dress properly any more. Standards have fallen, people. Also, how come March felt like one thousand years instead of a month and yet April has flown by so quickly?
Slowly trying to thin down my web presence this week by backing up a hosted server I rent that…hasn’t been doing anything of note for at least two years now. I really want to convert this blog to just being served from an S3 bucket, but given that the current set up has been running on Dreamhost for…18 years, well, let’s just say that there’s some inertia there. Also, this blog will be old enough to drink come August (though not in its jurisdiction). That’s absolutely terrifying. It also means that it’s been the same amount of time since I went to UNC. Ah, halcyon days.
Meanwhile, from what is apparently the Cincinnati 2020 reboot of Keeping Up Appearances, let me tell you how I met the ex-mayor of Cincinnati. It’s Friday evening, and I’m celebrating the end of the week with a drink. It’s a slightly larger than usual drink because it was the end of the bottle. There I am, downstairs in the basement watching an episode of The Sweeney whilst also reading the instructions for a GPU-accelerated dense vector search library. Yeah, I know you’re jealous; it’s a wild Friday night and no mistake. Anyway, having finished the bottle of Wild Turkey, of course the doorbell rings. Running up the stairs, slightly drunk, pulling off a sock as I go, I open the door and, yes, well there’s the former mayor, coincidentally my neighbour with a package in hand. Now, despite it also being the two-year anniversary of me moving to the area…I have never actually met him before. So he’s introducing himself, and I go to shake his hand.
“Oh no, I don’t think we do that any more”
I snap back, making it very clear I’ve been drinking, apologising and basically expecting a Miranda-like situation where I lose my trousers is coming any moment. On the bright side, I have now met both sets of neighbours! But I can never speak to one set ever again. I guess I’m thankful it’s not the set that I’m helping to take down a tree in the summer?
Apr 26, 2020 · 2 minute
read
just run it through T5 and call it done
Well, as the horror continues to pile up, a small bit of personal good news: apparently my book is being translated into German. I’m working with the editors to make a few changes in the upcoming edition, hopefully including the already-in-English Chapter 9.5 plus some sections on quantizing models and image self-supervision, providing I can get them written by the end of May. The English language versions of those will end up on the GitHub repo as promised. No word on a second edition (which would probably see massive revisions of the test and audio chapters, given that PyTorch changed everything a month after the book went to the printers), but I’ll let you know if it happens!
Otherwise, a week that mainly consisted of the New Normal of Remaining Indoors, but with a small visit to Jungle Jim’s on Saturday (with a mask, but nothing as fabulous as this outfit). I needed more tea. Of course, when you go, you can’t just buy one thing, as you will end up with at least ten things that catch you eye even when you go with a list and a plan in mind. Reader, I bought all the Star Bars. And the Tunnock’s Caramel Wafers. Look, I’m reduced to watching episodes of Police 5 whilst drinking a cup of tea. Eating Scottish sweets that haven’t changed since 1952 is the least of my worries right now.
I continue to plan for a future in the hope that there will be a future. Not that I can talk about things much at the moment, but one plan will have a website, hopefully around the end of June. It’ll be fun! At least fun in a certain type of way. Not quite as much fun as Sophie Ellis-Bextor’s Kitchen Discos, but something for a certain type of person to play about with for the rest of the year. No clues, but it will take obsessions from my early life and mix them with current obsessions. Keep ‘em peeled!
I’m so sorry
Apr 18, 2020 · 1 minute
read
today is —day
tomorrow is —day
this concludes the shipping forecast. remain indoors
I think this week qualifies as my first meltdown? That’s what days of increasingly less sleep gets you, I guess. Also: you cannot solve Kubernetes problems by thinking about them from 1am until 6am.
Anyway, I have had a long weekend of doing…almost nothing. It’s been quite nice, really. No baking or confectionary work. No long periods of staring at code. I even sat outside for a bit.
(Okay, I cheated by moving the baking to mid-week by virtue of having a quiet, but quite nice birthday on Wednesday. Not quite the birthday celebration we had planned, but that is pretty much 2020 so far…)
Next week: further adventures in Seldon Core. And maybe a masked-up visit to Jungle Jim’s…
(I don’t have a lot to say about being 41. Aside from being so old, obviously)