Mar 1, 2020 · 12 minute
read
pytorch
sometimes it is easier to write when there's no deadline
gpt-2
(note, you can find a Jupyter Notebook version of this post here)
While I’m mostly happy with how the book turned out, bar some silly errors that should not have made it to print and needing about another six months to do it properly (although work would have precluded that, so…anyway), I was a little disappointed with how I handled text generation. It worked, that’s for sure, but it was little more than ‘run this program on this text, then run this script to transform the Tensorflow model into a PyTorch compatible format, and run this script to generate output’. And then, to top it all off, about a week after the book went to print, the repo that housed most of the code underwent a major change from pytorch-pretrained-BERT
to its eventual name of transformers
. A bit of a pain.
In a way to make that up to people, welcome to Chapter 9.5 - A Half-Chapter in Two Parts. In this part, we’ll take another look at text generation, but this time, we won’t leave PyTorch. Promise. In Part Two (or is that Chapter 9.75?), we’ll have a bit of a final look back at images. The common theme between both parts will be self-supervision and domain modelling. I don’t have an ETA for Part Two yet, but it’ll come, promise.
If you’re looking for a refresher on the Transformer architecture, then there’s some in Chapter 9 of my book, but more usefully, you could go here to read The Illustrated Transformer, and here for The Illustrated GPT-2.
Adding New Generation Tricks To GPT-2
Right, so if you remember in the book, we went on a jolly side-jaunt with P.G. Wodehouse. And that was all very fine and whimsical, but maybe we want something that shows off the capabilities of GPT-2 a little better, even if it’s really just doing most of the same thing under the covers.
Instead of Jeeves and Wooster, we’re going to generate tweets. And we’re going to take things a step further by adding a new “control code” to our fine-tuned GPT-2 model, so we can instruct GPT-2 that we specifically want to generate a new tweet. If we don’t add the control code, then we should just get a (mostly) standard GPT-2 output. And we can use this technique to add multiple control codes, so if you had different sets of synthetic data that you wish to generate, you can use those codes to determine which type to create.
And first…let’s go back to the standard thing we always do.
“Gee Brain, what are we going to do tonight?"
“The same thing we do every night Pinky. Write a new custom dataset and take over the world!"
via GIPHY
Datasets
Don’t worry though, we won’t be doing anything too crazy with this Dataset
.
Much.
class CSVTwitter(Dataset):
def __init__(self, control_code, truncate=False, gpt2_type="gpt2", max_length=768):
self.tokenizer = GPT2Tokenizer.from_pretrained(gpt2_type)
self.tweets = []
# This uses the same CSV of Sentiment140 that we created in Chapter 5
with open('train-processed.csv', newline='') as csvfile:
tweet_csv = csv.reader(csvfile)
for row in tweet_csv:
self.tweets.append(torch.tensor(
self.tokenizer.encode(f"<|{control_code}|>{row[5][:max_length]}<|endoftext|>")
))
if truncate:
self.tweets = self.tweets[:20000]
self.tweet_count = len(self.tweets)
def __len__(self):
return self.tweet_count
def __getitem__(self, item):
return self.tweets[item]
Firstly, you might wonder is why we’re ensuring that we chop our strings at 768 characters. We’re going to be using gpt2-small
in this chapter, which has that limitation due to its hidden dimensionality of 768 (if you want to use larger pre-trained models, then you can increase this: gpt2-medium
/1024, gpt2-large
/1280, gpt2-xl
/1600). Of course, because this dataset is only tweets, we’re never going to bump up against the limit, but I thought I would I’d include it so you know to be aware of the limitation.
You’ll also see that we’re injecting our <|tweet|>
control code at the start of each entry, and the <|endoftext|>
code at the end - this is actually a code that GPT-2 has already learnt during its initial training to signify the end of a piece of text. It’ll become useful later on in training when we pack our training tensors.
The last part of the dataset is encoding. This is similar to the encoding of text that we did back in Chapter 5, but with a small twist. Instead of a simple mapping of all words to a new dictionary, we are using a byte pair encoding tokenizer. This works in a different way to what we have seen before as it builds a dictionary by keeping track of common pairs of bytes and replaces them with a byte that is not present in the encoding.
For example, take the nonsense string:
aabaabdeaa
The first pass of the byte pair encoder would replace our aa
strings:
AbAbdeA
A = aa
But note that we now have new byte pairs and so we can replace again:
BBdeA
A = aa
B = Ab
For building up a vocabulary from our data, the byte pair encoding in language models these days tends to work in the opposite direction; it starts out with a set of characters in that language, and through passes on the data, builds up subwords by finding the pairs present in the dataset, and then merging to find larger pairs, and so on. In this way, the tokenizer learns a vocabulary directly from the dataset itself and not from any manual input from an external source (like us).
Happily, we can use the BPE tokenizer that has already been trained on the dataset of GPT-2 and not have to worry about training it ourselves here (though if you’re looking to train on a new language, Huggingface’s tutorial on learning Esperanto will tell you everything you need to get started). We create a pre-trained version using GPT2Tokenizer.from_pretrained(gpt2_type)
, which will download the appropriate files for the version of GPT-2 we’re working with. We then encode the dataset and create tensors, returning a particular tensor within __getitem__()
as normal.
In addition to the CSV-based Dataset
, I’ve also included a different implementation that uses PyArrow to load in named columns from a parquet file. I just had a bunch of parquet-based datasets lying around so it was useful to make a class that could handle them as well.
We’ll build a DataLoader
in our usual way:
DataLoader(dataset, batch_size=1, shuffle=True)
(the reason for batch_size
being 1 is something we’ll come back to later)
Training
Okay, so how do we train this thing? Well, it turns out that it’s actually a lot more simple than you’d think. We already have a pre-trained model, so we’re just doing some fine-tuning (we won’t freeze layers here, but you can certainly experiment with it). But…don’t we need labels?
Training GPT-2’s involves passing our input text into the transformer model…and training the model to get the text back as output. In this way, the model learns the something of how text is structured, and eventually builds up a language model that can be used for generating further text. So our labels are the input text!
To get the model to produce anything resembling English or whatever language you’re training it on requires a gargantuan amount of text (OpenAI trained GPT-2 on 8 million webpages). But as we’re using a pre-trained model, all that hard work has been done for us, so we can get away with a much smaller dataset. We can create a pre-trained GPT-2 transformer with one line of code:
model = GPT2LMHeadModel.from_pretrained(gpt2_type)
As for our training loop, given that our labels are our input, all we’re really doing is:
outputs = model(input)
loss = loss_function(output, input)
loss.backward()
optimizer.step()
But there’s a slight catch. You remember that GPT-2 is big, right? Very big. It’s quite possible that you can’t fit all the parameters and all the gradient updates inside your GPU. I know I can’t, and I have a 1080Ti. There’s various approaches we can use to get around this problem, like distributed training, or maybe gradient checkpointing (covered in Chapter 7).
However, there’s a simpler option we can use . What we’re going to do is accumulate our gradients for a number of batches and then do the updating every x batches instead of every batch. We’ll divide our loss updates by the accumulated_batch_size
to average out the loss that we’re applying.
We’re almost at the point of having the training loop sorted. But what’s that, Columbo?
via GIPHY
You may have looked at the links to the illustrated Transformer articles and discovered that GPT-2 will ‘see’ all of its input at once. And we’re sending in encoded tensors of 140-character strings. That’s leaving a lot of our input set to…basically zero. Is that going to be great for training? Probably not, as we’re not going to get a lot of information flowing forwards and backwards through our network. Enter…pack_tensor()
!
def pack_tensor(new_tensor, packed_tensor, max_seq_len):
if packed_tensor is None:
return new_tensor, True, None
if new_tensor.size()[1] + packed_tensor.size()[1] > max_seq_len:
return packed_tensor, False, new_tensor
else:
packed_tensor = torch.cat([new_tensor, packed_tensor[:, 1:]], dim=1)
return packed_tensor, True, None
This is a very simple method that just tries to fit as many pieces of text into an input tensor as possible. This is why we created the DataLoader with a batch_size
of 1, as in our training loop, we’ll simply loop over and over the data until we’ve stuffed a tensor, and then push it through our model. Of course, this breaks the relationship between batches that come from the Dataset
and what we send to the model for the training, so we add accumulating_batch_count
as a counter to work out when we need to train on our accumulated gradients.
You’ll also notice in the train() code below that instead of our normal patten of:
outputs = model(input)
loss = loss_function(output, input)
We’re actually doing:
outputs = model(input, labels=input)
loss = outputs[0]
There’s nothing too nefarious going on here; the GPT-2 model simply has code inside it that calculates the loss to make things easier. It’s just a simple CrossEntropyLoss as we’ve seen in previous chapters.
Our optimizer and learning rate also come from the transformers
library, and we’re using the AdamW (Adam + Weight Decay) optimizer with a warmup and linear decay (you can see alternatives at Huggingface’s docs page). Plus we also include the ability to save a set of weights at the end of an epoch.
def train(
dataset,
model,
tokenizer,
batch_size=16,
epochs=4,
lr=2e-5,
max_seq_len=400,
warmup_steps=5000,
gpt2_type="gpt2",
device="cuda",
output_dir=".",
output_prefix="wreckgar",
test_mode=False,
save_model_on_epoch=False,
):
acc_steps = 100
model = model.to(device)
model.train()
optimizer = AdamW(model.parameters(), lr=lr)
scheduler = get_linear_schedule_with_warmup(
optimizer, num_warmup_steps=warmup_steps, num_training_steps=-1
)
train_dataloader = DataLoader(dataset, batch_size=1, shuffle=True)
accumulating_batch_count = 0
input_tensor = None
for epoch in range(epochs):
print(f"Training epoch {epoch}")
for idx, entry in tqdm(enumerate(train_dataloader)):
(input_tensor, carry_on, remainder) = pack_tensor(entry, input_tensor, 768)
if carry_on and idx != len(train_dataloader) - 1:
continue
input_tensor = input_tensor.to(device)
outputs = model(input_tensor, labels=input_tensor)
loss = outputs[0]
loss.backward()
if (accumulating_batch_count % batch_size) == 0:
optimizer.step()
scheduler.step()
optimizer.zero_grad()
model.zero_grad()
accumulating_batch_count += 1
input_tensor = None
if save_model_on_epoch:
torch.save(
model.state_dict(),
os.path.join(output_dir, f"{output_prefix}-{epoch}.pt"),
)
return model
Generating Text
For generating text from our fine-tuned model, there are multiple approaches that we could use, including beam search, top_k filtering, and the one we’re going to use — nucleus sampling (or top_p filtering). We take our input, in this case our new control code <|tweet|>
and then feed that into the model to generate a new sequence. But all we care about it is the next word, and in particular, the probabilities of all the possible words that the model predicts should appear there.
Of course, lots of words that the model may predict will not make sense, and that’s where we can bring in nucleus sampling (or top_k or any other approach). In this approach, we sum up all the probabilities, sorted in descending order that are present until the total sum (the cumulative distribution function) is above an adjustable hyperparameter, p
, which is normally set between 0.7 and 0.9. There’s another parameter, temperature
, which can be used to scale the probabilities before they’re summed up into the CDF.
Once the CDF is formed, we eliminate everything that falls outside of our p
by setting it to -Infinity
. We’re not messing around here. Note that as we’re doing this by summing the highest probability selections first, it’s possible that if there’s a few high probability choices, they’ll be the only ones present. And that makes sense if you think about sentences like:
The dog lifted up its ____
Possible options here could include paw, tail, tongue
. You’d expect paw
or tail
much more than tongue
. In this way, our sampling feels more natural, while still providing the possibility for surprise when probabilities are more spread out.
Most of the code here is taken from Huggingface’s run_generation.py
script.
Once we have our next word, we loop back around to the start, but this time we feed in the sentence with the new word added and choose the following word in the same way. We continue until we either reach entry_length
or if the model generates a <|endoftext|>
marker. And then it’s back to the outer loop to generate our next sentence until we’ve generated the requested number of sentences.
def generate(
model,
tokenizer,
prompt,
entry_count=10,
entry_length=100,
top_p=0.8,
temperature=1.,
):
model.eval()
generated_num = 0
generated_list = []
filter_value = -float("Inf")
with torch.no_grad():
for entry_idx in trange(entry_count):
entry_finished = False
generated = torch.tensor(tokenizer.encode(prompt)).unsqueeze(0)
# Using top-p (nucleus sampling): https://github.com/huggingface/transformers/blob/master/examples/run_generation.py
for i in range(entry_length):
outputs = model(generated, labels=generated)
loss, logits = outputs[:2]
logits = logits[:, -1, :] / (temperature if temperature > 0 else 1.0)
sorted_logits, sorted_indices = torch.sort(logits, descending=True)
cumulative_probs = torch.cumsum(
F.softmax(sorted_logits, dim=-1), dim=-1
)
sorted_indices_to_remove = cumulative_probs > top_p
sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[
..., :-1
].clone()
sorted_indices_to_remove[..., 0] = 0
indices_to_remove = sorted_indices[sorted_indices_to_remove]
logits[:, indices_to_remove] = filter_value
next_token = torch.multinomial(F.softmax(logits, dim=-1), num_samples=1)
generated = torch.cat((generated, next_token), dim=1)
if next_token in tokenizer.encode("<|endoftext|>"):
entry_finished = True
if entry_finished:
generated_num = generated_num + 1
output_list = list(generated.squeeze().numpy())
output_text = tokenizer.decode(output_list)
generated_list.append(output_text)
break
if not entry_finished:
output_list = list(generated.squeeze().numpy())
output_text = f"{tokenizer.decode(output_list)}<|endoftext|>"
generated_list.append(output_text)
return generated_list
Example Output
And here’s some output of calling generate
on our trained model.
"<|tweet|>Casa the fifth Monday afternoons in the summer. Stay for one more - you'll be much better at finding a workplace than you would at the office.\n\nThe Hours\n\n14:00 - 15:00, Hot and Cold\n\n18:00 - 19:00, Cafe Oktoberfest\n\n19:00 - 21:00, More Information<|endoftext|>",
'<|tweet|>Tweet what you like.<|endoftext|>',
'<|tweet|>Sigh. Hope to see ya in there.<|endoftext|>',
'<|tweet|> | The Walking Dead ends, '10 hours after everybody gets killed! I'm sick of zombies. pic.twitter.com/tsxhXdGLuGx.<|endoftext|>'
Further Techniques & Reading
Huggingface
Better Language Models and Their Implications (GPT-2)
Applying BERT-based models in Search
How To Sample From Language Models
Feb 23, 2020 · 5 minute
read
San Francisco
All the hotels
And the drinking
So, so much
Cheery electoral forecasting
San Francisco visit clichés: they always start with the feeling of optimism, soon ground into the dirt after a day’s exposure to the city, and the final leaving always seems a relief. This trip was no different; an incredibly pretty journey into the city centre on a brand spanking new BART train, and then as I was leaving my hotel for the very last time, I witnessed a Starbucks employee have a flashlight thrown at her face by a belligerent man, who was then chased off by her colleague brandishing a broom.
It’s difficult to know what to do; they were across the street and clearly didn’t need any more help or random people coming up to them, so I got in the waiting car and headed off to the vicinity of the airport, staying for one further night in a Holiday Inn Express that seemed to be equidistant and oppose from the Holiday Inn Express I stayed in last February.
Also, I may have to give up alcohol for Lent. This year’s company get-together was a little more restraint than the previous one with its Tiki Bar party. But trust me, we made up for that in the evening. Round and round we went, on a tasting tour of various bourbons from the country, explaining the cultural aversion to rice pudding that many British people of A Certain Age harbour, and that a prison that serves tea, but with no milk and only Rich Tea biscuits is one of our visions of Hell itself. It was a fun few days; I was reminded of that odd period in primary school where all my friends where shunted off to a different class and I discovered pop music, Smash Hits and became one of the cool kids. But with a lot more drinking and inadvertently confusing people as the Two British Ians sat next to each other quite a bit.
And then, of course, the actual reason for the trip itself. The days of talks, learning what everybody else is the company is getting up to, and what we’re planning to do this year. Obviously, I come to these things with a culturally cynical eye. But in a remarkable coincidence, one of the Play For Today’s I watched on the plane was Instant Enlightenment, including VAT where Simon Callow puts on an American accent and makes everybody go through a clear knockoff of est. Only Dot Cotton gets out intact. And well…you can’t deny that it has no effect (stares at 2016). It was good to see everyone, nice to have our work congratulated, and I’m looking forward to developing a few new models in the very near future (and I spent Friday night starting instead of doing something more appropriate like going to bed early for my 3am start on Saturday).
(One of my thoughts for the year - it’s absolutely fine to make neural models that are one-offs and have no real use beyond their initial one. I made a fine-tuned model based on DeOldify over Christmas for one purpose; I would defer to the new DeOldify for further colourization work, but I think I did okay with that little model I made for what I needed it for. And hopefully you’ll see some of that work end up here on the blog, or in my book repo as I fix some of the code examples that have broken or become obsolete through new PyTorch versions)
I did fulfil one long-standing SF desire on this trip - I finally made it to Wursthall, J. Kenji López-Alt’s restaurant. I can report that the Korean hot chicken bites are very good, and the Impossible sub is very meat-like, though it was also covered in approximately 8162 mushrooms, which reduced its appeal somewhat. I do hope to go back and try other bits of the menu. Anyway, that almost just leaves Lazy Bear as the last place there that I really want to try. But it’s not one you can do on a work trip, and I try to avoid the place otherwise. But one day, we’ll go and do the city in a more tourist vein. Maybe I won’t hate it so much.
March is coming. Super Tuesday. My first ever vote in a Presidential primary. The inevitability of current polling and how delegates get proportionally allocated. Last time it ran on for ages, despite it being mathematically improbable that the leader would change in the remaining states (and it didn’t). This year? We may get to that improbable stage by early March and have to deal with a zombie race all the way to the convention itself, the spectre of 1972 hanging over in multiple ways. Still, what could possibly go wrong, eh?
Feb 22, 2020 · 5 minute
read
San Francisco
All the hotels
And the drinking
So, so much
Cheery electoral forecasting
San Francisco visit clichés: they always start with the feeling of optimism, soon ground into the dirt after a day’s exposure to the city, and the final leaving always seems a relief. This trip was no different; an incredibly pretty journey into the city centre on a brand spanking new BART train, and then as I was leaving my hotel for the very last time, I witnessed a Starbucks employee have a flashlight thrown at her face by a belligerent man, who was then chased off by her colleague brandishing a broom.
It’s difficult to know what to do; they were across the street and clearly didn’t need any more help or random people coming up to them, so I got in the waiting car and headed off to the vicinity of the airport, staying for one further night in a Holiday Inn Express that seemed to be equidistant and oppose from the Holiday Inn Express I stayed in last February.
Also, I may have to give up alcohol for Lent. This year’s company get-together was a little more restraint than the previous one with its Tiki Bar party. But trust me, we made up for that in the evening. Round and round we went, on a tasting tour of various bourbons from the country, explaining the cultural aversion to rice pudding that many British people of A Certain Age harbour, and that a prison that serves tea, but with no milk and only Rich Tea biscuits is one of our visions of Hell itself. It was a fun few days; I was reminded of that odd period in primary school where all my friends where shunted off to a different class and I discovered pop music, Smash Hits and became one of the cool kids. But with a lot more drinking and inadvertently confusing people as the Two British Ians sat next to each other quite a bit.
And then, of course, the actual reason for the trip itself. The days of talks, learning what everybody else is the company is getting up to, and what we’re planning to do this year. Obviously, I come to these things with a culturally cynical eye. But in a remarkable coincidence, one of the Play For Today’s I watched on the plane was Instant Enlightenment, including VAT where Simon Callow puts on an American accent and makes everybody go through a clear knockoff of est. Only Dot Cotton gets out intact. And well…you can’t deny that it has no effect (stares at 2016). It was good to see everyone, nice to have our work congratulated, and I’m looking forward to developing a few new models in the very near future (and I spent Friday night starting instead of doing something more appropriate like going to bed early for my 3am start on Saturday).
(One of my thoughts for the year - it’s absolutely fine to make neural models that are one-offs and have no real use beyond their initial one. I made a fine-tuned model based on DeOldify over Christmas for one purpose; I would defer to the new DeOldify for further colourization work, but I think I did okay with that little model I made for what I needed it for. And hopefully you’ll see some of that work end up here on the blog, or in my book repo as I fix some of the code examples that have broken or become obsolete through new PyTorch versions)
I did fulfil one long-standing SF desire on this trip - I finally made it to Wursthall, J. Kenji López-Alt’s restaurant. I can report that the Korean hot chicken bites are very good, and the Impossible sub is very meat-like, though it was also covered in approximately 8162 mushrooms, which reduced its appeal somewhat. I do hope to go back and try other bits of the menu. Anyway, that almost just leaves Lazy Bear as the last place there that I really want to try. But it’s not one you can do on a work trip, and I try to avoid the place otherwise. But one day, we’ll go and do the city in a more tourist vein. Maybe I won’t hate it so much.
March is coming. Super Tuesday. My first ever vote in a Presidential primary. The inevitability of current polling and how delegates get proportionally allocated. Last time it ran on for ages, despite it being mathematically improbable that the leader would change in the remaining states (and it didn’t). This year? We may get to that improbable stage by early March and have to deal with a zombie race all the way to the convention itself, the spectre of 1972 hanging over in multiple ways. Still, what could possibly go wrong, eh?