It’s the first quiet weekend since, maybe February? We have no schedule, no appointments, or any plans. Maeryn has just gone down for an unexpected but solid afternoon nap. Tammy and I meet in the kitchen.
What are we supposed to do now?
Eventually, we’ll get used to it. And then Maeryn will stop sleeping in the afternoon…
We did, however, all go out to dinner…in a restaurant that we haven’t set foot in for four years. Plenty of ‘roadside delivery’ during that time, but we haven’t had a meal there since the start of the pandemic. Also, apparently Maeryn likes Bhangra music, rocking out in her little high chair while eating paneer.
With the release of Llama3 this week, I’ve toying with the idea of a series entitled: “Let’s look at old papers and replace ChatGPT3 with Llama3-7b-chat and see what happens!” I spent part of Friday night1 getting the ADaPT paper working, which took about five minutes, and then two hours attempting to work out why the WebShop evals weren’t working for the full 100 traces before giving up after staring at the mess of Java and Python that comprises the benchmark. So the tl;dr is: I saw ADaPT work with Llama3 for several traces, but can’t actually report on how it compares to the original ChatGPT implementation. Promising, though.2
Don’t worry, we had already watched an episode of Pole To Pole, so archive television had been slotted in! ↩︎
Although I will say that I have some fundamental objects to the functions they make available to the planner/agent LLMs - I don’t think SimpleMatch is ever going to return something useful in the WebShop context - I’d replace it with a very quick and dirty embedding function to give the agent a chance of returning candidates to the planner, even if they end up not being a perfect fit. ↩︎
So, I like this new LLM2Vec paper which presents a hand recipe for taking a decoder-only LLM and turning it into a strong embedding model - to the point where you can make a Mistral-7b model top the MTEB benchmark quite easily (and I think there’s probably a little headroom for more improvement if you used a slightly more complicated fine-tuning regime than SimCSE). But I don’t think it quite manages to answer the question it poses in the abstract: ‘why is the community only slowly adopting these models for text embedding tasks?’
And I think there is a bit of a disconnect between what the Search/IR community does with embedding models and the research community here. Consider a relatively standard use case of vector search on a knowledge base; we are going to be making embeddings for tens of thousands, if not millions of documents. That’s a lot of embeddings.
In order for this to be practical, we need models that are good and fast. I can, right now, pull out a BGE-base model that is a tenth of the size of the smallest model tested in the paper (Sharded-Llama-1.3bn), obtain a higher MTEB score, and throw batches through the model in less than <50ms. And if I’m willing to spend a bit of time optimizing the model (like maybe an hour or two), I can make those batches fly through at <10ms. On T4/L4 GPUs. I just can’t do that with billion-parameter scale LLMs without spending a lot of time and money on beefier GPUs and complicated optimizing regimes.
So, I like the paper. It’s really good in terms of the recipe provided and all the details of the experiments performed, but for the moment, I’m sticking with our old BERT-based friends.
Apr 14, 2024 · 2 minute
read
it was a lot of drywall, okay important questions about a 35-year-old-show
As I’m turning 45, I can say that the thing I’m most excited about is that this week I will get rid of the last bin bag containing parts of the garage ceiling. It’s only been five years. Important Dad Goals!
Back to the nostalgia well this week, as I have final got around to watching Michael Palin’s Eighty Days Around The World (I was too young to stay up to watch it on first broadcast). You can clearly see that the time limit was the first thing Palin dropped from every documentary following - the pace is relentless and most of the time he doesn’t even get a chance to see the new country he’s in. The worst example is Singapore, where he basically lands and then gets on another boat (to catch up with a ship that has already sailed (!)) instantly. It also has that weird issue with endings that a lot of multi-part UK documentaries of the time did; I have absolutely no idea why the Reform Club wouldn’t let him film on return, but it makes for a bizarrely downbeat ending.
Having finished the series, I did wonder about whether he’d be up for a remake in 2028 (age permitting). Some things would be a lot easier - almost everybody has the internet at their fingertips these days, but I wonder if some of the routes that only barely existed in 1988 would still be viable. At least he wouldn’t have to suffer Pacers when he got back to Britain this time…
Finally, I did G O O D N U M B E R S this week with a post on LinkedIn. I wasn’t really expecting almost 2,000 people to read my complaints about the LLM2Vec paper, but there we are. I will probably copy the text over to here later in the week, because it’s nice to have as much as possible of my long-form writing over here rather than on somebody else’s platform[^1].
[1]: It is amusing to think that I currently have one of the longest-running blogs still going on the net…
Mar 31, 2024 · 2 minute
read
caterpillars must die! kevin mccloud unleashed unhelpful checkout
And in time-honoured tradition, a catch-up bullet post!
Of all the caterpillar cakes, we feel that Tesco’s Slinky is the worst, made with little care and with a fondant face that borders on the deranged. Morrison’s Morris put in a decent showing, though!
The houses at Graven Hill are a great advertisement for the case of planning. Most of the self-builds resemble office blocks (with larch cladding, obviously), with a few totally bizarre choices — yes, I guess you can build a Carolina blue beach house in the middle of Bicester…but should you? Really? Still, respect to the house with the 40ft metal giraffe in the driveway.
You’ll be surprised just how happy a small child can be with a chair that looks like a lion. And possessive of it, too!
The South Bank was weird this time around…I found something was odd, something that I couldn’t really describe, and I didn’t want to be there that much…
I’m convinced that all the self-checkout systems in UK supermarkets are designed specifically to be user-hostile. Trying to simply get out of Sainsbury’s was an event.
I wonder how often the vocal tracks on the bus tours are re-recorded?
If I can go all “middle-class parent” for a moment, the gb Pockit+ All-Terrain is an amazing buggy. It folds up so small you can put it in a backpack! It’s light and manoeuvrable enough that you don’t feel like you’re being a pain on the Underground, and Maeryn seems to love being in it for the moment. 10/10!
I miss the New York Bloomer from Pret (I know they have something similar in roll form now, but it’s not quite the same).
It’s weird watching linear broadcast television again.
Hopefully, Maeryn doesn’t get too many ideas from our surprise upgrade on the flight back home. It’s not always going to be three-course meals and seats that can lay flat, I’m afraid!
The great thing about the first hour of the BBC’s Election ‘97 coverage is the absolute glee that kicks in as soon as the exit poll drops. Peter Snow gets out his shiny new toys to show the landslide knocking Tories out all over the country, and then Paxman just butchers Michael Portillo for five minutes or so, gets interrupted by an OB with Paddy Ashdown, and then they come back so Paxman can go for another five at him. Also, at that stage, they didn’t think Portillo was going to lose his seat, so the night only goes downhill for him from there.
And then Frank Skinner interviews John Major and Tony Blair lookalikes, where he gets them to dance together while Skinner sings ‘Rock Around The Clock’. There’s nothing quite like a live General Election broadcast…
Weirdly, I also found myself down another nostalgia hole, one that has been somewhat time-locked due to the author. I finally broke down and read Planetary (it was essentially $4 for the entire series digitally on Amazon over Christmas and I thought ‘why not?'). It made me think, and even dream some thoughts on comics. One of Ellis’s problems (and somebody else a little more current) is that, in the end he’s too aloof and distant, too afraid of the cringe1 to really land a lot of his work, and doesn’t have Moore’s skills to back him in the tour of the 20th century that Planetary starts out as. Still, it was better than Ministry of Space at the very least.
I then got curious, and yes, Ellis is still blogging away. Like nothing happened. There’s vague allusions to comics work…but who would publish him these days?
Grant Morrison, on the other hand, is 100% cringe, but can sell the hell out of a line like “REALITY DIES AT DAWN!” that nobody else really can. ↩︎
This week, I basically implemented an entire RLAIF pipeline that can scale up to thousands of topics, generating scads of synthetic data, all with open-source software and models, and from start to finish of the first 7bn parameter model rolling off the assembly line was four days. And it would have been three if I had had the courage to kick off the training on Wednesday evening instead of Thursday morning. Oh, and it works, too.
It is nice to actually be building things again.
Anyway, we’re now officially into “it is X days until we travel so I will now start making noises about suitcases and packing like an absolute weirdo seeing as how we still have over two weeks until we fly home” season. It is a fun game that I’m sure Maeryn will be absolutely sick of by the time she’s 4. I also need enough space to bring back around 32kg of Cadbury Mini Eggs, so you can see my need for upfront planning. And also a reserve suitcase. We’ll be going to London as well, so maybe I need a Pelican case to cover the eventuality of visiting the booksellers at the South Bank?
By the way, if anybody wants me to bring something back overseas, now is your chance to speak out before the cases get filled with soap for my family back home…
Feb 25, 2024 · 1 minute
read
daycare plague incubators
Another week in the series of: “Did Maeryn get sick and pass it on to Daddy, or was it the other way around?” I personally have to blame the person who spends all her time playing with other small infection vectors (okay, I guess infants is a more ‘appropriate’ term) rather than the one who works from home. Just saying.
Anyway, will try to get up a longer post before the end of February, but March and the return to the UK awaits!