Mar 23, 2025 · 2 minute
read
a ways to go, then dead mall watch saying good things about magic pie is not allowed
Quick round up!
I’ve now seen Eno four more times, and my thoughts from back in January haven’t changed. I’d also add that watching the film multiple times across a number of days really does highlight that the ‘spine’ running through the film is just too constant for the gimmick to work. The first time you rewatch, after a couple of months, you think ‘fair enough, there were a couple of new things there’. But after the third showing in 24 hours, it’s more ‘for god’s sake, that bit with taking photos of the beetles is in every version???’ I want a 90 minute Cybernetic Eno! Or 60 minutes on James! (okay, I might not want that all that much, but it’d at least be an interesting difference). It probably works better if you just watch a version once a year or so.
Meanwhile, there’s attacking sacred cows, and then there’s Marcello’s brave reassessment of Be Here Now. I can’t go that far, though I’ve always felt Noel always wanted to be more musically diverse than you’d think…but this only showed up here and there across the Oasis albums because he always bottled it. The only thing I can listen to from it these days is D’You Know What I Mean?, which I will still stan for, but the rest of it…just no.
April will be “GRPO month!” at Snappish Towers. I have two main experiments to work one; one will not be a surprise…the other probably won’t either, but if I can get it to work, it will be an interesting new approach to the particular problem
It looks like the two cities I’ve lived in while being in the US have another thing in common: malls called “Northgate” that have failed. Right now, the Cincinnati one has one single occupant of the food court and the anchor stores are basically Torrid (closing in a few days), and Hot Topic (which admittedly, looks like every other Hot Topic and may be the last tenant standing)
Adventures in toddlerdom: whereupon Daddy, Maeryn, and Helvetica all make poor choices which ends up with a cat walking around with melted chocolate all over its fur. Most of the bad choices were mine, but I’m still looking side-eye at the supposedly intelligent cat that decided that was a great place to sleep…
Mar 23, 2025 · 2 minute
read
two two two full spectrum toddler powers!
Obviously, the big news of the week is that our tiny little dictator, I mean, lovely little Maeryn, is now 2 years old! We celebrated with cake1, balloons, presents, and the all-important Sunday Roast. Well, she’s older now, so it’s time for her to really understand her heritage. Next week: Dennis Potter. Or maybe she just plays with her new Little People Barbie Dream House for the moment…
The Atlantic had an exposé this week about how Meta used LibGen to train Llama models, along with a little search bar to see if a book or author is present in LibGen (and thus likely the text(s) was used to train a bunch of LLMs). I realized late in the day that I would likely be in the database, and lo…I have 3 entries, including the German version of my PyTorch book. I am mostly fine with this, and I’m more amused that some of my writing has gone into training Meta’s LLM how to write PyTorch code, PyTorch being a Meta-owned project. Of course, it’s easy for me to say that, being both a wizened anti-copyright person that came of age during the Copyleft Wars of the 90s2, and somebody that doesn’t make their main income by writing books. I can see exactly where others are coming from, but I also don’t want to restrict us to a world where only OpenAI and Anthropic has the money to build and research models because nobody else can afford the usage fees on Common Crawl.
(also, I note that the AI narration that The Atlantic sticks on the article was almost certainly powered by using copyrighted content too…so…you’know)
Dear older Maeryn, if you’re reading — I’m sorry I went to the shop and bought milk chocolate frosting from the shop instead of making a whipped milk chocolate ganache. You are 2 right now. I promise there will be fancier ones later… ↩︎
Plus, for old hands of the blog, remember when I was interviewed by the New York Times when NTL (!) threatened to cut off my internet access because I put MP3s up on here? ↩︎
Mar 15, 2025 · 10 minute
read
embeddings search chunking litany of failures
Okay, so this month we’re going to talk about something that doesn’t really work. You’ve seen all those amazing arXiv papers with their fancy new discoveries? Bah, that’s easy. The true apex of research is talking about the things that failed. So welcome to a litany of failure!
Anyway, this is an idea I’ve been wanting to try out for at least a year. I have entries in my notes file that go back to 2023, and really I just wanted to get it out of my head and using my co-worker, Claude Sonnet, to determine if there was any promise in the technique.
Welcome, then, to ‘Gist’, an attempt to improve the search relevance of chunking in vector search!
(if you’d like a précis of searching with vectors, head over here and then come back. The tl;dr: use a model to create vectors of your documents, and then make vectors out of a query and select the ‘closest’ document to the query vector)
Problems With Embeddings & Chunking
When you’re making an embedding of a document using a ‘standard’ embedding model, one of the issues you’ll run into is that most of these models have a context length of 512 tokens or about 400 words.1 This has given rise to the cottage industry of ‘chunking’, where a document is split somehow into small chunks and each of them is vectorized. This way, your searches can dig deep into the documents and hopefully get really good results as opposed to just searching across the first 400 words.
However…
Let’s consider a dumb-but-common chunking method - splitting on sentences. Here’s a simple example:
The BT Tower is a grade II listed communications tower in Fitzrovia, London, England, owned by BT Group. It has also been known as the GPO Tower, the Post Office Tower, and the Telecom Tower. The main structure is 581 feet (177 m) high, with aerial rigging bringing the total height to 620 feet (189 m).
This gives us three chunks:
The BT Tower is a grade II listed communications tower in Fitzrovia, London, England, owned by BT Group.
It has also been known as the GPO Tower, the Post Office Tower, and the Telecom Tower.
The main structure is 581 feet (177 m) high, with aerial rigging bringing the total height to 620 feet (189 m).
If you were embedding these three chunks, and you were also embedding lots of other documents in the same way, it’s possible that your search is going to run into problems. If you have a query like post office tower height, you’re going to want that last chunk to score very highly. But that sentence, stripped from the rest of the paragraph, has no link to the concept of the tower whatsoever, and so neither is the embedding. Instead, what you’re likely to get is a response of all the chunks across your search index that mention height. Terrible!
The easiest fix to this, and one that would likely work well in this particular case is to split on paragraphs instead of sentences, so the embedding would have the context of the tower and the height in the same vector. But imagine a longer document, and you can see that you are likely to start missing context clues in your chunks, which will have a big impact on your search results.
But what to do…what to do?
The Idea
Admittedly, this is a pretty dumb idea, and it’s likely somebody has already done it before, but hear me out: what if the system could carry a CliffNotes version of the document with it for every chunk? That way, the search engine can be in page 23 of 113, but still have a general idea of what the chunk is talking about by relating it back to the notes. That should help boost the appropriate relevance when searching.
Turning that into an actual plan is even dumber: we get a LLM like Llama, Gemini, etc. to generate a 400 word summary of the document, taking advantage of their wide context windows to summarize the entire document. We then embed that, and here comes the magic!, we average this vector with every single chunk in the document.
It’s so stupid. And yet…compels me though…
A Brief PoC
So yes, I’ve had the idea rattling around in my head for a while, but never really enough time to sit down and do an evaluation. But then Claude 3.7 came out and I thought I’d use this as a chance to test that out and get this idea out of my brain.
Firstly, an evaluation dataset. I could have used MSMARCO, but I distrust it, knowing that a lot of the quality judgements in it are just plain wrong. Plus, every embedding model is trained on the MSMARCO data, so it’s not a fair test anymore (in my, and lots of others’ opinion). Instead, I downloaded 1000 random pages from Wikipedia and got Llama-3.3-70bn to generate possible search query terms for each page.
Now at this point, various IR people are yelling at me saying that that’s not fair either, as you can’t guarantee that the search terms I generate for one document are completely separate from any other document in the set…and that’s a good point, but this is not supposed to be a rigorous examination. It’s “does this even make sense to pursue?”
Anyhow, I now have documents, queries, and a mapping of what query goes with what document. Next up, we create a script that creates a summary of the documents, again using Llama-3.3-70bn2, breaks the documents up into paragraph-level chunks, and then produces two embeddings per chunk, one just being the chunk itself, and the other being the chunk embedding averaged with the summary embedding.
Having got all that sorted, finally we embed the queries, identify the top-scoring chunks (and importantly, their document id), and write that out as a series of qrels run files, which we then use ranx to rank with the answer set and give us a nice set of summary tables and plots. (This is the “draw the rest of the owl” part)
Results
In my best Peter Snow voice, this is just a bit of fun. I used Snowflake’s arctic-embed-xs model for embedding, as it’s very small, quick, and capable, and tested Gist in a paragraph splitting scenario. Have some tables and graphs. What we’re tracking here is NDCG, which scores between 0 and 1, putting higher weight to the correct documents appearing at their correct rank towards the start of the ranking list.
Paragraph Chunking & Averaged Summary Chunks
Here’s the data formatted as a markdown table:
Embedding Type
ndcg@1
ndcg@5
ndcg@10
mrr
map
recall@100
embedding_base
0.298597
0.456463
0.496966
0.424646
0.424646
0.727856
embedding_average
0.536072
0.681063
0.698795
0.645529
0.645529
0.864128
A Wild Sava Approaches
Having seen the big jump in NDCG scores, I was cautiously excited, so I broke cover and told my co-worker, Sava about the idea. He was initially suspicious about my Eval scores, which made sense because I accidentally sent him scores based on sentence chunking rather than paragraphs, where the NDCG difference is even more pronounced. It was much more reasonable when I corrected for that, but he did point out I was missing another baseline: what are the NDCG scores if you just used the summary and didn’t have any chunks at all? A great idea from our Principal Research Scientist and I collected the results:
Here’s the data formatted as markdown:
Embedding Type
ndcg@1
ndcg@5
ndcg@10
mrr
map
recall@100
embedding_doc_summary
0.789178
0.853623
0.861771
0.841532
0.841532
0.984168
BOBBINS. I must confess I was somewhat crushed by this, and I still haven’t told him (until he reads this). I even went off and repeated the experiment with HuggingFace’s fineweb dataset…and…yes, as you can see, pretty much the same result.
Model
ndcg@1
ndcg@5
ndcg@10
mrr
map
recall@100
embedding_base
0.2793
0.418632
0.456080
0.391381
0.391381
0.6630
embedding_average
0.5111
0.668388
0.691684
0.631249
0.631249
0.8803
embedding_doc_summary
0.7092
0.797029
0.809260
0.781517
0.781517
0.9753
Conclusion
All that work and I might as well have just used the summary. So is this a complete washout? Not entirely, and this is how even ‘failed’ research can still be useful. For one thing, instead of actually going to the trouble of doing all that chunking, it’s possible that for a variety of search applications, let’s just use not bother and use a summary? That way, we get good results and we don’t have to store all those chunk vectors in the database3
Also, there are some use cases where the averaged chunks could still be useful. In the currently common retrieval augmented generation pattern, chunks from various documents are sent to a large language model in order to answer a question. Even though today’s frontier models are powerful, they can still find themselves being confused by lots of different documents being sent to them in one go, so you want to send relevant information and not drown it in distracting text. If you’re trying to limit yourself to 5 or so docs to send, this technique gets you a much bigger chance of having the correct answer sent to the model…
There’s more that could be tested. We probably should have a better set of generated queries that can account for potential ‘bleed’ between queries and other documents by judging relevance of the query against other documents in the system (I have a version of this in progress as I type), as well as trying a bunch of other ideas. Should we try to append the summary in text form to the chunk text and then vectorize that new chunk? Just add title and maybe some important keywords? With those ideas, you do have to worry about total token counts again, but maybe if it’s kept concise and using a longer context embedding model, it would be easier to handle. Could we tailor the summary prompt to produce something more like an Anki card to see if that helps? Should we add a hyperparameter to multiply against the summary vector before it’s added and averaged against the chunk vector? Maybe we give up on the idea of the summary altogether and instead use a sliding window of previous and following chunks to try and keep context that way?4. So many different directions that could be ran into the ground. There’s still so much left on the table in the world of IR and embeddings. So, ‘failure’, yes, but as failures go, not a bad one.
This is due to most embedding models being based around the BERT architecture. There are some exceptions to this, and there are more modern BERT variants (e.g. ModernBERT) which have larger context lengths, but even with something like 8192 tokens, you’re still going to have difficulty embedding large documents in a one shot fashion, and trying to capture the entire document in a single vector often proves detrimental to search. Even the more modern models often fail at capturing information, despite their longer context lengths↩︎
this is likely overkill, and you should probably just use something like Gemini flash or a smaller llama model instead. ↩︎
You’ll be surprised at how much disk space these vectors will take up, especially when you start making lots of chunks! ↩︎
I did actually have a quick play with the hyperparameter and sliding window approaches. The hyperparameter approach really needs a complete sweep to see how it’ll work in practice, but multiplying the summary vector by 0.5 barely affected the NDCG scores. Sliding window also has its own hyperparameter(s) - how far ahead and back do you look? With a simple test of looking back and forward one chunk, the performance was barely above the baseline embeddings… ↩︎
Finally coming to the end of the flu. Still have the hacking cough and getting through a box of tissues a day, but it’s better than the start of the week.
Having said all that, still not a lot to talk about. I have finally done my taxes, although with the current state of the IRS, I’m not exactly convinced my refund will be coming to me any time soon. It’ll be nice once it does arrive though, as it’ll move me a big step closer to Weird Financial Goal.
I am still trying to work out Maeryn’s birthday cake. The basic contours are set: chocolate cake plus a raspberry component. But do I make it in the Berlin mold, which will be fancy, but I feel like it doesn’t have enough cake, as well as making a ganache layer much harder. Or I could use Kyoto and have a ganache filling and raspberries on top. It feels a more substantial cake size, but not overwhelming for toddler hands (I am fully aware that I’m over thinking this, and it could just be a box cake in a sheet pan and she’ll shove it into her face just the same, but left me have this!). Don’t worry, I still have two weeks left to figure it out…
First tech blog of the year has most of the first draft done. Unfortunately, it requires a vast amount of changes in the second draft, turning it from “look at this cool new technique” to more of “look at this idea that didn’t actually work out. Which is a little less fun to write up, and I need to do a few more experiments to convince myself one way or another, but I do feel it’s important to write up the things that fail as well as the things that succeed. So stay tuned for failure!
Maeryn brought home a wonderful strain of flu this week, so we’ve all taken turns off being absent from daycare and work. We’re on the mend, but it’s a slow slow process, so hopefully see you all back here next week…
Feb 23, 2025 · 2 minute
read
Apple Juice FTW Adventures in the UK
How can you pack even more into a four-day trip to the UK? Have you considered spending Saturday afternoon and evening in A&E?1 A sick baby that needed medical attention. And while I think we could probably write a few volumes on how 111 could be better set up for providing updates, and perhaps the US concept of Urgent Care clinics might be a way to alleviate some of the pressures on A&E departments, we walked into the Horton, were seen quickly, and thankfully, Maeryn perked up once her blood sugar level was helped by apple juice. And then we just walked out, and having now lived in the US for almost 14 years…I’ll confess even I found it strange not have handed over a credit card at any point. Maeryn is back to her usual “you will read all the books now, daddy” self, which is wonderful.
Anyway, the concert was great — there may be a separate post coming to unpack all that in the next week or so. It was my first time at Troxy, too; a lovely 1930s cinema that retains a lot of its former glory…unlike what they did to the Oxford Road Odeon in Manchester (still bitter about the loss of that fantastic Screen One). Now, I’m thinking of other venues in London I have known and loved…The Luminaire being the saddest loss, I think, even above the delights of The Astoria (come on, just look at the mirrorball in a venue for 200 people and tell me you wouldn’t fall in love with that place). A packed trip of highs and lows…but next time, maybe we’ll hopefully stay long enough to get over the jetlag before we get more…
And now back in the US as things continue to fall apart. But I have 4 kilograms of Mini Eggs, so things are not quite at their worst just yet. The snow is melting, we’re all house safe, and I’m putting together thoughts for Maeryn’s first homemade birthday cake.
No kings, no tyrants.
When telling this story to my boss, he got quite confused when I kept using A&E and Casualty interchangeably. ↩︎
Last week was San Francisco. This upcoming week? London. For a day (and Bicester for a couple more days). It’s a whirlwind tour, and hopefully we’ll be back properly for a longer period of time sooner than later. But this time, for Valentine’s Day, Tammy and I will be seeing Los Campesinos! in London…probably the first time I’ve seen them in the UK since 2010 or 2011. And a train ride there and back! What else could I really ask for, really?
First made-up story at 3am when demands were made to read a book in the dark. I have a feeling that Rev. W. Awdry would not have approved of Thomas Delivers Spent Nuclear Fuel Rods To A Previously Undisclosed Station on Sodor, but hey, Maeryn seemed to think it passed muster before passing out.
Feb 2, 2025 · 1 minute
read
let's not count the drinks surprisingly warm-ish
A sometimes strange, but mostly good visit to San Francisco. Although I probably shouldn’t drink again until the second half of March at the very least.
Once again, I felt broadly safe in the downtown and Mission areas (and anywhere else I went, really). One altercation early on Monday morning, but that was about it. However, you can’t deny that the city centre has taken a big hit from Covid and shows no real sign of bouncing back. Even more vacant shopfronts than my previous visit, with even It’s Sugar having bailed out of the old Forever 21 location…and it seems that the Macy’s that has been there since 1929, and has roots going back to 1866 (!) might be shutting down. The San Francisco Centre, stripped of the Westfield branding, is now a downtown mall with less than 25% occupancy and the haunted air of a place that feels like The Caretaker should be piped over the PA system. Not the healthiest, let’s just say.
Of course, we’ll always truly remember San Francisco this way:
(Of course, when I got home, I was greeted with smiles. And then directed to sit in the reading corner, because I had fallen behind in the daily task of reading all the books)
After managing to miss showings of it by weeks or even mere days when it has been screened in either cities nearby or those I’ve been visiting (missed it by days in SF twice, even), thanks to the global livestream yesterday, I finally got to see Eno, the 85-ish minute, “different every time it’s shown” documentary on Brian Eno (obv.)1
And…well. Having seen two of the 6 generations of the film over the past 24 hours, I can’t help but think there’s a great 160 minute cut hiding across the different generations. Which is not to say it isn’t good…but because the spine of the documentary remains fixed, with digressions here and there (maybe you’ll see Eno take the piss out of Bono while they’re recording Pride, maybe you’ll see some footage of Roxy Music, maybe you’ll get the admittedly amusing ‘repetition’ Oblique Strategies card that threatens to replay the entire previous sequence, etc), I don’t think it will hold up to that many repeated viewings, because the versions just don’t seem to be different enough. Which makes me think that a longer, more traditional cut would have been better after all. At least Eno comes across well; not taking himself seriously, or just seriously enough, yelling at the YouTube ads when they get in the way of him showing music, or giggling over the insanity of the Windows 95 startup brief.
(things I didn’t see in the edits I watched, but I think exist in other versions: an old interview with Sandi Toksvig, which I am curious about, because I wonder if it’s something from No. 73, and in the second edit, there’s a brief flash of the PAUL MORLEY KLAXON, so obviously, I was sad I didn’t get to see the full thing. Curiously, there were only a few flashes of the ‘so famous, it got parodied at leasttwice’ performance of Virginia Plain on TOTP, instead mainly using footage from European music shows…)
I think this would break the way of how Eno and Hustwit viewed the concept, and his feelings towards generative art in general, but I would prefer a way of being able to turn the “cybernetic dial” and get even 60 minutes of Eno talking about Stafford Beer, or the “Horny Eno dial” to get an edit that is much more in the vein of various diary entries from 1995. Being able to either consciously or randomly alter the complete documentary structure around concepts would make it a more attractive offering. Or just give me three hours of footage to watch.
Having said all that, the Q&A FaceTime thing afterwards last night gave me an idea of using embedding models with the Bluesky protocol that I will have to try and build in the next month or so. And who knows, this time I might actually finish something.
Anyway, off to San Francisco in the morning for a week. If you see me, why not say “haven’t you been promising a tech blog for quite some time now, eh?”
Song of the week, but it may take you less than one second to realise this is 100% Ian Back On His Thing Again. Maybe even less if you see the thumbnail.
This week, I began the Year of Dentistry. Which along with everything else, apparently comes with a brace of pills that you shouldn’t take anywhere near alcohol…even three days after, according to the pharmacist. So, we’ll be starting those after I get back from San Francisco, then…
Only one model trained this week, but ideas forming for more experiments in ‘test-time inference’ for embeddings. And a slow realization that despite its hipness towards the end of last year, I was essentially doing it at the start of 2024 with my tree-search jailbreaking techniques. I wish I had had the time to add the Monte Carlo Tree Search system on top, but alas.
Any ideas on fun places to visit in SF? Part of me feels like I should once again do the 90s thing and go to DNA Lounge, but in my head, I am still not cool enough to be there…