Gist

Okay, so this month we’re going to talk about something that doesn’t really work. You’ve seen all those amazing arXiv papers with their fancy new discoveries? Bah, that’s easy. The true apex of research is talking about the things that failed. So welcome to a litany of failure!

Anyway, this is an idea I’ve been wanting to try out for at least a year. I have entries in my notes file that go back to 2023, and really I just wanted to get it out of my head and using my co-worker, Claude Sonnet, to determine if there was any promise in the technique.

Welcome, then, to ‘Gist’, an attempt to improve the search relevance of chunking in vector search!

(if you’d like a précis of searching with vectors, head over here and then come back. The tl;dr: use a model to create vectors of your documents, and then make vectors out of a query and select the ‘closest’ document to the query vector)

Problems With Embeddings & Chunking

When you’re making an embedding of a document using a ‘standard’ embedding model, one of the issues you’ll run into is that most of these models have a context length of 512 tokens or about 400 words.1 This has given rise to the cottage industry of ‘chunking’, where a document is split somehow into small chunks and each of them is vectorized. This way, your searches can dig deep into the documents and hopefully get really good results as opposed to just searching across the first 400 words.

However…

Let’s consider a dumb-but-common chunking method - splitting on sentences. Here’s a simple example:

The BT Tower is a grade II listed communications tower in Fitzrovia, London, England, owned by BT Group. It has also been known as the GPO Tower, the Post Office Tower, and the Telecom Tower. The main structure is 581 feet (177 m) high, with aerial rigging bringing the total height to 620 feet (189 m).

This gives us three chunks:

The BT Tower is a grade II listed communications tower in Fitzrovia, London, England, owned by BT Group.

It has also been known as the GPO Tower, the Post Office Tower, and the Telecom Tower.

The main structure is 581 feet (177 m) high, with aerial rigging bringing the total height to 620 feet (189 m).

If you were embedding these three chunks, and you were also embedding lots of other documents in the same way, it’s possible that your search is going to run into problems. If you have a query like post office tower height, you’re going to want that last chunk to score very highly. But that sentence, stripped from the rest of the paragraph, has no link to the concept of the tower whatsoever, and so neither is the embedding. Instead, what you’re likely to get is a response of all the chunks across your search index that mention height. Terrible!

The easiest fix to this, and one that would likely work well in this particular case is to split on paragraphs instead of sentences, so the embedding would have the context of the tower and the height in the same vector. But imagine a longer document, and you can see that you are likely to start missing context clues in your chunks, which will have a big impact on your search results.

But what to do…what to do?

The Idea

Admittedly, this is a pretty dumb idea, and it’s likely somebody has already done it before, but hear me out: what if the system could carry a CliffNotes version of the document with it for every chunk? That way, the search engine can be in page 23 of 113, but still have a general idea of what the chunk is talking about by relating it back to the notes. That should help boost the appropriate relevance when searching.

Turning that into an actual plan is even dumber: we get a LLM like Llama, Gemini, etc. to generate a 400 word summary of the document, taking advantage of their wide context windows to summarize the entire document. We then embed that, and here comes the magic!, we average this vector with every single chunk in the document.

It’s so stupid. And yet…compels me though…

A Brief PoC

So yes, I’ve had the idea rattling around in my head for a while, but never really enough time to sit down and do an evaluation. But then Claude 3.7 came out and I thought I’d use this as a chance to test that out and get this idea out of my brain.

Firstly, an evaluation dataset. I could have used MSMARCO, but I distrust it, knowing that a lot of the quality judgements in it are just plain wrong. Plus, every embedding model is trained on the MSMARCO data, so it’s not a fair test anymore (in my, and lots of others’ opinion). Instead, I downloaded 1000 random pages from Wikipedia and got Llama-3.3-70bn to generate possible search query terms for each page.

Now at this point, various IR people are yelling at me saying that that’s not fair either, as you can’t guarantee that the search terms I generate for one document are completely separate from any other document in the set…and that’s a good point, but this is not supposed to be a rigorous examination. It’s “does this even make sense to pursue?”

Anyhow, I now have documents, queries, and a mapping of what query goes with what document. Next up, we create a script that creates a summary of the documents, again using Llama-3.3-70bn2, breaks the documents up into paragraph-level chunks, and then produces two embeddings per chunk, one just being the chunk itself, and the other being the chunk embedding averaged with the summary embedding.

Having got all that sorted, finally we embed the queries, identify the top-scoring chunks (and importantly, their document id), and write that out as a series of qrels run files, which we then use ranx to rank with the answer set and give us a nice set of summary tables and plots. (This is the “draw the rest of the owl” part)

Results

In my best Peter Snow voice, this is just a bit of fun. I used Snowflake’s arctic-embed-xs model for embedding, as it’s very small, quick, and capable, and tested Gist in a paragraph splitting scenario. Have some tables and graphs. What we’re tracking here is NDCG, which scores between 0 and 1, putting higher weight to the correct documents appearing at their correct rank towards the start of the ranking list.

Paragraph Chunking & Averaged Summary Chunks

Here’s the data formatted as a markdown table:

Embedding Type ndcg@1 ndcg@5 ndcg@10 mrr map recall@100
embedding_base 0.298597 0.456463 0.496966 0.424646 0.424646 0.727856
embedding_average 0.536072 0.681063 0.698795 0.645529 0.645529 0.864128

A Wild Sava Approaches

Having seen the big jump in NDCG scores, I was cautiously excited, so I broke cover and told my co-worker, Sava about the idea. He was initially suspicious about my Eval scores, which made sense because I accidentally sent him scores based on sentence chunking rather than paragraphs, where the NDCG difference is even more pronounced. It was much more reasonable when I corrected for that, but he did point out I was missing another baseline: what are the NDCG scores if you just used the summary and didn’t have any chunks at all? A great idea from our Principal Research Scientist and I collected the results:

Here’s the data formatted as markdown:

Embedding Type ndcg@1 ndcg@5 ndcg@10 mrr map recall@100
embedding_doc_summary 0.789178 0.853623 0.861771 0.841532 0.841532 0.984168

BOBBINS. I must confess I was somewhat crushed by this, and I still haven’t told him (until he reads this). I even went off and repeated the experiment with HuggingFace’s fineweb dataset…and…yes, as you can see, pretty much the same result.

Model ndcg@1 ndcg@5 ndcg@10 mrr map recall@100
embedding_base 0.2793 0.418632 0.456080 0.391381 0.391381 0.6630
embedding_average 0.5111 0.668388 0.691684 0.631249 0.631249 0.8803
embedding_doc_summary 0.7092 0.797029 0.809260 0.781517 0.781517 0.9753

Conclusion

All that work and I might as well have just used the summary. So is this a complete washout? Not entirely, and this is how even ‘failed’ research can still be useful. For one thing, instead of actually going to the trouble of doing all that chunking, it’s possible that for a variety of search applications, let’s just use not bother and use a summary? That way, we get good results and we don’t have to store all those chunk vectors in the database3

Also, there are some use cases where the averaged chunks could still be useful. In the currently common retrieval augmented generation pattern, chunks from various documents are sent to a large language model in order to answer a question. Even though today’s frontier models are powerful, they can still find themselves being confused by lots of different documents being sent to them in one go, so you want to send relevant information and not drown it in distracting text. If you’re trying to limit yourself to 5 or so docs to send, this technique gets you a much bigger chance of having the correct answer sent to the model…

There’s more that could be tested. We probably should have a better set of generated queries that can account for potential ‘bleed’ between queries and other documents by judging relevance of the query against other documents in the system (I have a version of this in progress as I type), as well as trying a bunch of other ideas. Should we try to append the summary in text form to the chunk text and then vectorize that new chunk? Just add title and maybe some important keywords? With those ideas, you do have to worry about total token counts again, but maybe if it’s kept concise and using a longer context embedding model, it would be easier to handle. Could we tailor the summary prompt to produce something more like an Anki card to see if that helps? Should we add a hyperparameter to multiply against the summary vector before it’s added and averaged against the chunk vector? Maybe we give up on the idea of the summary altogether and instead use a sliding window of previous and following chunks to try and keep context that way?4. So many different directions that could be ran into the ground. There’s still so much left on the table in the world of IR and embeddings. So, ‘failure’, yes, but as failures go, not a bad one.


  1. This is due to most embedding models being based around the BERT architecture. There are some exceptions to this, and there are more modern BERT variants (e.g. ModernBERT) which have larger context lengths, but even with something like 8192 tokens, you’re still going to have difficulty embedding large documents in a one shot fashion, and trying to capture the entire document in a single vector often proves detrimental to search. Even the more modern models often fail at capturing information, despite their longer context lengths ↩︎

  2. this is likely overkill, and you should probably just use something like Gemini flash or a smaller llama model instead. ↩︎

  3. You’ll be surprised at how much disk space these vectors will take up, especially when you start making lots of chunks! ↩︎

  4. I did actually have a quick play with the hyperparameter and sliding window approaches. The hyperparameter approach really needs a complete sweep to see how it’ll work in practice, but multiplying the summary vector by 0.5 barely affected the NDCG scores. Sliding window also has its own hyperparameter(s) - how far ahead and back do you look? With a simple test of looking back and forward one chunk, the performance was barely above the baseline embeddings… ↩︎