A Christmas LEAF

Jan 2, 2026 · 7 minute read

A Christmas LEAF

One of the more crazy things I did after Thanksgiving (and I don’t believe the ER visit had much to do with it, but you can read a midlife crisis into things if you want) is buy a DGX Spark¹. Now, my rationalisation of this expense was that my existing deep learning rig is almost nine years old and showing its age (while the graphics card in it is three years old and can still be flogged on eBay for the exact price I bought it in 2023, and might even go up in value in 2026), plus I’m assuming that the big influx in RAM prices means that it is not likely to get any cheaper in the next 24 months. And 128Gb VRAM would be more than I can get in a single A100 deployment at work.

Anyway, it turned up, it has been set up, and it has been sitting around waiting for a project. Hello, Christmas!

Over the past…five? Six? years, a standard pattern for training embedding models has emerged:

You get a pretrained model, or train your own model from scratch to be a general language model
You then do a ‘weakly-supervised’ fine-tune on top of that model, where you have a large corpus of web data and you basically tune on ‘title / first x tokens of text’
A much smaller, highly-supervised fine-tuning pass on high-quality datasets like MSMARCO.

I absolutely despise step 2. It requires huge amounts of data, most of which is extremely low quality and quite often not obviously semantically related, and also needs significant infrastructure to train at the batch sizes required to make it passable. With the improvements in datasets over the past couple of years (things like fineweb-edu), I have been convinced there’s a better way to do it. Something that you could pull off on a single A100…or a DGX Spark.

The first little inkling of an idea came with the Stella/Jasper paper that appeared over the New Year of 2024/5, where a much larger teacher model was used to orient the vector space of a smaller model. At the time, I saw it as a vindication of my approach of using a projection layer to make images searchable from a text-only embedding model, but I also felt that more could be done with the idea. Other things got in the way, I filed it away in my head for the rest of the year, and it wasn’t until I saw the LEAF paper a couple of months ago that it popped back out and I got excited again.

The general idea of both papers is this: we take a lot of data (good data, hopefully), and vectorize it with a very strong teacher model. Maybe it outputs 768-dimensional embeddings. We then have a student model, which is likely to be smaller and outputs a lower dimensionality of embedding, say 256. The student is a standard pre-trained model, which means it can probably produce text, but is absolutely rubbish at tackling embedding similarity. We add a small projection network to the student, which can be as simple as a single linear layer expanding the 256 embedding to 768 dimensions. We then take the embeddings and the texts from the first step of the process and train the student model (plus its projection layer) to match the vectors obtained by the teacher. So if “crazy golf” is vectorized by the teacher to a 768-dimensional vector of [1.0, 1.0,…], we’d keep updating our student model until it said something very similar.

Now, the trick is that the amount of data needed to train this student is much, much smaller than you’d traditionally do with the “fire the entire of the web at the model and hope it works” weakly supervised step that most embedding models are trained with, and you don’t need pairs. Just text. Having said that, my attempts to replicate the LEAF paper met with mixed results. It was definitely improving the ability of the student model, but not to the extent that the paper promised.

But, but…what if we take LEAF and use it to bypass step 2, but still run the supervised pass in step 3? That would likely allow us to get a half-decent model in under 24 hours, if we use some good quality datasets, a reasonable teacher, and…say, had access to a DGX Spark.

The Christmas project in full:

Vectorize a large chunk of the fineweb with snowflake-arctic-embed-m-v2.0 (a shoutout to Stéphan Tulkens here, for doing a bunch of extraction work that I just re-vectorized), along with basic vocab and word definition vectors that LEAF recommended. Note that I am not doing the full LEAF mix here, as they pull in a lot of the supervised datasets that I will be using in Step 3 and I want to actually fine-tune on them properly.
Use ettin-encoder-17m as the base model - a ModernBERT-based model that is only 17m parameters, but has a decent context length, and is a very, very strong performing model in its size class
We’ll use nanoBEIR’s ndcg metric as our benchmark — to start with, how well does the base model work for retrieval just by itself? (spoilers: it will be bobbins)
Then, run LEAF training on the model using the datasets above - replacing Step 2 in the traditional embedding model training run, and throwing away the projection layer afterwards (based on an observation in this paper that you still get most of the LEAF benefits even without the projection layer after training)²
Run nanoBEIR on base+LEAF
Perform Step 3 with a 1m sample of retrieval datasets from F2LLM, using 10 hard negatives per query.³
Oh, and we’ll also try just training the base on the F2LLM dataset directly, skipping Step 2 altogether, just for a comparison
The question: can this simple model training recipe out-perform the nanoBEIR benchmark results of the stalwart all-MiniLM-L6-v2?

So, between Christmas and New Year, myself and Claude did some experimenting. I did have some troubles getting a good base Docker image for the Spark, but once I had sorted that⁴, we got some good training runs in.

Final Score

Model	nanoBEIR NDCG@10
all-MiniLM-L6-v2	0.5623
ettin-encoder-17m	0.0667
ettin-encoder-17m+F2LLM	0.3490
ettin-encoder-17m+LEAF	0.5148
ettin-encoder-17m+LEAF+F2LLM	0.5690

Victory! Well, just, but at the end of the day, Brian, a win is a win. Now back to the studio…

I think the experiment shows that there really is something to LEAF. The base model is utterly useless by itself, and while training on the curated datasets helps considerably, it is still lagging behind the other variants. And there is still something to be said for the final supervised fine-tuning run. It provides a 10% uplift to the LEAF-only model (and this tracks with my experiments in October when throwing the supervised datasets into the LEAF mix).

Of course, beating all-MiniLM-L6-v2 isn’t all that these days. One of my favourite small models, snowflake-arctic-embed-xs, scores 0.61 on nanoBEIR. But it has 22m parameters versus my model’s 17m, and I’ve barely done any training - 4 million data points, whereas most embedding models break 200m without a sweat.⁵ Follow ups planned for 2026 include: better teachers, more LEAF data, and a look at those hard negatives in F2LLM⁶. But I’m quite pleased at this first experiment run on the Spark.

Actually an ASUS Ascent, trading a grand of savings for a smaller hard disk. But I have a 38Tb NAS sitting on my home network, so it seemed a reasonable trade… ↩︎
To save time, I only ran 10 epochs of training instead of the 30 in the LEAF paper, so there’s probably more room for improvement here ↩︎
And a batch size of 32768, which is something I have never got from a single GPU before… ↩︎
At the time of publishing this article, nvcr.io/nvidia/pytorch:25.11-py3 is what you’re looking for ↩︎
In fairness, you could point out that LEAF is freeloading on the teacher model’s work…but it’s what open source is all about! Plus, up until 2024, almost all embedding models stemmed from a checkout of Google’s BERT… ↩︎
I can be heard often saying “DEATH TO MSMARCO!” and I do mean it. ↩︎

Ian Pointer

A Christmas LEAF