A Christmas LEAF

A Christmas LEAF

One of the more crazy things I did after Thanksgiving (and I don’t believe the ER visit had much to do with it, but you can read a midlife crisis into things if you want) is buy a DGX Spark1. Now, my rationalisation of this expense was that my existing deep learning rig is almost nine years old and showing its age (while the graphics card in it is three years old and can still be flogged on eBay for the exact price I bought it in 2023, and might even go up in value in 2026), plus I’m assuming that the big influx in RAM prices means that it is not likely to get any cheaper in the next 24 months. And 128Gb VRAM would be more than I can get in a single A100 deployment at work.

Anyway, it turned up, it has been set up, and it has been sitting around waiting for a project. Hello, Christmas!

Over the past…five? Six? years, a standard pattern for training embedding models has emerged:

  1. You get a pretrained model, or train your own model from scratch to be a general language model
  2. You then do a ‘weakly-supervised’ fine-tune on top of that model, where you have a large corpus of web data and you basically tune on ‘title / first x tokens of text’
  3. A much smaller, highly-supervised fine-tuning pass on high-quality datasets like MSMARCO.

I absolutely despise step 2. It requires huge amounts of data, most of which is extremely low quality and quite often not obviously semantically related, and also needs significant infrastructure to train at the batch sizes required to make it passable. With the improvements in datasets over the past couple of years (things like fineweb-edu), I have been convinced there’s a better way to do it. Something that you could pull off on a single A100…or a DGX Spark.

The first little inkling of an idea came with the Stella/Jasper paper that appeared over the New Year of 2024/5, where a much larger teacher model was used to orient the vector space of a smaller model. At the time, I saw it as a vindication of my approach of using a projection layer to make images searchable from a text-only embedding model, but I also felt that more could be done with the idea. Other things got in the way, I filed it away in my head for the rest of the year, and it wasn’t until I saw the LEAF paper a couple of months ago that it popped back out and I got excited again.

The general idea of both papers is this: we take a lot of data (good data, hopefully), and vectorize it with a very strong teacher model. Maybe it outputs 768-dimensional embeddings. We then have a student model, which is likely to be smaller and outputs a lower dimensionality of embedding, say 256. The student is a standard pre-trained model, which means it can probably produce text, but is absolutely rubbish at tackling embedding similarity. We add a small projection network to the student, which can be as simple as a single linear layer expanding the 256 embedding to 768 dimensions. We then take the embeddings and the texts from the first step of the process and train the student model (plus its projection layer) to match the vectors obtained by the teacher. So if “crazy golf” is vectorized by the teacher to a 768-dimensional vector of [1.0, 1.0,…], we’d keep updating our student model until it said something very similar.

Now, the trick is that the amount of data needed to train this student is much, much smaller than you’d traditionally do with the “fire the entire of the web at the model and hope it works” weakly supervised step that most embedding models are trained with, and you don’t need pairs. Just text. Having said that, my attempts to replicate the LEAF paper met with mixed results. It was definitely improving the ability of the student model, but not to the extent that the paper promised.

But, but…what if we take LEAF and use it to bypass step 2, but still run the supervised pass in step 3? That would likely allow us to get a half-decent model in under 24 hours, if we use some good quality datasets, a reasonable teacher, and…say, had access to a DGX Spark.

The Christmas project in full:

So, between Christmas and New Year, myself and Claude did some experimenting. I did have some troubles getting a good base Docker image for the Spark, but once I had sorted that4, we got some good training runs in.

Final Score

Model nanoBEIR NDCG@10
all-MiniLM-L6-v2 0.5623
ettin-encoder-17m 0.0667
ettin-encoder-17m+F2LLM 0.3490
ettin-encoder-17m+LEAF 0.5148
ettin-encoder-17m+LEAF+F2LLM 0.5690

Victory! Well, just, but at the end of the day, Brian, a win is a win. Now back to the studio…

I think the experiment shows that there really is something to LEAF. The base model is utterly useless by itself, and while training on the curated datasets helps considerably, it is still lagging behind the other variants. And there is still something to be said for the final supervised fine-tuning run. It provides a 10% uplift to the LEAF-only model (and this tracks with my experiments in October when throwing the supervised datasets into the LEAF mix).

Of course, beating all-MiniLM-L6-v2 isn’t all that these days. One of my favourite small models, snowflake-arctic-embed-xs, scores 0.61 on nanoBEIR. But it has 22m parameters versus my model’s 17m, and I’ve barely done any training - 4 million data points, whereas most embedding models break 200m without a sweat.5 Follow ups planned for 2026 include: better teachers, more LEAF data, and a look at those hard negatives in F2LLM6. But I’m quite pleased at this first experiment run on the Spark.


  1. Actually an ASUS Ascent, trading a grand of savings for a smaller hard disk. But I have a 38Tb NAS sitting on my home network, so it seemed a reasonable trade… ↩︎

  2. To save time, I only ran 10 epochs of training instead of the 30 in the LEAF paper, so there’s probably more room for improvement here ↩︎

  3. And a batch size of 32768, which is something I have never got from a single GPU before… ↩︎

  4. At the time of publishing this article, nvcr.io/nvidia/pytorch:25.11-py3 is what you’re looking for ↩︎

  5. In fairness, you could point out that LEAF is freeloading on the teacher model’s work…but it’s what open source is all about! Plus, up until 2024, almost all embedding models stemmed from a checkout of Google’s BERT… ↩︎

  6. I can be heard often saying “DEATH TO MSMARCO!” and I do mean it. ↩︎