LLM2Vec: Benchmark Scores vs Actual Usage

So, I like this new LLM2Vec paper which presents a hand recipe for taking a decoder-only LLM and turning it into a strong embedding model - to the point where you can make a Mistral-7b model top the MTEB benchmark quite easily (and I think there’s probably a little headroom for more improvement if you used a slightly more complicated fine-tuning regime than SimCSE). But I don’t think it quite manages to answer the question it poses in the abstract: ‘why is the community only slowly adopting these models for text embedding tasks?’

And I think there is a bit of a disconnect between what the Search/IR community does with embedding models and the research community here. Consider a relatively standard use case of vector search on a knowledge base; we are going to be making embeddings for tens of thousands, if not millions of documents. That’s a lot of embeddings.

In order for this to be practical, we need models that are good and fast. I can, right now, pull out a BGE-base model that is a tenth of the size of the smallest model tested in the paper (Sharded-Llama-1.3bn), obtain a higher MTEB score, and throw batches through the model in less than <50ms. And if I’m willing to spend a bit of time optimizing the model (like maybe an hour or two), I can make those batches fly through at <10ms. On T4/L4 GPUs. I just can’t do that with billion-parameter scale LLMs without spending a lot of time and money on beefier GPUs and complicated optimizing regimes.

So, I like the paper. It’s really good in terms of the recipe provided and all the details of the experiments performed, but for the moment, I’m sticking with our old BERT-based friends.