FrugalGPT: This Big Boy Can Fit So Many LLMs

May 21, 2023 · 4 minute read

FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

I feel like the timing of this paper is amazing; you get the feeling that the authors watched some of the Tears of The Kingdom trailers, looked at the pile of models they had lying around and just thought “Why don’t we just use Ultrahand on them?”

What we have here, then, is a carefully constructed Heath Robinson machine designed to work around two big issues with calling GPT-3/4 in a query pipeline:

OpenAI calls are slow (there’s a fundamental issue about have to make a call to an external API, but even accounting for that, calls to OpenAI tend to be somewhat slower than the competition: https://github.com/kagisearch/pyllms)
OpenAI calls also cost money, and when you’re working with queries at scale, those fractions of cents are going to add up really fast.

The authors construct a system using five different techniques to reduce the cost of using OpenAI’s LLMs, some of which avoid talking to them at all:

Prompt Selection — reducing the number of examples provided in a prompt to reduce the total amount of tokens sent to the LLM
Query Concatenation — combining multiple queries into a singe request to the LLM, and demultiplexing the response to answer the separate queries
Response Cache — a cache that stores responses and returns answers from the cache if the queries is judged ‘similar’ enough
Use a fine-tuned model instead — Collect responses from a large model (e.g. GPT-4) and use those responses to fine-tune a smaller model (the paper uses GPT-J), which can then be used in the:
LLM Cascade — this is the main component of the paper. The cascade service sends a query to a list of LLMs in order of increasing expensiveness, and responses are evaluated via a scoring model to determine if the response is acceptable. If so, the response is returned to the user, if not, then the next LLM on the list is queried.

The resulting hodge-podge contains some surprises, the main one being: not only is it cheaper than just talking to GPT-4 directly, but when things are tuned, it actually performs better than GPT-4. The improvement isn’t massive, but combined with the 50-98% cost savings in their experiments, it does feel like there’s definitely something worth digging into here.

But also, a few issues. Using a fine-tuned model sounds like a good idea, but pretty much all the major LLM providers include clauses in their terms of service that would pretty much prevent you from doing this in production unless you have a robust legal department that is eager to try and argue the textual outputs of LLM models have no copyright protections and thus those terms are unenforceable. Some providers even prevent you caching LLM queries! And then there’s the issue that the authors point out as a major limitation — the scoring models need to be trained with labelled examples that come from the distribution domain of the incoming queries. Which means this is a system that will need to be continuously updated or else you’re going to have some serious model drift. Which makes me think of a bunch of data scientists running around this crazy contraption like Wallace and Gromit trying to prevent it from blowing up and spraying cheese all over the house. I do wonder if you could get this going in a RL framework or something else to alleviate the support the system would need.

(my eyebrows are also raised a little by the prompt selection and query concatenation stages — making LLMs that you don’t have raw access to consistently follow directions can be something of a challenge. You’d need to provide extra guardrails in production to make this work reliably)

One thing I think is interesting and glossed over a little in the paper, is the cache component. The paper refers to it as a ‘database’, but my thinking is that a traditional cache/database in such a pipeline is going to miss out on a lot of potential reuse of queries, e.g. “When was New Order’s Blue Monday originally released?” and “When did Blue Monday first come out?” are the same question but you’re going to get a cache miss on the second. So! Why not use a fast embedding model, a vector database, and an aggressive distance cut-off for nearest neighbours so you can respond to a lot more queries without having to go to the LLMs?¹

Anyway, that’s FrugalGPT. Save all the monies! Keep your Zonai batteries charged!

Another fun thing you can do here - you can take queries and redirect them to things that you want to promote. For example, a merchandiser could set up a promotion for nappies, and automatically searches for ‘Pampers’ could return the promotion results along with the user’s actual query response. ↩︎

Ian Pointer

FrugalGPT: This Big Boy Can Fit So Many LLMs