Caching OpenAI Embeddings API Calls In-memory

I recently posted about a job posting search engine I prototyped that used OpenAI’s Embeddings API.

As I tested this out in a Google Colab notebook, I guessed that the same text would always result in the same embedding. I compared embeddings of the same text and found that they were indeed identical. I also added a space or words to the text and saw that it resulted in a different embedding.

I started by saving the embeddings in the dataframe. This worked, but I would have to call the API again if I wanted the same embedding again (which happened a couple of times as my code was not robust enough the first couple of runs.) I also wanted a way to also have search queries that were previously requested return faster.

Since I was going to embed many job postings and might run the notebook multiple times, I wanted to cache the results to save a little money and increase the speed of future runs. This was helpful when I was iterating on the code and running over many postings, since some of the postings caused my code to error.

One solution to this is to store the embeddings in a database, perhaps a vector database. This would be more persistent, and would be a more production-friendly approach. For the time being, I decided to keep things simple and just cache the results in memory until I saw that the overall approach would work.

After some research, I found that there are some decorators that can be used to cache the results of a function. In Python 3.9+, the functools module has a @cache decorator. However, I was using Python 3.8. The docs note that this is equivalent to using the lru_cache decorator with maxsize=None, so I tried that instead and it seemed to work.

# Python < 3.9 version
@lru_cache(maxsize=None)
def cached_get_embedding(string: str, engine: str):
  # print first 50 characters of string
  print(f'Hitting OpenAI embedding endpoint with "{string[0:50]}..."')
  return get_embedding(string, engine=engine)

# Python >= 3.9 version
@cache
def cached_get_embedding(string: str, engine: str):
  # print first 50 characters of string
  print(f'Hitting OpenAI embedding endpoint with "{string[0:50]}..."')
  return get_embedding(string, engine=engine)

Then you can replace any get_embedding calls with cached_get_embedding calls. The first time you call the function, it will print and hit the API. Subsequent calls will return the cached result and not print anything.

Another way of doing this would be to use OpenAI’s get_embedding inside your own function called get_embedding that uses the cache decorators or looks up the result in a database. Then you don’t need to change any other code in your project and get the benefits of caching. (Has a slightly higher chance of being surprising/confusing though.)

Since the embeddings seemed whitespace-sensitive, you may also want to remove leading/trailing/inner whitespace before calling the API if that whitespace would not be meaningful for your case to reduce cache misses.

Overall this worked well for my use case. Wanted to share it since it seemed like an elegant or Pythonic way of caching API calls.

Anthony Panozzo's Blog

Ideas + Implementation

Caching OpenAI Embeddings API Calls In-memory

Comments