Picture This: Open Source AI for Image Description

A cartoon of a smiling robot looking at a computer screen with an image of trees and a house; it's emitting a speech bubble with a stream of made-up character shapes representing its description.
Image by Annie Ruygt

I’m Nolan, and I work on Fly Machines here at Fly.io. We’re building a new public cloud—one where you can spin up CPU and GPU workloads, around the world, in a jiffy. Try us out; you can be up and running in minutes. This is a post about LLMs being really helpful, and an extensible project you can build with open source on a weekend.

Picture this, if you will.

You’re blind. You’re in an unfamiliar hotel room on a trip to Chicago.

If you live in Chicago IRL, imagine the hotel in Winnipeg, the Chicago of the North.

You’ve absent-mindedly set your coffee down, and can’t remember where. You’re looking for the thermostat so you don’t wake up frozen. Or, just maybe, you’re playing a fun-filled round of “find the damn light switch so your sighted partner can get some sleep already!”

If, like me, you’ve been blind for a while, you have plenty of practice finding things without the luxury of a quick glance around. It may be more tedious than you’d like, but you’ll get it done.

But the speed of innovation in machine learning and large language models has been dizzying, and in 2024 you can snap a photo with your phone and have an app like Be My AI or Seeing AI tell you where in that picture it found your missing coffee mug, or where it thinks the light switch is.

Creative switch locations seem to be a point of pride for hotels, so the light game may be good for a few rounds of quality entertainment, regardless of how good your AI is.

This is big. It’s hard for me to state just how exciting and empowering AI image descriptions have been for me without sounding like a shill. In the past year, I’ve:

  • Found shit in strange hotel rooms.
  • Gotten descriptions of scenes and menus in otherwise inaccessible video games.
  • Requested summaries of technical diagrams and other materials where details weren’t made available textually.

I’ve been consistently blown away at how impressive and helpful AI-created image descriptions have been.

Also…

Which thousand words is this picture worth?

As a blind internet user for the last three decades, I have extensive empirical evidence to corroborate what you already know in your heart: humans are pretty flaky about writing useful alt text for all the images they publish. This does tend to make large swaths of the internet inaccessible to me!

In just a few years, the state of image description on the internet has gone from complete reliance on the aforementioned lovable, but ultimately tragically flawed, humans, to automated strings of words like Image may contain person, glasses, confusion, banality, disillusionment, to LLM-generated text that reads a lot like it was written by a person, perhaps sipping from a steaming cup of Earl Grey as they reflect on their previous experiences of a background that features a tree with snow on its branches, suggesting that this scene takes place during winter.

If an image is missing alt text, or if you want a second opinion, there are screen-reader addons, like this one for NVDA, that you can use with an API key to get image descriptions from GPT-4 or Google Gemini as you read. This is awesome!

And this brings me to the nerd snipe. How hard would it be to build an image description service we can host ourselves, using open source technologies? It turns out to be spookily easy.

Here’s what I came up with:

  1. Ollama to run the model
  2. A PocketBase project that provides a simple authenticated API for users to submit images, get descriptions, and ask followup questions about the image
  3. The simplest possible Python client to interact with the PocketBase app on behalf of users

The idea is to keep it modular and hackable, so if sentiment analysis or joke creation is your thing, you can swap out image description for that and have something going in, like, a weekend.

If you’re like me, and you go skipping through recipe blogs to find the “go directly to recipe” link, find the code itself here.

The LLM is the easiest part

An API to accept images and prompts, run the model, and spit out answers sounds like a lot! But it’s the simplest part of this whole thing, because: that’s Ollama.

You can just run the Ollama Docker image, get it to grab the model you want to use, and that’s it. There’s your AI server. (We have a blog post all about deploying Ollama on Fly.io; Fly GPUs are rad, try'em out, etc.).

For this project, we need a model that can make sense—or at least words—out of a picture. LLaVA is a trained, Apache-licensed “large multimodal model” that fits the bill. Get the model with the Ollama CLI:

ollama pull llava:34b

If you have hardware that can handle it, you could run this on your computer at home. If you run AI models on a cloud provider, be aware that GPU compute is expensive! It’s important to take steps to ensure you’re not paying for a massive GPU 24/7.

On Fly.io, at the time of writing, you’d achieve this with the autostart and autostop functions of the Fly Proxy, restricting Ollama access to internal requests over Flycast from the PocketBase app. That way, if there haven’t been any requests for a few minutes, the Fly Proxy stops the Ollama Machine, which releases the CPU, GPU, and RAM allocated to it. Here’s a post that goes into more detail.

A multi-tool on the backend

I want user auth to make sure just anyone can’t grab my “image description service” and keep it busy generating short stories about their cat. If I build this out into a service for others to use, I might also want business logic around plans or credits, or mobile-friendly APIs for use in the field. PocketBase provides a scaffolding for all of it. It’s a Swiss army knife: a Firebase-like API on top of SQLite, complete with authentication, authorization, an admin UI, extensibility in JavaScript and Go, and various client-side APIs.

Yes, of course I’ve used an LLM to generate feline fanfic. Theme songs too. Hasn’t everyone?

I “faked” a task-specific API that supports followup questions by extending PocketBase in Go, modeling requests and responses as collections (i.e. SQLite tables) with event hooks to trigger pre-set interactions with the Ollama app (via LangChainGo) and the client (via the PocketBase API).

If you’re following along, here’s the module that handles all that, along with initializing the LLM connection.

In a nutshell, this is the dance:

  • When a user uploads an image, a hook on the images collection sends the image to Ollama, along with this prompt: "You are a helpful assistant describing images for blind screen reader users. Please describe this image."
  • Ollama sends back its response, which the backend 1) passes back to the client and 2) stores in its followups collection for future reference.
  • If the user responds with a followup question about the image and description, that also goes into the followups collection; user-initiated changes to this collection trigger a hook to chain the new followup question with the image and the chat history into a new request for the model.
  • Lather, rinse, repeat.

This is a super simple hack to handle followup questions, and it’ll let you keep adding followups until something breaks. You’ll see the quality of responses get poorer—possibly incoherent—as the context exceeds the context window.

I also set up API rules in PocketBase, ensuring that users can’t read to and write from others’ chats with the AI.

If image descriptions aren’t your thing, this business logic is easily swappable for joke generation, extracting details from text, any other simple task you might want to throw at an LLM. Just slot the best model into Ollama (LLaVA is pretty OK as a general starting point too), and match the PocketBase schema and pre-set prompts to your application.

A seedling of a client

With the image description service in place, the user can talk to it with any client that speaks the PocketBase API. PocketBase already has SDK clients in JavaScript and Dart, but because my screen reader is written in Python, I went with a community-created Python library. That way I can build this out into an NVDA add-on if I want to.

If you’re a fancy Python developer, you probably have your preferred tooling for handling virtualenvs and friends. I’m not, and since my screen reader doesn’t use those anyway, I just pip installed the library so my client can import it:

pip install pocketbase

My client is a very simple script. It expects a couple of things: a file called image.jpg, located in the current directory, and environment variables to provide the service URL and user credentials to log into it with.

When you run the client script, it uploads the image to the user’s images collection on the backend app, starting the back-and-forth between user and model we saw in the previous section. The client prints the model’s output to the CLI and prompts the user to input a followup question, which it passes up to the followups collection, and so on.

This can run on Fly.io.

Run your LLM on a datacenter-grade GPU.

Try out a Fly GPU  

All together now

I grabbed this image and saved it to a file called image.jpg.

While I knew I was downloading an image of a winter scene, all I see on Unsplash is:

brown trees beside river under blue sky during daytime Bright winter landscape with lake, snow, forest, beautiful blue sky and white clouds. An example of charming wildlife in Russia.

Let’s see what our very own AI describer thinks of this picture:

$ python __init__.py
The image depicts a serene winter scene. There is a small stream of water flowing through the center, with patches of snow on its surface and along the banks. The surrounding area appears to be a mix of bare trees and shrubs, suggesting it's late in the season or possibly early spring when some plants have started to grow again but haven't yet leafed out.

The sky is clear and blue, indicating good weather conditions at the time the photo was taken. The sunlight seems to be coming from the left side of the image, casting a warm glow on the scene and creating reflections in the water. There are no visible texts or distinguishing marks that provide additional context about the location or the photographer.

The overall atmosphere is peaceful and natural, with the quiet beauty of the landscape undisturbed except for the gentle movement of the stream.

Is it a stellar description? Maybe not, but it certainly gives me a better sense of connection with the scene.

Let’s see how our describer copes with a followup question.

Enter your followup question, or 'quit' to quit: What types of trees are in the image?
Sending followup question
It's difficult to determine the exact species of trees in the image without more specific details, as many deciduous trees lose their leaves during winter and look quite similar. However, based on the general appearance of the bare branches and the fact that they are leafless, it is likely that these are common deciduous trees such as oaks, maples, or elms.

The presence of snow and the absence of foliage suggest that this photo was taken during winter or early spring when the trees have not yet started to bloom again. The exact species would require a closer examination of the bark, buds, and other distinguishing features, which are not clearly visible in this image.

Boo, the general-purpose LLaVA model couldn’t identify the leafless trees. At least it knows why it can’t. Maybe there’s a better model out there for that. Or we could train one, if we really needed tree identification! We could make every component of this service more sophisticated!

But that I, personally, can make a proof of concept like this with a few days of effort continues to boggle my mind. Thanks to a handful of amazing open source projects, it’s really, spookily, easy. And from here, I (or you) can build out a screen-reader addon, or a mobile app, or a different kind of AI service, with modular changes.

Deployment notes

On Fly.io, stopping GPU Machines saves you a bunch of money and some carbon footprint, in return for cold-start latency when you make a request for the first time in more than a few minutes. In testing this project, on the a100-40gb Fly Machine preset, the 34b-parameter LLaVA model took several seconds to generate each response. If the Machine was stopped when the request came in, starting it up took another handful of seconds, followed by several tens of seconds to load the model into GPU RAM. The total time from cold start to completed description was about 45 seconds. Just something to keep in mind.

If you’re running Ollama in the cloud, you likely want to put the model onto storage that’s persistent, so you don’t have to download it repeatedly. You could also build the model into a Docker image ahead of deployment.

The PocketBase Golang app compiles to a single executable that you can run wherever. I run it on Fly.io, unsurprisingly, and the repo comes with a Dockerfile and a fly.toml config file, which you can edit to point at your own Ollama instance. It uses a small persistent storage volume for the SQLite database. Under testing, it runs fine on a shared-cpu-1x Machine.