Challenges of local-first LLMs

2024-03-08

Underjord is a tiny, wholesome team doing Elixir consulting and contract work. If you like the writing you should really try the code. See our services for more information.

TLDR: the tricky part is “Large”. “Language” and “Model” seems manageable. But largeness has all sorts of trouble associated with it for mobile devices. Not insurmountable, but challenging. Also makes for finicky development.

To start off, I am not an AI expert. I am not an ML engineer. I’m a web and systems dev that mostly works in Elixir these days. Through the Elixir community I got connected with ElectricSQL and have done some collaboration with them. This post is done as part of my work with them and is paid work. I let them read it before publishing but this is all me and my perspectives. They have kindly paid me to endure JavaScript and experiment with their tools a bit :)

ElectricSQL are firmly in the local first ecosystem. Their open source project/product gives you a way to get active-active replication with eventual consistency between a server-side Postgres and a wherever-you-want SQLite. You handle the data model from the Postgres side with your normal tooling. Whenever your clients have a chance to reach the sync service/postgres proxy you’ll get schema updates and whatever new data the server might have to your local thing. Anyway. Cool stuff. On to the local first inference!

I really am not enjoying React Native here. Part of that is almost definitely that I’m poking around the immature ragged edge of what people have launched, like llama.rn (based on llama.cpp) and the outdated example for transformers.js. There are no real paths and I’m not deep enough on mobile dev to patch up the holes well myself. I also don’t know enough C/C++ to ship custom sqlite’s with vector extensions and whatnot. Might get there but am not there currently.

I know there are people who have made models run locally. I am not blazing a trail. But it is also not particularly mature as a space.

So the current approaches I see for getting LLMs to work on phones is mostly quantization which to my understanding is the process of reducing the precision of the numbers and calculations to save heavily on space and memory usage. Taking Llama2 7B you have this 13.5Gb model at 16bit floating point. Quantizing it with llama.cpp down to “Q2 K” I believe uses 2bit integers. Pretty lossy. But it brings it down to 2.6Gb and it can actually be run on a modern iPhone (mine is a 13 Pro). With infinite time I would explore Mistral 7B as an option along with Orca variants and whatnot. There is a lot out there.

Now there are also efforts to rework models to be smaller fundamentally and then you have something like TinyLlama-1.1B (GGUF). This model is miniscule (relatively speaking) at 747Mb. I tried the 0.3 version at one point which was generating more nonsense than sense, wrong language, pure gibberish and more. The 1.x ones seems much more coherent though it certainly makes mistakes frequently. It is plenty snappy.

The crunched up Llama has produced semi-relevant responses for me but it has also been much harder to run. A 731Mb asset is perhaps on the larger side. But whether it goes over wifi, or the lightning cable at USB 2.x speeds, npx react-native@latest run-ios is not impressively fast when slinging a 2.6Gb Llama over the wire (Nullsoft would have loved this era of computing. WinAmp should return as an LLM I guess. It would really whip ..).

And I’ve had the phone go very warm using Llama2 models. And then pause all audio playback as background tasks stop getting priority. The CPU, GPU and probably some Neural Engine gets busy doing the hard work of guessing the next word and your phone becomes warm and inert. Not a great experience. But also kind of fun to actually see what this device can do.

Overall, my sense is that the quantized models are ahead of the built-to-be-small models. I think those approaches can converge and lead to some useful stuff. With enough persistence you can probably already wrangle some good usage out of them.

A challenge with these large models is of course that every app that wants to use one ends up becoming essentially a full-on mobile game size. And they aren’t very fast. I think there might be a case for an app that manages models for people and provides practical on-device APIs to interact with. To my knowledge this can be done in a few ways under iOS specifically. You can do a bunch of hacking around with App URL schemes. Your model service app can expose Shortcuts that the user can then base integrations on, a real power user thing. And it seems you could build an Action Extension.

Essentially this is the revival of the shared library. Assuming people do want LLM-sized models doing things on their phones it will quickly become unpleasant that every app ships their own Llama or Mistral. This might be resolved if Apple (and Google on Android) exposes their own on-device models that cover the desirable use-cases. We will find out.

I haven’t gotten into it but I want to try some TensorFlow Lite/TFLite models of various types as that seems well suited for phone-level devices. I just recently ran into this library for React Native. Like any React Native library I assume it will be torture to get going but has the potential to solve all my problems. Qualcomm recently released 80-or-so edge/mobile-friendly model variants on HuggingFace so that’s a good place to look if you want some TFLite to toy with.

The work on this is paused for the foreseeable future but high on my list was to try more things with vector embeddings. Those seem perfectly feasible to generate on a phone. The TFLite version of OpenAI CLIP is 304MB.

I think there are many things that can be done and if you have some pain-tolerance you can already run a local model. It is kind of fun and a good challenge to get real use out of them. I certainly like the idea of not relying on cloud to achieve these things.

Now one reason to pause this is that the ElectricSQL folks have already made some good progress with other parts of local-first ML/AI experiments. Postgres, pg_vector, Tauri and Llama2. Who needs React Native, really? They also shortly after launched a very exciting collab with the Neon Postgres folks: pglite

You might have seen pglite if you frequent Hacker News and such. This would take the pg_vector opportunities further. An embeddable Postgres, eh? Not bad :)

There is a repo from my experiments on my github if you go digging, it may be incomplete and not build. It may have junk in it. It is not a place of honor. Overall I think I explored some good stuff but most paths definitely hit a “oof, this is too rough” end. Trying to get SQLite with VSS was hairy. Trying to get Shortcuts working of all things proved incredibly frustrating. Realizing Transformers.js is missing stuff to make it work on React Native.

Overall, if you want to get into this. I think I’d recommend doing it from Swift/Kotlin. The natives will save you from one layer of fragile indirection while you explore the bleeding edge.

Thanks for reading. If you have questions or concerns, hit me up @lawik for the Fediverse/Mastodon or via lars@underjord.io.

Underjord is a 4 people team doing Elixir consulting and contract work. If you like the writing you should really try the code. See our services for more information.

Note: Or try the videos on the YouTube channel.