Simon Willison’s Weblog

Subscribe

sqlite-utils 2.0: real upserts

30th December 2019

I just released version 2.0 of my sqlite-utils library/CLI tool to PyPI.

It’s 2.0 because this is my first backwards compatibility breaking version of the library. I was hoping to avoid this for as long as possible: given semantic versioning, my opinion is that the highest quality libraries are the 1.x versions with the highest sub-version. Version 1.253 means a library managed to ship 253 new features without once breaking compatibility!

For sqlite-utils I needed the clean break, because I made a big mistake in my definition of “upsert”.

An upsert is a database operation which says “Update this record with these new values. If the record does not exist, create it.”

This is actually a documented feature of SQLite version 3.24.0 and higher. I’ve seen the term used for other databases such as PostgreSQL as well.

The problem is: when I first implemented the .upsert() method for sqlite-utils I incorrectly assumed that upsert was an alias for INSERT OR REPLACE.

So in sqlite-utils 1.x, the following code:

db["dogs"].upsert({
    "id": 1,
    "name": "Cleo",
    "age": 4
}, pk="id")

Executes then following SQL:

INSERT OR REPLACE INTO dogs
    ("id", "name", "age")
VALUES
    (1, "Cleo", 4);

It turns out this isn’t how upsert should be expected to work. This query will entirely replace any existing record. But... an upsert should update the record in place. Crucially, it should leave any columns not referenced in the upsert alone. My implementation over-wrote the entire row, deleting any missing values.

So if I run this Python code:

db["dogs"].upsert({
    "id": 1,
    "age": 5
}, pk="id")

The existing record for id=1 should have its age updated, but the name should stay the same.

I thought hard about whether I could keep the existing, incorrect .upsert() behavior and introduce a different method name or argument that did things right—maintaining backwards compatibility at the price of a less intuitive API—but this felt like too big a discrepancy. And this is what major version number changes are for! You can follow my deliberations about this in this issue.

While this is technically a breaking change, when I look at the code I have already written that uses sqlite-utils it turns out all of my uses of .upsert() are compatible with the new implementation anyway—they all send the full record object, so they will over-ride all of the fields no matter if they are running against 1.x or 2.x.

If I’m lucky, other users of sqlite-utils will be unaffected by this behavior change either. So hopefully it will be a painless upgrade.

If you do need to keep the old behavior, I’ve made it available as a replace=True argument to .insert() and .insert_all(), like so:

db["dogs"].insert({
    "id": 1,
    "name": "Cleo",
    "age": 4
}, pk="id", replace=True)

I did think about introducing a new piece of terminology for this—.inplace(), for insert-or-replace—but after getting in trouble redefining upsert by mistake I decided inventing new language wasn’t worth the risk.

Using this for Niche Museums

The project that made me realize I needed this upsert change was Niche Museums.

Niche Museums works by building a SQLite database file from a museums.yaml file. The YAML file contains the data for all 81 museums on the site. To add or edit a listing, I edit that file—sometimes in Visual Code on my Mac, sometimes via the GitHub web interface and often by using the Working Copy git client on my iPhone.

Creating the database is done using my yaml-to-sqlite tool:

$ yaml-to-sqlite browse.db museums museums.yaml --pk=id

The build process then runs a few extra scripts to annotate the records with extra data.

annotate_timestamps.py iterates through the full git history and uses it to figure out when each record was first created or last updated.

annotate_nominatim.py uses the OpenStreetMap Nominatim API to reverse geocode the latitude and longitude of each museum, adding extra columns for the city, state, country and all kinds of other interesting geographical details.

I’m not doing much with this OSM data yet—mainly using it to include the OSM country and city name in the full-text search index—but I hope to use it for faceting and other navigation improvements in the future (and maybe location-based breadcrumbs).

Here’s the problem though: every time I deploy the site, I run those scripts. Nominatim asks you not to hit their API more than once a second, so I have a sleep(1) in the code. This means the more museums I add, the longer that build step takes to reverse geocode them all.

It would be great to avoid geocoding museums that have already been geocoded. Properly implemented upsert let’s me do exactly that.

I haven’t made these changes yet, but my plan is to update the build script so that instead of starting from a blank database each time, it downloads the previously built version of the database and runs an upsert against it with the data from the YAML. This will overwrite all of the data that I might have updated in that file, but leave the osm_* columns that were created by the annotate_nominatim.py script in place.

That script can then just run against records for which osm_country is null. As such, it should only geocode newly created records.

Other improvements this week

I added a few new features to Niche Museums this week.

First, I added simple photo galleries to the listings pages, using CSS columns. This means I can add more than one photo to a museum! I’ve done this to a few now, including going back to the listing for the Ilfracombe Museum to add a whole bunch of extra photos.

I added photos and press links to the Atom feed for the site. Since the Atom feed is defined using a custom SQL query and datasette-atom, achieving this involved building a truly diabolical SQL query (making extensive use of the SQLite json1 extension).

I fixed an issue with the search engine where unexpected punctuation could cause an FTS syntax error in SQLite. I’ve extracted that into a fix for a Datasette issue as well.

The site search engine also now attempts to geocode the user’s query and offers a link to the corresponding lat/lon radius search for that point. Try a search for new orleans to see what that looks like—the implementation (entirely client-side, calling the Nominatim API) is here.

Here are the museums I added this week: