Simon Willison’s Weblog

Subscribe

Exploring search relevance algorithms with SQLite

7th January 2019

SQLite isn’t just a fast, high quality embedded database: it also incorporates a powerful full-text search engine in the form of the FTS4 and FTS5 extensions. You’ve probably used these a bunch of times already: many iOS, Android and desktop applications use SQLite under-the-hood and use it to implement their built-in search.

I’ve been using these capabilities for basic search in Datasette for over a year now, but I’ve recently started digging into some of their more advanced features. It turns out hacking around with SQLite is a great way to learn more about how fundamental information retrieval algorithms work under the hood.

Today I’m releasing sqlite-fts4—a Python package that provides a collection of custom SQL functions for working with SQLite’s FTS4 module. It includes some neat tools for introspecting how relevancy ranking algorithms actually work.

Why not just use FTS5?

If it’s available to you FTS5 is usually the best option: it has a good ranking algorithm built in. I described how to use it to build fast autocomplete search for your website for the 2018 24 ways advent calendar. You can join directly against a virtual table and order by a pre-calculated relevance score accessible through that table.

What makes FTS4 interesting is that it doesn’t include a scoring mechanism: it instead exposes raw statistical data to you in a way that lets you build your own ranking functions.

You probably don’t need to do this—unless you are stuck on an older SQLite version that doesn’t support the latest features. But… if you’re interested in understanding more about how search actually works, the need to implement a ranking function is an excellent learning learning opportunity.

I’ll be demonstrating these functions using a hosted Datasette instance running at datasette-sqlite-fts4.datasette.io (with the data from my 24 ways article). You can play with them out there, or if you want to use your own Datasette instance you can enable these custom SQL functions by pip installing my new datasette-sqlite-fts4 plugin.

Raw FTS4 matchinfo() data

When using FTS4, the only scoring help SQLite gives you is the bulit-in matchinfo() function. For each document in your search result set, this function will expose raw statistical data that can be used to calculate a score.

Let’s try it out using the following query:

select
    *, matchinfo(articles_fts, "pcx")
from
    articles_fts
where
    articles_fts match :search

Run matchinfo() in Datasette

The pcx here is called the format string—it lets SQLite know what information about the match you would like to see.

The results are returned as a binary string! For the first matching document, we get back the following:

\x02\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\xa3\x00\x00\x00\x1f\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\\\x00\x00\x00\x15\x00\x00\x00

SQLite’s C heritage is showing through here.

decode_matchinfo() to decode the binary

The first step in working with matchinfo is to decode that binary string. It’s actually a sequence of unsigned 32 bit integers. We can turn it into a Python list of integers using the following:

struct.unpack("I" * (len(matchinfo) // 4), matchinfo)

sqlite-fts4 exposes a SQL function called decode_matchinfo() which does exactly this. Let’s expand our example to use it:

select
    title, author,
    decode_matchinfo(matchinfo(articles_fts, "pcx")),
    matchinfo(articles_fts, "pcx")
from
    articles_fts
where
    articles_fts match :search

Run decode_matchinfo() in Datasette

The matchinfo for our first matching document now looks like this:

[2, 3, 0, 2, 2, 0, 0, 0, 1, 163, 31, 0, 2, 2, 0, 0, 0, 2, 92, 21]

Better, but still obscure. What does it mean?

The anwser lies in the SQLite matchinfo documentation. In our format string, we requested p, c and x:

  • p requests a single integer representing the number of search terms we are matching. Since our search query is jquery maps this is 2.
  • c requests the number of searchable columns in our table. We created articles_fts with 3 columns, so it’s 3. That’s the second integer in the list.
  • x is much more interesting: it returns 3 integer values for each term/column combination. Since we have 2 terms and 3 columns that means we get back 6 * 3 = 18 integers. If you count the items in the array above you’ll see there are 18 left after you remove the first two. Each triple represents the number of times the term appears in the current column, the number of times it appears in this column across every row and the number of total documents that match the term in this column at least once.

Search relevancy scores are usually calculated against exactly this kind of collection of statistics: we rank based on how rare the matching terms are across the rest of the corpus.

annotate_matchinfo() to annotate the integers

Having a list of integers made things easier, but still not easy enough. That’s where annotate_matchinfo() comes in. This custom SQL function expands the matchinfo list of integers into a giant JSON object describing exactly what each of the results means.

We’ll try it out like this:

select
    title, author,
    decode_matchinfo(matchinfo(articles_fts, "pcx")),
    json_object("pre", annotate_matchinfo(matchinfo(articles_fts, "pcx"), "pcx"))
from
    articles_fts
where
    articles_fts match :search

Run annotate_matchinfo() in Datasette

Note that we have to provide the format string twice, so that annotate_matchinfo() knows the requested order of the binary matchinfo data.

This returns a JSON object that looks like this:

{
  "p": {
    "title": "Number of matchable phrases in the query",
    "value": 2,
    "idx": 0
  },
  "c": {
    "title": "Number of user defined columns in the FTS table",
    "value": 3,
    "idx": 1
  },
  "x": {
    "title": "Details for each phrase/column combination"
    "value": [
      ...
      {
        "phrase_index": 0,
        "column_index": 2,
        "hits_this_column_this_row": 1,
        "hits_this_column_all_rows": 163,
        "docs_with_hits": 31,
        "idxs": [8, 9, 10]
      },
      {
        "phrase_index": 1,
        "column_index": 0,
        "hits_this_column_this_row": 0,
        "hits_this_column_all_rows": 2,
        "docs_with_hits": 2,
        "idxs": [11, 12, 13]
      }...
    ],
  }
}

Try it out with pcxnalyb to see the complete set of format string options.

You may be wondering why I wrapped that function call in json_object("pre", ...). This is a Datasette trick: I recently added the ability to pretty-print JSON to my datasette-html-json plugin—see that package’s README for details.

Building ranking functions

These statistics are everything we need to calculate relevance scores. sqlite-fts4 implements two such functions: rank_score() is a simple TF/IDF function. rank_bm25() is much more interesting—it’s an implementation of the Okapi BM25, inspired by the one that ships with the peewee ORM.

Let’s try them both out:

select
    title, author,
    rank_score(matchinfo(articles_fts, "pcx")) as score,
    rank_bm25(matchinfo(articles_fts, "pcnalx")) as bm25,
    json_object("pre", annotate_matchinfo(matchinfo(articles_fts, "pcxnalyb"), "pcxnalyb"))
from
    articles_fts
where
    articles_fts match :search
order by bm25

Try rank_score() and rank_bm25() in Datasette

You can switch the order by clause between bm25 and score to compare the two.

bm25() is definitely a better option. It’s the default algorithm used these days by Elasticsearch, and they wrote up an excellent explanation of how it works on their blog.

Take a look at the source code for the ranking functions to see how they are implemented. They work against the data structure returned by annotate_matchinfo() to try and make it clear what is going on.

Building the rank_bm25() function took me longer than I expected: I was comparing my results against bm25() from peewee to ensure I was getting them right, but I couldn’t get them to match. After some furious debugging I finally figured out the problem: peewee had a rare bug! I reported it to Charles Leifer and he analyzed it and turned around a fix in a matter of hours—it turns out the C library that peewee had ported to Python had the same problem.

Next steps

I’m really impressed with the flexibility that FTS4 provides—it turns out FTS5 isn’t the only worthwhile option for search in SQLite

I’m thinking about ways to expose some of the bm25 tuning parameters (in particular the magic B and K1 constants explained by the Elasticsearch article) and I plan to upgrade Datasette’s search functionality to make ranking available as a first-class feature on the searchable table view.

I’m also generally excited about SQLite as a learning tool for exploring different search ranking mechanisms. Once you’ve decoded that binary matchinfo string it’s impressive how much you can get done with the underlying data.