Solr Indexing in Kithe

So you may recall the kithe toolkit we are building in concert with our new digital collections app, which I introduced here.

I have completed some Solr Indexing support in kithe. It’s just about indexing, getting your data into Solr. It doesn’t assume Blacklight, but should work fine with Blacklight; there isn’t currently any support in kithe for what you do to provide UX for your Solr index.  You can look at the kithe guide documentation for the indexing features for a walk-through.

The kithe indexing support is based on ActiveRecord callbacks, in particular the after_commit callback. While callbacks get a bad rap, I think they are appropriate here, and note that both the popular sunspot gem (Solr/Rails integration, currently looking for new maintainers) and the popular searchkick gem (ElasticSearch/Rails integration) base their indexing synchronization on AR callbacks too. (There are various ways in kithe’s feature to turn off automatic callbacks temporarily or permanently in your code, like there are in those other two gems too). I spent some time looking at API’s, features, and implementation of the indexing-related functionality in sunspot, and searchkick, as well as other “prior art”, before/while developing kithe’s support.

The kithe indexing support is also based on traject for defining your mappings.

I am very happy with how it turned out, I think the implementation and public API both ended up pretty decent. (I am often reminded of the quote of uncertain attribution “I didn’t have time to write a short letter, so I wrote a long one instead” — it can take a lot of work to make nice concise code).

The kithe indexing support is independent of any other kithe features and doesn’t depend on them. I think it might be worth looking at for anyone writing a an app whose persistence is based on ActiveRecord. (If something ActiveModel-like but not ActiveRecord, it probably doesn’t have after_commit callbacks, but if it has after_save callbacks, we could make the kithe feature optionally use those instead; sunspot and searchkick can both do that).

Again, here’s the kithe documentation giving a tour of the indexing features. 

Note on traject

The part of the architecture I’m least happy with is traject, actually.

Traject was written for a different use case — command-line executed high-volume bulk/batch indexing from file serializations. And it was built for that basic domain and context at the time, with some YAGNI thoughts.

So why try to use it for a different case of event-based few-or-one object sync’ing, integrated into an app?  Well, hopefully it was not just because I already had traject and was the maintainer (‘when all you have is a hammer’), although that’s a risk. Partially because traject’s mapping DSL/API has proven to work well for many existing users. And it did at least lead me to a nice architecture where the indexing code is separate and fairly decoupled from the ActiveRecord model.

And the Traject SolrJsonWriter already had nice batching functionality (and thread-safety, although didn’t end up using it in current kithe architecture), which made it convenient to implement batching features in a de-coupled way (just send to a writer that’s batching, the other code doesn’t need to know about it, except for maybe flushing at the end).

And, well, maybe I just wanted to try it. And I think it worked out pretty well, although there are some oddities in there due to traject’s current basic architectural decisions. (Like, instantiating a Traject “Indexer” can be slow, so we use a global singleton in the kithe architecture, which is weird.)  I have some ideas for possible refactors of traject (some backwards compat some not) that would make it seem more polished for this kind of use case, but in the meantime, it really does work out fine.

Note on times to index, legacy sufia app vs our kithe-based app

Our collection, currently in a sufia app, is relatively small. We have about 7,000 Works (some of which are “child works”), 23,000 “FileSets” (which in kithe we call “Assets”), and 50 Collections.

In our existing Sufia-based app, it takes about 6 hours to reindex to Solr on an empty index.

  • Except actually, on an empty index it might take two re-index operations, because of the way sufia indexing is reliant on getting things out of the index to figure out the proper way to index a thing at hand. (We spent a lot of work trying to reorganize the indexing to not require an index to index, but I’m not sure if we succeeded, and may ironically have made performance issues with fedora worse with the new patterns?) So maybe 12 hours.
  • Except that 6 hours is just a guess from memory. I tried to do a bulk reindex-everything in our sufia app to reconfirm it — but we can’t actually currently do a bulk reindex at all, because it triggers an HTTP timeout from Fedora taking too long to respond to some API request.
    • If we upgraded to ActiveFedora 12, we could increase the timeout that ActiveFedora is willing to wait for a fedora response for. If we upgraded to ActiveFedora 12.1, it would include this PR, which I believe is intended to eliminate those super long fedora responses. I don’t think it would significantly change our end-to-end indexing time, the bulk of it is not in those initial very long fedora API calls. But I could be wrong. And not sure how realistic it is to upgrade our sufia app to AF 12 anyway.
    • To be fair, if we already had an existing index, but needed to reindex our actual works/collections/filesets because of a Solr config change, we had another routine which could do so in only ~25 minutes.

In our new app, we can run our complete reindexing routine in currently… 30 seconds. (That’s about 300 records/second throughput — only indexing Works and Collections. In past versions as I was building out the indexing I was getting up to 1000 records/second, but I haven’t taken time to investigate what changed, cause 30s is still just fine).

In our sufia app we are backing up our on-disk Solr indexes, because we didn’t want to risk the downtime it would take to rebuild (possibly including fighting with the code to get it to reindex).  In addition to just being more bytes to sling, this leads to ongoing developer time on such things as “did we back up the solr data files in a consistent state? Sync’d with our postgres backup?”, and “turns out we just noticed an error in the backup routine means the backup actually wasn’t happening.” (As anyone who deals with backups of any sort knows can be A Thing).

In the new system, we can just… not do that.  We know we can easily and quickly regenerate the Solr index whenever, from the data in postgres. (And if we upgrade to a new Solr version that requires an index rebuild, no need to figure out how to do so without downtime in a complicated way).

Why is the new system so much faster? I’ve identified three areas I believe are likely, but haven’t actually tried to do much profiling to determine which of these (if any?) are the predominant factors, so couldn’t say.

  1. Getting things out of fedora (at least under sufia’s usage patterns) is slow. Getting things out of postgres is fast.
  2. We are now only indexing what we need to support search.
    • The only things that show up in our search results are Works and Collections, so that’s all we’re indexing. (Sufia indexes not only FileSets too, but some ancillary objects such as one or two kinds of permission objects, and possibly a variety of other things I’m not familiar with. Sufia is trying to put pretty much everything that’s in fedora in Solr. For Reasons, mainly that it’s hard to query your things in Fedora with Fedora).
    • And we are only indexing the fields we actually need in Solr for those objects. Sufia tries to index a more or less round-trippable representation to Solr, with every property in it’s own stored solr field, etc. We aren’t doing that anymore. We could put all text in one “text” field, if we didn’t want to boost some higher than others. So we only index to as many fields as need different boosts, plus fields for facets, etc. Only what need to support the Solr functionality we want.
      • If you want to render your results from only Solr stored fields (as sufia/hyrax do, and blacklight kind of wants you to) you’d also need those stored fields, sufficiently independently addressable to render what you want (or perhaps just in one big serialized JSON?). We are hoping to not use solr stored fields for rendering at all, but even if we end up with Solr stored fields for rendering, it will be just enough that we need for rendering. (For instance, some people using Blacklight are using solr stored fields for the “index”/search results/hits page, but not for the individual record ‘show’ page).
  3. The indexing routines in new thing send updates to Solr in an efficient way, both batching record updates into fewer Solr HTTP update requests, and not sending synchronous Solr “hard commits” at all. (the bulk reindex, like the after_commit indexing, currently sends a softCommit per update request, although this could be configured differently).

So

Check out the kithe guide on indexing support! Maybe you want to use kithe, maybe you’re writing an ActiveRecord-based apps and want to consider kithe’s solr indexing support in isolation, or maybe you just want to look at it for API and implementation ideas in your own thing(s).

Leave a comment