Deep-dive into hydra-derivatives

(Actually first wrote this in November, five months ago, getting it published now…)

In our sufia 7.4 digital repository, we wanted to add some more derivative thumbnails and download JPGs from our large TIFF originals: 3-4 sizes of JPG to download, and 3 total sizes of thumbnail for the three sizes in our customized design, with each of them having a 2x version for srcset too. But we also wanted to change some of the ways the derivatives-creation code worked in our infrastructure.

1. Derivatives creation is already in a bg ActiveJob, but we wanted to run it on a different server than the rails app server. While the built-in job was capable of this, downloading the original from fedora, in our experience,in at least some circumstances, it left behind that temporary download instead of removing it when done. Which caused problems especially if you had to do bulk derivatives creation of already uploaded items.

  • Derivative-creating bg jobs ought not to be fighting over CPU/RAM with our Rails server, and also ought to be able to be on a server separately properly sized and scaled for the amount of work to be done.

2. We wanted to store derivatives on AWS S3

  • All our stuff is deployed on AWS, storing on S3 is over the long-term cheaper than storing on an Elastic Block Storage ‘local disk’.
  • If you ever wanted to horizontally scale your rails server “local disk” storage (when delivered through a rails controller as sufia 7 does it) requires some complexity, probably a shared file system, which can be expensive and/or unreliable on AWS.
  • If we instead deliver directly from S3 to browsers, we take that load off the Rails server, which doesn’t need it. (This does make auth more challenging, we decided to punt on it for now, with the same justification and possible future directions as we discussed for DZI tiles).
  • S3 is just a storage solution that makes sense for a whole bunch of JPGs and other assets you are going to deliver over the web, it’s what it’s for.

3. Ideally, it would be great to tweak the TIFF->JPG generation parameters a bit. The JPGs should preferably be progressive JPGs, for instance, they weren’t out of stock codebase. The parameters might vary somewhat between JPGs intended as thumbnails and on-screen display, vs JPGs intended as downloads. The thumb ones should ideally use some pretty aggressive parameters to reduce size, such as removing embedded color profiles. (We ended up using vips instead of imagemagick).

4. Derivatives creation seemed pretty slow, it would be nice to speed it up a bit, if there were opportunities discovered to do so. This was especially inconvenient if you had to generate or re-generate one or more derivatives for all objects already existing in the repo. But could also be an issue even with routine operation, when ingesting many new files at once.

I started with a sort of “deep-dive” into seeing what Sufia (via hydra-derivatives) were doing already. I was looking for possible places to intervene, and also to see what it was doing, so if I ended up reimplementing any of it I could duplicate anything that seemed important.  I ultimately decided that I would need to customize or override so many parts of the existing stack, it made sense to just replace most of it locally. I’ll lead you through both those processes, and end with some (much briefer than usual) thoughts.

Deep-dive into Hydra Derivatives

We are using Sufia 7.4, and CurationConcerns 1.7.8. Some of this has changed in Hyrax, but I believe the basic architecture is largely similar. I’ll try to make a note of parts I know have changed in Hyrax. (links to hyrax code will be to master at the time I write this, links to Sufia and CC will be to the versions we are using).

CreateDerivativesJob

We’ll start at the top with the CurationConcerns CreateDerivativesJob. (Or similar version in Hyrax).  See my previous post for an overview of how/when this job gets scheduled.  Turns out the execution of a CreateDerivativesJob is hard-coded into the CharacterizeJob, you can’t choose to have it run a different job or none at all. (Same in hyrax).

The first thing this does is acquire a file path to the original asset file, with `CurationConcerns::WorkingDirectory.find_or_retrieve(file_id, file_set.id, filepath)`. CurationConcerns::WorkingDirectory (or see in hyrax) checks to see if the file is already there in an expected place inside CurationConcerns.working_directory, and if not copies it to the working directory from a fedora fetch,  using a Hydra::PCDM::File object.

Because it’s using Hydra::PCDM::File object #content API, it fetches the entire fedora file into memory, before writing it to the CurationConcerns.working_directory.  For big files, this uses a lot of RAM temporarily, but more distressing to me is the additional latency, to first fetch the thing into RAM and then stream RAM to disk, instead of streaming right to disk. While the CurationConcerns::WorkingDirectory code seems to have been written originally to try to stream, with a copy_stream_to_working_directory method in terms of streams, the current implementation just turns a full in-memory string into a StringIO instead.  The hyrax implementation is the same. 

Back to the CreateDerivativesJob, we now have a filename to a copy of the original asset in the ‘working directory’.  I don’t see any logic here to clean up that copy, so perhaps this is the source of the ‘temporary file buildup’ my team has sometimes seen.  I’m not sure why we only sometimes see it, or if there are other parts of the stack meant to clean this up later in some cases. I’m not sure if the contract of `CurationConcerns::WorkingDirectory#find_or_retrieve` is to always return a temporary file that the caller is meant to clean up when done, if it’s always safe to assume the filename returned can be deleted by caller; or if instead future actors are meant to use it and/or clean it up.

The CreateDerivativesJob does an acquire_lock_for: I think this is probably left over from when derivatives were actually stored in fedora, now that they are not, this seems superflous (and possibly expensive, not sure). And indeed it’s gone from the hyrax version, so that’s probably true.

Later, the CreateDerivativesJob reindexes the fileset object (first doing a file_set.reload, I think that’s from fedora, not solr?), and in some cases it’s parent.   This is a potentially expensive operation — which matters especially if you’re, say, trying to reindex all derivatives. Why does it need a reindex? Well, sufia/hyrax objects in Solr index have a relative URL to thumbnails in a `thumbnail_path_ss` field (a design our app no longer uses).  But thumbnail paths in sufia/hyrax are consistently predictable from file_set_id, of the form /downloads/#{file_set_id}?file=thumbnail.  Maybe the reindex dates from before this is true? Or maybe it’s just meant to register “yes, a thumbnail is there now”, so the front-end can tell the difference between missing and absent thumb?  (I’d rather just keep that out of the index and handle thumbs not present at expected URLs with some JS. )

I tried removing the index update from my locally overridden CreateDerivativesJob, and discovered one reason it is there. In normal operation, this is the only time a parent work gets reindexed after a fileset is added to it that will be marked it’s representative fileset. And it needs to get reindexed to have the representative_id and such.  I added it to AddFileToFileSet instead, where it belongs. Phew!

So anyway,  how are the derivatives actually created?  Just by calling file_set.create_derivatives(filename). Note the actual local (working directory) method on the model object doesn’t seem quite right for this, you might want different derivatives in different contexts for the same model, but it works. Hyrax is making the same call.  Hyrax introduces a DerivativeService class not present in Sufia/CC , which I believe is meant to support easier customization.

FileSet#create_derivatives

FileSet#create_derivatives is defined in a module that gets mixed into your FileSet class. It branches on the mime type of your original, running different (hard-coded) classes from the hydra-derivatives gem depending on type.  For images, that’s:

Hydra::Derivatives::ImageDerivatives.create(filename,
 outputs: [{ label: :thumbnail, 
             format: 'jpg', 
             size: '200x150>', 
             url: derivative_url('thumbnail') }])

You can see it passes in the local filepath again, as well as some various options in an outputs keyword arg — including a specified url of the to-be-created derivative — as a single hash inside an array for some reason. derivative_url uses a derivative_path_factory, to get a path (on local FS?), and change it into a file: url — so this is really more of a path than a URL, it’s apparently not actually the eventual end-user-facing URL, but just instructions for where to write the file. The derivative_path_factory is a DerivativePath, which uses CurationConcerns.config.derivatives_path, to decide where to put it — it seems like there’s a baked-in assumption (passed through several layers) that  destination will  be on a local filesystem on the machine running the job.

Hyrax actually changes this somewhat — the relevant create_derivatives method seems to moved to the FileSetDerivativeService — it works largely the same, although the different code to run for each mime-type branch has been moved to separate methods, perhaps to make it easier to override. I’m not quite sure how/where FileSet#create_derivatives is defined (Hyrax CreateDerivativesJob still calls it), as the Hyrax::FileSet::Derivatives module doesn’t seem to mix it in anymore. But FileSet#create_derivatives presumably calls #create_derivatives for the FileSetDerivativeService somehow.  Since I was mainly focusing on our code using Sufia/CC, I left the train here. The Hyrax version does have a cleanup_derivatives method as a before_destroy presumably on the FileSet itself, which is about cleaning up derivatives is a fileset is deleted (did the sufia version not do that at all?) Hyrax seems to still be using the same logic from hydra_derivatives to actually do derivatives creation.

Since i was mostly interested with images, I’m going to specifically dive in only to the  Hydra::Derivatives::ImageDerivatives code.  Both Hyrax and Sufia use this. Our Sufia 7.4 app is using hydra-derivatives 3.2.1. At the time of this writing, hydra-derivatives latest release is 3.3.2, and hyrax does require 3.3.x, so a different minor version than what I’m using.

Hydra::Derivatives::ImageDerivatives and cooperators

If we look at Hydra::Derivatives::ImageDerivatives (same in master and 3.2.1) — there isn’t much there. It sets a self.processor_class to Processors::Image, inherits from Runner, and does something to set a format: png as a default argument.

The superclass Hydra::Derivatives::Runner has some business logic for being a derivative processor. It has a class-wide output_file_service defaulting to whatever is configured as Hydra::Derivatives.output_file_service.  And a class-wide source_file_service defaulting to Hydra::Derivatives.source_file_service.  It fetches the original using the the source file service. For each arg hash passed in (now we understand why that argument was an array of hashes), it just sends it to the configured processor class, along with the output_file_service:  The processor_class seems to be responsible for using the passed-in  output_file_service to actually write output.  While it also passes in the source_file_service, this seems to be ignored:  The source file itself has already been fetched and had it’s local file system path passed in directly, and I did not find anything using the passed-in source_file_service.  (this logic seems the same between 3.2.1 and current master).

In my Sufia app, Hydra::Derivatives.output_file_service is CurationConcerns::PersistDerivatives — which basically just writes it to local file system, again using a derivative_path_factory set to DerivativePath.  The derivative_path_factory PersistDerivatives probably has to match the one up in FileSet#create_derivatives — I guess if you changed the derivative_path_factory in your FileSet, or probably bad things would happen?  And Hydra::Derivatives.source_file_service is CurationConcerns::LocalFileService which does nothing but open the local file path passed in, and return a File object. Hyrax has pretty much the same PersistDerivatives and LocalFileService services, I would guess they are also the defaults, although haven’t checked.

I’d guess this architecture was designed with the intention that if you wanted to get a source file from somewhere other than local file system, you’d set a custom  source_file_service.   But even though Sufia and Hyrax do get a source file from somewhere else, they don’t customize the source_file_service, they fetch from fedora a layer up and then just pass in a local file that can be handled by the LocalFileService.

Okay, but what about actually creating derivatives?

So okay, the actual derivative generation though, recall, was handled by the processor_class dependency, hard-coded to Processors::Image.

Hydra::Derivatives::Processors::Image I think is the same in hydra-derivatives 3.2.1 and current master. It uses MiniMagick to do it’s work. It will possibly change the format of the image. And possibly set (or change?) it’s quality (which mostly only effects JPGs I think, maybe PNGs too). Then it will run a layer flatten operation the image.  And resize it.  Recall that #create_derivatives actually passed in an imagemagick-compatible argument for desired size, size: '200x150>', so create_derivatives is actually assuming that the Hydra::Derivatives::ImageDerivatives.create will be imagemagick-based, or understand imagemagick-type size specifications, there’s some coupling here.

MiniMagick actually does it’s work by shelling  out to command-line imagemagick (or optionally graphicsmagick, which is more or less API-compatible with imagemagick). A line in the MiniMagick README makes me concerned about how many times MiniMagick is writing temporary files:

MiniMagick::Image.open makes a copy of the image, and further methods modify that copy (the original stays untouched). We then resize the image, and write it to a file. The writing part is necessary because the copy is just temporary, it gets garbage collected when we lose reference to the image.

I’m not sure if that would apply to the flatten command too. Or even the format and quality directives?  If the way MiniMagick is being used, files are written/read multiple times, that would definitely be an opportunity for performance improvements, because these days touching the file system is one of the slowest things one can do. ImageMagick/GraphicsMagick/other-similar are definitely capable of doing all of these operations without interim temporary file system writes in between each, I’m not certain if Hydra::Derivatives::Processors::Image use of MiniMagick is doing so.

It’s not clear to me how to change what operations Hydra::Derivatives::Processors::Image​ does — let’s say you want to strip extra metadata for a smaller thumb as for instance Google suggests, how would you do that? I guess you’d write your own class to use as a processor_class. It could sub-class Hydra::Derivatives::Processors::Image or not (really no need for a sub-class I don’t think, what it’s doing is pretty straightforward).  How would you set your custom processor to be used?  I guess you’d have to override the line in Hydra::Derivatives::ImageDerivatives Or perhaps you should you instead provide your own class to replace Hydra::Derivatives::ImageDerivatives, and have that used instead? Which in Sufia would probably be by overriding FileSet#create_derivatives to call your custom class.   Or in Hyrax, there’s that newer Hyrax::DerivativeService stuff, perhaps you’d change your local FileSet to use a different DerivativeService, which seems at least more straightforward (alas I’m not on Hyrax). If you did this, I’m not sure if it would be recommended for you to re-use pieces of the existing architecture as components (and in what way), or just write the whole thing from scratch.

Some Brief Analysis and Decision-making

So I actually wanted to change nearly every part of the default pipeline here in our app.

Reading: I want to continue reading from fedora, being sure to stream it from fedora to local file system as a working copy.

Cleanup: I want to make sure to clean up the temporary working copy when you’re done with it, which I know in at least some cases was not being done in our out of the box code. Maybe to leave it around for future ‘actor’ steps? In our actual app, downloading from one EC2 to another on the same local AWS network is very speedy, I’d rather just be safe and clean it up even if it means it might get downloaded again.

Transformation:  I want to have different image transformation options. Stripping metadata, interlaced JPGs, setting color profiles. Maybe different parameters for images to be used as in-browser thumbs vs downloadable files. (See advice about thumb parameters from  Google’s, or vips). Maybe using a non-ImageMagick processor (we ended up with vips).

Output: I want to write to S3, because it makes sense to store assets like this there, especially but not only if you’re deploying on AWS already like we are.  Of course, you’d have to change the front-end to find the thumbs (and/or downloads) at a separate URL still, more on that later.

So, there are many parts I wanted to customize. And for nearly all of them, it was unclear to me the ‘right’/intended/best way to to customize in the current architecture. I figured, okay then, I’m just going to completely replace CreateDerivativesJob with my own implementation.

The good news is that worked out pretty fine — the only place this is coupled to the rest of sufia at all, is in sufia knowing what URLs to link to for thumbs (which I suspect many people have customized already, for instance to use an IIIF server for thumbs instead of creating them statically, as the default and my new implementation both do). So in one sense that is an architectural success!

Irony?

Sandi Metz has written about the consequences of “the wrong abstraction”, sometimes paraphrased as “the wrong abstraction is worse than no abstraction.”

hydra-derivatives, and parts of sufia/hyrax that use it, have a pretty complex cooperating object graph, with many cooperating objects and several inheritance hierarchies.  Presumably this was done intending to support flexibility, customization, and maintainability, that’s why you do such things.

Ironically, adding more cooperating objects (that is, abstractions), can paradoxically inhibit flexibility, customizability, or maintainability — if you don’t get it quite right. With more code, there’s more for developers to understand, and it can be easy to get overwhelmed and not be able to figure out the right place to intervene for a change  (especially in the absence of docs). And changes and improvements to the codebase can require changes across many different accidentally-coupled objects in concert, raising the cost of improvements, especially when crossing gem boundaries too.

If the lines between objects, and the places objects interface with each other, aren’t drawn quite right to support needed use cases, you may sometimes have to customize or override or change things in multiple places now (because you have more places) to do what seems like one thing.

Some of this may be at play in hydra_derivatives and sufia/hyrax’s use of them.  And I think some of it comes from people adding additional layers of abstraction to try to compensate for problems in the existing ones, instead of changing the existing ones (Why does one do this? For backwards compat reasons? Because they don’t understand the existing ones enough to touch them? Organizational boundaries? Quicker development?)

It would be interesting to do a survey see how often hooks in hydra_derivatives that seem to have been put there for customization have actually been used, or what people are doing instead/in addition for the customization they need.

Getting architecture right (the right abstractions) is not easy, and takes more than just good intentions. It probably takes pretty good understanding of the domain and expected developer usage scenarios; careful design of object graphs and interfaces to support those scenarios; documentation of such to guide future users and developers. Maybe ideally starting some working individual examples in local ‘bespoke’ codebases that are only then abstracted/generalized to a shared codebase (which takes time).  And with all that, some luck and skill and experience too.

The number of different cooperating objects you have involved should probably be proportional to how much thinking and research you’ve done about usage scenarios to support and how the APIs will support them — when in doubt keep it simpler and less granular.

What We Did

This article previous to here, I wrote about 5 months ago. Then I sat it on it until now… for some reason the whole thing just filled me with a sort of psychic exhaustion, can’t totally explain it. So looking back to code I wrote a while ago, I can try to give you a very brief overview of our code.

Here’s the PR, which involves quite a bit of code, as well as building on top of some existing custom local architecture.

We completely override the CreateDerivativesJob#perform method, to just call our own “service” class to create derivatives (extracted into a service object instead of being inline in the job!)– if our Env variables are configured to use our new-fangled store-things-on-s3 functionality.  Otherwise we call super — but try to clean up the temporary working files that the built-in code was leaving lying around to fill up our file system.

Our derivatives-creating service is relatively straightforward.  Creating a bunch of derivatives and storing them in S3 is not something particularly challenging.

We made it harder for ourself by trying to support derivatives stored on S3 or in local file system, based on config — partially because it’s convenient to not have to use S3 in dev and test, and partially thinking about generalizing to share with the community.

Also, there needs to be a way for front-end code to get urls to derivatives of course, and really this should be tied into the derivatives creation, something hydra-derivatives appears to lack.  And in our case, we also need to add our derivatives meant to be offered as downloads to our ‘downloads’ menu, including in our custom image viewer. So there’s a lot of code related to that, including some refactoring of our custom image viewer.

One neat thing we did is (at least when using S3, as we do in production) deliver our downloads with a content-disposition header specifying a more human-friendly filename, including the first few words of the title.

Generalizing? Upstream? Future?

I knew from the start that what I had wasn’t quite good enough to generalize for upstream or other shareable dependency.  In fact, in the months since I implemented it, it hasn’t worked out great even for me, additional use cases I had didn’t fit neatly into it, my architecture has ended up overly complex and confusing.

Abstracting/generalizing to share really requires even more care and consideration to get the right architecture, compared to having something that works well enough for your app. In part, because refactoring something only used by one app is a lot less costly than with a shared dependency.

Initially, some months ago, even knowing what I had was not quite good enough to generalize, I thought I had figured out enough and thought about enough to be able to spend more time to come up with something that would be a good generalized shareable dependency.  This would only be worth spending time on if there seemed a good chance others would want to use it of course.

I even had a break-out session at Samvera Connect to discuss it, and others who showed up agreed that the current hydra-derivatives API was really not right (including at least one who was involved in writing it originally), and that a new try was due.

And then I just… lost steam to do it.  In part overwhelmed by community things; the process of doing a samvera working group, the uncertainty of knowing whether anyone would really switch from hydra-derivatives to use a new thing, of whether it could become the thing in hyrax (with hyrax valkyrie refactor already going on, how does this effect it?), etc.

And in part, I just realized…. the basic challenge here is coming up with the right API and architecture to a) allow choice of back-end storage (S3, local file system, etc), with b) URL generation, and ideally API for both streaming bytes from the storage location and downloading the whole thing, regardless of back-end storage. This is the harder part architecturally then just actually creating the derivatives. And… nothing about this is particularly unique to the domain of digital collections/repositories, isn’t there something already existing we could just use?

My current best bet is shrine.  It already handles those basic things above with a really nice very flexible decoupled architecture.  It’s a bit more confusing to use than, say, carrierwave (or the newer built-into-Rails ActiveStorage), but that’s because it’s a more flexible decoupled-components API, which is probably worth it so we can do exactly what we want with it, build it into our own frameworks. (More flexibility is always more complexity; I think ActiveStorage currently lacks the flexibility we need for our communities use cases).   Although it works great with Rails and ActiveRecord, it doesn’t even depend on Rails or ActiveRecord (the author prefers hanami I think), so quite possibly could work with ActiveFedora too.

But then the community (maybe? probably?) seems to be… at least in part… moving away from ActiveFedora too. Could you integrate shrine, to support derivatives, with valkyrie in a back-end independent way? I’m sure you could, I have no idea how the best way would be to do so, how much work it would be, the overall cost/benefit, or still if anyone would use it if you did.

So I’m not sure I’m going to be looking at shrine myself in a valkyrie context. (Although I think the very unsuitable hydra-derivatives is the only relevant shared dependency anyone is currently using with valkyrie, and presumably what hyrax 3 will still be using, and I still think it’s not really… right).

But I am going to be looking at shrine more — I’ve already started talking to the shrine author about what I see as my (and my understanding of our communities) needs for features for derivatives (which shrine currently calls “versions”), and I think I’m going to try to do some R&D on a new shrine plugin that meets my/our needs better. I’m not sure I’ll end up wanting to try to integrate it with valkyrie and/or hyrax, or with some new approaches I’ve been thinking on and doing some R&D on, which I hope to share more about in the medium-term future.

Leave a comment