Cover by Archivo-FSP - Own work, CC BY-SA 3.0, original

Archiving is important

I’m a big fan of archives, of all sorts. Whatever we might think in the heat of the current times, we owe it to future generations to preserve our cultural artifacts so that they might be analyzed and studied in the future, devoid of current bias - and surely imbued with new ones, but let’s not go there.

It’s utterly maddening that we probably know more about certain periods of the Roman empire than about the early stages of the Web, whose links are all dead, and hard drives storing those early nuggets of optimism and experimentation are either dead or in some landfill by now.

And it’s not just the early Web, one study from 2019 posits that “Just 1.11% of all links lasted longer than three months” which is kinda scary. This came to affect me personally! I save a lot of links to read later with Linkding which does feature Wayback Machine integration, but I had never used it before. More than once now, I’ve wanted to return to an old link saved a few weeks before and it’s just gone. Sometimes I can find it on Wayback, but not always. So I started looking for what’s out there in terms of solutions.

Archiving landscape

I’m only really interested in open-source solutions that allow exporting data in standard formats, to prevent being unable to access the data I produce due to the very bitrot that I’m trying to fight. From my research, these are the most active solutions in the space:

  • ArchiveBox does almost everything, but it’s kinda clunky, the Rest API is in alpha, and running it requires a few add-on services too
  • replayweb.page has no pre-packaged Docker setup I could try out
  • conifer formerly Webrecorder.io, it looks cool but the Docker setup requires a ton of extra services to run
  • SingleFile pretty cool Firefox addon, this stores the current page as a single HTML file, pictures and all! Very close to what I want, but it’s bothersome to move files to my NAS for safe keeping

SingleFile sure is interesting, I wonder if there’s a way to achieve a similar result with something else… 🤔

Getting started

I wanted to see if there was a CLI version of SingleFile, and luckily there is. Using something like webhook, I could create a script to run it and output the resulting file to a known location I can access via NFS or WebDAV. Now I just need to find something to trigger the hook from Firefox, ideally on the toolbar. As luck would have it, Send Tab URL is a thing! So the plan is the addon sends the current page to the server, that runs singlefile-cli and voilà!

Step 1

Let’s setup webhook - this is the YAML I’m using:

- id: singlefile-url
  execute-command: "/path/to/archive_script"
  command-working-directory: "/path/to"
  success-http-response-code: 200
  pass-arguments-to-command:
  - source: "url"
    name: "url"
  - source: "url"
    name: "title"
  trigger-rule:
    match:
      type: "value"
      value: "<STRONG RANDOM STRING>"
      parameter:
        source: "url"
        name: "secret"

Calls to this URL will look like /hooks/singlefile-url?url={URL}&title={TITLE}&secret={STRONG RANDOM STRING}, with the random string there just to prevent possible misuse - callers will have to match it for the hook to be triggered.

The archive_script just grabs stuff from ARGV and calls singlefile:

#!/usr/bin/env ruby

url = ARGV[0].dup
title = ARGV[1].dup

raise "need a url" unless url
raise "need a title" unless title

safe_title = title.gsub(/^.*(\\|\/)/, '').gsub(/[^0-9A-Za-z.\-]/, '_')[0..255]
image = "capsulecode/singlefile"

system(
	"docker run --rm #{image} #{url} > /path/to/singlefile/#{safe_title}.html",
	exception: true
)
File.open("/path/to/singlefile/#{safe_title}.url", "w") { |f| f.write(url) }

puts
puts "OK"

Here I’m using the Docker image for singlefile as I don’t want to pollute the local state on my NAS too much. Since the script creates a file from the page title, I make the string safe by replacing some characters that would be problematic. When the script runs successfully, it creates a .html file, and an extra .url one with the same name that just contains the original URL.

Running webhook is super simple:

./webhook -port SOME_PORT_NUMBER -hooks hooks.yaml -verbose -header Access-Control-Allow-Origin=*

Step 2

Now we just need to install the Send Tab URL Firefox addon, and configure it in the addon preferences page:

Send Tab URL addon settings
Send Tab URL addon settings

Give the server any name, and set the URL to your server’s address, with ?url={URL}&title={TITLE}&secret=<your random string from before> on the query string, as URL and TITLE are special placeholders.

Step 3

Let’s try it out!

Seems to work! This generated an html and a url file on my NAS that I can read later, without fear of 404s.

Parting words

Using open-source components like this to achieve a result is really cool! Also: remember to donate to the Internet Archive some time.

Until next time! 🖖

 

Feel free to reply with comments or feedback to this toot