Content hashing static assets to break caches with md5sum and bash
Info
Summary | Join me as I (over-)engineer implementing cache-busting for static assets on a simple website! |
---|---|
Shared | 2024-01-30 |
Revised | 2024-01-31 @ 01:00 UTC |
The Problem
When I make something for the web, I have plenty of âstatic assetsâ: CSS files, JS files, and images. In my HTML files, I reference their sources as you might expect:
<link rel="stylesheet" href="./styles.css" />
<script async src="./scripts.js"></script>
<img src="./cool-image.avif" />
But when a browser downloads these, it will note the source URLs and cache the static assets, typically for a time set by the server via headers like Expires: <some date>
or Cache-Control: public, max-age=15552000
(6 months). This is exactly what we want the browser to do, but what happens if I change the contents of the file?
Nothing! The last-cached content is served!
The browser caching is doing its job to help the user by not re-downloading the same content, but if we have updated static asset content, we need to give the browser a way to know that the content is different. Then it should go download and use that updated asset.
One option, if Iâm using a CDN, is to manually purge the assets. There are also some additional headers, like ETag
and Last-Modified
, that can help hint to a browser that it can keep its cache or not for an asset, but if you donât want to mess with request headers and/or want to guarantee the browser gets the latest version of an asset, you can provide a different file name for each version of an asset. Since the URL to the resource is different, the browser should always go and try to download it.
Yes, frameworks like Ruby on Rails and Phoenix will automatically do this sort of thing for you, but weâre exploring here and trying to keep things simple! (That will remain to be seen⊠đ )
Letâs talk about how we can DIY (do it yourself).
What our assets should look like when weâre done
What we want to do is take a file like styles.css
, get an MD5 fingerprint (checksum/digest/hash) based on the fileâs contents, and output something like styles.78f7f2c2d416e59525938565dd6dd565.css
. This way, if anything in our file changes, weâll get a new hash and therefore a new file name.
Given we have these files:
index.html
cool-image.avif
scripts.js
styles.css
and our index.html
file contains this:
<link rel="stylesheet" href="./styles.css" />
<script async src="./scripts.js"></script>
<img alt="" src="./cool-image.avif" />
then we should create a dist/
directory with files that resemble these:
index.html
cool-image.dadb0e162005e9b241a13ca5f871e250.avif
scripts.9efef7ad3d06e7703c7563dbc1ed78a9.js
styles.78f7f2c2d416e59525938565dd6dd565.css
and our index.html
file should have its assetsâ paths updated to resemble these:
<link rel="stylesheet" href="./styles.78f7f2c2d416e59525938565dd6dd565.css" />
<script async src="./scripts.9efef7ad3d06e7703c7563dbc1ed78a9.js"></script>
<img src="./cool-image.dadb0e162005e9b241a13ca5f871e250.avif" />
Using md5sum to get file content hashes
If you havenât used md5sum
before, go ahead and run man md5sum
in your terminal. There are some neat things you can use this for, like storing a list of file checksums in a file, then detecting which files changed, having your build system make decisions based on that, and avoiding costly project rebuilds by only rebuilding files or directories and their dependencies that changed. But we only need the top-level, most basic thing from md5sum
: computing an MD5 message digest.
Letâs say this is what our project folder looks like:
λ tree -a -L 1
.
âââ .git
âââ .gitignore
âââ cool-image.avif
âââ dist
âââ index.html
âââ scripts.js
âââ styles.css
If I want to get an MD5 content hash for styles.css
, I pass the filename to md5sm
:
λ md5sum styles.css
e6dd05b39c5fb97218130638c0a374de styles.css
Sweet! If we want to query by a bunch of different file extensions, md5sum
can handle that:
λ md5sum *.{avif,css,js}
dadb0e162005e9b241a13ca5f871e250 cool-image.avif
e6dd05b39c5fb97218130638c0a374de styles.css
78f7f2c2d416e59525938565dd6dd565 bingo.js
But if we want md5sum
to ignore certain directories, find a bunch of different file types, and maybe do so a bit more efficiently, we can lean on the find
tool. Run man find
if youâre unfamiliar with it or canât remember its syntax!
Letâs run it with some options and then break down what we did:
λ find . \
-type f \
! -path "./.git/*" \
! -path "./dist/*" \
\( -iname "*.css" -o \
-iname "*.js" -o \
-iname "*.avif" -o \
-iname "*.bmp" -o \
-iname "*.gif" -o \
-iname "*.heif" -o \
-iname "*.jpeg" -o \
-iname "*.jpg" -o \
-iname "*.png" -o \
-iname "*.svg" -o \
-iname "*.webp" \
\) \
-exec md5sum '{}' +
e6dd05b39c5fb97218130638c0a374de ./styles.css
dadb0e162005e9b241a13ca5f871e250 ./cool-image.avif
78f7f2c2d416e59525938565dd6dd565 ./bingo.js
Above, we told the find command to find all files in this directory, excluding the .git/
and dist/
directories, where the file extension ends in one of a handful of extensions of likely static assets, and then we tell it to execute md5sum
on each one. At the bottom, we see the results!
Next, we want to take that MD5 hash on the left and output a new file where the filename has the hash just before the extension. For that, weâre going to want to start putting this into a build
script.
Starting our build script
In your terminal, run the following commands to create a build file with some scaffolding, then change it to an executable file (donât copy the λ):
λ cat <<EOF > ./build
#!/usr/bin/env bash
set -o errexit
set -o errtrace
set -o nounset
set -eou pipefail
function main {
}
main
EOF
λ chmod +x ./build
Once youâve done that open the file, and letâs add our find function in there:
# ...
BUILD_DIR="./dist"
function get_asset_md5sums {
find . \
-type f \
! -path "./.git/*" \
! -path "${BUILD_DIR}/*" \
\( -iname "*.css" -o \
-iname "*.js" -o \
-iname "*.avif" -o \
-iname "*.bmp" -o \
-iname "*.gif" -o \
-iname "*.heif" -o \
-iname "*.jpeg" -o \
-iname "*.jpg" -o \
-iname "*.png" -o \
-iname "*.svg" -o \
-iname "*.webp" \
\) \
-exec md5sum '{}' +
}
function main {
get_asset_md5sums
}
If you then run that file via ./build
, youâll get back the same results as before.
Outputting hashed static asset file names
Update your main
function with the following. Weâll use code comments to explain most of this part:
function main {
# Recreate build dir
rm -rf "${BUILD_DIR}" && mkdir -p "${BUILD_DIR}"
# Create a bash array for holding # "file=file_with_sum"
# pairs for use later. Yes, I know bash 4 has associative arrays.
# E.g.: "styles.css=styles.78f7f2c2d416e59525938565dd6dd565.css"
assets_array=()
# Get all asset MD5 checksums, put them into an assets
# array for later use, and write each file to a new
# file with the checksum in the name.
while read -r sum file; do
file_name="${file%.*}" # Extract the file's name
file_ext="${file##*.}" # Extract the file's extension
file_with_sum="${file_name}.${sum}.${file_ext}" # Hashed file name
# Append to the assets array
assets_array+=( "${file}=${file_with_sum}" )
# Write the file's contents to the build directory
# at the new, hashed file name.
cat "${file}" > "${BUILD_DIR}/${file_with_sum}"
done < <(get_asset_md5sums)
}
If youâre wondering about the <(get_asset_md5sums)
part, it uses process substitution to let us have access to the assets_array
variable, which we wouldnât have access to if we piped get_asset_md5sums
to while read ...
, for the while
loop would be ran in a subshell environment. Instead, with process substitution, the result of that function is stored in a named pipe/special temporary file (in /dev/fd/
on my system), the file name is passed, and then its contents are read and attached to the standard input by the <
input file descriptor. To sum this aside up, if we did get_asset_md5sums | while read...
, weâd get an assets_array[@]: unbound variable
error, so weâre using process substitution to get around that.
If you run ./build
again, you wonât see any terminal output, but you will see a shiny new ./dist
folder with your files in it!
The next part is a little more involved, for we need to create new HTML files that have updated values for asset source locations.
Including the hashed file names in our HTML file(s)
Before we can copy, update, and output our HTML files (of which we only have one in this example), we first need a way to find them! Add this below your get_asset_md5sums
function:
function get_html_files {
find . \
-type f \
! -path "./.git/*" \
! -path "${BUILD_DIR}/*" \
-iname "*.html"
}
Next, at the bottom of your main
function, add this code, and weâll use comments to try to explain each piece in context:
# For each HTML file...
while read -r file; do
# For each line in the current HTML file...
while IFS='' read -r line; do
line_updated="${line}"
# For each "file=file_with_sum" pairing...
for val in "${assets_array[@]}"; do
file_name_original=$(echo "${val}" | cut -d "=" -f 1)
file_name_summed=$(echo "${val}" | cut -d "=" -f 2)
# If the current line has the original file name...
if [[ "${line}" =~ ${file_name_original} ]]; then
# ...then replace that file name with the hashed one
line_updated=$(echo "${line}" | sed -E 's@'"${file_name_original}"'@'"${file_name_summed}"'@g')
break
fi
done
# Print the line
echo "${line_updated}"
# Pass the file in, then once done, redirect the
# printed file lines to a new file in our build dir
done < "${file}" > "${BUILD_DIR}/${file}"
# Pass the HTML files in via process substitution
done < <(get_html_files)
Once this code runs, your HTML files should be copied over to your dist/
directory, but the lines referencing your static assets should all be updated!
This runs fast enough for my purposes, but if you have any performance tips or explanation corrections, please email me.
Building with GitHub Actions and deploying to GitHub Pages
Once youâve got this building, you could build it locally and push the dist/
folder up to your source control, but I want ./build
to run automatically, and since I primarily use GitHub, I just want to deploy the dist/
directory to GitHub Pages.
To make this a reality, we can use Github Actions to deploy to GitHub Pages. Create a .github/workflows/main.yml
(the YAML file name can be whatever you like) and add the following:
name: CI
on:
pull_request:
push:
jobs:
build:
runs-on: ubuntu-latest
permissions:
contents: write
steps:
- name: Checkout repo under GH workspace
uses: actions/checkout@v4
- name: Run build script
run: ./build
- name: Deploy to gh-pages
uses: peaceiris/actions-gh-pages@v3
if: ${{ github.ref == 'refs/heads/main' }}
with:
github_token: ${{ secrets.GITHUB_TOKEN }}
publish_dir: ./dist
After youâve committed and merged everything into your main
branch, go to your projectâs âSettingsâ page, then click on âPagesâ on the left, and set your âBranchâ to point to gh-pages
and / (root)
. The page should tell you that your site is live and give you a link and button to visit the site.
Example project
Here is a silly project where everything in this blog post was implemented: https://github.com/rpearce/gom-jabbar-bingo.
Wrapping up
So⊠did we over-engineer our website? Maybe? But weâre also avoiding static asset caching issues by automating away a guaranteed way of cache-busting our static assets, so thatâs something!
If youâd like to see more bash content or something else entirely, send me an email!
Thanks for reading!
â Robert