Word Counting My Whole Site
My site is static HTML, built with Jekyll (more details in my colophon). This means I have a folder that contains the whole site in HTML files.
I wanted to find the total word count. I found this combination of commands works great:
find . -iname "*.html" | parallel pandoc -t plain | wc -w
It uses:
- UNIX
find
to list the relative paths of all the HTML files locally. - GNU
parallel
to run a command on each file in parallel. pandoc
document converter to convert the input HTML to plain text.- UNIX
wc
to calculate the total word count.
It took about 2 seconds on my computer to tell me my site currently has about 75,000 words. More than I expected, though this counts words in footers etc. many times over.
Thanks to pandoc’s universality, you can also use this to count words in many file formats: markdown, reStructuredText, MS Word, etc.
If your site is more dynamic, but still small enough to download, you might consider using GNU wget
. Its --recursive
flag will let you download every page as HTML locally, following links to find everything on the website.
Read my book Boost Your Git DX to Git better.
One summary email a week, no spam, I pinky promise.
Related posts:
Tags: commandline, jekyll