Web analytics from CloudFront access logs with GoAccess

First, turn on turn on logging in your CloudFront Distribution settings to get logs written to an S3 bucket. Then, use the AWS command line interface1 to sync the logs to a local directory:2

aws s3 sync s3://jeffkreeftmeijer.com-log-cf logs

With the logs downloaded, generate a report by passing the logs to GoAccess to produce a 28-day HTML3 report:

find logs -name "*.gz" | \
    xargs gzcat | \
    grep --invert-match --file=exclude.txt | \
    /usr/local/bin/goaccess \
	--log-format "%d\\t%t\\t%^\\t%b\\t%h\\t%m\\t%^\\t%r\\t%s\\t%R\\t%u\\t%^" \
	--date-format CLOUDFRONT \
	--time-format CLOUDFRONT \
	--ignore-crawlers \
	--ignore-status=301 \
	--ignore-status=302 \
	--keep-last=28 \
	--output index.html

This command consists of four commands piped together:

find logs -name "*.gz"

Produce a list of all files in the logs directory. Because of the number of files in the logs directory, passing a directory glob to gunzip directly would result in an “argument list too long” error because the list of filenames exceeds the ARG_MAX configuration:

gunzip logs/*.gz
zsh: argument list too long: gunzip
xargs gzcat

xargs takes the output from find—which outputs a stream of filenames delimited by newlines—and calls the gzcat utility for every line by appending it to the passed command. Essentially, this runs the gzcat command for every file in the logs directory.

gzcat is an alias for gzip --decompress --stdout, which decompresses gzipped files and prints the output to the standard output stream.

grep --invert-match --file=exclude.txt
grep takes the input stream and filters out all log lines that match a line in the exclude file (exclude.txt). The exclude file is a list of words that are ignored when producing the report4.
goaccess …
The decompressed logs get piped to goaccess to generate a report with the following options:
--log-format "%d\\t%t\\t%^\\t%b\\t%h\\t%m\\t%^\\t%r\\t%s\\t%R\\t%u\\t%^"
Use CloudFront’s log format.5
--log-format CLOUDFRONT --date-format CLOUDFRONT --time-format CLOUDFRONT
Use CloudFront’s date and time formats to parse the log lines.
--ignore-crawlers --ignore-status=301 --ignore-status=302
Ignore crawlers and redirects.
--keep-last=28
Use the last 28 days to build the report.
--output=index.html
Output an HTML report to a file named index.html.

To sync the logs and generate a new report, run the sync.sh and html.sh scripts in a cron job every night at midnight:

echo '0 0 * * * ~/stats/sync.sh && ~/stats/html.sh' | crontab

  1. On a mac, use Homebrew:

    brew install awscli
    
    ↩︎
  2. Running the aws s3 sync command on an empty local directory took me two hours and produced a 2.1 GB directory of .gz files for roughly 3 years of logs. Updating the logs by running the same command takes about five minutes.

    Since I’m only interested in the stats for the last 28 days, it would make sense to only download the last 28 days of logs to generate the report. However, AWS’s command line tool doesn’t support filters like that.

    One thing that does work is using both the --exclude and --include options to include only the logs for the current month:

    aws s3 sync --exclude "*" --include "*2021-07-*" s3://jeffkreeftmeijer.com-log-cf ~/stats/logs
    

    While this still loops over all files, it won’t download anything outside of the selected month.

    The command accepts the --include= option multiple times, so it’s possible to select multiple months like this. One could, theoretically, write a script that finds the current year and month, then downloads that stats matching that month and the month before it to produce a 28-day report.

    ↩︎
  3. GoAccess generates JSON and CSV files when passing a filename with a .json or .csv extension, respectively. To generate the 28-day report in CSV format:

    find logs -name "*.gz" | \
        xargs gzcat | \
        grep --invert-match --file=exclude.txt | \
        /usr/local/bin/goaccess \
    	--log-format "%d\\t%t\\t%^\\t%b\\t%h\\t%m\\t%^\\t%r\\t%s\\t%R\\t%u\\t%^" \
    	--date-format CLOUDFRONT \
    	--time-format CLOUDFRONT \
    	--ignore-crawlers \
    	--ignore-status=301 \
    	--ignore-status=302 \
    	--keep-last=28 \
    	--output stats.csv
    
    ↩︎
  4. My exclude.txt currently consists of the HEAD HTTP request type and the path to the feed file:

    HEAD
    feed.xml
    
    ↩︎
  5. Initially, the value for the --log-format option was CLOUDFRONT, which points to a predefined log format. However, that internal value changed, which broke the script when updating goaccess, producing errors like this:

    ==77953== FILE: -
    ==77953== Parsed 10 lines producing the following errors:
    ==77953==
    ==77953== Token for '%H' specifier is NULL.
    ==77953== Token for '%H' specifier is NULL.
    ==77953== Token for '%H' specifier is NULL.
    ==77953== Token for '%H' specifier is NULL.
    ==77953== Token for '%H' specifier is NULL.
    ==77953== Token for '%H' specifier is NULL.
    ==77953== Token for '%H' specifier is NULL.
    ==77953== Token for '%H' specifier is NULL.
    ==77953== Token for '%H' specifier is NULL.
    ==77953== Token for '%H' specifier is NULL.
    ==77953==
    ==77953== Format Errors - Verify your log/date/time format
    

    I haven’t been able to find out what the problem is, so I’ve reverted to the old log format for the time being. I’m suspecting the newly introduced log format in GoAccess doesn’t match the old log lines from 2017 anymore.

    An example of a log line that doesn’t match the new log format:

    2017-09-17	09:16:00	CDG50	573	132.166.177.54	GET	d2xkchmcg9g2pt.cloudfront.net	/favicon.ico	301	-	Mozilla/5.0%2520(X11;%2520Linux%2520x86_64)%2520KHTML/5.37.0%2520(like%2520Gecko)%2520Konqueror/5%2520KIO/5.37	-	-	Redirect	QgyNFmDkLiZ23dKCu9ozmQFWrGY407bHn9VRlWzhp9KjyCe3b0b4WQ==	jeffkreeftmeijer.com	http	425	0.000	-	-	-	Redirect	HTTP/1.1
    
    ↩︎