How to download a documentation website with Wget

Time to grab a whole document tree.

This post is an adapted extract from my book Boost Your Django DX, available now.

Sometimes you want to download a whole website so you have a local copy that you can browse offline. When programming, this is often useful for documentation that sites that do not provide downloadable versions, and are not available in offline tools like DevDocs.

One tool for making such copies is Wget (from “web get”). With the right flags, Wget can download a whole website and convert it for offline browsing.

Install Wget

Wget is widely available from platform package managers.

On macOS, you can use Homebrew:

$ brew install wget

On Windows, you can use Chocolatey:

> choco install wget

On Linux, most distributions have Wget pre-installed. If not, it’s normally installable from a wget package.

How to download a website

You can invoke this single big Wget command to download a site, replacing <website> the URL of the site:

$ wget --mirror --convert-links --adjust-extension --page-requisites --no-parent <website>

The URL may be either the full domain such as https://www.example.com, or have a path prefix such as https://www.example.com/tutorial/en/. (We’ll take apart all those flags in a few sections.)

Downloading a website can take a little while, even on a fast connection. This is because Wget downloads pages one at a time, in order to discover links as it goes.

Wget stores the downloaded pages in a directory named after the website’s domain name, such as www.example.com. After Wget has completed, you can open pages from there in your web browser, and navigate as usual.

Example: the Django REST Framework documentation

The DRF documentation is available on DevDocs, but it can be out of date. And unfortunately, the DRF site doesn’t provide downloads.

You can use the above Wget command to download the Django REST Framework documentation like so:

$ wget --mirror --convert-links --adjust-extension --page-requisites --no-parent https://www.django-rest-framework.org/

Wget prints a lot of output, starting:

--2021-10-27 10:56:12--  https://www.django-rest-framework.org/
Resolving www.django-rest-framework.org (www.django-rest-framework.org)... 185.199.108.153, 185.199.109.153, 185.199.110.153, ...
Connecting to www.django-rest-framework.org (www.django-rest-framework.org)|185.199.108.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 30663 (30K) [text/html]
Saving to: ‘www.django-rest-framework.org/index.html’

www.django-rest-fr 100%[==============>]  29.94K  --.-KB/s    in 0.002s

2021-10-27 10:56:12 (13.2 MB/s) - ‘www.django-rest-framework.org/index.html’ saved [30663/30663]

Loading robots.txt; please ignore errors.
--2021-10-27 10:56:12--  https://www.django-rest-framework.org/robots.txt
Reusing existing connection to www.django-rest-framework.org:443.
HTTP request sent, awaiting response... 404 Not Found
2021-10-27 10:56:12 ERROR 404: Not Found.
...

…after downloading every file, Wget finishes by converting links:

...
Converting links in www.django-rest-framework.org/community/3.9-announcement/index.html... 109.
109-0
Converting links in www.django-rest-framework.org/css/prettify.css... nothing to do.
Converting links in www.django-rest-framework.org/css/bootstrap.css... 2.
2-0
Converting links in www.django-rest-framework.org/css/default.css... 1.
1-0
Converting links in www.django-rest-framework.org/css/bootstrap-responsive.css... nothing to do.
Converted links in 73 files in 0.3 seconds.

…and it’s done.

Once Wget has finished, you can check the downloaded files:

$ ls www.django-rest-framework.org
api-guide  css        index.html search     tutorial
community  img        js         topics

Things seem in place. To read the offline copy, you can open index.html in the browser, and browse away as usual.

Read offline documentation with Python’s web server

Some websites do not work when opened as a .html file in the web browser. This is because they use web features that browsers block on file:// URLs, for security. To make such offline copies work, you need to open them over http:// URLs, via a local web server, and luckily there’s one built in to Python.

For example, take the Django Girls Tutorial at https://tutorial.djangogirls.org/en/ . After downloading the site with Wget, you can open its pages in the browser, but navigation doesn’t work. If you open the browser’s developer console, you’ll see errors from clicking links, such as:

Security Error: Content at file:///.../tutorial.djangogirls.org/en/index.html may not load data from file:///.../tutorial.djangogirls.org/en/intro_to_command_line/index.html.

Uncaught DOMException: The operation is insecure.

These messages are the browser reporting that it is blocking the website’s use of JavaScript for navigation.

You can fix these errors by loading the site through Python’s built-in web server. (This server is only suitable for local development, like Django’s runserver.)

To do so, navigate to the site folder:

$ cd tutorial.djangogirls.org
$ ls
en gitbook

…then, start the web server:

$ python -m http.server 8001
Serving HTTP on :: port 8001 (http://[::]:8001/) ...

Note this command explicitly uses port 8001, to avoid colliding with Django’s runserver, which you probably have running. Both http.server and runserver default to port 8000.

With the server running, open http://localhost:8001 in the browser, and you’ll find the documentation loads with working navigation. Huzzah!

An Explanation of All the Flags

Wget has very many options. Here’s a brief explanation of the flags we’re using:

Another flag that you may find useful is --wait <n>, which limits bandwidth consumption by adding a delay of <n> seconds between requests. This can lighten the load both for others on your internet connection and the web server you’re downloading from.

For more info see the Wget documentation.

Fin

Enjoy your time offline,

—Adam


Newly updated: my book Boost Your Django DX now covers Django 5.0 and Python 3.12.


Subscribe via RSS, Twitter, Mastodon, or email:

One summary email a week, no spam, I pinky promise.

Related posts:

Tags: