High availability with managed Elastic IP and health checks

One of the key principles of a well-designed web infrastructure is high availability: whatever may happen server-side, users should experience minimal to no service disruption.

In order to achieve this goal, the usual technique used by most online services named load balancing, allows scaling by distributing traffic among multiple servers, but also ensuring that the servers we’re dispatching requests to are actually able to process them.

This later function is named health checking, and it is now available for use with Elastic IP addresses. In this article we will showcase how you can leverage the managed Elastic IP health checking in order to build a highly available web infrastructure on Exoscale.

Managed Elastic IP

Architecture Overview

Let’s consider the following web infrastructure: we’re running a Wordpress instance on 2 independent compute instances, relying on a pair of database servers (one primary, and one secondary replicating the primary).

db1 (private network interface: 10.0.0.1/24)
db2 (private network interface: 10.0.0.2/24)
web1 (private network interface: 10.0.0.3/24)
web2 (private network interface: 10.0.0.4/24)

In case of a database failure happening on the primary server, a failover to the secondary can be performed by switching the database server address to the secondary servers’ address in the Wordpress configuration.

Demo Infrastructure Deployment

Note

In order to keep this article focused on the topic of load balancing, we will not detail the actual database server configuration/replication setup nor the web server setup/Wordpress installation: you can find plenty of up-to-date resources and tutorials covering those topics available online.

We start by creating a Private Network through which all servers communicate with each other:

$ exo privnet create data --zone ch-gva-2
┼──────┼─────────────┼──────────────────────────────────────┼──────┼
│ NAME │ DESCRIPTION │                  ID                  │ DHCP │
┼──────┼─────────────┼──────────────────────────────────────┼──────┼
│ data │             │ ffca3300-319a-41a5-ba6a-0a6cc55b140f │ n/a  │
┼──────┼─────────────┼──────────────────────────────────────┼──────┼

The next step is to create an Anti-Affinity Group (AAG) where we’ll place our database server instances. This is an optional, but highly recommended architecture best practice in cloud-based deployments. Instances in the same AAG will be explicitly dispatched among different hypervisors, resulting in a lower probability to lose several redundant instances at once in case of a hypervisor failure – thus improving the overall reliability.

$ exo affinitygroup create aa-db
┼───────┼─────────────┼──────────────────────────────────────┼
│ NAME  │ DESCRIPTION │                  ID                  │
┼───────┼─────────────┼──────────────────────────────────────┼
│ aa-db │             │ f7997f5b-7896-4753-b450-b9649f31b949 │
┼───────┼─────────────┼──────────────────────────────────────┼

We then proceed to create our database server instances in the Private Network data and Anti-Affinity Group aa-db:

$ exo vm create db1 --zone ch-gva-2 --anti-affinity-group aa-db --privnet data --security-group default
$ exo vm create db2 --zone ch-gva-2 --anti-affinity-group aa-db --privnet data --security-group default

Now that the database tier is up and running, we can focus and the web tier – which is, our webserver instances serving the Wordpress blog. Similar to what we did for the db* instances, we start by creating a new Anti-Affinity Group for the web* instances:

$ exo affinitygroup create aa-web
┼────────┼─────────────┼──────────────────────────────────────┼
│  NAME  │ DESCRIPTION │                  ID                  │
┼────────┼─────────────┼──────────────────────────────────────┼
│ aa-web │             │ f294a902-004c-448f-920e-43d6adfa9140 │
┼────────┼─────────────┼──────────────────────────────────────┼

Since we want to expose our web servers on the Internet, we create a new firewall Security Group (SG) to allow ingress traffic from any source to port 80 (HTTP):

$ exo firewall create web
┼──────┼────┼──────────────────────────────────────┼
│ NAME │ ID │                                      │
┼──────┼────┼──────────────────────────────────────┼
│ web  │    │ 91c8cd04-3f2d-4aa4-ae69-eb9dac527e81 │
┼──────┼────┼──────────────────────────────────────┼

$ exo firewall add web --protocol tcp --port 80-80
┼─────────┼────────────────┼──────────┼───────────┼─────────────┼──────────────────────────────────────┼
│  TYPE   │     SOURCE     │ PROTOCOL │   PORT    │ DESCRIPTION │                  ID                  │
┼─────────┼────────────────┼──────────┼───────────┼─────────────┼──────────────────────────────────────┼
│ INGRESS │ CIDR 0.0.0.0/0 │ tcp      │ 80 (http) │             │ b2d9d919-f46c-4083-811e-858b3fca85b7 │
┼─────────┼────────────────┼──────────┼───────────┼─────────────┼──────────────────────────────────────┼

We then create our 2 web server instances with the AAG and SG we created earlier, also including them in the data Private Network so they can reach the database servers securely:

$ exo vm create web1 --zone ch-gva-2 --anti-affinity-group aa-web --privnet data --security-group web
$ exo vm create web2 --zone ch-gva-2 --anti-affinity-group aa-web --privnet data --security-group web

As mentioned before we’ll skip the Wordpress installation and configuration, as well as the Nginx HTTP server set up, however to help us visualise the practical effects of the load balancing we’ll tweak the Nginx configuration a little so that they return their hostname in a HTTP response header:

# /etc/nginx/conf.d/add_header.conf
add_header "X-Server" "$hostname";

Note: Do not do this on your productions servers as it may expose internal infrastructure information that can be used by attackers.

$ exo eip create ch-gva-2 --healthcheck-mode http --healthcheck-path / --healthcheck-port 80
┼──────────┼─────────────────┼──────────────────────────────────────┼
│   ZONE   │       IP        │                  ID                  │
┼──────────┼─────────────────┼──────────────────────────────────────┼
│ ch-gva-2 │ 159.100.241.202 │ c595c5b6-8211-4497-9154-050c696eae1c │
┼──────────┼─────────────────┼──────────────────────────────────────┼

The final step is to associate our managed Elastic IP with our web servers, effectively starting the health checking process:

$ exo eip associate 159.100.241.202 web1 web2
associate "159.100.241.202" EIP
associate "159.100.241.202" EIP

Within a few seconds, sending requests to the associated Elastic EIP shows that both the web1 and web2 servers receive the incoming requests (the balancing method is 5-tuple based on a hash of the source IP address, source port, destination IP address, destination port and the layer 4 protocol):

$ curl -I 159.100.241.202
HTTP/1.1 200 OK
Server: nginx/1.14.0 (Ubuntu)
Date: Tue, 02 Apr 2019 15:36:38 GMT
Content-Type: text/html; charset=UTF-8
Connection: keep-alive
Link: <http://159.100.241.220/index.php?rest_route=/>; rel="https://api.w.org/"
X-Server: web2

$ curl -I 159.100.241.202
HTTP/1.1 200 OK
Server: nginx/1.14.0 (Ubuntu)
Date: Tue, 02 Apr 2019 15:36:56 GMT
Content-Type: text/html; charset=UTF-8
Connection: keep-alive
Link: <http://159.100.241.220/index.php?rest_route=/>; rel="https://api.w.org/"
X-Server: web1

$ curl -I 159.100.241.202
HTTP/1.1 200 OK
Server: nginx/1.14.0 (Ubuntu)
Date: Tue, 02 Apr 2019 15:36:57 GMT
Content-Type: text/html; charset=UTF-8
Connection: keep-alive
Link: <http://159.100.241.220/index.php?rest_route=/>; rel="https://api.w.org/"
X-Server: web2

$ curl -I 159.100.241.202
HTTP/1.1 200 OK
Server: nginx/1.14.0 (Ubuntu)
Date: Tue, 02 Apr 2019 15:36:58 GMT
Content-Type: text/html; charset=UTF-8
Connection: keep-alive
Link: <http://159.100.241.220/index.php?rest_route=/>; rel="https://api.w.org/"
X-Server: web1

If we simulate a web service outage by stopping the HTTP server on the instance web1, we see that now only the instance web2 serves incoming requests:

ubuntu@web1:~$ sudo systemctl stop nginx

curl -I 159.100.241.202
HTTP/1.1 200 OK
Server: nginx/1.14.0 (Ubuntu)
Date: Tue, 02 Apr 2019 15:39:39 GMT
Content-Type: text/html; charset=UTF-8
Connection: keep-alive
Link: <http://159.100.241.220/index.php?rest_route=/>; rel="https://api.w.org/"
X-Server: web2

$ curl -I 159.100.241.202
HTTP/1.1 200 OK
Server: nginx/1.14.0 (Ubuntu)
Date: Tue, 02 Apr 2019 15:39:40 GMT
Content-Type: text/html; charset=UTF-8
Connection: keep-alive
Link: <http://159.100.241.220/index.php?rest_route=/>; rel="https://api.w.org/"
X-Server: web2

$ curl -I 159.100.241.202
HTTP/1.1 200 OK
Server: nginx/1.14.0 (Ubuntu)
Date: Tue, 02 Apr 2019 15:39:40 GMT
Content-Type: text/html; charset=UTF-8
Connection: keep-alive
Link: <http://159.100.241.220/index.php?rest_route=/>; rel="https://api.w.org/"
X-Server: web2

$ curl -I 159.100.241.202
HTTP/1.1 200 OK
Server: nginx/1.14.0 (Ubuntu)
Date: Tue, 02 Apr 2019 15:39:41 GMT
Content-Type: text/html; charset=UTF-8
Connection: keep-alive
Link: <http://159.100.241.220/index.php?rest_route=/>; rel="https://api.w.org/"
X-Server: web2

$ curl -I 159.100.241.202
HTTP/1.1 200 OK
Server: nginx/1.14.0 (Ubuntu)
Date: Tue, 02 Apr 2019 15:39:42 GMT
Content-Type: text/html; charset=UTF-8
Connection: keep-alive
Link: <http://159.100.241.220/index.php?rest_route=/>; rel="https://api.w.org/"
X-Server: web2

Restarting the HTTP server on server web1 resumes traffic distribution after a few seconds:

ubuntu@web1:~$ sudo systemctl start nginx

$ curl -I 159.100.241.202
HTTP/1.1 200 OK
Server: nginx/1.14.0 (Ubuntu)
Date: Tue, 02 Apr 2019 15:44:30 GMT
Content-Type: text/html; charset=UTF-8
Connection: keep-alive
Link: <http://159.100.241.220/index.php?rest_route=/>; rel="https://api.w.org/"
X-Server: web1

$ curl -I 159.100.241.202
HTTP/1.1 200 OK
Server: nginx/1.14.0 (Ubuntu)
Date: Tue, 02 Apr 2019 15:44:30 GMT
Content-Type: text/html; charset=UTF-8
Connection: keep-alive
Link: <http://159.100.241.220/index.php?rest_route=/>; rel="https://api.w.org/"
X-Server: web2

$ curl -I 159.100.241.202
HTTP/1.1 200 OK
Server: nginx/1.14.0 (Ubuntu)
Date: Tue, 02 Apr 2019 15:44:31 GMT
Content-Type: text/html; charset=UTF-8
Connection: keep-alive
Link: <http://159.100.241.220/index.php?rest_route=/>; rel="https://api.w.org/"
X-Server: web1

$ curl -I 159.100.241.202
HTTP/1.1 200 OK
Server: nginx/1.14.0 (Ubuntu)
Date: Tue, 02 Apr 2019 15:44:32 GMT
Content-Type: text/html; charset=UTF-8
Connection: keep-alive
Link: <http://159.100.241.220/index.php?rest_route=/>; rel="https://api.w.org/"
X-Server: web1

$ curl -I 159.100.241.202
HTTP/1.1 200 OK
Server: nginx/1.14.0 (Ubuntu)
Date: Tue, 02 Apr 2019 15:44:33 GMT
Content-Type: text/html; charset=UTF-8
Connection: keep-alive
Link: <http://159.100.241.220/index.php?rest_route=/>; rel="https://api.w.org/"
X-Server: web2

Conclusion

It is worth mentioning that this solution, although battle-tested for years, is not a silver bullet: there is slight a delay before the managed Elastic IP healthcheck detects an unhealthy instance and removes it from the distribution, resulting in requests sent towards it and potentially errors returned to users. However, this is still better than minutes or even hours of partially degraded service in the case of a manual intervention.

Need more capacity to handle increasing incoming traffic? Just create more instances and associate them to your managed Elastic IP the same way we did with web1 and web2, and you can virtually scale to infinity and beyond.