Couchbase High Availability and Disaster Recovery: Part 2

Table of contents

Reading Time: 3 minutes

In our previous blog we learned about how Couchbase achieves high availability. This post will focus on understanding Couchbase Disaster Recovery mechanisms to prevent data loss.

Disaster Recovery

Couchbase uses the following mechanisms to prevent potential data loss due to unplanned incidents or disasters.

XDCR

As discussed in the previous blog, Cross-Data Center Replication (XDCR) is a technology that lets us keep entire Data Centers consistent as fast as the network speed and delivers high availability.
So, the moment Couchbase detects a disaster, it is ready for client switch over. And hence, it provides disaster recovery and geographic load balancing.

Enterprise backup and restoration tools

cbbackupmgr is an Enterprise Backup and Restoration Tool. It provides both full and incremental data backup with merge, remove and restore support.
To support this, a backup repository manages the archives with a very high performance for large datasets. In addition to that, the backup is not limited to data service, but also for index and view services.

In contrast, the open-source tools for Community Couchbase are: cbbackup, cbrestore, and cbtransfer

Recovery Time Objective

So, now that we know how Couchbase achieves high availability and disaster recovery. But in this fast paced world, it is also necessary to avoid delays in response and enhance customer experience.
Above all, we must know what is our Recovery Time Objective? That is how much time the system takes to get back online!

To understand this, let’s discuss some failure/disaster scenarios

The cluster lost a node.

To recover from this, an Auto Failover occurs.
Firstly, Couchbase activates the Replica documents, and meanwhile, Ongoing cluster maps updates are sent to all clients.
Similarly, when a new node adds to the cluster, data rebalances across the cluster. And meanwhile, updated cluster maps are shared with clients.

Zero downtime!! Clients are unaffected!

The cluster lost a complete Rack/Physical Zone.

Here, Couchbase leverages Rack Zone Awareness and ensures active and replica documents lie on different racks.

So, when failover occurs, replicas are still available to serve the requests. As a result, replicas are promoted for active document. Meanwhile, updated cluster maps are shared with clients.
Similarly, when a new node adds to the cluster, New rack must be configured and put online so that data streams proportionally to the nodes. updated cluster map are continuously shared with clients for any new topology changes.

As a result, Clients stay unaffected and never even know something has happened!!

What if the complete Data Center goes down? Might have flooded!

With resilient application design, the client code must be ready to re-route the requests on timeout. So that, failed requests are re-executed against the hot-standby Data Center which is kept ready and consistent using XDCR.

When downed Data Centre comes back again, 2 way(bidirectional) XDCR begins reloading the data to it the other Data Center. By contrast, it is comparatively faster than restoring a backup and having downtime differences.

So, Clients will be slightly affected by the timeout from the downed Data Center and re-routing requests to the standby Data Centre, which is again configurable and can be configured for as low as 5ms

Entire system slagged; All Data Centers are unavailable

Talking about Disasters, nothing is impossible! So, what if, both Active and Standby Data Centres has been wiped out simultaneously?
In that case, the Enterprise Backup and Restore tool can be used to restore the Data Centers to any incremental backup point.

Clients will be affected for a while until the backup and restore process completes!

Conclusion

So, we learned about how Couchbase Disaster Recovery mechanisms not only ensures to prevent potential data loss. But, also ensures to recover quickly and avoid affecting clients!
Hence, provisioning or removing nodes can be done without interrupting the running applications. That too, without requiring developers to modify their applications.
Furthermore, Couchbase allows to do all of this automatically. And hence, it does not require any manual intervention or downtime.

To sum up, we’ve learned how high availability and disaster recovery ensures preventing data loss and keeps the clients unaffected.
Hope you liked the post. Feel free to share your queries or thoughts in the comments section.

Thanks!