DBA in training: SQL Server high availability options

DBAs must make sure data is highly available, and there are many SQL Server high availability options. Pamela Mooney discusses those options in this article.

The series so far:

  1. DBA in training: So, you want to be a DBA...
  2. DBA in training: Preparing for interviews
  3. DBA in training: Know your environment(s)
  4. DBA in training: Security
  5. DBA in training: Backups, SLAs, and restore strategies
  6. DBA in training: Know your server’s limits 
  7. DBA in training: SQL Server under the hood
  8. DBA in training: SQL Server high availability options

In previous articles in this series, I have stated that the job of the DBA is to make the right data available to the right people as quickly as possible.

Here is where we delve more into the word “available” and take it up a notch. SQL Server offers several options for high availability, and understanding the advantages and caveats of each one will give you the best chance of ensuring the availability of data in any scenario. Let’s discuss the options for high availability in general terms and find out where to go to get more information as you need it.

Due to the breadth of this article and keeping with the idea of just learning the basics, I am not going to cover Azure here except to say that Azure either has compatibility with these features in most of its offerings uses them in background processes.

Backups – The Foundation

This series has already covered understanding backup strategies and SLAs. This is the foundation of making the right data available. Having a tried and tested backup and restore strategy is your protection against a host of issues, from data entry errors to unintentional deletes to database corruption. No high availability solution should ever take the place of your backups. But restores take time – and businesses may sometimes require less downtime than a restore can provide.

On other occasions, backups and restores cannot perform a business need. Say, for instance, that you have a situation where you have a third-party database that performs poorly for you when you try to report against it. It is, in fact, almost unusable for that purpose – but reporting against that data is something that must happen. You can’t modify the database (i.e., apply indexing or tune stored procedures) without voiding your warranty, and the business doesn’t want to take that risk. They do, however, need queries to work faster than they currently are.

Your data needs to be more readily available to your business. What choices do you have to make your data more performant and accessible to your business users?

One option comes immediately to mind: replication.

Replication – Object-Level Availability

Replication, simply defined, is a process where schema and data from a database are copied from one location to another location (or locations). It can be used to copy parts of a database or as a solution to the scenario I’ve just described. Replication allows you to keep your third-party database in pristine condition while modifying the copied schema and data. In this example, the third-party database could be reserved for data entry, while the most critical data is replicated to another server that is used for reporting. Those replicated tables can then be indexed and optimized to your heart’s content. Replication can offer advantages over other options such as an availability group that I’ll describe later; the schema doesn’t have to be exactly the same on both sides if you know how to do it which I will explain.

How Replication Works

Replication has three components: a publisher (source database), a distributor (the database in the middle, in charge of getting the data to one or more destinations), and a subscriber (the destination database). Inside the publisher are articles, which are groups of database object(s) that you assemble to be replicated over together. The articles can have one or more subscriptions, or destinations where the articles are copied.

To make the magic happen, SQL Server has two agents that it uses. The snapshot agent assembles the metadata of the schema and data of the objects to be replicated and keeps track of the synchronizations in the distribution database. There is also a log reader agent which monitors the transaction log of the database for any changes which will need to be copied to the subscriber(s). When you start replication on a publication, SQL Server uses the snapshot agent to create the snapshot of the schema so that an identical copy can be made on the other side. Once that process is complete, it sends over the data using a utility called the bulk copy program utility, or BCP. When the process is finished, there is an exact copy of all the articles on all of the subscribers.

You have options with replication. Not all options are supported across all editions. If you have questions about whether the option you want is available in your edition of SQL Server, this should help. Options for replication are:

  1. Snapshot replication – SQL Server will take a snapshot of your data (either immediately or at a specified time of day) and send it over. This one-time process is a great choice for data that doesn’t often change and/or isn’t needed in real-time since it can be done on a schedule or as needed, allowing you to comply with business requirements or to avoid maintenance windows. Since snapshot replication isn’t an ongoing process, it is also more economical for SQL Server to do compared to the other replication options you have.
  2. Transactional replication – This is SQL Server copying the data over in “real-time”. (For the purposes of this article, “real-time” will be defined by factors such as how fast your network is, how much replication you have going on at a time, etc.) This choice is for data which needs to be as close to real-time as possible

There are two types of subscriptions in transactional replication: push subscriptions, where SQL Server “pushes” the changes the log reader finds over to the subscriber without the subscriber needing to request it. The distribution agent runs at the distributor, freeing up the subscriber.

The other kind of subscription is known as a pull subscription. In this case, the distribution agent runs on the subscriber. With pull subscriptions, the subscriber will request for the changes to be brought over. Because of these differences, pull subscriptions should not be expected to be as close to real-time as a push publication (although they can be pretty close). They are typically used when there is more than one subscriber when you want to minimize resource contention at the distributor, and you need the data sooner rather than later but not necessarily immediately. Push subscriptions are also commonly used with merge replication (the next option).

  1. Merge replication – This is the choice when there is no primary key on the table to be replicated, but it still needs to be copied over in real-time. You may also consider merge replication if you have subscribers updating data, and you want that data copied to the publisher and/or other subscribers.

Merge replication works a bit differently than transactional and snapshot replication. In merge replication, data is updated using a series of insert, delete and update triggers, and by assigning a unique identifier to every row of data being copied. Another set of triggers capture the schema changes. Instead of using the log reader agent, a merge agent coordinates data movement. Merge replication can be tricky to set up and troubleshoot, and conflicts are not uncommon. When possible, I recommend that you go with a snapshot or transactional replication instead.

  1. Peer-to-peer – This type of replication goes back and forth between two or more servers in real-time. Updates on any one server are copied over to the others. This option is only available to Enterprise edition users. If you use this choice, the schema and object names must remain identical on all nodes, taking this choice off the list for the reporting scenario examined earlier. Publications cannot be filtered, a consideration for very large tables. There are quite a few restrictions on this type of replication, so if you are considering it, you might start learning about them here. Peer-to-peer replication does present some latency and conflict/data loss possibilities, so it should be tested thoroughly before using in production.

I explain, step by step, how to set up transactional replication here. Snapshot replication is just as easy. TechBrothersIT does a walkthrough of setting up snapshot replication over here. For peer-to-peer and merge replication, I like SQL Shack’s walkthrough.

The Gotchas of Replication

You’ve probably heard that replication is complicated. You may have even been afraid to try it. The truth is that although it can be a little complicated, it is a well documented and stable option for you to consider. While it is true that in previous versions (especially versions prior to SQL Server 2005), replication could be difficult to work with, I have worked with it since SQL Server 2008r2 with success. I have encountered scenarios when I had to stagger the times when snapshot publications fired off so they wouldn’t be competing with each other for resources. Once in a while, I have had to reinitialize a hung publication. Neither issue was particularly difficult to resolve. On the other hand, I am always surprised by replication’s resiliency. I have had to bounce a server where replication was happening before, during patching or other situations that called for it. (Spoiler alert: Nothing happened. I didn’t need to do anything special. The log reader agent simply picked up where it left off, and we all lived happily ever after.) I have had to do database restores or database snapshot rollbacks with replicated databases, which require replication to be completely blown away and recreated. There wasn’t so much as a blip on the radar, just some hurry-up-and-wait for it to complete. I have even automated replication with success. In general, replication issues and their fixes are out there for the Googling and have rarely been a problem.

That is not to say there are no catches or things you might need to know upfront. For one thing, let’s go back to our indexing scenario.

At some point, you will need to reinitialize replication or even blow it away and recreate it. Any time the process is restarted, the objects and data on the other side are also deleted and will be repopulated from scratch. That means any indexing, permissions, triggers, etc., you may have put on the subscriber side will be lost unless they are backed up in some form. The fix is to do just that. Keep backup .sql files to recreate your indexes, triggers and permissions. You even have the option in the GUI to tell SQL Server to run your scripts after it is done making the publication, or you can choose to do it yourself, either manually or in an Agent job.

Any indexed views on the subscriber dependent on replicated tables will need to be taken down prior to reinitializing or restarting replication and put back afterwards. Again, having a backup .sql script to do this is very helpful.

It is important to remember that if you reinitialize or blow away/restart replication on a publication, all the data is dropped and repopulated. All the data. Unless you are using peer-to-peer replication, you can filter the publication to remedy that. One common way is to filter on the date – for example, maybe the business specifies that you only need the last six months of data. However, if you ever have to blow that publication away and restart it, you will have to remember to adjust that filter, because the date will be hardcoded (you will have saved backup scripts of your publications, so reapplying them and adjusting the filter will be a breeze). I have tried applying a dynamic filter to publications in the past, but with no success. Adjusting filter dates is easy to forget, but the good news is that no matter what, at least the old filter date will hold, and all the data from the beginning of time won’t come over.

Monitoring replication is essential but not as easy to do as you would think. Some monitoring tools offer the ability to alert if something is wrong with replication, but they usually have to be combined with home-grown checks to ensure that all is well. Checks can be written to check row counts on publisher and subscriber tables to ensure they are concurrent. Alternatively, you could query the distributor for latency and alert on that. Microsoft has some TSQL to help get you started. I also highly recommend taking a look at Kendra Little’s approach.

Replication can be a great option for making your data more available. It can also help reserve your third-party databases for data entry and keep your vendors happy. It is a well-vetted, stable process – but you do have to understand what you are doing and where it can cause issues. Properly applied, though, it is a good tool to have in your toolbelt.

So, replication is one alternative for copying parts of a database to one or more places. But what if you want to copy an entire instance of SQL Server? This is where you might consider a failover clustered instance. Failover clustered instances (FCIs) are built on top of a Windows Server Failover Cluster (WSFC). We will discuss the WFFCs in some depth, since its internals govern the other technologies that are built on top of it.

Let’s talk about it is and how it works.

Windows Server Failover Clustering – Server-Level Availability

Microsoft defines Windows server failover clustering like this:

A failover cluster is a group of independent computers that work together to increase the availability and scalability of clustered roles (formerly called clustered applications and services). The clustered servers (called nodes) are connected by physical cables and by software. If one or more of the cluster nodes fail, other nodes begin to provide service (a process known as failover). In addition, the clustered roles are proactively monitored to verify that they are working properly. If they are not working, they are restarted or moved to another node.

Simply put, a Windows Server Failover Cluster (WSFC) are a set of servers that can seamlessly take up for each other in the case of a problem. If one goes down, another picks up where the other left off. It has a mechanism to automatically detect problems and failover when it needs to. The amount of downtime perceived by the end-user depends on many factors and should be tested to ensure it complies with service level agreements established with the business. Windows server failover clusters can be used as the foundation for failover cluster instances and availability groups for SQL Server, but it can also be utilized for other things, such as distributed file systems. Because this is configured at the OS level, your networking or Infrastructure teams will most likely set this up for you, but if you want to learn more (or find yourself having to do it), this will get you started.

How WSFC Works

WSFC shares storage between its nodes. Only one node will own the service or instance at a time. How many nodes you are allowed to have in total depends on the edition and release of your server.

So, how does it know how or when to failover? Failover works through a mechanism called heartbeats, very similar to how heartbeats work in the human body. A doctor can listen to a patient’s heart. If it is beating within a certain range of beats per minute, the heart is functioning normally. If, however, the doctor doesn’t hear a heartbeat within a certain length of time, the person is either dead or in the process of dying.

With a WSFC, if a heartbeat from the active server is not detected for six consecutive heartbeats, Windows knows there is a problem and will try to restart it. If that is unsuccessful, the server is presumed to be down, and another server steps takes over.

So, you may wonder, one server is down. Say you have three total nodes. How does the WSFC know when it’s getting low on nodes? Is there any warning? The answer is yes. WSFC uses a thing called quorum to track if the WSFC has enough nodes to be operational. Usually, a quorum is over half the nodes. Quorum is the rule for the number of failures that can happen while the failover cluster remains operational. Because it ensures that only one node runs at a time, it also helps prevent a nasty consequence called the split-brain scenario, in which two nodes start acting independently of each other and trying to write data to the same place.

When a node goes down, Windows will try to restart it. If that fails, a failover happens. The other nodes check in with something called a vote. Generally speaking, it is a good practice to have an odd number of nodes. If you don’t, you may also have something called a witness, to help with maintaining quorum. If quorum is maintained, the cluster is viable. Windows will figure out which node it wants to make primary and will failover. Microsoft gives an in-depth explanation of quorum and witnesses here.

The Gotchas of WSFCs

The main thing to know about using WSFCs is that you likely won’t have access to them. If you do, understand that the steps to set one up have to be followed very carefully. I highly recommend you practice it on a cluster of test servers before you try it in production because there are many steps, and they are not all obvious. You must do the validation checks in the setup if you want Microsoft support and clear them, so don’t skip them.

There are many requirements when setting up a WSFC, but I’ll touch on a few of the more important ones. Matching hardware on all nodes is no longer necessary (although it is still considered a good practice). However, the hardware on all nodes must be compatible and pass the validation checks. In addition, the nodes must be able to handle the workload of what is going to fail over to it. This sounds like common sense but may not be intuitive. A public network is required for client access to a clustered instance. Ensure the “All clients to connect through the network box” is checked on the Cluster Network Properties.

WSFCs are a solid option for instance-level failovers. They are not simple, but they are effective. But what if you need only specific databases on an instance and don’t want the whole instance itself to come over? That is where Failover Cluster Instances comes in.

Failover Cluster Instances (FCIs) – SQL Server Instance-Level Availability

FCIs are SQL Server instances that sit on top of two or more nodes of a Windows Server failover cluster. The instance can only be running on one node can run at a time. The FCI is run and governed by the same technology that runs the WSFC we discussed above. This will provide instance-level copies over two or more nodes in the cluster. As with WSFCs, failure detection is triggered by six missed consecutive heartbeats or unresponsive within six interval checks. The length of the intervals in the interval checks is configurable through the Failover Cluster Manager.

The Gotchas of FCIs

Because the storage in FCIs is shared, the schema and objects in a failover clustered instance must be identical – no adding indexes or filtering data, and no picking and choosing what you are bringing over.

Patching and upgrades require a couple of extra steps. This will help get you started. In-place and rolling upgrades are supported.

Don’t use FCIs with SSRS or SSIS. SSRS offers a scale-out, load-balanced solution that works well. For SSIS, Microsoft recommends installing it as a stand-alone service on each node of the cluster.

If you haven’t worked with PowerShell before, get ready to. You’ll need it for setup and also for easier troubleshooting. This brings us to troubleshooting. It can be a little tricky to understand what is happening when issues occur. DBATools has a set of cmdlets to help.

Now, we’ve covered high availability at the object, server, and instance levels. What about entire selected databases?

Availability Groups (AGs) – Database-Level Availability

A long time ago, Microsoft introduced an availability option called database mirroring, in which a copy of changes on one database were copied to another server in a different location.

Think of availability groups as mirroring on steroids.

Now, instead of one copy of database(s) going to one place, you can have multiple copies of database(s) going to multiple places (depending on the edition and release you have of SQL Server). How many of those copies (or replicas) you can read depends on what editions and releases of SQL Server you are running. Availability groups has become the de facto mode for database-level high availability; mirroring is now deprecated.

How Availability Groups Work

Availability groups are built on top of failover cluster instances and use the same basic mechanisms with a couple of twists. The readable secondaries will use snapshot isolation, (meaning transaction-level concurrency ensured by the creation of a version store in tempdb). Once the failover cluster is built, the AGs are configured. Now, however, you get options as to the type of failover you want. You can choose automatic failover with no data loss, a planned manual failover (again, no data loss) or a forced failover (with data loss). You can also specify how quickly your data needs to be sent to the replicas. Choose synchronous commit if your replicas are geographically close. If the replicas are far apart, asynchronous commits are the better option.

You even have choices as to the type of AG you want to use.

  1. You can opt for a basic availability group if you are on SQL Server Standard edition, which supports a single database with no readable access to the secondary. However, you cannot upgrade to Always On AGs from a basic availability group if you upgrade to Enterprise Edition. You will have to drop this first.
  2. You can choose Always On Availability Groups (or Availability Groups (AGs)), which is what I will be speaking to for the rest of this section.
  3. You can choose Read-Scale Availability Groups, which are a special type of AG used for a specific purpose. You don’t use these for high availability or disaster recovery, but for offloading read workloads to your secondary(ies). These AGs don’t require the WSFC, although that is not the definition of them. You can learn more about those here.
  4. Finally, there are Distributed AGs, which are an availability group of availability groups. These are quite advanced. Learn more about them here. Denny Cherry also does a nice writeup on distributed availability groups, their use in disaster recovery, their setup and gotchas.

AGs offer you the flexibility to offload backups to secondaries and can even, in some cases, fix corruption (NB: Be careful offloading your backups, and don’t count on your AG to fix corruption. Ever.). Still, it’s good to know.

Unlike WSFCs, you are most likely going to be the person to set up an availability group. This will help walk you through the setup of an AlwaysOn availability group.

The Gotchas of AGs

Put succinctly: AGs are a complicated technology. The more complicated the tech, the more gotchas. And with AGs, there are many of them.

One thing to consider carefully is the databases you want to include. Very large databases can be problematic because they take forever to populate, or seed. In fact, seeding your replicas could well be your biggest logistical challenge on setup. Be sure that the log and data file paths are the same on both sides. If you use automatic seeding, know that if backups are going on during seeding, it can cause blocking. One alternative in your toolbelt to avoid these possible headaches might be to pick an off-hours time, stop activity on the databases in the group, take a backup of them, restore the backups to the replicas, then start the AG and let everyone back in. Otherwise, the blank database(s) can be created on the replicas and seeded, and you will need to wait until that is done.

Cross-database and distributed transactions are not supported on AGs until SQL Server 2016 SP2, so if you are not on that version, you will need to upgrade if you have those in place and want to use availability groups.

Monitoring and troubleshooting AGs can be tricky and is best done with a combination of PowerShell, the built-in dashboard, and the DMVs provided by Microsoft. This is a nice list of the Microsoft-supplied DMVs. DBATools (which I linked earlier) also has a list of its AG resources here. Microsoft also provides a list of its PowerShell resources. Some monitoring tools, such as Redgate’s SQL Monitor have availability group monitoring built right in.

It’s important to understand that although availability groups are built on top of failover cluster instances, they are a separate entity from them. Don’t use the failover cluster manager to manage your availability groups.

It is possible (although quite complicated) to use replication with availability groups. You cannot use peer-to-peer replication on an AG. If you use replication on an AG, the publisher goes on the primary replica. However, if you remove the server from the availability group for some reason, you have to tear down replication, so this is something to keep in mind. That, and the fact that you are adding a layer of complexity to an already complex process, so troubleshooting becomes that much more complicated.

There are outside factors that can cause an availability group to go down (at least temporarily). Server-level snapshot backups that quiesce the databases in an AG can cause the availability group to go down. Memory dumps can cause failover, as can excessive I/O. You may find that different circumstances cause you to increase the health check timeout on the failover cluster manager, but if you do that, you won’t know about failovers as quickly – which can be an issue for troubleshooting.

In short, availability groups are another choice to consider when you want to copy entire databases and downtime must be minimal, but the flexibility and options they offer come with complexity and potential problems. Be sure to practice using them in a test environment before trying it for real.

SQL Server high availability options

Database administrators have a number of choices to consider to make the data they manage highly available to the businesses they serve. Understanding when and why to choose one option over another, practicing that technology until they are comfortable with it, and having troubleshooting measures in place will offer the best guarantee of success in ensuring your data is available as quickly as possible.