A cost-effective approach to SQL server high availability in the cloud – ITProPortal

Configuring SQL Server for high availability (HA) can be a costly prospect. In a traditional on-premises approach, one creates a failover cluster instance (FCI) with two (or more) servers. Only one of those servers is typically performing production tasks at any moment; the others are largely standing ready to be called into service should the primary server fail. When your SQL Server requirements demand a large system with multiple high-powered CPU cores and hundreds of gigabytes of memory, your FCI can have a lot of expensive hardware doing nothing but standing by.

The cloud affords you different options when it comes to configuring for HA. In Azure, AWS, and Google Cloud Platform (GCP) you can create a SQL Server FCI on virtual machines (VMs) rather than on physical machines. More interestingly, you may find that you can create an FCI in the cloud whose backup VMs are not equal in size and performance to the primary VM running your production SQL Server instance. You might configure your secondary VMs as much smaller systems.

Why? Because you may be able to cut your operating costs considerably. The VM you need for your primary production environment may be very expensive, but if you provision your backup VMs as smaller servers think of them as emergency spare tires as opposed to full-sized spares you can pay far less for the systems that are doing nothing but waiting to be called into emergency service.

But heres where the cloud and the elasticity of VMs provide a distinct advantage over an FCI built on-premises: If an event occurs that causes your FCI to fail over to one of the smaller secondary VMs, you can re-provision that smaller VM so that it reconstitutes as a new VM that is as large and as powerful as the original primary. The secondary that would have been far too small to support your production load becomes a VM that can then deliver the full support that your SQL Server application demands. The fee for that secondary VM will increase commensurately, but you have avoided paying that higher fee until this moment. In an on-premises FCI you would have been paying for the larger system for months, possibly years while it sat waiting to be brought online.

Later, whenever the previous primary VM comes back online, you have a choice: you can either move your production SQL Server load back to that VM and return the secondary VM to its emergency spare tire size or shrink the original primary to that spare-tire size and continue to use it as the new secondary failover server in the FCI. If the latter, youd continue to use the expanded secondary VM as your primary production system. Note that if youre taking advantage of the AWS EC2 Reserved Instances option, you will continue to be charged the higher rate once youve expanded the VM, even if you subsequently shrink it down to its previously undersized dimensions.

Are there trade-offs to configuring an FCI with undersized secondary VMs? There are, and they are important to weigh in the balance.

Youre configuring for HA for a reason, and its important to have a clear understanding of your expectations. We can talk about HA in terms of a cloud SLA that guarantees access to at least one of the VMs in your FCI 99.99 per cent of the time, but when weighing the use of undersized backup servers in a SQL Server FCI there are two other metrics you need to take into consideration.

The first is your recovery time objective (RTO), which represents the amount of time it will take to get your application back up and running in the event of a failure. By definition, an HA solution must be able to detect a failure of the primary VM and then perform an automatic recovery which, at a high level, means failing over to the secondary VM, rolling back the database to the last committed transaction, and making the secondary instance of SQL Server the primary instance so that users can begin working with the database again. The amount of elapsed time that you would consider acceptable between the event that causes failure of the primary and the resumption of user interaction with SQL Server on the secondary VM is your recovery time objective.

Knowing your RTO is important because one of the trade-offs in using an undersized secondary that you intend to convert into a larger VM when necessary is that reprovisioning takes time. Its only a matter of minutes, but if those extra minutes might result in the loss of millions of dollars worth of transactions then using undersized VMs as your secondaries may not be worthwhile. However, if taking an extra two minutes to reprovision the secondary as a larger VM results in a minimal loss of revenue or customer satisfaction, then the amount of money you save by not paying for a larger standby VM may warrant consideration of an undersized approach.

The second metric to weigh in the balance is your recovery point objective (RPO), which represents the amount of data you can stand to lose in a failure scenario. When youre configuring for HA, its safe to assume that you dont want to lose any data, but that means that you need to ensure that your backup VMs have access to the data that your primary SQL Server instance is working with. Since no provider currently offers a shared cloud storage solution with a 99.99 per cent availability SLA, youll need a way to reliably replicate your SQL Server data among the separate physical locations where your secondary VMs reside.

If you configure for HA using a SQL Server Always On Availability Group (AG) approach (rather than as an FCI), SQL Server will replicate your user-defined databases to your secondary servers. However, Always On Availability Groups require SQL Server Enterprise Edition, which is going to increase your costs (and the whole point of under sizing your secondaries is to decrease your costs). Youll also find that key SQL Server databases (for agents, jobs, passwords, etc.) are not replicated to the secondary VMs under AG.

If youre using SQL Server Standard Edition or if your RPO demands that you replicate all SQL Server databases to the secondary VMs, then youll want to construct an FCI using a SANless Clustering tool such as SIOS DataKeeper, which provides complete database replication between your primary and secondary VMs. That way, when the secondary VM is called into service, all the data that the primary had been working with is available to the secondary.

Second, while services within AG or Windows failover cluster manager can automate failover to the secondary VM, it is not possible to automate the resizing of the secondary server. Youll have to do that manually. You should start by configuring an alert that notifies you when a failover occurs. At that point you will need to make a decisiondo I upsize the target or fail back to the original server? Some failures might be transient, in which case moving the workload back to the original server will be your best option for the quickest recovery. However, its not always obvious why the original server failed, so you may find SQL Server failing over again soon after you fail back. In other cases, such as where there is a service interruption in the availability zone where your primary VM resides, the best option will be to go ahead and resize the undersized VM since you wont know how long the outage will last.

Two final points to consider when weighing the cost-effectiveness of configuring SQL Server for HA in the cloud using undersized secondaries are these:

First, you must be careful when picking the size of the undersized target. Cloud instances throttle disk IOPS based upon instance size. You should check the disk IOPS on the secondary VM to ensure it will not become a bottleneck for your SQL Server load at failover. Fortunately, on the target VM you will typically be seeing write IOPS, not read IOPS.

Dave Bermingham, Senior Technical Evangelist, SIOS Technology

Read more:
A cost-effective approach to SQL server high availability in the cloud - ITProPortal

Related Posts

Comments are closed.