Cloudburst: Hard lessons learned from the OVH datacenter blaze – VentureBeat

Join Transform 2021 for the most important themes in enterprise AI & Data. Learn more.

In every tabletop disaster-recovery exercise in every enterprise IT shop, theres a moment when attention grudgingly shifts from high-profile threats malicious intrusion, data theft, ransomware to more mundane (and seemingly less likely) threats, like natural disasters, accidents, and low-tech turmoil.

What hurricanes, explosions, earthquakes, fires, and floods lack in cybersecurity panache, they often make up for in ferocity. The history is clear: CIOs need to put more emphasis on force majeure an act of God or moment of mayhem that threatens data availability at scale when making their plans.

On Christmas Day 2020, a bomb packed into an RV decimated a section of downtown Nashville, Tennessee. The collateral damage included a crippled AT&T transmission facility, which disrupted communications and network traffic across three states and grounded flights at Nashville International Airport. Outages for business clients and their customers lasted through the rest of the holiday season.

This week brought even more stark evidence of the disruptive power of calamity. One of Europes largest cloud hosting firms, OVH Groupe SAS, better known as OVHCloud, suffered a catastrophic fire at its facility in Strasbourg, France. The blaze in a cluster of boxy, nondescript structures actually stacks of shipping containers repurposed to save on construction costs completely destroyed one of OVHs four datacenters at the site and heavily damaged another.

OVH officials were quick to sound the alarm, with founder and chair Octave Klaba warning that it could take weeks for the firm to fully recover and urging clients to implement their own data recovery plans.

Assuming they had them. Many did not.

Scarcely protected data remains a significant problem for businesses of all stripes and sizes. In 2018, Riverbank IT Management in the U.K. found that 46% of SMEs (small and mid-size enterprises) had no plan in place for backup and recovery. Most companies (95%) failed to account for all of their data, on-premises and in the cloud, in whatever backup plans they did have.

The results of such indiscretion are costly. According to Gartner, data-driven downtime costs the average company $300,000 per hour thats $5,600 every minute. The destruction at the OVH facility on the banks of the Rhine near the German border took down 3.6 million websites, from government agencies to financial institutions to computer gaming companies, many of which remain dark as of this writing. Affected complained on blogs and social media that years worth of data was lost for good in the OVH conflagration. The final financial tally will be staggering.

Not all data catastrophes are caused by a hoodie-wearing, Eastern European hacker, said Kenneth R. van Wyk, president and principal consultant at KRvW Associates, a security consultancy and training company in Alexandria, Virginia. Some are caused by the most mundane circumstances.

Sure, we need to consider modern security threats like ransomware, [but] lets never forget the power of a backhoe ripping through a fiber optic line feeding a business-critical datacenter.

Its about a mindset of always expecting the worst, van Wyk said. Security professionals look at systems and immediately ask What could go wrong? Every business owner should do the same.

In this age of ubiquitous cloud migration and digital transformation, what can IT leadership do to gird the organization against hazards large and small? The answer lies within the realm of business continuity and disaster recovery (BCDR). This well-codified discipline in information security is a critical, but often missing, piece in enterprise risk management and mitigation. Most organizations understand the basic rules of engagement when it comes to BCDR, but security experts agree that execution often lacks rigor and commitment.

As a CIO, Id immediately ask, Have we truly tested our backups and recovery capability?' said cloud security specialist Dave Shackleford, founder and principal consultant at Voodoo Security in Roswell, Georgia. Whether cloud-based or not, too many organizations turn disaster recovery and business continuity planning and testing into paper exercises without really ensuring theyre effective.

For organizations looking to protect key digital assets, what Shackleford deems an effective BCDR approach begins with a few time-tested best practices.

Ask about redundancy and geographic resilience and get it in writing. Losing two cloud datacenters will always result in disruption and downtime, even for a host like OVH with 300,000 servers in 14 facilities across Europe and 27 worldwide. But how painful and protracted that loss is will largely depend on the robustness of the hosting companys own backup and fail-over protocols.

The assurances, as spelled out in the service-level agreement (SLA), must also go beyond data processing and storage. A big part of Roubaix-based OVHs troubles stemmed from the failure of backup power supplies that damaged its own custom-built servers even in areas unaffected by the actual fire.

Look for items in the SLA that address not only the service guarantee but also the eligibility for compensation and level of compensation offered. Offering five-nines availability is great, but the host should also demonstrate a commitment to diverse transit connections; multiple sources of power; redundant networking devices; and multiple, discrete storage assets on the backend.

Holding your cloud host accountable is a solid start, but its important to remember that, as the OVH experience casts in stark relief, enterprise-grade cloud is not some mythical realm of infinite resources and eternal uptime. Moving important digital assets to the cloud means swapping your own infrastructure for that of another, for-profit vendor partner.

The first requirement for cloud migration is to establish a framework for determining the wisdom and efficacy of making such a move to the cloud in the first place. Then there needs to be a comprehensive plan in place to protect everything the organization holds dear.

Inventory all your critical assets, van Wyk suggests. Ask how much it would cost you if any of them were unavailable, for any reason, for an hour, a day, a week. Ask how you would restore your business if everything in your inventory vaporized. What would the downtime be? Can you afford that? What is your Plan B?

The Cloud Security Alliance offers excellent guidance when preparing, analyzing, and justifying cloud projects with an eye toward risk, particularly with its Cloud Controls Matrix (CCM).

If third-party hosting is warranted, it should be guided by formal policy that covers issues such as:

Understand that failures are going to happen. Backup and recovery is so fundamental to the security triad of data confidentiality, integrity, and availability (CIA) that it enjoys its own domain in the NIST Cybersecurity Framework. NISTs CSF encourages organizations to ensure that recovery processes and procedures are executed and maintained to ensure timely restoration of systems or assets affected by cybersecurity incidents.

Theres a lot going on in that sentence, to be sure.

Developing a robust approach to recovery that can satisfy NIST and withstand a catastrophic event like the OVH fire takes more than scheduling some automated backups and hoping for the best.

Van Wyk said its a good idea to take extra precautions with your vital business data and processing and ensure you will actually be able to use your backup plans in different emergency scenarios.

Whether organizations crown jewels live on-premises, in a hybrid environment, or solely in the cloud, a mature and pragmatic BCDR approach should include:

No BCDR plan can ward off all chaos and guarantee perfect protection. But as the OVH incident demonstrates, half-hearted policies and incomplete protocols are about as effective as no plan at all. Establishing a solid BCDR posture requires meaningful investment in resources, time, and capital. The payoff comes when the lights flicker back on and rebooted systems go back online, data intact and none the worse for the experience.

Read more from the original source:
Cloudburst: Hard lessons learned from the OVH datacenter blaze - VentureBeat

Related Posts

Comments are closed.