10 things to know about data-center outages – Network World

The severity of data-center outages appears to be falling, while the cost of outages continues to climb. Power failures are the biggest cause of significant site outages. Network failures and IT system glitches also bring down data centers, and human error often contributes.

Those are some of the problems pinpointed in the most recent Uptime Institute data-center outage report, which analyzes types of outages, their frequency, and what they cost both in money and consequences.

Uptime cautions that data relating to outages should be treated skeptically given the lack of transparency of some outage victims and the quality of reporting mechanisms. Outage information is opaque and unreliable, said Andy Lawrence, executive director of research at Uptime, during a briefing about Uptimes Annual Outages Analysis 2023.

While some industries, such as airlines, have mandatory reporting requirements, theres limited reporting in other industries, Lawrence said. So we have to rely on our own means and methods to get the data. And as we all know, not everybody wants to share details about outages for a whole variety of reasons. Sometimes you get a very detailed root-cause analysis, and other times you get pretty well nothing, he said.

The Uptime report culled data from three main sources: Uptimes Abnormal Incident Report (AIRs) database; its own surveys; and public reports, which include news stories, social media, outage trackers, and company statements. The accuracy of each varies. Public reports may lack details and sources might not be trustworthy, for example. Uptime rates its own surveys as producing fair/good data, since the respondents are anonymous, and their job roles vary. AIRs quality is deemed very good, since it comprises detailed, facility-level data voluntarily shared by data-center owners and operators among their peers.

Theres evidence that outage rates have been gradually falling in recent years, according to Uptime.

That doesnt mean the total number of outages is shrinkingin fact, the number of outages globally increases each year as the data-center industry expands. This can give the false impression that the rate of outages relative to IT load is growing, whereas the opposite is the case, Uptime reported. The frequency of outages is not growing as fast as the expansion of IT or the global data-center footprint.

Overall, Uptime has observed a steady decline in the outage rate per site, as tracked through four of its own surveys of data-center managers and operators conducted from 2020 to 2022. In 2022, 60% of survey respondents said they had an outage in the past three years, down from 69% in 2021 and 78% in 2020.

There seems to be a gently, gently improving picture of the outage rate, Lawrence said.

While 60% of data-center sites have experienced an outage in the past three years, only a small proportion are rated serious or severe.

Uptime measures the severity of outages on a scale of one to five, with five being the most severe. Level 1 outages are negligible and cause no service disruptions. Level five mission-critical outages involve major and damaging disruption of services and/or operations and often include large financial losses, safety issues, compliance breaches, customer losses. and reputational damage.

Level 5 and Level 4 (serious) outages historically account for about 20% of all outages. In 2022, outages in the serious/severe categories fell to 14%.

A key reason is that data-center operators are better equipped to handle unexpected events, according to Chris Brown, chief technical officer at Uptime. Weve become much better at designing systems and managing operations to a point where a single fault or failure does not necessarily result in a severe or serious outage, he said.

Todays systems are built with redundancy, and operators are more disciplined about creating systems that are capable of responding to abnormal incidences and averting outages, Brown said.

When outages do occur, they are becoming more expensivea trend that is likely to continue as dependency on digital services grows.

Looking at the last four years of Uptimes own survey data, the proportion of major outages that cost more than $100,000 in direct and indirect costs is increasing. In 2019, 60% of outages fell under $100,000 in terms of recovery costs. In 2022, just 39% of outages cost less than $100,000.

Also in 2022, 25% of respondents said their most recent outage cost more than $1 million, and 45% said their most recent outage cost between $100,000 and $1 million.

Inflation is part of the reason, Brown said; the cost of replacement equipment and labor are higher.

More significant is the degree to which companies depend on digital services to run their businesses. The loss of a critical IT service can be tied directly to disrupted business and lost revenue. Any of these outages, especially the serious and severe outages, have the ability to impact multiple organizations, and a larger swath of people, Brown said, and the cost of having to mitigate that is ever increasing.

As more workloads are outsourced to external service providers, the reliability of third-party digital infrastructure companies is increasingly important to enterprise customers, and these providers tend to suffer the most public outages.

Third-party commercial operators of IT and data centerscloud providers, digital service providers, telecommunications providersaccounted for 66% of all the public outages tracked since 2016, Uptime reported. Looked at year-by-year, the percentage has been creeping up. In 2021 the proportion of outages caused by cloud, colocation, telecommunications, and hosting companies was 70%, and in 2022 it was up to 81%.

The more that companies push their IT services into other peoples domain, theyre going to have to do their due diligenceand also continue to do their due diligence even after the deal is struck, Brown said.

While its rarely the single or root cause of an outage, human error plays some role in 66% to 80% of all outages, according to Uptimes estimate based on 25 years of data. But it acknowledges that analyzing human error is challenging. Shortcomings such as improper training, operator fatigue, and a lack of resources can be difficult to pinpoint.

Uptime found that human error-related outages are mostly caused either by staff failing to follow procedures (cited by 47% of respondents) or by the procedures themselves being faulty (40%). Other common causes include in-service issues (27%), installation issues (20%), insufficient staff (14%), preventative maintenance-frequency issues (12%), and data-center design or omissions (12%).

On the positive side, investing in good training and management processes can go a long way toward reducing outages without costing too much.

You dont need to go to a banker and get a bunch of capital money to solve these problems, Brown said. People need to make the effort to create the procedures, test them, make sure theyre correct, train their staff to follow them, and then have the oversight to ensure that they truly are following them.

This is the low hanging fruit to prevent outages, because human error is implicated in so many, Lawrence said.

Uptime said its current survey findings are consistent with previous years and show that on-site power problems remain the biggest cause of significant site outages by a large margin. This despite the fact that most outages have several causes, and that the quality of reporting about them varies.

In 2022, 44% of respondents said power was the primary cause of their most recent impactful incident or outage. Power was also the leading cause of significant outages in 2021 (cited by 43%) and 2020 (37%)

Network issues, IT system errors, and cooling failures also stand out as troubling causes, Uptime said.

Uptime used its own data, from its2023 Uptime resiliency survey, to dig into network outage trends. Among survey respondents, 44% said their organization had experienced a major outage caused by network or connectivity issues over the past three years. Another 45% said no, and 12% didnt know.

The two most common causes of networking- and connectivity-related outages are configuration or change management failure (cited by 45% of respondents) and a third-party network providers failure (39%).

Uptime attributed the trend to todays network complexity. In modern, dynamically switched and software-defined environments, programs to manage and optimize networks are constantly revised or reconfigured. Errors become inevitable, and in such a complex and high-throughput environment, frequent small errors can propagate across networks, resulting in cascading failures that can be difficult to stop, diagnose, and fix, Uptime reported.

Other common causes of major network-related outages include:

When Uptime asked respondents toits resiliency survey if their organization experienced a major outage caused by an IT systems or software failure over the past three years, 36% said yes, 50% said no, and 15% didnt know. The most common causes of outages related to IT systems and software are:

Publicly recorded outages, which include outages that are reported in the media, reveal a wide range of causes. The causes can differ from what data-center operators and IT teams report, since the media sources knowledge and understanding of outages depends on their perspective. Whats really interesting is the sheer variety of causes, and thats partly because this is how the public and the media perceive them, Lawrence said.

Fire is one cause that showed up among publicly reported outages but didnt rank highly among IT-related sources. Specifically, Uptime found that 7% of publicly reported data-center outages were caused by fires. In the web briefing, Uptime researchers related the incidence of data-center fires to increasing use of lithium-ion (Li-ion) batteries.

Li-ion batteries have a smaller footprint, simpler maintenance, and longer lifespan compared to lead-acid batteries. However, Li-ion batteries present a greater fire risk. A Maxnod data center in France suffered a devasting fire on March 28, 2023, and we believe its caused by lithium-ion battery fire, Lawrence said. A lithium-ion battery fire is also the reported cause of a major fire on Oct. 15, 2022, at a South Korea colocation facility owned by SK Group and operated by its C&C subsidiary.

We find, every time we do these surveys, fire doesnt go away, Lawrence said.

Read more here:
10 things to know about data-center outages - Network World

Related Posts

Comments are closed.