Downtime, Outages and Failures - Understanding Their True Costs
- 11 Apr 2019
- Written by: Gad Cohen
This content is brought to you by Evolven. Evolven Change Analytics is a unique AIOps solution that tracks and analyzes all actual changes carried out in the enterprise cloud environment. Evolven helps leading enterprises cut the number of incidents, slash troubleshoot time, and eliminate unauthorized changes. Learn more
When it comes to mission-critical applications or data-center performance quality, enterprises are willing to make huge investments. Unfortunately, these investments don’t always fully deliver.
Confronting system downtime
Despite the efforts invested in infrastructure robustness, many IT organizations continue to deal with database, hardware, and software downtime incidents that last from just a few minutes to several days, completely incapacitating the business and causing tremendous losses.
The world of IT failure can sometimes seem awkward.
Despite the variety of advanced solutions and the mounting data collected by major enterprise software vendors and IT departments (from ERP to CRM and more), outages are still a valid and a terrifying threat to the industry.
On the other hand, IT failures have somehow become an inherently accepted, even expected, part of the enterprise life.
This is counter intuitive…
IT downtime revisited
While IT professionals find themselves confronting downtimes from time to time, and then they are fully focused on trying to get on top of them, the business organization as a whole suffers from the ‘financial pain’ by effects, which tend to be very significant.
In the past, we took an in-depth look at the multiple ways in which IT downtime can impact enterprises’ bottom line (you can read more about it here - Cost and Scope of Unplanned Outages). We looked at different aspects, from direct loss of revenues through reputation damage to indirect effects such as decrease in productivity.
Now, I wish to revisit the issue and examine how organizations should address and assess threats to their IT operations, including systems, applications and data, by analysing solid (and established) benchmarks that represent the potential costs behind downtime and outages.
Measuring big brand failures
When should the industry start measuring the financial impact of big brand outages, such as the one that recently hit Facebook, theone that hit hundreds of thousands of Lloyds Bank customers, or the Jetstar outage that resulted in hundreds of flights delays?
In other words, at what point is an outage ‘significant enough’ so that a cost analysis becomes valuable to the industry in order to learn from it and predict the impact of future outage incidents?
Well, apparently at some point the outage creates an impact that can’t be ignored, PR wise. That’s the point of no return, which is followed by financial impact estimations.
Downtime costs vary significantly between industries. The affected business size is obviously a critical factor, but it is not the only major one. The role of the IT systems in the business is also key.
Setting a numerical value behind an IT outage means predefining its implications across multiple business and organizational aspects, so that the whole industry can learn and optimize accordingly.
A failure of a critical application can lead to two distinct types of losses:
- Loss of the application service – the impact of downtime varies according to the application and the business;
- Loss of data – the potential loss of data due to a system outage can have significant legal and financial implications.
Now, I am sure that you would agree that today's data centers should never go down; applications must stay available 24/7, and internal (let alone external) end-users worldwide must be able to rely on data centers’ availability (for critical data and application availability) at all times.
Well, reality bites. In the back office (meaning inside the data center) this is not the case. No organization enjoys 100% uptime. Should you aspire to reach 100%? Sure. But you should also develop a deep understanding of downtime implications and ways to minimize it.
The worst outage nightmare ever? Probably the one that happened to you…
Some past outage incidents turned into PR catastrophes, like the mythological Virgin Blue debacle from 2010, or the recent one that affected Facebook.
Why? The mass impact probably had something to do with it.
As a reminder, the Virgin Blue outage prevented passengers from boarding flights for 11 days (!!) resulting in negative press, damaged reputation, and millions of dollars lost.
To be more accurate: Virgin Blue's reservations management company, Navitaire, ended up compensating Virgin Blue for more than $20 million (Navitaire booking glitch earns Virgin $20M in Compo).
There are many other incidents that still manage to capture the attention of the media. Here’s just one recent article by USA Today about the Wells Fargo outage that prevented customers from accessing their accounts for many hours.
I can safely say that anyone in the IT industry would agree that outages or downtimes are VERY bad for business. They are unwanted, very harmful financially, and must be fought against using all available resources.
Misconfigurations are key
The IT Process Institute's Visible Ops Handbook reported in the past that "80% of unplanned outages are due to ill-planned changes made by administrators ("operations staff") or developers" (Visible Ops).
The Enterprise Management Association reported that 60% of availability and performance errors are the result of misconfigurations.
What’s the cost?
Downtime can cost companies $5,600 per minute and up to $300,000 per hour in web application downtime (according to a 2014 Gartner's analysis).
The average hourly cost of enterprise server downtime, worldwide, 2017-2018:
Application maintenance costs are increasing at an annual rate of 20%. But that can’t solve all of your problems. A past industry survey revealed that at least one-quarter of polled downtime was caused by configuration errors. (How much will you spend on application downtime this year?).
How common are downtimes or outages?
Ok, downtime can be a financial nightmare. That part is clear. But If you wish to properly estimate the risk potential of outages to your business, the immediate question should be “how likely is it to happen?”
Source: Data Center Knowledge
Ok, so outages are way too common to be ignored by thinking “I am not likely to experience a major outage”. Now comes the question of how to calculate their specific risk to your business.
Production and application downtimes costs made clear
Unplanned outages are up to IT to resolve. Nevertheless, and as I already mentioned, at the end of the day these outages impact the entire organization.
An important part of a thorough outage risk evaluation process is estimating how much money you will lose per hour (or minute, or any other time increment of your choice) in the incident of downtime.
For enterprises that depend solely on data centers' ability to deliver IT and networking services to customers – such as telecommunications service providers or e-commerce companies – downtime can be particularly costly, with the highest cost of a single event topping $1 million (more than $11,000 per minute) according to estimations by experts.
In a USA Today survey of 200 data center managers, over 80% reported that their downtime costs exceeded $50,000 per hour. Over 25% reported downtime costs of over $500,000 per hour (!!).
According to another survey, while companies can't achieve zero downtime, one in every 10 companies said that their availability must be greater than 99.999%.
Source: Searchcio Techtarget
To get a firm understanding of the implications of production and release downtime, let's take a look at how the consequences of downtime are manifested.
Downtime cost - per year or per incident?
A 2017 study revealed that out of 400 IT decision makers, 46% experienced more than four hours of IT-related downtime over 12 months; 23% said that they incurred costs ranging from $12,000 up to more than $1 million per hour.
Over 35% admitted that they are unsure of the cost of an outage to their business.
If you ask Delta airlines, which had to cancel 280 flights due to an outage in 2017, the losses of a single outage incident can reach over $150 million.
A couple of years ago, Dun & Bradstreet reported that 59% of Fortune 500 companies experience a minimum of 1.6 downtime hours per week.
If you take the average Fortune 500 company (or a company that employ at least 10,000 employees) and assume that it pays an IT team members an average of $56 per hour, then (assuming the entire IT is busy solving the downtime) just the labor part of downtime for an organization of this size would reach $896,000 per week, translating to more than $46 million per year (Assessing The Financial Impact Of Downtime).
Of course that the reality is more complicated, as you need to take into consideration many parameters like the time of the event (mid-week or weekend? Day or night time?) and more. Still, understanding the costs of outages will significantly help estimate your risk potential and the ROI of tools that can help minimizing the effect of downtime incidents.
Has the industry managed to learn from the past and to minimize the collateral damage during an outage?
How have things changed from the past?
So, we already know that downtimes and outage incidents still happen today, and the industry has yet to successfully abolish. But how has their cost changed over time? Are these incidents less harmful today?
In 2010, a research by Coleman Parkes found that IT downtime incidents collectively cost businesses more than 127 million man-hours per year - an average of 545 man-hours per company - in employee productivity.
In 2009, it was reported that the average downtime costs vary considerably across industries, from approximately $90,000 per hour in the media sector to about $6.48 million per hour for large online brokerages (How to quantify downtime).
According to a survey of IT managers conducted during those years, companies are becoming more aware of the direct financial costs of computer downtime. The survey revealed that one in every five businesses loses $12,000 an hour through systems downtime (How to quantify downtime).
As mentioned above, a later analysis performed in 2014 by Gartner, reported an average cost of $5,600 per minute and over $300k per hour.
Even as early as 2004, a conservative estimate from Gartner pegged the hourly cost of downtime for computer networks at $42,000. Accordingly, a company that suffers from a worse-than-average downtime of 175 hours per year can lose more than $7 million annually. However, the cost of each outage affects each company differently, so it's important to know how to calculate the precise financial impact (How to quantify downtime).
It makes sense to believe that the cost of outage only gets higher with time (since we all lean more on data systems today). You can therefore understand why past data can be multiplied by a significant number in order to reflect today’s reality…
Every minute counts
Over ten years ago, the average cost of a data center downtime across industries was valued at approximately $5,600 per minute (Unplanned IT Outages Cost More than $5,000 per Minute), a figure which, according to Gartner, remained the same until 2014. The aforementioned past study by the Ponemon Institute calculated the minimum, median, mean and maximum cost per minute of unplanned outages, based on input from 41 data centers. The greatest cost of an unplanned outage was found to exceed $11,000 per minute.
On average, the cost of an unplanned outage is likely to exceed $5,000 per minute.
It only gets more significant
A 2013 study saw an uplift of over 41% from the past averages described above, and an average of more than $7900 cost per one minute.
An ITIC survey from 2015 clearly showed that the hourly cost (compared to data from 2008) has increased by between 25% to 30%.
Downtime impact per year
A past analysis Gartner has calculated that downtime incidents can reach 87 hours per year, on average. Obviously that's the sum of many outages - anywhere from a few minutes to several hours (Average large corporation experiences 87 hours of network downtime a year).
How things have changed?
A later research from 2011 revealed that although the industry has managed to successfully fight the downtime epidemic and decrease their occurences, we are still seeing significant downtime hours and huge revenue losses (Source: led to over 3 million (apparently Whatsapp users) that migrated to Telegram)
The impact on reputation and loyalty
How much is your business reputation worth? This may be extremely difficult to assess, as well as the long-term effect of a damaged reputation and its impact on revenue and profitability.
In this case, downtime costs include lost customers (both short and long term), and other tangible elements that reflect the costs of reputation impairment like stock downturns, marketing hours (crisis and brand recovery management) and media budget required to reboot and polish up an organization's profile.
What parameters should impact your calculation?
When trying to estimate the cost of downtimes, there are the obvious direct costs (such as loss of business during downtime). However, many indirect costs such as employee overhead or reputation issues discussed above, should be calculated in as well.
Workforce overhead is derived from the cost of burning ‘war-room’ tasks that focus on getting the IT systems back up and running, the cost of being delayed with all other planned tasks, the cost of employee overtime expenses (if applicable), and more. Then there’s the value of data loss, emergency maintenance fees (particularly if the outage occurs during off hours), and additional repair costs that may continue long after service has been restored.
Needless to say, you must calculate these costs when you estimate the implication of downtime, as they are usually very significant; but even a rough guesstimate can prove to be extremely beneficial for understanding the risks and deciding on the required level of technology you should lean on, in order to fight it.
There’s also the impact of lost sales. To have an accurate assessment of the total lost sales, the impact percentage must be increased to reflect the real lifetime value of customers who permanently defect to a competitor. For instance, the Facebook (and Whatsapp) outage that I mentioned earlier Cost-Unconscious: Denying the True Cost of Network Downtime. What is the revenue loss derived by the fact that these users will present less billable ad-impressions?
Stock dropped by 25%
Although it's hard to put a number on so many parameters, they are still substantial and significant. For instance, when Amazon.com went offline for several hours during its early days, its stock dropped by 25% in a single day (Cost-Unconscious: Denying the True Cost of Network Downtime)!
In this Amazon cloud outage example, the company continued to scramble to get its cloud services back online. As a result, many customers questioned the reliability of its cloud and Amazon’s communication surrounding the outage. Other customers thought they should be compensated for the downtime as part of their SLA.
I know you are curious: As for the SLA, despite the almost-four-day outage, Amazon's EC2 SLA was not breached (Seven lessons to learn from Amazon's outage).
The cost of downtime: Calculating it yourself
How much are you bound to lose from an unexpected downtime of your servers or business applications?
According to multiple sources, the simplest way to calculate potential revenue losses during an outage is by using this equation:
|LOST REVENUE||=||(GR/TH) x I x H|
|GR||=||gross yearly revenue|
|TH||=||total yearly business hours|
|H||=||number of hours of outage|
How to minimize outage and downtime risk?
Downtime and outages are catastrophic, but they don’t have to be that impactful. By utilizing solutions that focus on getting to the root of the problem, outages can be prevented before they even occur.
Evolven Change Analytics developed a unique AIOps solution that focuses on changes - the true root cause of performance incidents. Evolven helps enterprise IT and Cloud Ops teams prevent and troubleshoot incidents before the trouble starts.
Contact us to see how we help leading enterprises slash the number of incidents and MTTR.
What is the true cost of system downtime? ›
Quick downtime calculator
To get a quick estimate of your company's probable downtime costs, use the following formula, based on the size of your business and the number of minutes your most recent incident lasted: Downtime cost = minutes of downtime x cost-per-minute. For small business, use $427 as cost-per-minute.
Downtime cost is defined as any profit that a company loses when its equipment or network stops functioning. The cost of downtime implies not only direct financial loss but can have an impact on your company in at least the other 4 ways.What is the difference between downtime and outage? ›
Downtime occurs when a system can't complete its primary function. It can be broken up into two types: IT outages and brownouts. IT brownouts occur when a system is slowed or partially available. This might mean customers can access your site, but pages load slowly or dynamic features like "add to cart" don't function.What is the meaning of outage cost in business? ›
Outage Costs means the actual increased costs of replacement energy incurred by Transmission Owner during an Outage calculated in accordance with this section and does not include costs that would have been incurred notwithstanding the Generating Facility interconnection.
TDC is a methodology of analyzing all cost factors associated with downtime, and using this information for cost justification and day to day management decisions. Most likely, this data is already being collected in your facility, and need only be consolidated and organized according to the TDC guidelines.What are the three types of downtime? ›
Common categories of downtime include excessive tool changeover, excessive job changeover, lack of operator, and unplanned machine maintenance.What is downtime failure? ›
In industrial environments, downtime may refer to failures in production equipment. This type of downtime is often measured as downtime per work shift or downtime per a 12- or 24-hour period. Downtime duration is the period of time when a system fails to perform its primary function.What are the main causes of downtime? ›
This can be due to several reasons including hardware or software failure, human error, malicious attacks or natural disasters. Since unplanned downtime is unexpected and occurs without a warning, preventing it can be a challenge.How do you explain downtime? ›
a time during a regular working period when an employee is not actively productive. an interval during which a machine is not productive, as during repair, malfunction, maintenance.What are the two types of downtime? ›
Downtime falls into two categories: planned and unplanned. Planned downtime is notable because it offers advanced warning and gives users a chance to prepare. Planned downtime is usually done for upgrades or maintenance to the network infrastructure.
How do you calculate maintenance downtime? ›
1. Divide your total revenue by the planned operating time to get your daily revenue. 2. Assess by how much your daily revenue goes down if the chosen piece of equipment stops working for 1 hour.What does downtime mean in maintenance? ›
In manufacturing, “downtime” occurs when an unplanned event halts production for a period of time. This event can be a malfunction, repair, or changeover of tools or equipment. Maintenance downtime in particular is when a machine is not operating or being productive due to required maintenance work.What does outage mean in project? ›
A period when a service or an application is not available or when equipment is not operational.
All manufacturing downtime reduces overall output by stopping production. Unplanned downtime can cost 15 times more than planned downtime. The loss of revenue during any type of asset maintenance can be as high as $3 million per incident.What is the meaning of outage in accounting? ›
Definition of Outage
Loss of electrical power long enough to interrupt a firm's essential business, data processing system, support services, and/or other activities that may result in loss of income or associated liabilities.
- Not-Utilizing Talent.
- Motion Waste.
- Excess Processing.
Calculating Downtime Cost
The duration of the downtime and the cost incurred per minute you're offline are the two variables that most affect the financial impact of an outage.
How Much Does Downtime Cost a Company? The average cost of downtime is significant. Each minute costs an average of $9,000, according to the Ponemon Institute, bringing the downtime cost per hour to over $500,000.What is downtime also called as? ›
DOWNTIME stands for Defect, Overproduction, Waiting, Non-Utilized Talent, Transportation, Inventory, Motion, and Extra Processing.What are the benefits of downtime? ›
Downtime gives us time and space to enjoy our personal lives and get personal tasks done. It grants us time with family, friends, and our hobbies. On a brain level, it allows us to reach homeostasis and is a necessary break from the aroused state, Dr. Hanson says.
What is a major outage? ›
More Definitions of Major Outage
Major Outage means any Power Outage that lasts for at least ten (10) consecutive minutes and/or any Temperature Irregularity, in each case causing inoperability of Customer's Equipment.
Importance of Reducing Unplanned Downtime
Waiting on parts or the necessary personnel to fix an issue takes time and could mean the machine is going to stay down for longer. Longer downtime is less time making product, directly effecting the bottom line.
Consequences of unplanned downtime
Lost productivity and revenue: Every minute of downtime can result in lost productivity and revenue, affecting a business's bottom line. Decreased customer satisfaction: Unplanned downtime can lead to delayed deliveries, canceled orders, and frustrated customers.
After a busy day at work, I look forward to some downtime at home. The kids napped during their downtime. We need to minimize network downtime.What are some downtime activities? ›
- Volunteer. There are only a few things that feel better than genuinely making a contribution and helping other people. ...
- Write down everything you're grateful for. ...
- Meditate. ...
- Do something creative. ...
- Spend time in nature. ...
- Organize your space. ...
- Go over and personalize your devices' settings. ...
- Go for Inbox Zero.
World Class Standards For Downtime
Aim for unscheduled downtime to be 10% or less.
This is known as “cost of ownership”, which follows the formula below: Cost of labour + Cost of materials + Suppliers (outsourcing) + Energy + Other Expenses. Please note that this formula only considers routine maintenance activities, minor repairs, and the cost of parts.What is acceptable downtime? ›
Maximum allowable downtime denotes the maximum time a business can tolerate the absence or unavailability of a particular business function. Different business functions are likely to have different answers to the allowable downtime equation.What is the difference between breakdown time and downtime? ›
Breakdown time is downtime that results from the equipment breaking down. You'd start counting from the time the asset fails to the time you manage to get it up and running again. Equipment downtime, on the other hand, is any amount of time in which a piece of equipment is offline.How do you handle an outage? ›
- Acknowledge the issue. ...
- Empathize with impacted customers. ...
- Clearly communicate the scope of the outage. ...
- Focus on customer impact. ...
- Give alternatives where possible. ...
- Don't lay blame; take responsibility. ...
- But do give important context. ...
- Write to your audience's technical level.
What is an example of a planned outage? ›
Planned outages are deliberate and are scheduled at a convenient time, for example, for the following purposes: Database administration, such as offline backup or offline reorganization. Software maintenance of the operating system or database server. Software upgrades of the operating system or database server.What is the purpose of outage management? ›
Outage Management System (OMS), an ADMS solution, provides rapid, real-time information to predict outages, enabling you to respond quickly when faced with extreme weather or excess demand.What is outage and shutdown? ›
A Shutdown, Turnaround, or Outage (STO)—three terms that mean the same depending on the industrial sector we are acting in—is an event in which an industrial plant or some process units are decommissioned for a planned period to perform maintenance, inspection, regeneration, or revamp.How much does downtime cost the auto industry? ›
For example, in the auto industry, downtime can cost up to $50,000 per minute. That's $3 million per hour. 400 The true downtime cost includes a variety of wasted business support costs and lost business opportunity costs because resources were needed to resolve a downtime incident that probably didn't need to happen.Is database downtime costly? ›
Database outages can have a significant impact on top line revenue. In fact, according to a survey conducted by ITIC, 98% of organizations say a single hour of downtime costs over $100,000, while 81% report that it costs over $300,000. And that's just for a single hour!What is the average cost of downtime in a data center? ›
According to Gartner, downtime costs $5,600 per minute on average. This results in average costs between $140,000 and $540,00 per hour depending on the organization. Some factors that contribute to the costs associated with downtime include: Lost sales.Is the auto shortage getting better? ›
The Auto Chip Shortage Remains, But It May Be Improving
However, if Fiorani's estimate holds true, it would mark a significant improvement for the industry. More than 10.5 million vehicles were cut from production in 2021, according to Auto News.
Diagnostic Labor – This requires significantly more training than a repair laborer, as well as different tools, both of which require training and exact a significant expense. Repair Labor – This requires a significant amount of training and experience, which master technicians take many years to accrue.
The first way to measure your equipment downtime is in actual time. For a given asset (or set of assets), record the amount of time during each month that the asset is broken down. Keeping a running tally and comparing it to past months will help you know when an asset is having more issues than normal.How does downtime affect a business? ›
Repeated downtime events can result in unhappy customers, which can quickly translate into bad customer reviews and tarnished brand image. Data Loss: Downtime affects not only your business but your clients as well. Downtime due to cyberattacks, server or network outage can result in corrupt, damaged or stolen data.
What are the financial impacts of downtime? ›
The cost of downtime = downtime duration x per-minute cost.
You can use around $400 as a cost-per-minute figure for small enterprises. In the case of large and medium businesses, use $10,000. Many people only associate downtime costs with lost revenue.
In other words, data downtime is the periods of time when data quality is bad or the data is unavailable. You can't do anything without data or even with bad data. For example, let's say you're forecasting stocks using Twitter as your data source. If Twitter is down, you won't have any data to use for forecasting.How do you calculate downtime cost per hour? ›
The cost per hour of downtime is calculated by adding labor costs per hour to the revenue lost per hour.What is average downtime? ›
Average downtime is usually built into the price of goods produced to recover its costs through the sales revenue. Opposite of "uptime." Also called "waiting time."