Building Highly Available IT Infrastructure – a Paradox of 9s

Article by Sunil Gupta

  • Filed under:
  • IMS


Most IT Managers think that having more 9s is actually better – but what is to be considered is the cost associated with having more 9s, and how many 9s are actually needed?

Mandar Kulkarni (Senior Vice President - Solutions Engineering and Private Cloud Practice at Netmagic Solutions)

High availability is the co-efficient of the uptime required to run your IT Infrastructure– simply put, for a CIO it translates into two things - peace of mind and growth for his business. If a CIO is assured of the availability (read uptime) of his business applications, he is sure of his business running optimally at all times. HA is a co-efficient of this assurance.

There is a fundamental change needed in what people think of availability. It is not negotiated or bought –it is not a commercial discussion.

The table below highlights the acceptable downtime – from an availability perspective – for various 9s.

SLA Minutes of acceptable downtime per month Minutes of acceptable downtime Annualized
99% 432 5184
99.90% 43.2 518.4
99.99% 4.32 51.84
99.995% 2.16 25.92

There is a standard approach to go from 99 to 99.9 to 99.99.

Achieving 99% Availability

Broadly and briefly, when you want to achieve 99% availability, it essentially means the infrastructure is allowed 432 minutes of downtime each month, which is roughly little over 7 hours. Most of professionally run IT setup the recovery time is roughly 2-4 hours depending on what kind of IT operations are run.

Look at the diagram 1.2(a) below. Here you are allowed to have a single point of failure – at the power, IT systems level, bandwidth or air conditioning of the DC levels – and you have enough time to recover from the failure.

Diagram 1.2 (a): Setup for 99% availability

Diagrams 1.2 (b), (C) and (d) show the outage of the architecture built for 99% availability.

the outage of the architecture built for 99% availability

When moving from 99% to 99.9% means that the acceptable downtime is only 43.2 minutes per month. Here, the target recovery time as mentioned above is more than this acceptable downtime.

So how does one avoid outage? Isolate all single points of failure and create redundancies. As shown in diagram 1.3(a), there are 2 ISPs, power inputs, servers and network setup (clustered server and network device setup), and redundant aircon, so on and so forth.

Diagram 1.3 (a): Setup for 99.9% availability

Diagram 1.3 (a): Setup for 99.9% availability

In this architecture, failure at any single point does not result in end user outage.

Failure at any single point also allows ample time to switch to the redundancy that is build into the architecture. The 43.2 minutes is enough to only shift to the redundant device or setup – and not recover from a failure.

architecture for 99% availability

The move from 99% to 99.9% significantly increases the cost. It almost doubles the cost of the setup – about 100% incremental cost over the infra setup of 99% availability

99.99% of availability means 4.32 minutes of outage is acceptable per month. This essentially means that one does not even have time to move workloads or shift to redundant infra pieces.

A high available architecture as shown in diagram 1.3 (a) of 99.9% availability with an active-passive disaster recovery site, which looks like diagram 1.2 (a) of 99% availability – see figure 1.4 (a). It is a fault tolerant architecture that can sustain 2 system failures in parallel. The need here is to have replication and DR management tools in place like Sanovi / SRM to invoke DR in case of multiple system failure or site level disaster.

99 percent availability

Diagram 1.4 (a): Setup for 99.99% availability

99% availability setup

Achieving 99.995% availability at the infrastructure level is a different ball game altogether. While it means adding an automation and virtualization layer over the 99.99% architecture, the implications and complexity involved are quite different.

The complexity is not just limited to having a global load balancer at the primary site, but applications will have to support this architecture – there is a need to architect the infrastructure as well as the application layer to achieve the 99.995% availability. See diagram 1.5 (a)

Setup for 99.995% availability

Diagram 1.5 (a): Setup for 99.995% availability

Diagram 1.5 (b): Setup for 99.995% availability

In Conclusion

The 99.995% and 99.99% availability are used where there are financial systems with very high transaction volumes– millions of transactions per minutes – stock markets, core banking systems, billing applications at telcos, etc. They are also used in organizations with mandate ofregulatory requirement for this level of resilience, and those doing business transactions where a single minute of outage causes significant financial losses.

The most-consumed environment, the 99.9% availability, is typically used in most tier 1 applications run by companies – those running 24x7 operations. ECommerce websites, SAP environments, Internet Banking websites, and international companies that are accessed 24x7 run on 99.9% availability.

Environments that opt for 99% availability are UAT (User Acceptance Test), QA (Quality Assurance), file systems, Intranets, internal applications such as HRMS, Intranet site, collaboration sites, etc. Internet companies without serious financial penalties and most corporate websites run on 99% availability.