Building Highly Available IT Infrastructure – a Paradox of 9s
High availability is the co-efficient of the uptime required to run your IT Infrastructure– simply put, for a CIO it translates into two things - peace of mind and growth for his business. If a CIO is assured of the availability (read uptime) of his business applications, he is sure of his business running optimally at all times. HA is a co-efficient of this assurance.
There is a fundamental change needed in what people think of availability. It is not negotiated or bought –it is not a commercial discussion.
The table below highlights the acceptable downtime – from an availability perspective – for various 9s.
|SLA||Minutes of acceptable downtime per month||Minutes of acceptable downtime Annualized|
There is a standard approach to go from 99 to 99.9 to 99.99.
Achieving 99% Availability
Broadly and briefly, when you want to achieve 99% availability, it essentially means the infrastructure is allowed 432 minutes of downtime each month, which is roughly little over 7 hours. Most of professionally run IT setup the recovery time is roughly 2-4 hours depending on what kind of IT operations are run.
Look at the diagram 1.2(a) below. Here you are allowed to have a single point of failure – at the power, IT systems level, bandwidth or air conditioning of the DC levels – and you have enough time to recover from the failure.
Diagrams 1.2 (b), (C) and (d) show the outage of the architecture built for 99% availability.
When moving from 99% to 99.9% means that the acceptable downtime is only 43.2 minutes per month. Here, the target recovery time as mentioned above is more than this acceptable downtime.
So how does one avoid outage? Isolate all single points of failure and create redundancies. As shown in diagram 1.3(a), there are 2 ISPs, power inputs, servers and network setup (clustered server and network device setup), and redundant aircon, so on and so forth.
Diagram 1.3 (a): Setup for 99.9% availability
In this architecture, failure at any single point does not result in end user outage.
Failure at any single point also allows ample time to switch to the redundancy that is build into the architecture. The 43.2 minutes is enough to only shift to the redundant device or setup – and not recover from a failure.
The move from 99% to 99.9% significantly increases the cost. It almost doubles the cost of the setup – about 100% incremental cost over the infra setup of 99% availability
99.99% of availability means 4.32 minutes of outage is acceptable per month. This essentially means that one does not even have time to move workloads or shift to redundant infra pieces.
A high available architecture as shown in diagram 1.3 (a) of 99.9% availability with an active-passive disaster recovery site, which looks like diagram 1.2 (a) of 99% availability – see figure 1.4 (a). It is a fault tolerant architecture that can sustain 2 system failures in parallel. The need here is to have replication and DR management tools in place like Sanovi / SRM to invoke DR in case of multiple system failure or site level disaster.
Diagram 1.4 (a): Setup for 99.99% availability
Achieving 99.995% availability at the infrastructure level is a different ball game altogether. While it means adding an automation and virtualization layer over the 99.99% architecture, the implications and complexity involved are quite different.
The complexity is not just limited to having a global load balancer at the primary site, but applications will have to support this architecture – there is a need to architect the infrastructure as well as the application layer to achieve the 99.995% availability. See diagram 1.5 (a)
Diagram 1.5 (a): Setup for 99.995% availability
The 99.995% and 99.99% availability are used where there are financial systems with very high transaction volumes– millions of transactions per minutes – stock markets, core banking systems, billing applications at telcos, etc. They are also used in organizations with mandate ofregulatory requirement for this level of resilience, and those doing business transactions where a single minute of outage causes significant financial losses.
The most-consumed environment, the 99.9% availability, is typically used in most tier 1 applications run by companies – those running 24x7 operations. ECommerce websites, SAP environments, Internet Banking websites, and international companies that are accessed 24x7 run on 99.9% availability.
Environments that opt for 99% availability are UAT (User Acceptance Test), QA (Quality Assurance), file systems, Intranets, internal applications such as HRMS, Intranet site, collaboration sites, etc. Internet companies without serious financial penalties and most corporate websites run on 99% availability.
Sunil's expertise spans management and growing of datacenter business. His experience includes driving business development, service delivery and assurance, revenue assurance, and back office infrastructure operations and growth.Prior to joining Netmagic, he spent over a decade at Reliance Communications where, as Senior Vice President, he helped build and manage their Internet Datacenter business. He is a regular speaker and panelist at major industry seminars and trade shows.Sunil has over 18 years of IT and Communication industry experience. He has held senior executive positions at Global Telecom Services, Iris Software and Angel Solutions. Sunil has an Engineering degree with a specialization in Electronics and Telecommunications as well as a MBA in Marketing Management.