Causes of Downtime in Datacenters and how to overcome themAuthor : Nilesh Rane Date : November 30,2015 Category : Datacenter
What are some of the causes of downtime in a data center? Equipment failure, bad design, or human error? The reason could be a combination of all but human error is possibly the biggest culprit, and not without good reason. Let us carefully analyze each of one of these factors individually.
Equipment failure can be one of the causes for downtime and can be tackled by purchasing good quality hardware from reputed vendors. The IT assets (servers, racks, networking equipment, cabling etc) in a data center are always running and constantly heating up. If the temperature of servers, storage and other equipment in a data center is more than the prescribed norms/limits, it downgrades their performance and can lead to system failure and eventually downtime in adata center. Proper ventilation, cooling and air conditioning in data centers can reduce the threat of downtime from over heated systems that may cause system breakdown/failure.
Most organizations usually have a DR and BCP strategy to tackle the incidences of downtime caused due to system failures. Technologies like RAID, automatic failover systems, Cloud DR, backup and retrieval systems, redundant power systems, etc. ensures business continuity in the event of an outage or system failure.
Another cause for downtime is poor designing of a data center facility. If you have increased server density in the data center without making adequate provisions for space, power, ventilation, cooling and air conditioning, you are surely headed for trouble. Hence, you must design your data center to accommodate any future additions that you may need to make to the existing facility should your business demand it. It is also advisable to have a N+1 or 2N redundancy to support the future growth and prevent any incidences of outages that may happen because of the increased load.
Human error is another prime reason for downtime in data centers. According to the Uptime Institute, a New York-based research and consulting organization that focuses on data-center performance, human error causes roughly 70% of the problems that plague data centers today. The group analyzed 4,500 data-center incidents, including 400 full downtime events. Whether it's due to neglect, insufficient training, end-user interference, tight purse strings or simple mistakes, human error is unavoidable. And these days, thanks to the ever-increasing complexity of IT systems – and the related problem of increasingly overworked data center staffers -- even the mishaps that can be avoided often aren't. Poor cabling in data centers increases the chances of human error being responsible for unplanned downtime. Loose cables are likely to obstruct cooling vents leading to increase in temperature in the data center and ultimately to system failure.
In order to address human error, service providers should properly train all employees – not just in their day-to-day responsibilities, but in worst-case scenarios too so that they are able to quickly respond and mitigate damage in any situation.
There are other strategies that data center providers can leverage to help prevent more common causes of downtime, one of which is robust redundancy in critical systems throughout the data center. When a facility is equipped with backup sources for power, connectivity and cooling, even if one source is interrupted or otherwise negatively impacted, operators can switch to the redundant system to keep the data center up and running. Proactive investments in redundant systems can help prevent costly outage events down the line.
Being proactive in avoiding possible incidences of downtime is a good practice that all organizations should follow. If you have any queries on how your company can tackle the issue of downtime, do write to us at firstname.lastname@example.org
Nilesh Rane is the Associate Vice President - Product and Service at Netmagic Solutions. Nilesh is an expert in the data center domain, specifically in areas such as Disaster Recovery, DR-as-a-Service, IDC and Bandwidth. He has extensive experience of over 10 years within the data center domain, out of his total work experience of 20 years. Nilesh has been with Netmagic for 6 years handing key roles and responsibilities within areas such as DR, DRaaS, IDC and Bandwidth.