RAID’s Failure Paves Way for Object Storage in Big Data EnvironmentsAuthor : Govind Desikan Date : December 06,2013 Category : Managed Storage Services
Among the bustling world of technology Big Data has clearly created a buzz. While IT providers are adding the Big Data lever into their marketing campaigns, user organizations are applying it to very specific areas of scale and analysis. The best part is that it is not only for the large and deep-pocketed but smaller organizations can also get business value and insight from the concept.
So What Makes Big Data The Buzz It Is?
Companies in industries such as Media & Entertainment (M&E) and Research and Development (R&D) have been dealing in large volumes of unstructured data (and data changing in real time) for long now. However, extracting meaning from this data is a daunting task – prohibitive and involving advanced and expensive technology.
Also, traditionally, technologies that see large volumes of data and analysis are overwhelmingly of data that is structured and see analysis operations running on a batch basis – which necessarily runs into days, and often weeks to research, analyze and note insights. Big Data is about large volumes of unstructured data complemented with rapid analysis with insights being noted within seconds.
Big Data Requires Big Storage
Big Data requires more capacity, high scalability and highly efficient accessibility. Traditional storage architectures are designed to scale as per data growth – adding more capacities by adding storage boxes. RAID(Redundant Array of Independent Drives)-based storage systems end up with huge storage sprawl – not necessarily what Big Data requires.
Big Data Storage would require scale-out or clustered storage systems – such as scale-out NAS (Network Attached Storage). This architecture can scale out to meet capacity or increased requirements and uses parallel file systems that are distributed across many storage devices that can handle billions of files without performance degradation that happens with ordinary file systems.
Another strategy is to go Object Storage – a storage architecture that tackles issues of large volumes of data and number of files similarly as scale-out NAS does.
RAID Is Not Big Data Ready
As IDC predicted, digital content explosion will result in data stored in systems will be over 35,000 Exabytes (1 Exabyte = 1 million terabytes) by 2020. With Big data using Hadoop stack has been gaining acceptance and real-world requirements have become extremely demanding on delivering high-performance for faster decision making, timely insights as well as unlock significant value by making information transparent and usable at much higher frequency.
Also, as organizations create and store more transactional data in digital form, they can collect more accurate and detailed performance information on everything from product inventories to sick days, and therefore exposes variability and boost performance. Leading companies are using data collection and analysis to conduct controlled experiments to make better management decisions; others are using data for basic low-frequency forecasting to high-frequency now-casting to adjust their business levers just in time.
Current data storage systems based on RAID arrays were not designed to scale to this type of data growth. As a result, the cost of RAID-based storage systems increases as the total amount of data storage increases, while data protection degrades, resulting in permanent digital asset loss. With the capacity of storage devices today, RAID- based systems cannot protect data from loss. Most IT organizations using RAID for big-data storage is bound to incur additional costs to copy their data two or three times to protect it from inevitable data loss.
Object Storage For Big Data
Object storage systems (e.g. SimpliDrive) have risen to prominence in the storage industry and underlie both public and private cloud offerings.Object storage is an approach to addressing and manipulating discrete units of storage called objects. Like files, objects contain data -- but unlike files, objects are not organized in a hierarchy. Object-based Storage Device (OBSD) or Object Storage Device (OSD) is a storage system that organizes data into containers called objects.
Every object exists at the same level in a flat address space called a storage pool and one object cannot be placed inside another object.
Both files and objects have metadata associated with the data they contain, but objects are characterized by their extended metadata. Each object is assigned a unique identifier which allows a server or end user to retrieve the object without needing to know the physical location of the data.
To draw an analogy - Object storage is like a valet parking at an upscale restaurant. When a customer uses valet parking, he exchanges his car keys for a receipt. The customer does not know where his car will be parked or how many times an attendant might move the car while the customer is dining, a storage object's unique identifier represents the customer's receipt.
Resilliency And Data Integrity
Distributing copies of data across storage nodes can provide a measure of data protection; distributing those nodes geographically can add a Disaster Recovery (DR) capability.
Object Storage Device systems can also provide data resiliency with erasure coding. This is a computational process somewhat similar to RAID parity, which parses a data set into sub-blocks, or "chunks," adding a percentage of redundant chunks, depending upon the desired level of protection.
Erasure coding helps maintain data integrity, restoring lost or corrupted data using these redundant chunks, much like RAID does at the disk drive level. But, compared with RAID and traditional replication, erasure coding can be more efficient and more robust and is being used by some OSD vendors to replace RAID protection altogether within objects. This can reduce the storage capacity and processing overhead required.
Object storage supports growth without significant performance degradation, and OSDs can be scaled geographically. Object-based Storage can scale without increasing resource demands thus help curtail cost increases.
To translate into real world costs, an example for a one-petabyte requirement with 99.9999% reliability, the following table summarizes the same.
RAID 5 was not designed for terabytes of data, as data loss will be close to certain in no time. RAID 6 can prevent data loss for several years with low storage requirements, however, at 512 terabytes there is a very high probability of data loss in less than a year.
Erasure Coding algorithm based outputs does not appear in the above chart because, even for a large amount of data, like 524K Terabytes, the theoretical value of data loss is approx. 79 million years.
Popularity of SMAC (Social, Mobile, Analysis and Cloud Computing) has changed the way organizations are looking at IT. As amount of unstructured data in the enterprise grows, CIOs are continously looking for newer approaches to manage data.
Future sees organizations building storage architectures with Big Data in mind and essential features such as Scalability, High Performance, Simple Management, Efficiency, Increased Data Protection and Interoperability – all helping them to curtail huge cost increases and driving better business efficiencies and decision-making.
Govind Desikan is the Business Development Head for Cloud Services, Netmagic, responsible in evangelising Cloud initiatives and to engage with customers deeply in preparing a Cloud blue-print for successful Cloud roll-outs. He has been associated with IT industry for close to 20 years with a wide ranging experiences from large Enterprises, working with Software Vendors, building Datacenter services grounds-up as well as in architecting large system roll-outs including elastic and adaptable architecture. Conversant with most software technologies, he is a passionate and vivid believer in simplifying technology to make it relevant for Business Decision Maker. In his past, he has worked with popular software OEM brands such as VMware, Microsoft, Sun, Oracle etc. Apart from being an Computer Science Engineer, he is also a Cost Accountant and an ISO Lead Auditor. He is well known with his customers who fondly recognize him as "one of the best consultative solution seller".