Disaster recovery and redundancy testing at Dirigo03/05/2012
At Dirigo we know that data is one of the most important aspect of any organization. In our business we cannot lose data. Never! Your data enables you act on Business Intelligence (BI) and can make the difference between success and failure. We treat all client data as if it was our own. Having an enterprise backup solution and a disaster recovery plan is a must for any serious development frim. Sadly, most web development shops have neither.
Today I am performing redunancy experiments using a Dell Powervault MD1000 storage array connected to a Poweredge host controller. Here at Dirigo, we invest in disaster planning and recovery. I have very nice tools.
You might be wondering what all of this means. The MD1000 is a 3U rackmount enclosure that stores 15 hard drives. Wow, having to manage 15 hard drives must be a real pain! Imagine having all of these drives showing up in your file explorer. Not so, using advanced storage architectures like RAID (Redundant Array of Independent Disks) can either improve performance or data integrity and redundancy.
There are many different RAID levels and it can be quite confusing at first. It is important to properly calculate the IO demands of your applicaiton in order to pick the correct RAID level. For instance RAID 0 will take 2+ disks and distribute your data across each disk. As an example the hard drive in your computer probably spins at 7200 RPM and can read and write about 75MB/s on average. If you distribute the load between two hard drives using RAID 0 it is the same as having one hard drive that spins at 14400 RPM and can read and write 150MB/s. RAID 0 isn't limited to two disks though, imagine a RAID 0 array using 8 hard drives, you have just multipled your IO performance by 8x.
RAID 0 won't provide fault tolerance though. If one of your 8 hard drives fail this is an unrecoverable failure resulting in total data loss. The Dell PowerVault MD1000 in the picture I posted is running eight 15K RPM enterprise hard drives cabaple of reading and writing 3Gbs. The array is configured using RAID 5 which is similar to RAID 0 except that it tolerates one hard drive failure and continues to function. RAID 5 distributes the data among the hard drive and also puts extra data called parity on each drive in the event that one of the drives fail. If one of the drives fail we just replace it and the array rebuilds. The whole time there is zero downtime and no data is lost.
The array in the picture doesn't actually use all 8 drives though. There are 7 drives in the RAID 5 array for data storage, and there is one drive dedicated as a hot spare. This means that as soon as one of the drives fails in the array it is replaced with the hot spare. The array begins the rebuild process instantly to protect your data and return the array to a full level of redundancy. The host server instantly dispatches an alert so that a technician can be dispatched to replace the failed hard drive. The new drive will then become the new hot spare.
My test was a success. Today I expanded the array by adding extra drives. The first part of the experiment was to setup the RAID 5 array with 3 drives. I then expanded it using 3 more drives. This ensures that as data grows we can continue to increase our storage capacity. I then physically removed one of the drives in order to simulate drive failure. This degrade of the array and triggered an automatic rebuild. It worked like a charm, the hot spare started rebuilding instantly and I then replaced the pulled drive and reconfigured it to be the new hot spare.
A storage array like the MD1000 or MD3000 holds 15x2TB drives which gives Dirigo a peak storage capacity of 30TB. A Dell PowerVault MD3000 can have two MD1000's chained to it. This puts our main three disk array capacity at 90TB of highly available direct attached storage. At Dirigo we use large MD1000 and 3000 arrays to back up all of corporate workstations and servers. By systematically calculating and planning for future hardware failure we can anticipate and circumvent catastrophic data loss and provide near 100% uptime.