I have just read an interesting article: http://www.smbitjournal.com/2012/05/when-no-redundancy-is-more-reliable/ and I am nothing but shocked and astonished…
Astonished how the maths has became a weird, mysterious artefact of the past and logical thinking is now passé. C’mon SMB IT guys – I expect more from writers of such long essays.
Long story short – author claims that today, with massive discs of terabytes, RAID5 did loose its appeal and does not provide sustainable data protection.
I don’t argue with that. Even without huge experience with RAID5 I can imagine that it might be very hard to successfully resilver such array when one drive dies:
- not only resilvering slows the array down, but it also stresses it much more. That combined with other factors, such as age of the drives means that the failure of second drive during resilvering is much more probable.
- there is also the case of URE (Unrecoverable Read Errors) which severity I don’t really understand.
I accept the fact that current implementations will fail the array when during resilvering when an URE happens, but this is just a limitation of the software. URE during resilvering does not destroy the data – it is still there, in exactly same state as it was before an URE. Why the array does not just drop the resilvering and put the array in “unrecoverable/online” state? Data would still be accessible!
But that’s not what did strike me… The author went further and suggest that RAID5 is so bad, that even RAID0 provides better protection. What a blunt piece of nonsense…
To the calculators!
Let’s do some calculations (Bernoulli trial to be precise)… We’ll be using the 3% annual fail rate of a harddrive in RAID0, as stated in the article. I’ll bump the failure rate to 4% for RAID5 – as the initial build puts some stress…
Probability of a 8 disk RAID0 array surviving a year (no hard drive failure) equals 78%. Thus probability of failing within a year is 22%.
Probability of a 9 disk RAID5 (to provide as much capacity as 8 disk RAID0) array surviving a year (0 or 1 harddrive failures) can be calculated by calculating the simpler, non overlapping cases and adding the probability:
- zero HDD failures: 69%
- one HDD failure: 26%
- zero or one hdd failure: 69%+26% = 95%
- probability of RAID5 failure within a year – 5%
You can verify above figures by using piece of paper (the formulas are in the previously linked Wikipedia page) or an online calculator.
RAID0 is more than 4 times more prone to failure than RAID5 (and that goes up to 7 times should I use the 3% failure rate).
Please, don’t believe in some mumbo-jumbo… Believe in numbers… As flawed RAID5 is today, it still much more reliable than RAID0.