29 Sep 2008

56% Chance Your Hard Drive Is Not (Fully) Readable, A Lawsuit in the Making

If you are a hard drive junkie, you have probably at least heard of the bit error rate. The last time I read about it was when I was happy to have these huge honking 40GB drives. Then, the error rate combined with the size of my drives, was small enough that it was not worth worrying about it. At some point, I forgot that the worry was actually a function of how much data there is, not to mention that the quantity of data is probably also a factor in the utility function for that data. Thus, as time has gone by and drive sizes increased dramatically, the bit error rate is of serious concern.

The short story first: Your consumer level 1TB SATA drive has a 44% chance that it can be completely read without any error. If you run a RAID setup, this is really bad news because it may prevent rebuilding an array in the case of disk failure, making your RAID not so Redundant.

The clincher is that this is easily fixed with a software update and a modest decrease in drive size to make space for error correcting data. This would increase drive prices modestly but would stave off lots of problems. Of course, explaining this all to the average consumer in two words or less (do you really look at much more than “750 GB” or whatever when choosing your HD?) is rather difficult.

The long story (from someone who knows much more about these things than myself):

SATA vs. SCSI reliability

Here’s a guy who discusses SATA vs. SCSI disk reliability. Short conclusion: actual disk failures (MTBF) are almost exactly as likely wtih cheap SATA disks as expensive SCSI disks. But the bit error rate of SATA is much higher. In other words, the likelihood of not being able to read a sector because it got corrupted on SATA is vastly higher than SCSI. By his calculation, on a 1TB disk, you have about a 56% chance of not being able to read every single sector, which means rebuilding your RAID correctly in the case of a failed disk is usually impossible.

It’s true, and I learned this the hard way myself. Back at NITI, we ran into exactly this problem. Back in the days when we introduced our software RAID support, typical disk sizes were around 20 GB, about 50x smaller than they are now. (Wow!) The bit error rates now are about the same as they were then, which means, assuming the failure percentage declines linearly(1), about a 1.1% failure rate in recovering a RAID.

In general, that 1.1% failure rate isn’t so bad. Remember, it’s 1.1% on top of the rather low chance that your RAID failed in the first place, and even then it doesn’t result in total data loss – just some inconvenience and the loss of a sector here or there. Anyway, the failure rate was small enough that nobody knew about it, including us. So when we had about a 1.1% rate of weird tech support cases involving RAID problems, we looked into it, but blamed it on bad luck with hard drive vendors.

By the time disks were 200GB and failure rates were more like 10%, we were having some long chats with those hard drive vendors. Um, guys? Your disks. They’re dropping dead at a pretty ridiculous pace, here.

You see, we were still proceeding under the assumption that IDE disk are either entirely good, or they’re bad. That is, if you get a bad sector on an IDE disk, it’s supposed to be the beginning of the end. That’s because modern disks have a spare sector remapping feature that’s supposed to automatically (and silently) stop using sectors when the disk finds that they’re bad. The problem, though, is it has to discover this at write time, not at read time. If you’re writing a sector, you can just read it back, make sure it was written correctly, and if not, write it to your spare sector area. But if you read it back and it fails the checksum – what then?

This is the “bit error rate” problem. It’s not nice to think about, but modern disks just plain lose data over time. You can write it one day, and read-verify it right afterwards without a problem, and then the data can be missing again tomorrow. Ugh.

And the frequency – per bit – with which this happens is the same as ever. With SCSI it’s less than with SATA, but as we have more bits per disk, the frequency per disk is getting ridiculous. A 56% chance that you can’t read all the data from a particular disk now.

There are two reasons you probably haven’t heard about this problem. First, you probably don’t run a RAID. Let’s face it, if your home disk has a terabyte of stuff on it, you just probably aren’t accessing all that data. Most of the files on your disk, you will probably never access again. Face it! It’s true. If you filled up a 1TB disk, you probably filled it with a bunch of movies, and most of those movies suck and you will never watch them again. Stasticially speaking, the part of your disk that loses data is probably in the movies that suck, not the movies that are good, simply because the vast majority of movies suck.

But if you’re using a RAID, you occasionally need to read the entire disk, so the system will find those bad sectors, even in files you don’t care about. Maybe the system will be smart enough not to report those bad sectors to you, but it’ll find them.

Secondly, even when you lose a sector here and there, you usually don’t even care. Movies, again: MPEG streams are designed to recover from occasional data loss, because they’re designed to be delivered over much less reliable streams than hard disks. What happens if you get a corrupt blob of data in your movie? A little sprinkle of digital junk on your screen. And within a second or so, the MPEG decoder hits another keyframe and the junk is gone. Whatever, just another decoder glitch, right? Maybe. Maybe not. But you don’t really care either way.

The Solution

At NITI, we eventually settled on a clever solution to this that won’t lose data on RAIDs. Of course we can’t protect data on a non-RAID disk in any direct sense, but we strongly recommended for our customers to do frequent incremental backups instead.

But on a RAID, the problem is actually easier: simply catch the problem before a disk fails. In the background, we would be constantly, but slowly, reading through the contents of all your disks, averaging about one pass per week. If your RAID is still intact but we find a bad sector, no data has been lost yet: the other disks in the RAID can still be used to reconstruct it. So that’s exactly what we would do! Reconstruct the bad sector, and write it back to the failing disk which could then automatically use its sector remapping code to make the bad sector disappear forever.

The read-reconstruction part was never open sourced, so if you want that, you’d have to write it yourself. Luckily, it was easy, and now that we have ionice you don’t have to be nearly as careful to do it slowly in the background.

The other part was to make sure Linux’s software RAID could recover in case it ran into individual bad sectors. You see, they made the same bad assumption that we did: if you get a bad sector, the disk is bad, so drop it out of the RAID right away and use the remaining good disks. The problem is that nowadays, every disk in the RAID is likely to have sector errors, so it will be impossible to rebuild the RAID under that assumption. Not only that, but throwing the disk out of the RAID is the worst thing you can do, because it prevents you from recovering the bad sectors on the other disks!

A co-worker of mine at the time, Peter Zion(2), modified the Linux 2.4 RAID code to do something much smarter: it would keep a list of bad sectors it found on each disk, and simply choose to read that sector from the set of other disks whenever you tried to read it. Of course it would then report the problem back to userspace through a side channel, where we could report a warning about your disks being potentially bad, and accelerate the background auto-recovery process.

Sadly, while the code for this must be GPL as it modified the Linux kernel, the old svn repository at svn.nit.ca seems to have disappeared. I imagine it’s an accident, albeint a legally ambiguous one. But I can’t point you to it right now.

I also don’t know if the latest Linux RAID code is any smarter out of the box than it used to be. As we learned, it used to be pretty darn dumb. But I don’t have to know; I don’t work there anymore. Still, please feel free to let me know if you’ve got some interesting information about it.

Footnote

(1) Of course the failure rate is not exactly a linear function of the disk size, for simple reasons of probability. The probability of a 1TB disk having no errors (0.44, apparently) is actually the same as the probability that all of a set of 50x 20GB disks has no errors. The probability of no failures on any one disk is thus the 50th root of that, or 98.4%. In other words, the probability of failure was more like 1.6% back in the day, not 1.1%.

(2) Peter is now a member of The Navarra Group, a software contracting group which may be able to solve your Linux kernel problems too. (Full disclosure: I’m an advisor and board member at Navarra.)

Further Reading:

5 Responses to “56% Chance Your Hard Drive Is Not (Fully) Readable, A Lawsuit in the Making”

  1. EconTech » MHDDFS: Yet Another Neat Fuse FileSystem says:

    [...] As all of our collections of harddrives continues to grow, trying to find files and keep them organized in one place becomes impossible. Multiple mount points (or, if you are a poor soul, drive letters) means making choices at to what set of files goes on what drive and in what organization pattern. There are of course ways around this, RAID, which adds its own level of complexity for the user and of course eats disk space with questionable redundancy. [...]

  2. Louis Morrison says:

    Though you are correct about the Bit Error Rate (BER) of the SATA interface (vs. SCSI/SAS), there is CRC detection and the SATA packet will fail and the operating system will retry. This is a physical layer BER and it is severly subjected to the lack of unsheilded SATA cables. It is not uncommon to use a BER of 10^19 when in a PCB backplane situation. The BER of the physical HDD media is a completely different situation. The data on the harddrive is protected by multiple layers of ECC and CRC, this is how they can store data at the higher densitys. Yes SCSI and SAS drives have “wider” tracks to allow more magnetic area to be polarized, but the underlying ECC and CRC protection is the same. The Uncorrectable data you are talking about is dominated by the areal density and not the SATA vs SCSI/SAS as you claim.

  3. computer.economist says:

    I definitely don’t claim to be an expert on this, but that is pretty much my point. The difference is NOT fundamental to the technology, and yet there is a consistent difference between the implementations of the technology, leading to this situation, and thus, the potential for collusion and other anti-trust claims.

    As for the wider tracks vs areal density, isn’t the high areal density in SATA drives a result of using more narrow tracks, which could be changed in the software by telling the head to write to more magnetic material, thus creating the wider track and more reliable error correction? Even maybe a firmware option (or does the head need to be of a different size to write to the ‘wider’ track?) It seems possible to run a given drive in two different states, a “high density, low reliability” and a “lower density, high reliability” that could adjust based on need.

  4. ngpilsung says:

    The numbers quoted are from DESKTOP drives with a BER of 10^14. Anyone using Desktop drives in the first place in a RAID group get what they deserve. There are other reasons why using Desktop SATA for RAID is not a good idea – different error retry handling, I/O algorithms that minimize head movements and may starve I/O to certain areas to the disk, etc. Looking at Enterprise SATA, the BER is 10^15 decreasing the number by an order of magnitude to 5.6%. Of course, this assumes running a RAID 5. Running RAID 6 minimizes this. While I agree that the BER is a serious problem and needs to be addressed, but let’s get the facts straight and stop saying the sky is failling.

  5. computer.economist says:

    I think the desktop/server drive distinction is a bit overstated. Build a server in Dell or HP small business with RAID-5 and check the drives you get. They are ‘desktop’ drives. These drives are definitely marketed with RAID in mind, e.g. NAS devices with RAID sold by the HD manufacturers themselves. If the drives are known to be, in the typical case, not capable of reconstruction of the array, then I believe that is grounds for holding the sellers liable. I’m not sure who is saying the sky is falling, just litigation may be.

Leave a Reply