Friday, September 17, 2010

Quick-Take: ZFS and Early Disk Failure

Anyone who's discussed storage with me knows that I "hate" desktop drives in storage arrays. When using SAS disks as a standard, that's typically a non-issue because there's not typically a distinction between "desktop" and "server" disks in the SAS world. Therefore, you know I'm talking about the other "S" word - SATA. Here's a tale of SATA woe that I've seen repeatedly cause problems for inexperienced ZFS'ers out there...

When volumes fail in ZFS, the "final" indicator is data corruption. Fortunately, ZFS checksums recognize corrupted data and can take action to correct and report the problem. But that's like treating cancer only after you've experienced the symptoms. In fact, the failing disk will likely begin to "under-perform" well before actual "hard" errors show-up as read, write or checksum errors in the ZFS pool. Depending on the reason for "under-performing" this can affect the performance of any controller, pool or enclosure that contains the disk.

Wait - did he say enclosure? Sure. Just like a bad NIC chattering on a loaded network, a bad SATA device can occupy enough of the available service time for a controller or SAS bus (i.e. JBOD enclosure) to make a noticeable performance drop in otherwise "unrelated" ZFS pools. Hence, detection of such events is an important thing. Here's an example of an old WD SATA disk failing as viewed from the NexentaStor "Data Sets" GUI:

[caption id="attachment_1660" align="aligncenter" width="450" caption="Something is wrong with device c5t84d0..."]Disk Statistics showing failing drive[/caption]

Device c5t84d0 is having some serious problems. Busy time is 7x higher than counterparts, and its average service time is 14x higher. As a member of a RAIDz group, the entire group is being held-back by this "under-performing" member. From this snapshot, it appears that NexentaStor is giving us some good information about the disk from the "web GUI" but this assumption would not be correct. In fact, the "web GUI" is only reporting "real time" data so long as the disk is under load. In the case of a lightly loaded zpool, the statistics may not even be reported.

However, from the command shell, historic and real-time access to per-device performance is available. The output of "iostat -exn" shows the count of all errors for devices since the last time counters were reset, and average I/O loads for each:

[caption id="attachment_1662" align="aligncenter" width="450" caption="Device statistics from 'iostat' show error and I/O history."]Device statistics from 'iostat' show error and I/O history.[/caption]

The output of iostat clearly shows this disk has serious hardware problems. It indicates hardware errors as well as transmission errors for the device recognized as 'c5t84d0' and the I/O statistics - chiefly read, write and average service time - implicate this disk as a performance problem for the associated RAIDz group. So, if the device is really failing, shouldn't there be a log report of such an event? Yes, and here's a snip from the message log showing the error:

[caption id="attachment_1663" align="aligncenter" width="450" caption="SCSI error with ioc_status=0x8048 reported in /var/log/messages for failing device."]SCSI error with ioc_status=0x8048 reported in /var/log/messages[/caption]

However, in this case, the log is not "full" with messages of this sort. In fact, it only showed-up under the stress of an iozone benchmark (run from the NexentaStor 'nmc' console). I can (somewhat safely) conclude this to be a device failure since at least one other disk in this group is of the same make, model and firmware revision of the culprit. The interesting aspect about this "failure" is that it does not result in a read, write or checksum error for the associated zpool. Why? Because the device is only loosely coupled to the zpool as a constituent leaf device, and it also implies that the device errors were recoverable by either the drive or the device driver (mapping around a bad/hard error.)

Since these problems are being resolved at the device layer, the ZFS pool is "unaware" of the problem as you can see from the output of 'zpool status' for this volume:

[caption id="attachment_1661" align="aligncenter" width="450" caption="Problems with disk device as yet undetected at the zpool layer."]zpool status output for pool with undetected failing device[/caption]

This doesn't mean that the "consumers" of the zpool's resources are "unaware" of the problem, as the disk error has manifested itself in the zpool as higher delays, lower I/O through-put and subsequently less pool bandwidth. In short, if the error is persistent under load, the drive has a correctable but catastrophic (to performance) problem and will need to be replaced. If, however, the error goes away, it is possible that the device driver has suitably corrected for the problem and the drive can stay in place.

SOLORI's Take: How do we know if the drive needs to be replaced? Time will establish an error rate. In short, running the benchmark again and watching the error counters for the device will determine if the problem persists. Eventually, the errors will either go away or they wont. For me, I'm hoping that the disk fails to give me an excuse to replace the whole pool with a new set of SATA "eco/green" disks for more lab play. Stay tuned...

SOLORI's Take: In all of its flavors, 1.5Gbps, 3Gbps and 6Gbps, I find SATA drives inferior to "similarly" spec'd SAS for just about everything. In my experience, the worst SAS drives I've ever used have been more reliable than most of the SATA drives I've used. That doesn't mean there are "no" good SATA drives, but it means that you really need to work within tighter boundaries when mixing vendors and models in SATA arrays. On top of that, the additional drive port and better typical sustained performance make SAS a clear winner over SATA (IMHO). The big exception to the rule is economy - especially where disk arrays are used for on-line backup - but that's another discussion...

4 comments:

  1. This is a very common problem for people using SATA drives (even the "enterprise" ones). We try to stay on top of it by pro-actively replacing disks when they start to show signs of failure or poor performance, but it's still hard to determine what's the threshold. Like you said, if the workload is very low, our clients might not even notice it that much. And they often don't, specially e-mail and web... the virtualization folks on the other hand won't allow a single time-out to pass unnoticed :)

    ReplyDelete
  2. @giovanni - I actually considered taking the "anti-SATA rant" out of this Quick-Take, but instead moved it to the editorial comment at the end. I've always had a love-hate relationship with SATA because of its excellent density/cost ratio. Fortunately, SAS is catching SATA in cost and density and with the better signaling, extra port and reliability, the few remaining dollars difference are becoming harder and harder to justify... If you're using SATA, I hope you're running mirror groups and not RAIDz groups; and if RAIDz, I hope it's RAIDz2 or better :) These kind of error modes are intrinsically worse on RAIDz types.

    You raise a great point about virtualization guys - especially since the admins without cross-training in SAN and VM are apt to point fingers at the network or CPU when these conditions arise. Shared storage either lifts or sinks all boats in VM. It's kind of a catch-22 for the guys "doing it on the cheap" to have chosen SATA over SAS on density/price and skimped-by with the minimum number of RAID groups. Here's where this type of "hidden" failure come to bite them where it hurts :) And faced with a resilver, their day is going to get much worse before it gets better...

    ReplyDelete
  3. [...] This post was mentioned on Twitter by Andy Leonard, Collin C MacMillan. Collin C MacMillan said: Quick post ZFS/NexentaStor and early drive failure/consequences: http://bit.ly/cVupy9 [...]

    ReplyDelete
  4. Curious if you tried using Nexenta's AutoSMART plugin and whether it provided anything useful in the diagnosis.

    ReplyDelete