Reasons to Love and Hate ZFS

So yesterday – inspired by Chris Gerhard – I installed my own homebrew take a snapshot every 5 minutes script on my shiny new home NAS box. Everything has been going swimmingly, so I decided to cut-over from my old striped-external-disk solution on the iMac, to the new 2.5Tb RAID-Z on the Solaris box.

After all, what could possibly go wrong?

So: yesterday I copied-over the entire iMac data store – nearly 500Gb – to the AMD’s ZFS filesystem and snapshotted it for good measure, independent of the script.[0] Then – and this is where I really goofed – I erased and reformatted the old iMac RAID array.

So it was karma this morning when I awoke to find a moderately wedged Solaris box, and skimmed through the /var/adm/messages file to find:

(heavily edited for repetitititition)

Sep 13 05:33:15 suzi scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci-ide@d,1/ide@0 (ata4):
Sep 13 05:33:15 suzi timeout: abort request, target=0 lun=0

Sep 13 05:33:16 suzi scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci-ide@d/ide@1 (ata3):
Sep 13 05:33:16 suzi timeout: abort request, target=0 lun=0

Sep 13 05:33:16 suzi scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci-ide@d/ide@0 (ata2):
Sep 13 05:33:16 suzi timeout: abort request, target=0 lun=0

Sep 13 05:33:16 suzi gda: [ID 107833 kern.warning] WARNING: /pci@0,0/pci-ide@d,1/ide@0/cmdk@0,0 (Disk2):
Sep 13 05:33:16 suzi Error for command ‘write sector’ Error Level: Informational
Sep 13 05:33:16 suzi gda: [ID 107833 kern.notice] Sense Key: aborted command
Sep 13 05:33:16 suzi gda: [ID 107833 kern.notice] Vendor ‘Gen-ATA ‘ error code: 0×3

Sep 13 05:33:16 suzi gda: [ID 107833 kern.warning] WARNING: /pci@0,0/pci-ide@d/ide@1/cmdk@0,0 (Disk1):
Sep 13 05:33:16 suzi Error for command ‘write sector’ Error Level: Informational
Sep 13 05:33:16 suzi gda: [ID 107833 kern.notice] Sense Key: aborted command
Sep 13 05:33:16 suzi gda: [ID 107833 kern.notice] Vendor ‘Gen-ATA ‘ error code: 0×3

Sep 13 05:33:16 suzi gda: [ID 107833 kern.warning] WARNING: /pci@0,0/pci-ide@d/ide@0/cmdk@0,0 (Disk4):
Sep 13 05:33:16 suzi Error for command ‘write sector’ Error Level: Informational
Sep 13 05:33:16 suzi gda: [ID 107833 kern.notice] Sense Key: aborted command
Sep 13 05:33:16 suzi gda: [ID 107833 kern.notice] Vendor ‘Gen-ATA ‘ error code: 0×3

Sep 13 05:40:06 suzi fmd: [ID 441519 daemon.error] SUNW-MSG-ID: ZFS-8000-GH, TYPE: Fault, VER: 1, SEVERITY: Major
Sep 13 05:40:06 suzi EVENT-TIME: Thu Sep 13 05:40:06 BST 2007
Sep 13 05:40:06 suzi PLATFORM: System Product Name, CSN: System Serial Number, HOSTNAME: suzi
Sep 13 05:40:06 suzi SOURCE: zfs-diagnosis, REV: 1.0
Sep 13 05:40:06 suzi EVENT-ID: 5ad39b52-c2e7-6d53-b937-d43694ed2568
Sep 13 05:40:06 suzi DESC: The number of checksum errors associated with a ZFS device
Sep 13 05:40:06 suzi exceeded acceptable levels. Refer to http://sun.com/msg/ZFS-8000-GH for more information.
Sep 13 05:40:06 suzi AUTO-RESPONSE: The device has been marked as degraded. An attempt
Sep 13 05:40:06 suzi will be made to activate a hot spare if available.
Sep 13 05:40:06 suzi IMPACT: Fault tolerance of the pool may be compromised.
Sep 13 05:40:06 suzi REC-ACTION: Run ‘zpool status -x’ and replace the bad device.

…which was deeply scary. What was more scary to me was the output of zpool status:

pool: tank
state: ONLINE
status: One or more devices has experienced an error resulting in data corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the entire pool from backup.
see: http://www.sun.com/msg/ZFS-8000-8A
scrub: none requested
[...] errors: Permanent errors have been detected in the following files:
<metadata>:<0×305>
tank/local/alecm@quick.20070913.053500:<0×0>

Let me put it to you like this: 15 years of my life are on those spindles – and for that moment they contained the only copy. There was no backup from which to restore.

I didn’t panic – I’m not given to panic in such circumstances – but I downed the machine and thought about it very hard. And swore, rather a lot. Especially about the lack of coffee.

Zpool was calling the tank unmountable, because three drives were missing and two “degraded”, and there were not enough “replicas” to get the pool back up.

Start with the basics: I swapped cables around, checked the SATA power-leads were properly seated (I had had one loosen, before). Cutting a long story short, the next few hours of research were deeply fraught, with multiple reboots, testing hypotheses that the motherboard’s SATA controller might be fried, or that a particular disk might be fried because it was making ticking noises (it’s useful, having a stethoscope) – and so forth.

Particularly worrying was the “cascade” failure, that three disks seem to have died at the same time, although only one of them was ticking curiously, and what really confused was that the names of faulty disks in /var/adm/messages seemed to bear no relationship to the ones which zpool was reporting at fault.

In the end I bet each-way on two possibilities, and fixed both:

  1. that the daisychain powerlead driving three disks in a bank, was at fault
  2. that the SATA data leads were too close/tied together, and crosstalk under heavy load was causing ‘sense errors’

So I separated all the SATA cables out so they have airgaps[1], and will run them separately next time I service the machine; I then reconnected the power cable using different plugs – and the “ticking” stopped, so I am betting that the tail-end power connector was flaky, leading to an undervolt on one/more of the drives.

But this still left the toasted zpool.

The machine booted and mounted the filesystems, but did not appear to enjoy the experience. Occasionally it would hang. I was curiously pleased that it was complaining about having lost metadata since although that would surely require a rebuild:

Unfortunately, the data cannot be repaired, and the only choice to repair the data is to restore the pool from backup.

…at least it gave me hope that data could be recovered without loss before having to rebuild the pool.

So once zpool could see all the disks again, I used zpool clear tank and rebooted. And eventually it would hang. And then I did rebooted again. And hang. And again. And again.

This was overkill, but inbetween each reboot I had time to rsync a little more data onto the reformatted iMac RAID system; and gradually, miracle of all miracles, I built up a copy of all the unreplicated data.

So now I had a corrupt zpool, and nothing to lose; so I could afford to experiment. Noting that the corrupt metadata was pertinent to a (inherently readonly) snapshot:

tank/local/alecm@quick.20070913.053500:<0×0>

– I destroyed all the snapshots on the machine, leaving only the filesystems. Then I iterated:

  1. zpool scrub tank
  2. zpool clear tank
  3. lather, rinse, repeat…
…until magically it sprang back to life:
22:14:43 suzi:~ $ zpool status -x
all pools are healthy

I am still gonna rebuild the pool – not taking chances – but the data is safe and I can take time to hammer the pool thoroughly before committing my archive data to it as a primary store.

First thing I shall do is reinstate the snapshot script.

So why do I love ZFS?

Well I can guarantee I have not lost any data to corruption, or indeed any data at all. No files missing, none with bad blocks. If there were, it would have told me. And I managed to salvage the situation.

So why hate it?

Well if the documentation – and the tool itself – tells you “action: Destroy the pool and restore from backup.” then you will pretty obviously assume you are screwed. Not merely is that defeatist. It is (potentially) wrong – so if you get stuck in this situation, I recommend you consider having a go at working around it.

It worked for me.


[0] Which I am beta-testing and will release once I am happy.
[1] Chatting with Chris Gerhard he confirmed with some research: SATA sucks for crosstalk, the cables are unshielded, under no account tie them together “to improve airflow” because it’ll bugger your system. Keep the data leads away from the power leads, too…

8 thoughts on “Reasons to Love and Hate ZFS

  1. Ben

    “Chatting with Chris Gerhard he confirmed with some research: SATA sucks for crosstalk, the cables are unshielded, under no account tie them together “to improve airflow” because it’ll bugger your system. Keep the data leads away from the power leads, too…”

    Thanks for the heads-up on that one!! Okay, I don’t have 500GB of data lying around to get corrupted, but still, knowing that SATA has a weakness like this is very helpful.

    ttfn

    Reply
  2. Wes W.

    Excellent commentary. Some time ago I’ve had a similar odd disk error messages with a pair of new “nearline” Seagate SATA drives and known good SATA controller. I never did figure that one out and just let ZFS keep my data safe…though I did a lot more testing on it before moving my years of archived data there! ;) And yes, I had tied the pair of SATA cables, one to another, to the CPU heatsink to keep the cables out of the CPU cooler fans and just generally tidy things. Now knowing a probable cause of those random errors is great!

    Reply
  3. Jim

    I don’t know much about ZFS apart from the basic concepts, but that sounds absolutely terrifying. However, that fact that you were able to recover the data at all is impressive.

    Reply
  4. JT

    Wow, that sucks. Good on you for getting your data back. When my external drive crashed I took it to CBL data recovery and they managed to save my photos – www cbltech.com Unfortunately I didn’t know enough to do it myself, but it’s encouraging to hear the DIY solutions are still available in some situations. JT

    Reply
  5. Jerie

    “Well I can guarantee I have not lost any data to corruption, or indeed any data at all. No files missing, none with bad blocks. If there were, it would have told me. And I managed to salvage the situation.”

    Well, you did remove your snapshots.

    “Well if the documentation – and the tool itself – tells you “action: Destroy the pool and restore from backup.” then you will pretty obviously assume you are screwed. Not merely is that defeatist. It is (potentially) wrong – so if you get stuck in this situation, I recommend you consider having a go at working around it.”

    Yes, I had this when my device names changed. Ofcourse all the drives were connected but now they were called ad4/ad6/ad8/ad10 instead of ad4/ad5/ad6/ad7. Later a few were called da0/da1/da2/da3. But they were the exact same disks. I merely switched them to different controllers.

    “Thanks for the heads-up on that one!!”

    Indeed. My NAS (w/ZFS et al) does have them rolled up until they achieve the drive bay because they are long cables. Too long for my liking. The reasons for doing so are indeed airflow (and hardware administration) but now that I know they are unshielded (quite a d’oh) I should reconsider. I wonder, is a multi-lane cable also unshielded?

    Reply
  6. Dick Davies

    Had something similar happen last week. To be fair, the error does say ‘you may be able to restore your lost files’ (which is what happened to me), but for some reason the ‘ALL YOUR DATA IS TOAST’ message drew my eye first :)

    In my case, I had a bad read from both sides of the mirror when our SAN had a total outage and ZFS decided everything was ruined. A ‘zpool scrub’ cleared things up OK.

    Reply
  7. Lucia

    You did so many stupidities ‘managing’ that problem that only the robustness of ZFS saved you from yourself. Next time have backups.

    Reply

Leave a Reply