The stress of disk failure

A few weeks ago my ZFS raid containing 12 1.5tb Western Digital green
drives was degraded due to a drive failure. As I have a hot spare
added to the raid as well, the raid was quickly rebuilt and has been
working fine through the problems and afterwards. I ordered two new
drives, and today I thought I would change the defective drive, and
let the replacement-drive become the new hot spare. Before I started,
I did a detach of the defective drive to make the hot spare
permanently stay in place of the defective drive: ‘zpool detach DATA
c2t2d0′.

Before I started, I thought it would be a good idea to reboot the
server, just to see if the server works as it should before I started
tinkering. I reboot, and the server starts beeping loudly about a
drive missing (this was from the raid controller before the OS had
started). I thought nothing of it, thinking that it was the one that
was defective. So, I started the OS to see if everything worked as it
should.

A quick ‘zpool status’ showed me that I had indeed a new problem: A
new drive had failed, and one of the three raidz1-groups was now
degraded as I did not have any hot spare free to rebuild the raid.
While doing some ‘zpool status’ my heart suddenly skipped a few beats
as I noticed a second drive in the same raidz1-group missing, and the
message was now that the zpool did not have enough replicas to
function at all. Luckily I have been in tough situations when it
comes to drives and hardware before, so I thought this was not a good
time to panick. I descended from the attic, and took a break with
some coca cola.

I thought, why would two “new” drives fail at the same time, before I
had even touched anything? I got back up to the attic, and now one of
the drives missing was suddenly working again and some data had been
resilvered. This made me think that maybe the drives were not
defective after all, maybe they just had some bad cables. I turned
off and opened the server, and checked every cable thouroughly.
Having close to 20 hard drives in this server, there are as you can
understand quite a few power splits. One of them was going to the two
drives that were missing (of which one had luckily come back). I
changed out the powersplit, and also changed the defective drive that
was the original problem. I booted the server and did the changes in
the raid controller, making new volumes for the new drive and the
changed one. Then booted the OS in single user mode.

Luckily drive number two of the two that went missing was still
working, and I did a ‘zpool replace DATA c0t3d0 c0t3d0’ to get the
zpool to rebuild and start using the first drive that had been
missing. Then I added the new hot spare to the zpool:
‘zpool add DATA spare c3t5d0’ After about 1.5 hours the raid had
been rebuilt, and was back in normal mode with no faults.

Hopefully the problem with the two drives that went missing is only
the power splits, and that they don’t give any new problems. And also
that it will be a long, long time before the new hot spare is needed
🙂

Thank God for raid and hot spares, I do not even want to think about
loosing the 12tb raid. I do have backups of the most important stuff,
but backing up all of the data is quite impossible at home with such
amounts. Starting from scratch with all of my ripped movies and music
and so on would be a nightmare.

MortenCB's blog

mortencb

One comment The stress of disk failure

Leave a Reply Cancel reply