Confusing hardware issues at home

I’ve got this server in my loft at home that’s mainly a file server for the data we use/view/listen to here. It looks like this:

A bit of a beast. When I bought it over 4 years ago I somehow thought I’d be adding a lot more drives. Anyway.

It’s been a good, reliable bit of kit and had no problems for a long time apart from overheating in the old house, but that was a problem with the room it was in. It’s never even lost a disk. A couple of months ago though the PSU went pop and ever since then it started occasionally giving me this sort of thing:

Mar 21 13:53:16 specialbrew kernel: [5875576.400044] ata3.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Mar 21 13:53:16 specialbrew kernel: [5875576.400095] ata3.01: cmd c8/00:50:9e:a2:1d/00:00:00:00:00/f2 tag 0 dma 40960 in
Mar 21 13:53:16 specialbrew kernel: [5875576.400098]          res 40/00:01:01:4f:c2/00:00:00:00:00/10 Emask 0x4 (timeout)
Mar 21 13:53:16 specialbrew kernel: [5875576.400167] ata3.01: status: { DRDY }
Mar 21 13:53:16 specialbrew kernel: [5875576.400196] ata3: soft resetting link
Mar 21 13:53:16 specialbrew kernel: [5875576.719196] ata3.00: configured for UDMA/33
Mar 21 13:53:16 specialbrew kernel: [5875576.759036] ata3.01: configured for UDMA/100
Mar 21 13:53:16 specialbrew kernel: [5875576.759075] ata3: EH complete
Mar 21 13:53:16 specialbrew kernel: [5875576.800851] sd 2:0:0:0: [sdc] 625134827 512-byte hardware sectors (320069 MB)
Mar 21 13:53:16 specialbrew kernel: [5875576.801386] sd 2:0:0:0: [sdc] Write Protect is off
Mar 21 13:53:16 specialbrew kernel: [5875576.801418] sd 2:0:0:0: [sdc] Mode Sense: 00 3a 00 00
Mar 21 13:53:16 specialbrew kernel: [5875576.808855] sd 2:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Mar 21 13:53:16 specialbrew kernel: [5875576.810058] sd 2:0:1:0: [sdd] 625134827 512-byte hardware sectors (320069 MB)
Mar 21 13:53:16 specialbrew kernel: [5875576.810452] sd 2:0:1:0: [sdd] Write Protect is off
Mar 21 13:53:16 specialbrew kernel: [5875576.810482] sd 2:0:1:0: [sdd] Mode Sense: 00 3a 00 00
Mar 21 13:53:16 specialbrew kernel: [5875576.867347] sd 2:0:1:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Mar 21 13:53:16 specialbrew kernel: [5875576.871943] sd 2:0:0:0: [sdc] 625134827 512-byte hardware sectors (320069 MB)
Mar 21 13:53:16 specialbrew kernel: [5875576.873744] sd 2:0:0:0: [sdc] Write Protect is off
Mar 21 13:53:16 specialbrew kernel: [5875576.873770] sd 2:0:0:0: [sdc] Mode Sense: 00 3a 00 00
Mar 21 13:53:16 specialbrew kernel: [5875576.873966] sd 2:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Mar 21 13:53:16 specialbrew kernel: [5875576.874062] sd 2:0:1:0: [sdd] 625134827 512-byte hardware sectors (320069 MB)
Mar 21 13:53:16 specialbrew kernel: [5875576.874125] sd 2:0:1:0: [sdd] Write Protect is off
Mar 21 13:53:16 specialbrew kernel: [5875576.874148] sd 2:0:1:0: [sdd] Mode Sense: 00 3a 00 00
Mar 21 13:53:16 specialbrew kernel: [5875576.874195] sd 2:0:1:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

There’s 6 drives in there and the above messages have been seen referring to all of them at one time or another, so I don’t believe it’s as simple as a broken disk.

The incidences of the above have become more and more frequent, so today I spent some time trying to work out where the problem lay.

The way it seemed to affect all ATA busses made me think maybe the (new) PSU was underperforming, but I tried two different ones and they seem fine.

The six disks are inserted into two 3-bay Icydocks. Here’s what they look like:

They’re pretty dumb devices which just let you fit three 3.5″ disks into two 5.25″ bays. On the back they have three SATA data connectors (one for each disk), two molex power, one SATA power and a fan. I bought them because I didn’t want to buy a really expensive disk chassis for home, but I also didn’t want to screw six drives inside the case where they’d be hard to get access to.

Inside I have four of the drives connected to the motherboard’s SATA controller, and two of them connected to an additional Si3112 SATA card. This setup has been in place for over four years.

When all the drives are removed from the Icydocks and directly connected to SATA and power, everything appears to be fine. When either of the Icydocks have three disks in, the problem reappears. I then put three disks in an Icydock, three disks directly connected, but popped one of the disks in the Icydock out. This appears to also work fine (the file systems are all RAID-10 so can stand to run with one disk missing).

I’m a bit confused by that. When I was testing the Icydocks individually, I was using the same set of three disks with each one (with the other three disks connected directly). I could believe that the disk I have now removed is bad in some way that causes the whole bus to reset, but I would have to ask why it affects the other busses, and why it doesn’t happen when it’s directly connected.

I know other people who bought Icydocks and had a real struggle getting them to behave reliably, but mine worked well from the start and have done for over four years. I could believe that one of them went bad when the power popped, even though they are very simple electro-mechanical devices, but it’s hard to believe that two of them did.

I can’t just remove the Icydocks from the picture and forget about it because that leaves six SATA drives running on the floor. 🙂 They need to be inside some form of enclosure, and I don’t want to fork out for a new enclosure or two right now if I can help it.

I’ve left it there for this evening, but I’ll have to return to it tomorrow afternoon. I’ll probably start by putting the other three disks back in their Icydock to see if the removal of that one really does fix it.

Any ideas for ways to narrow the problem down?

I hate hardware.

Update 2010-03-31

I tentatively believe I’ve tracked down the issue.

Joel wins: despite the new PSU being a bit beefier in max output than the dead one I was replacing (500W vs 384W), the new one actually had a lower limit on the 12V rail: 2.5A vs the previous 3.3A.

I scavenged a PSU from elsewhere that also had 3.3A and everything seems fine now and has been for 2 days.

I think that things worked fine outside the Icydocks because the Icydocks have fans, which are probably not very good, and suck additional power. Or else they maybe don’t do any kind of staggered spinup that might happen without them.

4 thoughts on “Confusing hardware issues at home

    1. @neuro: I wouldn’t have expected it all to work without the Icydocks or work as it is now, with only one of the two Icydocks. Also the resets are for busses both on the motherboard and on the Si3112 card.

  1. I’d think your ownly option is to have the discs on the floor… It does seem unlikely that they’d both fail, but it must be similar to someone getting two flat Tyres at once – really f***ing annoying and unlikely – but it has to happen.

    Hadn’t heard of icydocks before – wish I had, they look cool / useful.

  2. Sounds to me like your PSU hasn’t got enough grunt to spin them up properly at times. I’ve seen that before specifically with portable disk enclosures (might even have been Icydocks, tbh).

    When you replaced the PSU did you ensure it had enough power to drive everything?

Leave a Reply

Your email address will not be published. Required fields are marked *