November 2015 – The ongoing struggle

Contents

All the RAIDs are breaking
TL;DR
How do storage timeouts work, and when are they a factor?
How to check/set drive timeouts
I’ve got a mix of drives that support scterc, some that have it disabled, and some that don’t support it. What now?
What about hardware RAIDs?
What about ZFS?
What about btrfs?
What about NAS devices?
So which drives support SCTERC/TLER and how much more do they cost?
Fart fart fart

All the RAIDs are breaking ^

I feel like I’ve been seeing a lot more threads on the linux-raid mailing list recently where people’s arrays have broken, they need help putting them back together (because they aren’t familiar with what to do in that situation), and it turns out that there’s nothing much wrong with the devices in question other than device timeouts.

When I say “a lot”, I mean, “more than I used to.”

I think the reason for the increase in failures may be that HDD vendors have been busy segregating their products into “desktop” and “RAID” editions in a somewhat arbitrary fashion, by removing features from the “desktop” editions in the drive firmware. One of the features that today’s consumer desktop drives tend to entirely lack is configurable error timeouts, also known as SCTERC, also known as TLER.

TL;DR ^

If you use redundant storage but may be using non-RAID drives, you absolutely must check them for configurable timeout support. If they don’t have it then you must increase your storage driver’s timeout to compensate, otherwise you risk data loss.

How do storage timeouts work, and when are they a factor? ^

When the operating system requests from or write to a particular drive sector and fails to do so, it keeps trying, and does nothing else while it is trying. An HDD that either does not have configurable timeouts or that has them disabled will keep doing this for quite a long time—minutes—and won’t be responding to any other command while it does that.

At some point Linux’s own timeouts will be exceeded and the Linux kernel will decide that there is something really wrong with the drive in question. It will try to reset it, and that will probably fail, because the drive will not be responding to the reset command. Linux will probably then reset the entire SATA or SCSI link and fail the IO request.

In a single drive situation (no RAID redundancy) it is probably a good thing that the drive tries really hard to get/set the data. If it really persists it just may work, and so there’s no data loss, and you are left under no illusion that your drive is really unwell and should be replaced soon.

In a multiple drive software RAID situation it’s a really bad thing. Linux MD will kick the drive out because as far as it is concerned it’s a drive that stopped responding to anything for several minutes. But why do you need to care? RAID is resilient, right? So a drive gets kicked out and added back again, it should be no problem.

Well, a lot of the time that’s true, but if you happen to hit another unreadable sector on some other drive while the array is degraded then you’ve got two drives kicked out, and so on. A bus / controller reset can also kick multiple drives out. It’s really easy to end up with an array that thinks it’s too damaged to function because of a relatively minor amount of unreadable sectors. RAID6 can’t help you here.

If you know what you’re doing you can still coerce such an array to assemble itself again and begin rebuilding, but if its component drives have long timeouts set then you may never be able to get it to rebuild fully!

What should happen in a RAID setup is that the drives give up quickly. In the case of a failed read, RAID just reads it from elsewhere and writes it back (causing a sector reallocation in the drive). The monthly scrub that Linux MD does catches these bad sectors before you have a bad time. You can monitor your reallocated sector count and know when a drive is going bad.

How to check/set drive timeouts ^

You can query the current timeout setting with smartctl like so:

# for drive in /sys/block/sd*; do drive="/dev/$(basename $drive)"; echo "$drive:"; smartctl -l scterc $drive; done

You hopefully end up with something like this:

/dev/sda:
smartctl 6.4 2014-10-07 r4002 [x86_64-linux-3.16.0-4-amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org
 
SCT Error Recovery Control:
           Read:     70 (7.0 seconds)
          Write:     70 (7.0 seconds)
 
/dev/sdb:
smartctl 6.4 2014-10-07 r4002 [x86_64-linux-3.16.0-4-amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org
 
SCT Error Recovery Control:
           Read:     70 (7.0 seconds)
          Write:     70 (7.0 seconds)

That’s a good result because it shows that configurable error timeouts (scterc) are supported, and the timeout is set to 70 all over. That’s in centiseconds, so it’s 7 seconds.

Consumer desktop drives from a few years ago might come back with something like this though:

SCT Error Recovery Control:
           Read:     Disabled
          Write:     Disabled

That would mean that the drive supports scterc, but does not enable it on power up. You will need to enable it yourself with smartctl again. Here’s how:

# smartctl -q errorsonly -l scterc,70,70 /dev/sda

That will be silent unless there is some error.

More modern consumer desktop drives probably won’t support scterc at all. They’ll look like this:

Warning: device does not support SCT Error Recovery Control command

Here you have no alternative but to tell Linux itself to expect this drive to take several minutes to recover from an error and please not aggressively reset it or its controller until at least that time has passed. 180 seconds has been found to be longer than any observed desktop drive will try for.

# echo 180 > /sys/block/sda/device/timeout

I’ve got a mix of drives that support scterc, some that have it disabled, and some that don’t support it. What now? ^

It’s not difficult to come up with a script that leaves your drives set into their most optimal error timeout condition on each boot. Here’s a trivial example:

#!/bin/sh
 
for disk in `find /sys/block -maxdepth 1 -name 'sd*' | xargs -n 1 basename`
do
    smartctl -q errorsonly -l scterc,70,70 /dev/$disk
 
    if test $? -eq 4
    then
        echo "/dev/$disk doesn't suppport scterc, setting timeout to 180s" '/o\'
        echo 180 > /sys/block/$disk/device/timeout
    else
        echo "/dev/$disk supports scterc " '\o/'
    fi
done

If you call that from your system’s startup scripts (e.g. /etc/rc.local on Debian/Ubuntu) then it will try to set scterc to 7 seconds on every /dev/sd* block device. If it works, great. If it gets an error then it sets the device driver timeout to 180 seconds instead.

There are a couple of shortcomings with this approach, but I offer it here because it’s simple to understand.

It may do odd things if you have a /dev/sd* device that isn’t a real SATA/SCSI disk, for example if it’s iSCSI, or maybe some types of USB enclosure. If the drive is something that can be unplugged and plugged in again (like a USB or eSATA dock) then the drive may reset its scterc setting while unpowered and not get it back when re-plugged: the above script only runs at system boot time.

A more complete but more complex approach may be to get udev to do the work whenever it sees a new drive. That covers both boot time and any time one is plugged in. The smartctl project has had one of these scripts contributed. It looks very clever—for example it works out which devices are part of MD RAIDs—but I don’t use it yet myself as a simpler thing like the script above works for me.

What about hardware RAIDs? ^

A hardware RAID controller is going to set low timeouts on the drives itself, so as long as they support the feature you don’t have to worry about that.

If the support isn’t there in the drive then you may or may not be screwed there: chances are that the RAID controller is going to be smarter about how it handles slow requests and just ignore the drive for a while. If you are unlucky though you will end up in a position where some of your drives need the setting changed but you can’t directly address them with smartctl. Some brands e.g. 3ware/LSI do allow smartctl interaction through a control device.

When using hardware RAID it would be a good idea to only buy drives that support scterc.

What about ZFS? ^

I don’t know anything about ZFS and a quick look gives some conflicting advice:

Drives with scterc support don’t cost that much more, so I’d probably want to buy them and check it’s enabled if it were me.

What about btrfs? ^

As far as I can see btrfs does not disable drives, it leaves it until Linux does that, so you’re probably not at risk of losing data.

If your drives do support scterc though then you’re still best off making sure it’s set as otherwise things will crawl to a halt at the first sign of trouble.

What about NAS devices? ^

The thing about these is, they’re quite often just low-end hardware running Linux and doing Linux software RAID under the covers. With the disadvantage that you maybe can’t log in to them and change their timeout settings. This post claims that a few NAS vendors say they have their own timeouts and ignore scterc.

So which drives support SCTERC/TLER and how much more do they cost? ^

I’m not going to list any here because the list will become out of date too quickly. It’s just something to bear in mind, check for, and take action over.

Fart fart fart ^

Comments along the lines of “Always use hardware RAID” or “always use $filesystem” will be replaced with “fart fart fart,” so if that’s what you feel the need to say you should probably just do so on Twitter instead, where I will only have the choice to read them in my head as “fart fart fart.”

M	T	W	T	F	S	S
« Aug				Dec »
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30

The ongoing struggle

I'll get there one day.

Month: November 2015

Supermicro IPMI remote console on Ubuntu 14.04 through SSH tunnel

Linux Software RAID and drive timeouts