Linux Software RAID and drive timeouts

All the RAIDs are breaking ^

I feel like I’ve been seeing a lot more threads on the linux-raid mailing list recently where people’s arrays have broken, they need help putting them back together (because they aren’t familiar with what to do in that situation), and it turns out that there’s nothing much wrong with the devices in question other than device timeouts.

When I say “a lot”, I mean, “more than I used to.”

I think the reason for the increase in failures may be that HDD vendors have been busy segregating their products into “desktop” and “RAID” editions in a somewhat arbitrary fashion, by removing features from the “desktop” editions in the drive firmware. One of the features that today’s consumer desktop drives tend to entirely lack is configurable error timeouts, also known as SCTERC, also known as TLER.

TL;DR ^

If you use redundant storage but may be using non-RAID drives, you absolutely must check them for configurable timeout support. If they don’t have it then you must increase your storage driver’s timeout to compensate, otherwise you risk data loss.

How do storage timeouts work, and when are they a factor? ^

When the operating system requests from or write to a particular drive sector and fails to do so, it keeps trying, and does nothing else while it is trying. An HDD that either does not have configurable timeouts or that has them disabled will keep doing this for quite a long time—minutes—and won’t be responding to any other command while it does that.

At some point Linux’s own timeouts will be exceeded and the Linux kernel will decide that there is something really wrong with the drive in question. It will try to reset it, and that will probably fail, because the drive will not be responding to the reset command. Linux will probably then reset the entire SATA or SCSI link and fail the IO request.

In a single drive situation (no RAID redundancy) it is probably a good thing that the drive tries really hard to get/set the data. If it really persists it just may work, and so there’s no data loss, and you are left under no illusion that your drive is really unwell and should be replaced soon.

In a multiple drive software RAID situation it’s a really bad thing. Linux MD will kick the drive out because as far as it is concerned it’s a drive that stopped responding to anything for several minutes. But why do you need to care? RAID is resilient, right? So a drive gets kicked out and added back again, it should be no problem.

Well, a lot of the time that’s true, but if you happen to hit another unreadable sector on some other drive while the array is degraded then you’ve got two drives kicked out, and so on. A bus / controller reset can also kick multiple drives out. It’s really easy to end up with an array that thinks it’s too damaged to function because of a relatively minor amount of unreadable sectors. RAID6 can’t help you here.

If you know what you’re doing you can still coerce such an array to assemble itself again and begin rebuilding, but if its component drives have long timeouts set then you may never be able to get it to rebuild fully!

What should happen in a RAID setup is that the drives give up quickly. In the case of a failed read, RAID just reads it from elsewhere and writes it back (causing a sector reallocation in the drive). The monthly scrub that Linux MD does catches these bad sectors before you have a bad time. You can monitor your reallocated sector count and know when a drive is going bad.

How to check/set drive timeouts ^

You can query the current timeout setting with smartctl like so:

# for drive in /sys/block/sd*; do drive="/dev/$(basename $drive)"; echo "$drive:"; smartctl -l scterc $drive; done

You hopefully end up with something like this:

/dev/sda:
smartctl 6.4 2014-10-07 r4002 [x86_64-linux-3.16.0-4-amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org
 
SCT Error Recovery Control:
           Read:     70 (7.0 seconds)
          Write:     70 (7.0 seconds)
 
/dev/sdb:
smartctl 6.4 2014-10-07 r4002 [x86_64-linux-3.16.0-4-amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org
 
SCT Error Recovery Control:
           Read:     70 (7.0 seconds)
          Write:     70 (7.0 seconds)

That’s a good result because it shows that configurable error timeouts (scterc) are supported, and the timeout is set to 70 all over. That’s in centiseconds, so it’s 7 seconds.

Consumer desktop drives from a few years ago might come back with something like this though:

SCT Error Recovery Control:
           Read:     Disabled
          Write:     Disabled

That would mean that the drive supports scterc, but does not enable it on power up. You will need to enable it yourself with smartctl again. Here’s how:

# smartctl -q errorsonly -l scterc,70,70 /dev/sda

That will be silent unless there is some error.

More modern consumer desktop drives probably won’t support scterc at all. They’ll look like this:

Warning: device does not support SCT Error Recovery Control command

Here you have no alternative but to tell Linux itself to expect this drive to take several minutes to recover from an error and please not aggressively reset it or its controller until at least that time has passed. 180 seconds has been found to be longer than any observed desktop drive will try for.

# echo 180 > /sys/block/sda/device/timeout

I’ve got a mix of drives that support scterc, some that have it disabled, and some that don’t support it. What now? ^

It’s not difficult to come up with a script that leaves your drives set into their most optimal error timeout condition on each boot. Here’s a trivial example:

#!/bin/sh
 
for disk in `find /sys/block -maxdepth 1 -name 'sd*' | xargs -n 1 basename`
do
    smartctl -q errorsonly -l scterc,70,70 /dev/$disk
 
    if test $? -eq 4
    then
        echo "/dev/$disk doesn't suppport scterc, setting timeout to 180s" '/o\'
        echo 180 > /sys/block/$disk/device/timeout
    else
        echo "/dev/$disk supports scterc " '\o/'
    fi
done

If you call that from your system’s startup scripts (e.g. /etc/rc.local on Debian/Ubuntu) then it will try to set scterc to 7 seconds on every /dev/sd* block device. If it works, great. If it gets an error then it sets the device driver timeout to 180 seconds instead.

There are a couple of shortcomings with this approach, but I offer it here because it’s simple to understand.

It may do odd things if you have a /dev/sd* device that isn’t a real SATA/SCSI disk, for example if it’s iSCSI, or maybe some types of USB enclosure. If the drive is something that can be unplugged and plugged in again (like a USB or eSATA dock) then the drive may reset its scterc setting while unpowered and not get it back when re-plugged: the above script only runs at system boot time.

A more complete but more complex approach may be to get udev to do the work whenever it sees a new drive. That covers both boot time and any time one is plugged in. The smartctl project has had one of these scripts contributed. It looks very clever—for example it works out which devices are part of MD RAIDs—but I don’t use it yet myself as a simpler thing like the script above works for me.

What about hardware RAIDs? ^

A hardware RAID controller is going to set low timeouts on the drives itself, so as long as they support the feature you don’t have to worry about that.

If the support isn’t there in the drive then you may or may not be screwed there: chances are that the RAID controller is going to be smarter about how it handles slow requests and just ignore the drive for a while. If you are unlucky though you will end up in a position where some of your drives need the setting changed but you can’t directly address them with smartctl. Some brands e.g. 3ware/LSI do allow smartctl interaction through a control device.

When using hardware RAID it would be a good idea to only buy drives that support scterc.

What about ZFS? ^

I don’t know anything about ZFS and a quick look gives some conflicting advice:

Drives with scterc support don’t cost that much more, so I’d probably want to buy them and check it’s enabled if it were me.

What about btrfs? ^

As far as I can see btrfs does not disable drives, it leaves it until Linux does that, so you’re probably not at risk of losing data.

If your drives do support scterc though then you’re still best off making sure it’s set as otherwise things will crawl to a halt at the first sign of trouble.

What about NAS devices? ^

The thing about these is, they’re quite often just low-end hardware running Linux and doing Linux software RAID under the covers. With the disadvantage that you maybe can’t log in to them and change their timeout settings. This post claims that a few NAS vendors say they have their own timeouts and ignore scterc.

So which drives support SCTERC/TLER and how much more do they cost? ^

I’m not going to list any here because the list will become out of date too quickly. It’s just something to bear in mind, check for, and take action over.

Fart fart fart ^

Comments along the lines of “Always use hardware RAID” or “always use $filesystem” will be replaced with “fart fart fart,” so if that’s what you feel the need to say you should probably just do so on Twitter instead, where I will only have the choice to read them in my head as “fart fart fart.”

13 thoughts on “Linux Software RAID and drive timeouts

  1. This is venturing dangerously into “fart fart fart” advocacy territory, however I’ve found that ZFS raidz is much more forgiving than the likes of Linux mdraid in the face of marginal disks. Non-TLER disks are still a pain in that they’ll get prematurely dropped from the array on attempts to read a dodgy block, but re-adding the disk causes a “resilver” that essentially just replays the data written since the disk was last seen in the array, and not a full-disk rebuild, which is much less stressful on the array, and also appears to use fresh LBAs so avoids the bad blocks. This is particularly helpful for disks connected over USB which has its own exciting failure modes that cause disks to fail to respond to commands even though the disk itself is fine.

    Resilvering works without data loss if the whole array had been dropped due to no longer having a quorum number of “good” disks—which is already game over for Linux mdraid—and no data will be lost provided that bad sectors are not so extensive that there’s still quorum for the individual blocks. This is rather contrary to my experience with Linux mdraid, which I now refuse to use except where it is required for small boot/root volumes.

  2. Some udev configuration files to set proper timeouts have been contributed upstream and to debian, but neither upstream nor distro maintainers have mangaged to update the udev config files in their packages.

    mdadm https://bugs.debian.org/780207 The default HDD block error correction timeouts make entire! drives fail + high risk of data loss during array re-build

    smartmontools https://bugs.debian.org/780162 general debian base-system fix: default HDD timeouts cause data loss or corruption (silent controller resets)

  3. Thanks for this very interesting article.
    I’m using a SSD (Intel SSDSA2MH08) as system disk (without redudancy) which does not support SCT commands: should I set the kernel timeout to 180, or is it useless for a SSD?

    1. JDG,

      You only have one device so it does not matter. The purpose of increasing the SCSI timeout is so that the drive can return a failure and allow another layer (usually MD) to recover (e.g. by rereading data from redundant location and writing it back over top of dead sectors). You have no other layer so you don’t really care whether the drive appears locked up for minutes and is entirely dropped, or else returns error swiftly and kernel complains and usually remounts read-only.

      On the one hand it can be slightly nicer to have swift error and read-only remount, but on the other hand when you have no redundancy you usually do want the single drive to try the hardest it can to read the data as it is the only copy of the data that you have. In your situation I’d probably leave it as is and make sure I had good backups.

  4. When trying to set the scterc on an external drive on an USB dock, the error cod is 2 (device open failed) but the script only condiers value 4 and 180 s timeout won’t be set. I am using modified version:

    #!/bin/bash

    for disk in $(find /sys/block -maxdepth 1 -name ‘sd*’ -exec basename {} \;); do
    smartctl -q errorsonly -l scterc,70,70 /dev/$disk
    ret=$?

    if [ $((ret & 1)) -ne 0 ]; then
    continue
    fi

    if [ $((ret & 2)) -ne 0 ]; then
    smartctl -d sat -q errorsonly -l scterc,70,70 /dev/$disk
    ret=$?
    fi

    if [ $((ret & 6)) -ne 0 ]; then
    echo “/dev/$disk doesn’t suppport scterc, setting timeout to 180s” ‘/o\’
    echo 180 > /sys/block/$disk/device/timeout
    else
    echo “/dev/$disk supports scterc ” ‘\o/’
    fi
    done

  5. What about NVME drives? They too seems to have a kernel-driver-timeout although it’s located on a different path: `/sys/block/nvme0n1/queue/io_timeout`

    I assume the same apply in regards to the timeout issue that the driver will reset the disk (or whole controller) if the disk doesn’t support SCTERC (or hasn’t been enabled)?
    How to you check for SCTERC on NVME devices, the `smartctl -l scterc /dev/nvme0n1` doesn’t work on nvme devices

    1. I’m afraid I don’t know the answer to any of this, although I do have a few NVMe drives now so I probably should find out. It’s perhaps a question for the linux-raid mailing list…

Leave a Reply to Peter C Cancel reply

Your email address will not be published. Required fields are marked *