Linux RAID-10 fixed on imbalanced devices?

Contents

Recap
A possible fix
Updated performance figures

Sequential IO
Random IO

The patch

Recap ^

In a previous article I demonstrated that Linux RAID-10 lacked an optimisation for non-rotational devices that was present in RAID-1.

In the case of imbalanced devices such as my system with one SATA SSD and one PCI NVMe, this could cause RAID-10 to perform 3 times worse than RAID-1 at random reads.

A possible fix ^

Kernel developer Guoqing Jiang contacted me to provide a patch to add the same optimisation that is present in RAID-1 to RAID-10.

Updated performance figures ^

I’ve applied Guoqing’s patch and re-run the tests for the RAID-10 targets. Figures for other targets are from the pervious post for comparison.

Sequential IO ^

Test	Throughput (MiB/s)
Test	Fast RAID-1	Fast RAID-10	Fast RAID-10 (patched)	Slow RAID-1	Slow RAID-10	Slow RAID-10 (patched)
Read	1,237	1,682	2,141	198	188	211
Write	321	321	321	18	19	19

The patched RAID-10 is the clear winner for sequential IO. It even peforms about 27% faster than the unpacthed variant.

Random IO ^

Test	IOPS
Test	Fast RAID-1	Fast RAID-10	Fast RAID-10 (patched)	Slow RAID-1	Slow RAID-10	Slow RAID-10 (patched)
Random Read	602,000	208,000	602,000	501	501	487
Random Write	82,200	82,200	82,200	25	21	71

The patched RAID-10 is now indistinguishable from performance of RAID-1, almost 3 times faster than without the patch!

I am unable to explain why RAID-10 writes on the slow devices (HDDs) is so much better than before.

The patch ^

Guoqing Jiang’s patch is as follows in case anyone wants to test it. Guoqing has only compile-tested it as they don’t have the required hardware. I have tested it and it seems okay, but don’t use it on any data you care about yet.

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 25e97de36717..2ebe49b18aeb 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -745,15 +745,19 @@ static struct md_rdev *read_balance(struct r10conf *conf,
        int sectors = r10_bio->sectors;
        int best_good_sectors;
        sector_t new_distance, best_dist;
-       struct md_rdev *best_rdev, *rdev = NULL;
+       struct md_rdev *best_dist_rdev, *best_pending_rdev, *rdev = NULL;
        int do_balance;
-       int best_slot;
+       int best_dist_slot, best_pending_slot;
+       int has_nonrot_disk = 0;
+       unsigned int min_pending;
        struct geom *geo = &conf->geo;
 
        raid10_find_phys(conf, r10_bio);
        rcu_read_lock();
-       best_slot = -1;
-       best_rdev = NULL;
+       best_dist_slot = -1;
+       min_pending = UINT_MAX;
+       best_dist_rdev = NULL;
+       best_pending_rdev = NULL;
        best_dist = MaxSector;
        best_good_sectors = 0;
        do_balance = 1;
@@ -775,6 +779,8 @@ static struct md_rdev *read_balance(struct r10conf *conf,
                sector_t first_bad;
                int bad_sectors;
                sector_t dev_sector;
+               unsigned int pending;
+               bool nonrot;
 
                if (r10_bio->devs[slot].bio == IO_BLOCKED)
                        continue;
@@ -811,8 +817,8 @@ static struct md_rdev *read_balance(struct r10conf *conf,
                                        first_bad - dev_sector;
                                if (good_sectors > best_good_sectors) {
                                        best_good_sectors = good_sectors;
-                                       best_slot = slot;
-                                       best_rdev = rdev;
+                                       best_dist_slot = slot;
+                                       best_dist_rdev = rdev;
                                }
                                if (!do_balance)
                                        /* Must read from here */
@@ -825,14 +831,23 @@ static struct md_rdev *read_balance(struct r10conf *conf,
                if (!do_balance)
                        break;
 
-               if (best_slot >= 0)
+               nonrot = blk_queue_nonrot(bdev_get_queue(rdev->bdev));
+               has_nonrot_disk |= nonrot;
+               pending = atomic_read(&rdev->nr_pending);
+               if (min_pending > pending && nonrot) {
+                       min_pending = pending;
+                       best_pending_slot = slot;
+                       best_pending_rdev = rdev;
+               }
+
+               if (best_dist_slot >= 0)
                        /* At least 2 disks to choose from so failfast is OK */
                        set_bit(R10BIO_FailFast, &r10_bio->state);
                /* This optimisation is debatable, and completely destroys
                 * sequential read speed for 'far copies' arrays.  So only
                 * keep it for 'near' arrays, and review those later.
                 */
-               if (geo->near_copies > 1 && !atomic_read(&rdev->nr_pending))
+               if (geo->near_copies > 1 && !pending)
                        new_distance = 0;
 
                /* for far > 1 always use the lowest address */
@@ -841,15 +856,21 @@ static struct md_rdev *read_balance(struct r10conf *conf,
                else
                        new_distance = abs(r10_bio->devs[slot].addr -
                                           conf->mirrors[disk].head_position);
+
                if (new_distance < best_dist) {
                        best_dist = new_distance;
-                       best_slot = slot;
-                       best_rdev = rdev;
+                       best_dist_slot = slot;
+                       best_dist_rdev = rdev;
                }
        }
        if (slot >= conf->copies) {
-               slot = best_slot;
-               rdev = best_rdev;
+               if (has_nonrot_disk) {
+                       slot = best_pending_slot;
+                       rdev = best_pending_rdev;
+               } else {
+                       slot = best_dist_slot;
+                       rdev = best_dist_rdev;
+               }
        }
 
        if (slot >= 0) {

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index 25e97de36717..2ebe49b18aeb 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -745,15 +745,19 @@ static struct md_rdev *read_balance(struct r10conf *conf, int sectors = r10_bio->sectors; int best_good_sectors; sector_t new_distance, best_dist; - struct md_rdev *best_rdev, *rdev = NULL; + struct md_rdev *best_dist_rdev, *best_pending_rdev, *rdev = NULL; int do_balance; - int best_slot; + int best_dist_slot, best_pending_slot; + int has_nonrot_disk = 0; + unsigned int min_pending; struct geom *geo = &conf->geo; raid10_find_phys(conf, r10_bio); rcu_read_lock(); - best_slot = -1; - best_rdev = NULL; + best_dist_slot = -1; + min_pending = UINT_MAX; + best_dist_rdev = NULL; + best_pending_rdev = NULL; best_dist = MaxSector; best_good_sectors = 0; do_balance = 1; @@ -775,6 +779,8 @@ static struct md_rdev *read_balance(struct r10conf *conf, sector_t first_bad; int bad_sectors; sector_t dev_sector; + unsigned int pending; + bool nonrot; if (r10_bio->devs[slot].bio == IO_BLOCKED) continue; @@ -811,8 +817,8 @@ static struct md_rdev *read_balance(struct r10conf *conf, first_bad - dev_sector; if (good_sectors > best_good_sectors) { best_good_sectors = good_sectors; - best_slot = slot; - best_rdev = rdev; + best_dist_slot = slot; + best_dist_rdev = rdev; } if (!do_balance) /* Must read from here */ @@ -825,14 +831,23 @@ static struct md_rdev *read_balance(struct r10conf *conf, if (!do_balance) break; - if (best_slot >= 0) + nonrot = blk_queue_nonrot(bdev_get_queue(rdev->bdev)); + has_nonrot_disk |= nonrot; + pending = atomic_read(&rdev->nr_pending); + if (min_pending > pending && nonrot) { + min_pending = pending; + best_pending_slot = slot; + best_pending_rdev = rdev; + } + + if (best_dist_slot >= 0) /* At least 2 disks to choose from so failfast is OK */ set_bit(R10BIO_FailFast, &r10_bio->state); /* This optimisation is debatable, and completely destroys * sequential read speed for 'far copies' arrays. So only * keep it for 'near' arrays, and review those later. */ - if (geo->near_copies > 1 && !atomic_read(&rdev->nr_pending)) + if (geo->near_copies > 1 && !pending) new_distance = 0; /* for far > 1 always use the lowest address */ @@ -841,15 +856,21 @@ static struct md_rdev *read_balance(struct r10conf *conf, else new_distance = abs(r10_bio->devs[slot].addr - conf->mirrors[disk].head_position); + if (new_distance < best_dist) { best_dist = new_distance; - best_slot = slot; - best_rdev = rdev; + best_dist_slot = slot; + best_dist_rdev = rdev; } } if (slot >= conf->copies) { - slot = best_slot; - rdev = best_rdev; + if (has_nonrot_disk) { + slot = best_pending_slot; + rdev = best_pending_rdev; + } else { + slot = best_dist_slot; + rdev = best_dist_rdev; + } } if (slot >= 0) {

2 thoughts on “Linux RAID-10 fixed on imbalanced devices?”

Tim Small says:

May 6, 2020 at 09:20

Interesting article thanks! Out of interest do you know if this ever got applied to upstream linux raid? Also you don’t explicitly say, but are you using RAID10 with just two devices?

1. Andy says:
  
  May 6, 2020 at 09:29
  
  Yes it was applied: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/md/raid10.c?id=e9eeba28a1e01a55b49cdcf9c7a346d2aaa0aa7d
  
  Because of this issue and not wanting to run custom kernels I switched to RAID-1. I still do have servers deployed that use RAID-10 on fewer than 3 devices though.
  
  It doesn’t really make much difference when all the devices are of similar performance.

The ongoing struggle

I'll get there one day.