Recap ^
In a previous article I demonstrated that Linux RAID-10 lacked an optimisation for non-rotational devices that was present in RAID-1.
In the case of imbalanced devices such as my system with one SATA SSD and one PCI NVMe, this could cause RAID-10 to perform 3 times worse than RAID-1 at random reads.
A possible fix ^
Kernel developer Guoqing Jiang contacted me to provide a patch to add the same optimisation that is present in RAID-1 to RAID-10.
Updated performance figures ^
I’ve applied Guoqing’s patch and re-run the tests for the RAID-10 targets. Figures for other targets are from the pervious post for comparison.
Sequential IO ^
Test | Throughput (MiB/s) | |||||
---|---|---|---|---|---|---|
Fast RAID-1 | Fast RAID-10 | Fast RAID-10 (patched) | Slow RAID-1 | Slow RAID-10 | Slow RAID-10 (patched) | |
Read | 1,237 | 1,682 | 2,141 | 198 | 188 | 211 |
Write | 321 | 321 | 321 | 18 | 19 | 19 |
The patched RAID-10 is the clear winner for sequential IO. It even peforms about 27% faster than the unpacthed variant.
Random IO ^
Test | IOPS | |||||
---|---|---|---|---|---|---|
Fast RAID-1 | Fast RAID-10 | Fast RAID-10 (patched) | Slow RAID-1 | Slow RAID-10 | Slow RAID-10 (patched) | |
Random Read | 602,000 | 208,000 | 602,000 | 501 | 501 | 487 |
Random Write | 82,200 | 82,200 | 82,200 | 25 | 21 | 71 |
The patched RAID-10 is now indistinguishable from performance of RAID-1, almost 3 times faster than without the patch!
I am unable to explain why RAID-10 writes on the slow devices (HDDs) is so much better than before.
The patch ^
Guoqing Jiang’s patch is as follows in case anyone wants to test it. Guoqing has only compile-tested it as they don’t have the required hardware. I have tested it and it seems okay, but don’t use it on any data you care about yet.
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index 25e97de36717..2ebe49b18aeb 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -745,15 +745,19 @@ static struct md_rdev *read_balance(struct r10conf *conf, int sectors = r10_bio->sectors; int best_good_sectors; sector_t new_distance, best_dist; - struct md_rdev *best_rdev, *rdev = NULL; + struct md_rdev *best_dist_rdev, *best_pending_rdev, *rdev = NULL; int do_balance; - int best_slot; + int best_dist_slot, best_pending_slot; + int has_nonrot_disk = 0; + unsigned int min_pending; struct geom *geo = &conf->geo; raid10_find_phys(conf, r10_bio); rcu_read_lock(); - best_slot = -1; - best_rdev = NULL; + best_dist_slot = -1; + min_pending = UINT_MAX; + best_dist_rdev = NULL; + best_pending_rdev = NULL; best_dist = MaxSector; best_good_sectors = 0; do_balance = 1; @@ -775,6 +779,8 @@ static struct md_rdev *read_balance(struct r10conf *conf, sector_t first_bad; int bad_sectors; sector_t dev_sector; + unsigned int pending; + bool nonrot; if (r10_bio->devs[slot].bio == IO_BLOCKED) continue; @@ -811,8 +817,8 @@ static struct md_rdev *read_balance(struct r10conf *conf, first_bad - dev_sector; if (good_sectors > best_good_sectors) { best_good_sectors = good_sectors; - best_slot = slot; - best_rdev = rdev; + best_dist_slot = slot; + best_dist_rdev = rdev; } if (!do_balance) /* Must read from here */ @@ -825,14 +831,23 @@ static struct md_rdev *read_balance(struct r10conf *conf, if (!do_balance) break; - if (best_slot >= 0) + nonrot = blk_queue_nonrot(bdev_get_queue(rdev->bdev)); + has_nonrot_disk |= nonrot; + pending = atomic_read(&rdev->nr_pending); + if (min_pending > pending && nonrot) { + min_pending = pending; + best_pending_slot = slot; + best_pending_rdev = rdev; + } + + if (best_dist_slot >= 0) /* At least 2 disks to choose from so failfast is OK */ set_bit(R10BIO_FailFast, &r10_bio->state); /* This optimisation is debatable, and completely destroys * sequential read speed for 'far copies' arrays. So only * keep it for 'near' arrays, and review those later. */ - if (geo->near_copies > 1 && !atomic_read(&rdev->nr_pending)) + if (geo->near_copies > 1 && !pending) new_distance = 0; /* for far > 1 always use the lowest address */ @@ -841,15 +856,21 @@ static struct md_rdev *read_balance(struct r10conf *conf, else new_distance = abs(r10_bio->devs[slot].addr - conf->mirrors[disk].head_position); + if (new_distance < best_dist) { best_dist = new_distance; - best_slot = slot; - best_rdev = rdev; + best_dist_slot = slot; + best_dist_rdev = rdev; } } if (slot >= conf->copies) { - slot = best_slot; - rdev = best_rdev; + if (has_nonrot_disk) { + slot = best_pending_slot; + rdev = best_pending_rdev; + } else { + slot = best_dist_slot; + rdev = best_dist_rdev; + } } if (slot >= 0) { |
Interesting article thanks! Out of interest do you know if this ever got applied to upstream linux raid? Also you don’t explicitly say, but are you using RAID10 with just two devices?
Yes it was applied: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/md/raid10.c?id=e9eeba28a1e01a55b49cdcf9c7a346d2aaa0aa7d
Because of this issue and not wanting to run custom kernels I switched to RAID-1. I still do have servers deployed that use RAID-10 on fewer than 3 devices though.
It doesn’t really make much difference when all the devices are of similar performance.