A slightly more realistic look at lvmcache

Recap ^

And then… ^

I decided to perform some slightly more realistic benchmarks against lvmcache.

The problem with the initial benchmark was that it only covered 4GiB of data with a 4GiB cache device. Naturally once lvmcache was working correctly its performance was awesome – the entire dataset was in the cache. But clearly if you have enough fast block device available to fit all your data then you don’t need to cache it at all and may as well just use the fast device directly.

I decided to perform some fio tests with varying data sizes, some of which were larger than the cache device.

Test methodology ^

Once again I used a Zipf distribution with a factor of 1.2, which should have caused about 90% of the hits to come from just 10% of the data. I kept the cache device at 4GiB but varied the data size. The following data sizes were tested:

  • 1GiB
  • 2GiB
  • 4GiB
  • 8GiB
  • 16GiB
  • 32GiB
  • 48GiB

With the 48GiB test I expected to see lvmcache struggling, as the hot 10% (~4.8GiB) would no longer fit within the 4GiB cache device.

A similar fio job spec to those from the earlier articles was used:

[cachesize-1g]
size=512m
ioengine=libaio
direct=1
iodepth=8
numjobs=2
readwrite=randread
random_distribution=zipf:1.2
bs=4k
unlink=1
runtime=30m
time_based=1
per_job_logs=1
log_avg_msec=500
write_iops_log=/var/tmp/fio-${FIOTEST}

…the only difference being that several different job files were used each with a different size= directive. Note that as there are two jobs, the size= is half the desired total data size: each job lays out a data file of the specified size.

For each data size I took care to fill the cache with data first before doing a test run, as unreproducible performance is still seen against a completely empty cache device. This produced IOPS logs and a completion latency histogram. Test were also run against SSD and HDD to provide baseline figures.

Results ^

IOPS graphs ^

All-in-one ^

Immediately we can see that for data sizes 4GiB and below performance converges quite quickly to near-SSD levels. That is very much what we would expect when the cache device is 4GiB, so big enough to completely cache everything.

Let’s just have a look at the lower-performing configurations.

Low-end performers ^

For 8, 16 and 32GiB data sizes performance clearly gets progressively worse, but it is still much better than baseline HDD. The 10% of hot data still fits within the cache device, so plenty of acceleration is still happening.

For the 48GiB data size it is a little bit of a different story. Performance is still better (on average) than baseline HDD, but there are periodic dips back down to roughly HDD figures. This is because not all of the 10% hot data fits into the cache device any more. Cache misses cause reads from HDD and consequently end up with HDD levels of performance for those reads.

The results no longer look quite so impressive, with even the 8GiB data set achieving only a few thousand IOPS on average. Are things as bad as they seem? Well no, I don’t think they are, and to see why we will have to look at the completion latency histograms.

Completion latency histograms ^

The above graphs are generated by fitting a Bezier curve to a scatter of data points each of which represents a 500ms average of IOPS achieved. The problem there is the word average.

It’s important to understand what effect averaging the figures gives. We’ve already seen that HDDs are really slow. Even if only a few percent of IOs end up missing cache and going to HDD, the massive latency of those requests will pull the average for the whole 500ms window way down.

Presumably we have a cache because we suspect we have hot spots of data, and we’ve been trying to evaluate that by doing most of the reads from only 10% of the data. Do we care what the average performance is then? Well it’s a useful metric but it’s not going to say much about the performance of reads from the hot data.

The histogram of completion latencies can be more useful. This shows how long it took between issuing the IO and completing the read for a certain percentage of issued IOs. Below I have focused on the 50% to 99% latency buckets, with the times for each bucket averaged between the two jobs. In the interests of being able to see anything at all I’ve had to double the height of the graph and still cut off the y axis for the three worst performers.

A couple of observations:

  • Somewhere between 70% and 80% of IOs complete with a latency that’s so close to SSD performance as to be near-indistinguishable, no matter what the data size. So what I think I am proving is that:

    you can cache a 48GiB slow backing device with 4GiB of fast SSD and if you have 10% hot data then you can expect it to be served up at near-SSD latencies 70%–80% of the time. If your hot spots are larger (not so hot) then you won’t achieve that. If your fast device is larger than 1/12th the backing device then you should do better than 70%–80%.

  • If the cache were perfect then we should expect the 90th percentile to be near SSD performance even for the 32GiB data set, as the 10% hot spot of ~3.2GiB fits inside the 4GiB cache. For whatever reason this is not achieved, but for that data size the 90th percentile latency is still about half that of HDD.
  • When the backing device is many times larger (32GiB+) than the cache device, the 99th percentile latencies can be slightly worse than for baseline HDD.

    I hesitate to suggest there is a problem here as there are going to be very few samples in the top 1%, so it could just be showing close to HDD performance.

Conclusion

Assuming you are okay with using a 4.12..x kernel, and assuming you are already comfortable using LVM, then at the moment it looks fairly harmless to deploy lvmcache.

Getting a decent performance boost out of it though will require you to check that your data really does have hot spots and size your cache appropriately.

Measuring your existing workload with something like blktrace is probably advisable, and these days you can feed the output of blktrace back into fio to see what performance might be like in a difference configuration.

Full test output

You probably want to stop reading here unless the complete output of all the fio runs is of interest to you.
Continue reading “A slightly more realistic look at lvmcache” ^

Tracking down the lvmcache fix

Background ^

In the previous article I covered how, in order to get decent performance out of lvmcache with a packaged Debian kernel, you’d have to use the 4.12.2-1~exp1 kernel from experimental. The kernels packaged in sid, testing (buster) and stable (stretch) aren’t new enough.

I decided to bisect the Linux kernel upstream git repository to find out exactly which commit(s) fixed things.

Results ^

Here’s a graph showing the IOPS over time for baseline SSD and lvmcache with a full cache under several different kernel versions. As in previous articles, the lines are actually Bezier curves fitted to the data which is scattered all over the place from 500ms averages.

What we can see here is that performance starts to improve with commit 4d44ec5ab751 authored by Joe Thornber:

dm cache policy smq: put newly promoted entries at the top of the multiqueue

This stops entries bouncing in and out of the cache quickly.

This is part of a set of commits authored by Joe Thornber on the drivers/md/dm-cache-policy-smq.c file and committed on 2017-05-14. By the time we reach commit 6cf4cc8f8b3b we have the long-term good performance that we were looking for.

The first of Joe Thornber’s commits on that day in the dm-cache area was 072792dcdfc8 and stepping back to the commit immediately prior to that one (2ea659a9ef48) we get a kernel representing the moment that Linus designated the v4.12-rc1 tag. Joe’s commits went into -rc1, and without them the performance of lvmcache under these test conditions isn’t much better than baseline HDD.

It seems like some of Joe’s changes helped a lot and then the last one really provided the long term performance.

git bisect procedure ^

Normally when you do a git bisect you’re starting with something that works and you’re looking for the commit that introduced a bug. In this case I was starting off with a known-good state and was interested in which commit(s) got me there. The normal bisect key words of “good” and “bad” in this case would be backwards to what I wanted. Dominic gave me the tip that I could alias the terms in order to reduce my confusion:

$ git bisect start --term-old broken --term-new fixed

From here on, when I encountered a test run that produced poor results I would issue:

$ git bisect broken

and when I had a test run with good results I would issue:

$ git bisect fixed

As I knew that the tag v4.13-rc1 produced a good run and v4.11 was bad, I could start off with:

$ git bisect reset v4.13-rc1
$ git bisect fixed
$ git bisect broken v4.11

git would then keep bisecting the search space of commits until I would find the one(s) that resulted in the high performance I was looking for.

Good and bad? ^

As before I’m using fio to conduct the testing, with the same job specification:

ioengine=libaio
direct=1
iodepth=8
numjobs=2
readwrite=randread
random_distribution=zipf:1.2
bs=4k
size=2g
unlink=1
runtime=15m
time_based=1
per_job_logs=1
log_avg_msec=500
write_iops_log=/var/tmp/fio-${FIOTEST}

The only difference from the other articles was that the run time was reduced to 15 minutes as all of the interesting behaviour happened within the first 11 minutes.

To recap, this fio job specification lays out two 2GiB files of random data and then starts two processes that perform 4kiB-sized reads against the files. Direct IO is used, in order to bypass the page cache.

A Zipfian distribution with a factor of 1.2 is used; this gives a 90/10 split where about 90% of the reads should come from about 10% of the data. The purpose of this is to simulate the hot spots of popular data that occur in real life. If the access pattern were to be perfectly and uniformly random then caching would not be effective.

In previous tests we had observed that dramatically different performance would be seen on the first run against an empty cache device compared to all other subsequent runs against what would be a full cache device. In the tests using kernels with the fix present the IOPS achieved would converge towards baseline SSD performance, whereas in kernels without the fix the performance would remain down near the level of baseline HDD. Therefore the fio tests were carried out twice.

Where to next? ^

I think I am going to see what happens when the cache device is pretty small in comparison to the working data.

All of the tests so far have used a 4GiB cache with 4GiB of data, so if everything got promoted it would entirely fit in cache. Not only that but the Zipf distribution makes most of the hits come from 10% of the data, so it’s actually just ~400MiB of hot data. I think it would be interesting to see what happens when the hot 10% is bigger than the cache device.

git bisect progress and test output ^

Unless you are particularly interested in the fio output and why I considered each one to be either fixed or broken, you probably want to stop reading now.

Continue reading “Tracking down the lvmcache fix”

lvmcache with a 4.12.3 kernel

Background ^

In the previous two articles I had discovered that lvmcache had amazing performance on an empty cache but then on every run after that (i.e. when the cache device was full of junk) went scarcely better than baseline HDD.

A few days ago I happened across an email on the linux-lvm list where Mike Snitzer advised:

the [CentOS] 7.4 dm-cache will be much more performant than the 7.3 cache you appear to be using.

…and…

It could be that your workload isn’t accessing the data enough to warrant promotion to the cache. dm-cache is a “hotspot” cache. If you aren’t accessing the data repeatedly then you won’t see much benefit (particularly with the 7.3 and earlier releases).

Just to get a feel, you could try the latest upstream 4.12 kernel to see how effective the 7.4 dm-cache will be for your setup.

I don’t know what kernel version CentOS 7.3 uses, but the VM I’m testing with is Debian testing (buster), so some version of 4.11.x plus backported patches.

That seemed pretty new, but Mike is suggesting 4.12.x so I thought I’d re-test lvmcache with the latest stable upstream kernel, which at the time of writing is version 4.12.3.

Test methodology ^

This time around I only focused on fio tests, using the same settings as before:

[partial]
ioengine=libaio
direct=1
iodepth=8
numjobs=2
readwrite=randread
random_distribution=zipf:1.2
bs=4k
size=2g
unlink=1
runtime=20m
time_based=1
per_job_logs=1
log_avg_msec=500
write_iops_log=/var/tmp/fio-${FIOTEST}

The only changes were:

  1. to reduce the run time to 20 minutes from 30 minutes, since all the interesting things happened within the first 20 minutes before.
  2. to write an IOPS log entry every 500ms instead of ever 1000ms, as the log files were quite small and some higher resolution might help smooth graphs out.

Last time there was a dramatic difference between the initial run with an empty cache and subsequent runs with a cache volume full of junk, so I did a test for each of those conditions, as well as tests for the baseline SSD and HDD.

The virtual machine had been upgraded from Debian 9 (stretch) to testing (buster), so it still had packaged kernel versions 4.9.30-2 and 4.11.6-1 laying around to test things with. In addition I compiled up version 4.12.3 by copying the .config from 4.11.6-1 then doing make oldconfig accepting all defaults.

Results ^

Although the fio job spec was essentially the same as in the previous article, I have since worked out how to merge the IOPS logs from both jobs so the graphs will seem to show about double the IOPS as they did before.

All-in-one ^

Well that’s an interesting set of graphs but rather hard to distinguish. Let’s try that by kernel version.

Baseline SSD by kernel version ^

A couple of weird things here:

  1. 4.12.3 and 4.11.6-1 are actually fairly consistent, but 4.9.30-2 varies rather a lot.
  2. All kernels show a sharp dip a few minutes in. I don’t know what that is about.

Although these lines do look quite far apart, bear in mind that this graph’s y axis starts at 92k IOPS. The average IOPS didn’t vary that much:

Average IOPS by kernel version
4.9.30-2 4.11.6-1 4.12.3
102,325 102,742 104,352

So there was actually only a 1.9% difference between the worst performer and the best.

Baseline HDD by kernel version ^

4.9.30-2 and 4.12.3 are close enough here to probably be within the margin of error, but there is something weird going on with 4.11.6-1.

Its average IOPS across the 20 minute test were only 56% of those for 4.12.3 and 53% of those for 4.9.30-2, which is quite a big disparity. I re-ran these tests 5 times to check it wasn’t some anomaly, but no, it’s reproducible.

Maybe something to look into another day.

lvmcache by kernel version ^

Dragging things back to the point of this article: previously we discovered that lvmcache worked great the first time through, when its cache volume was completely empty, but then subsequent runs all absolutely sucked. They didn’t perform significantly better than HDD baseline.

Let’s graph all the lvmcache results for each kernel version against the SSD baseline for that kernel to see if things changed at all.

lvmcache 4.9.30-2 ^

This is the similar to what we saw before: an empty cache volume produces decent results of around 47k IOPS. Although it’s interesting that the second job is down around 1k IOPS. Again the results on a full cache are poor. In fact the results for the second job of the empty cache are about the same as the results for both jobs on a full cache.

lvmcache 4.11.6-1 ^

Same story again here, although the performance is a little higher. Again the first job on an empty cache is getting the big results of almost 60k IOPS while the second job—and both jobs on a full cache—get only around 1k IOPS.

lvmcache 4.12.3 ^

Wow. Something dramatic has been fixed. The performance on an empty cache is still better, but crucially the performance on a full cache pretty quickly becomes very close to baseline SSD.

Also the runs against both the empty and full cache device result in both jobs getting roughly the same IOPS performance rather than the first job being great and all others very poor.

What’s next? ^

It’s really encouraging that the performance is so much better with 4.12.3. It’s changed lvmcache from a “hmm, maybe” option to one that I would strongly consider using anywhere I could.

It’s a shame though that such a new kernel is required. The kernel version in Debian testing (buster) is currently 4.11.6-1. Debian experimental’s linux-image-4.12.0-trunk-amd64 package currently has version 4.12.2-1 so I should test if that is new enough I tested to see if that was new enough.

Failing that I think I should git bisect or similar in order to find out exactly which changeset fixed this, so I could have some chance of knowing when it hits a packaged version.

Continue reading “lvmcache with a 4.12.3 kernel”

bcache and lvmcache

Background ^

Over at BitFolk we offer both SSD-backed storage and HDD-backed archive storage. The SSDs we use are really nice and basically have removed all IO performance problems we have ever encountered in the past. I can’t begin to describe how pleasant it is to just never have to think about that whole class of problems.

The main downside of course is that SSD capacity is still really expensive. That’s why we introduced the HDD-backed archive storage: for bulk storage of things that didn’t need to have high performance.

You’d really think though that by now there would be some kind of commodity tiered storage that would allow a relatively small amount of fast SSD storage to accelerate a much larger pool of HDDs. Of course there are various enterprise solutions, and there is also ZFS where SSDs could be used for the ZIL and L2ARC while HDDs are used for the pool.

ZFS is quite appealing but I’m not ready to switch everything to that yet, and I’m certainly not going to buy some enterprise storage solution. I also don’t necessarily want to commit to putting all storage behind such a system.

I decided to explore Linux’s own block device caching solutions.

Scope ^

I’ve restricted the scope to the two solutions which are part of the mainline Linux kernel as of July 2017, these being bcache and lvmcache.

lvmcache is based upon dm-cache which has been included with the mainline kernel since April 2013. It’s quite conservative, and having been around for quite a while is considered stable. It has the advantage that it can work with any LVM logical volume no matter what the contents. That brings the disadvantage that you do need to run LVM.

bcache has been around for a little longer but is a much more ambitious project. Being completely dedicated to accelerating slow block devices with fast ones it is claimed to be able to achieve higher performance than other caching solutions, but as it’s much more complicated than dm-cache there are still bugs being found. Also it requires you format your block devices as bcache before you use them for anything.

Test environment ^

I’m testing this on a Debian testing (buster) Xen virtual machine with a 20GiB xvda virtual disk containing the main operating system. That disk is backed by a software (md) RAID-10 composed of two Samsung sm863 SSDs. It was also used for testing the baseline SSD performance from the directory /srv/ssd.

The virtual machine had 1GiB of memory but the pagecache was cleared between each test run in an attempt to prevent anything being cached in memory.

A 5GiB xvdc virtual disk was provided, backed again on the SSD RAID. This was used for the cache role both in bcache and lvmcache.

A 50GiB xvdd virtual disk was provided, backed by a pair of Seagate ST4000LM016-1N2170 HDDs in software RAID-1. This was used for the HDD backing store in each of the caching implementations. The resulting cache device was mounted at /srv/cache.

Finally a 50GiB xvde virtual disk also backed on HDD was used to test baseline HDD performance, mounted at /srv/slow.

The filesystem in use in all cases was ext4 with default options. In dom0, deadline scheduler was used in all cases.

TL;DR, I just want graphs ^

In case you can’t be bothered to read the rest of this article, here’s just the graphs with some attempt at interpreting them. Down at the tests section you’ll find details of the actual testing process and more commentary on why certain graphs were produced.

git test graphs ^

Times to git clone and git grep.

fio IOPS graphs ^

These are graphs of IOPS across the 30 minutes of testing. There’s two important things to note about these graphs:

  1. They’re a Bezier curve fitted to the data points which are one per second. The actual data points are all over the place, because achieved IOPS depends on how many cache hits/misses there were, which is statistical.
  2. Only the IOPS for the first job is graphed. Even when using the per_job_logs=0 setting my copy of fio writes a set of results for each job. I couldn’t work out how to easily combine these so I’ve shown only the results for the first job.

    For all tests except bcache (sequential_cutoff=0) you just have to bear in mind that there is a second job working in parallel doing pretty much the same amount of IOPS. Strangely for that second bcache test the second job only managed a fraction of the IOPS (though still more than 10k IOPS) and I don’t know why.

IOPS over time for all tests

Well, those results are so extreme that it kind of makes it hard to distinguish between the low-end results.

A couple of observations:

  • SSD is incredibly and consistently fast.
  • For everything else there is a short steep section at the beginning which is likely to be the effect of HDD drive cache.
  • With sequential_cutoff set to 0, bcache very quickly reaches near-SSD performance for this workload (4k reads, 90% hitting 10% of data that fits entirely in the bcache). This is probably because the initial write put data straight into cache as it’s set to writeback.
  • When starting with a completely empty cache, lvmcache is no slouch either. It’s not quite as performant as bcache but that is still up near the 48k IOPS per process region, and very predictable.
  • When sequential_cutoff is left at its default of 4M, bcache performs much worse though still blazing compared to an HDD on its own. At the end of this 30 minute test performance was still increasing so it might be worth performing a longer test
  • The performance of lvmcache when starting with a cache already full of junk data seems to be not that much better than HDD baseline.
IOPS over time for low-end results

Leaving the high-performers out to see if there is anything interesting going on near the bottom of the previous graph.

Apart from the initial spike, HDD results are flat as expected.

Although the lvmcache (full cache) results in the previous graph seemed flat too, looking closer we can see that performance is still increasing, just very slowly. It may be interesting to test for longer to see if performance does continue to increase.

Both HDD and lvmcache have a very similar spike at the start of the test so let’s look closer at that.

IOPS for first 30 seconds

For all the lower-end performers the first 19 seconds are steeper and I can only think this is the effect of HDD drive cache. Once that is filled, HDD remains basically flat, lvmcache (full cache) increases performance more slowly and bcache with the default sequential_cutoff starts to take off.

SSDs don’t have the same sort of cache and bcache with no sequential_cutoff spikes up too quickly to really be noticeable at this scale.

3-hour lvmcache test

Since it seemed like lvmcache with a full cache device was still slowly increasing in performance I did a 3-hour testing on that one.

Skipping the first 20 minutes which show stronger growth, even after 3 hours there is still some performance increase happening. It seems like even a full cache would eventually promote read hot spots, but it could take a very very long time.

Continue reading “bcache and lvmcache”