bcache and lvmcache

Background ^

Over at BitFolk we offer both SSD-backed storage and HDD-backed archive storage. The SSDs we use are really nice and basically have removed all IO performance problems we have ever encountered in the past. I can’t begin to describe how pleasant it is to just never have to think about that whole class of problems.

The main downside of course is that SSD capacity is still really expensive. That’s why we introduced the HDD-backed archive storage: for bulk storage of things that didn’t need to have high performance.

You’d really think though that by now there would be some kind of commodity tiered storage that would allow a relatively small amount of fast SSD storage to accelerate a much larger pool of HDDs. Of course there are various enterprise solutions, and there is also ZFS where SSDs could be used for the ZIL and L2ARC while HDDs are used for the pool.

ZFS is quite appealing but I’m not ready to switch everything to that yet, and I’m certainly not going to buy some enterprise storage solution. I also don’t necessarily want to commit to putting all storage behind such a system.

I decided to explore Linux’s own block device caching solutions.

Scope ^

I’ve restricted the scope to the two solutions which are part of the mainline Linux kernel as of July 2017, these being bcache and lvmcache.

lvmcache is based upon dm-cache which has been included with the mainline kernel since April 2013. It’s quite conservative, and having been around for quite a while is considered stable. It has the advantage that it can work with any LVM logical volume no matter what the contents. That brings the disadvantage that you do need to run LVM.

bcache has been around for a little longer but is a much more ambitious project. Being completely dedicated to accelerating slow block devices with fast ones it is claimed to be able to achieve higher performance than other caching solutions, but as it’s much more complicated than dm-cache there are still bugs being found. Also it requires you format your block devices as bcache before you use them for anything.

Test environment ^

I’m testing this on a Debian testing (buster) Xen virtual machine with a 20GiB xvda virtual disk containing the main operating system. That disk is backed by a software (md) RAID-10 composed of two Samsung sm863 SSDs. It was also used for testing the baseline SSD performance from the directory /srv/ssd.

The virtual machine had 1GiB of memory but the pagecache was cleared between each test run in an attempt to prevent anything being cached in memory.

A 5GiB xvdc virtual disk was provided, backed again on the SSD RAID. This was used for the cache role both in bcache and lvmcache.

A 50GiB xvdd virtual disk was provided, backed by a pair of Seagate ST4000LM016-1N2170 HDDs in software RAID-1. This was used for the HDD backing store in each of the caching implementations. The resulting cache device was mounted at /srv/cache.

Finally a 50GiB xvde virtual disk also backed on HDD was used to test baseline HDD performance, mounted at /srv/slow.

The filesystem in use in all cases was ext4 with default options. In dom0, deadline scheduler was used in all cases.

TL;DR, I just want graphs ^

In case you can’t be bothered to read the rest of this article, here’s just the graphs with some attempt at interpreting them. Down at the tests section you’ll find details of the actual testing process and more commentary on why certain graphs were produced.

git test graphs ^

Times to git clone and git grep.

fio IOPS graphs ^

These are graphs of IOPS across the 30 minutes of testing. There’s two important things to note about these graphs:

  1. They’re a Bezier curve fitted to the data points which are one per second. The actual data points are all over the place, because achieved IOPS depends on how many cache hits/misses there were, which is statistical.
  2. Only the IOPS for the first job is graphed. Even when using the per_job_logs=0 setting my copy of fio writes a set of results for each job. I couldn’t work out how to easily combine these so I’ve shown only the results for the first job.

    For all tests except bcache (sequential_cutoff=0) you just have to bear in mind that there is a second job working in parallel doing pretty much the same amount of IOPS. Strangely for that second bcache test the second job only managed a fraction of the IOPS (though still more than 10k IOPS) and I don’t know why.

IOPS over time for all tests

Well, those results are so extreme that it kind of makes it hard to distinguish between the low-end results.

A couple of observations:

  • SSD is incredibly and consistently fast.
  • For everything else there is a short steep section at the beginning which is likely to be the effect of HDD drive cache.
  • With sequential_cutoff set to 0, bcache very quickly reaches near-SSD performance for this workload (4k reads, 90% hitting 10% of data that fits entirely in the bcache). This is probably because the initial write put data straight into cache as it’s set to writeback.
  • When starting with a completely empty cache, lvmcache is no slouch either. It’s not quite as performant as bcache but that is still up near the 48k IOPS per process region, and very predictable.
  • When sequential_cutoff is left at its default of 4M, bcache performs much worse though still blazing compared to an HDD on its own. At the end of this 30 minute test performance was still increasing so it might be worth performing a longer test
  • The performance of lvmcache when starting with a cache already full of junk data seems to be not that much better than HDD baseline.
IOPS over time for low-end results

Leaving the high-performers out to see if there is anything interesting going on near the bottom of the previous graph.

Apart from the initial spike, HDD results are flat as expected.

Although the lvmcache (full cache) results in the previous graph seemed flat too, looking closer we can see that performance is still increasing, just very slowly. It may be interesting to test for longer to see if performance does continue to increase.

Both HDD and lvmcache have a very similar spike at the start of the test so let’s look closer at that.

IOPS for first 30 seconds

For all the lower-end performers the first 19 seconds are steeper and I can only think this is the effect of HDD drive cache. Once that is filled, HDD remains basically flat, lvmcache (full cache) increases performance more slowly and bcache with the default sequential_cutoff starts to take off.

SSDs don’t have the same sort of cache and bcache with no sequential_cutoff spikes up too quickly to really be noticeable at this scale.

3-hour lvmcache test

Since it seemed like lvmcache with a full cache device was still slowly increasing in performance I did a 3-hour testing on that one.

Skipping the first 20 minutes which show stronger growth, even after 3 hours there is still some performance increase happening. It seems like even a full cache would eventually promote read hot spots, but it could take a very very long time.

The tests ^

This being caching, we would only really expect it to be effective for reads of already-known data. Sequential reads on an HDD aren’t too bad but what really makes things feel slow are random reads all over the place: the time taken to seek to a new location on an HDD is really a killer for performance. So, we’d like a test that involves multiple reads of the same data but non-sequential.

git tests ^

I decided to do a git clone of the Linux kernel and then do a git grep of this, which would be guaranteed to be random access. The git grep step was done repeatedly to see if we could spot the cache being warmed up, also with the kernel’s pagecache being cleared after each step to avoid a situation where the entirety of the kernel source ended up in RAM. The clone process was also timed just to see if there would be any effect on write time.

fio tests ^

Despite git being a real application, the workload itself (searching for a string over and over) seemed rather too synthetic and simplistic. I decided to also try using the excellent fio to gather some detailed statistics. This is also synthetic of course, relating to no real workload, but it’s quite sophisticated and hopefully of interest.

The fio job spec used in all cases was as follows:

[partial]
ioengine=libaio
direct=1
iodepth=8
numjobs=2
readwrite=randread
random_distribution=zipf:1.2
bs=4k
size=2g
unlink=1
runtime=1800
time_based=1
per_job_logs=1
log_avg_msec=1000
write_iops_log=/var/tmp/fio-${FIOTEST}

This causes two 2GiB files of random data to be written out and then two processes read from them for 30 minutes with up to 8 IOs in flight at once for each process. Between the use of direct=1 and the fact that there is 4GiB of data and only 1GiB of RAM, that should be enough to avoid testing any memory cache.

log_avg_msec=1000 is specified in order to restrict the IOPS log to one entry per second, otherwise you get one entry per IO. As that is kept in memory until fio completes, unless the number of entries is reduced then due to the sheer number of IOs the SSDs can service the fio process will cause the VM to run out of memory. This directive causes just one averaged entry per 1000ms, which is adequate for graphing.

unlink is used to make sure that data files are deleted after each test run. Leaving the files in place would cause additional promotions into cache, which would favour tests that are run multiple times.

By default when told to do a random read, fio would ensure it reads the entire file once only (before starting again, if the test is time-based). That would not be any use for testing cache warming as there would only ever be one read from each part of the file. To avoid this, a Zipf distribution is used so that read locations are chosen according to a somewhat repetitive distribution, simulating hotspots. I have no real idea about this part so if anyone has any input on that I’d be interested to hear, but my understanding is that a Zipf distribution with exponent 1.2 should cause about 91% of the reads to be from about 10% of the blocks.

In fact I would be very interested to hear of other fio job configurations that people think would be useful here.

If you’re unfamiliar with the tool, fio output can appear quite daunting. There’s a wealth of information in there though, so it’s worth looking into if you ever need to test storage performance. Too many times people just reach for dd or bonnie as performance testing tools but apart from the most trivial of tests they are just too simple. I found this article by Albert Tobey useful in understanding fio‘s output.

Each test run was performed against:

  1. A purely SSD-backed filesystem.
  2. A purely HDD-backed filesystem.
  3. A filesystem on top of a bcache device, sequential_cutoff=4M (default).
  4. A filesystem on top of a bcache device, sequential_cutoff=0.
  5. A filesystem on top of an lvmcache device, empty cache.
  6. A filesystem on top of an lvmcache device, cache full of junk.

Baseline SSD ^

git grep test
$ cd /srv/ssd
$ /usr/bin/time -f '%U user %S system %E elapsed %P CPU' \
> git clone git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git
Cloning into 'linux-stable'...
remote: Counting objects: 5948690, done.
remote: Compressing objects: 100% (124799/124799), done.
remote: Total 5948690 (delta 427756), reused 549517 (delta 425675)
Receiving objects: 100% (5948690/5948690), 1.21 GiB | 39.61 MiB/s, done.
Resolving deltas: 100% (4961965/4961965), done.
Checking out files: 100% (59844/59844), done.
195.45 user 22.10 system 2:37.89 elapsed 137% CPU
 
$ cd /srv/ssd/linux-stable
$ for i in {1..20}; do
> echo 3 | sudo tee /proc/sys/vm/drop_caches >/dev/null ; \
> /usr/bin/time -f '%U user %S system %E elapsed %P CPU' \
> git --no-pager grep -q blah ; \
> sleep 1; \
> done
0.98 user 1.28 system 0:02.28 elapsed 99% CPU
0.95 user 1.15 system 0:02.18 elapsed 96% CPU
0.89 user 1.30 system 0:02.23 elapsed 98% CPU
1.04 user 1.13 system 0:02.25 elapsed 96% CPU
0.95 user 1.22 system 0:02.24 elapsed 96% CPU
0.95 user 1.23 system 0:02.23 elapsed 97% CPU
1.00 user 1.16 system 0:02.24 elapsed 96% CPU
1.00 user 1.16 system 0:02.23 elapsed 96% CPU
0.89 user 1.28 system 0:02.22 elapsed 97% CPU
0.97 user 1.20 system 0:02.24 elapsed 97% CPU
1.00 user 1.18 system 0:02.23 elapsed 97% CPU
0.96 user 1.21 system 0:02.23 elapsed 97% CPU
0.96 user 1.22 system 0:02.24 elapsed 97% CPU
1.01 user 1.17 system 0:02.24 elapsed 97% CPU
0.94 user 1.22 system 0:02.24 elapsed 96% CPU
1.00 user 1.16 system 0:02.24 elapsed 96% CPU
0.98 user 1.18 system 0:02.24 elapsed 96% CPU
0.92 user 1.24 system 0:02.24 elapsed 97% CPU
0.92 user 1.28 system 0:02.25 elapsed 98% CPU
1.09 user 1.07 system 0:02.23 elapsed 97% CPU

The initial git clone took 157.89 seconds and each git grep reliably took about 2.24 seconds.

fio test
$ cd /srv/ssd/fio && FIOTEST=ssd fio ~/partial.fio
partial: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=8
...
fio-2.16
Starting 2 processes
partial: Laying out IO file(s) (1 file(s) / 2048MB)
partial: Laying out IO file(s) (1 file(s) / 2048MB)
Jobs: 2 (f=2): [r(2)] [100.0% done] [409.8MB/0KB/0KB /s] [105K/0/0 iops] [eta 00m:00s]
partial: (groupid=0, jobs=1): err= 0: pid=29306: Wed Jul 19 05:17:39 2017
  read : io=379033MB, bw=215628KB/s, iops=53906, runt=1800001msec
    slat (usec): min=2, max=1547, avg= 4.38, stdev= 1.68
    clat (usec): min=26, max=4350, avg=141.54, stdev=29.67
     lat (usec): min=94, max=4353, avg=146.39, stdev=29.65
    clat percentiles (usec):
     |  1.00th=[  103],  5.00th=[  108], 10.00th=[  112], 20.00th=[  119],
     | 30.00th=[  124], 40.00th=[  129], 50.00th=[  135], 60.00th=[  141],
     | 70.00th=[  149], 80.00th=[  161], 90.00th=[  179], 95.00th=[  197],
     | 99.00th=[  243], 99.50th=[  262], 99.90th=[  314], 99.95th=[  338],
     | 99.99th=[  406]
    lat (usec) : 50=0.01%, 100=0.14%, 250=99.08%, 500=0.77%, 750=0.01%
    lat (usec) : 1000=0.01%
    lat (msec) : 2=0.01%, 4=0.01%, 10=0.01%
  cpu          : usr=14.30%, sys=39.70%, ctx=50601996, majf=0, minf=16
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=97032448/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=8
partial: (groupid=0, jobs=1): err= 0: pid=29307: Wed Jul 19 05:17:39 2017
  read : io=358188MB, bw=203769KB/s, iops=50942, runt=1800001msec
    slat (usec): min=2, max=225, avg= 4.46, stdev= 1.71
    clat (usec): min=21, max=4385, avg=150.08, stdev=43.00
     lat (usec): min=69, max=4391, avg=155.01, stdev=42.98
    clat percentiles (usec):
     |  1.00th=[  104],  5.00th=[  110], 10.00th=[  115], 20.00th=[  121],
     | 30.00th=[  127], 40.00th=[  133], 50.00th=[  137], 60.00th=[  145],
     | 70.00th=[  155], 80.00th=[  171], 90.00th=[  199], 95.00th=[  237],
     | 99.00th=[  322], 99.50th=[  350], 99.90th=[  414], 99.95th=[  438],
     | 99.99th=[  498]
    lat (usec) : 50=0.01%, 100=0.10%, 250=95.82%, 500=4.06%, 750=0.01%
    lat (usec) : 1000=0.01%
    lat (msec) : 2=0.01%, 4=0.01%, 10=0.01%
  cpu          : usr=13.80%, sys=37.97%, ctx=48695297, majf=0, minf=15
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=91696255/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=8
 
Run status group 0 (all jobs):
   READ: io=737221MB, aggrb=419396KB/s, minb=203769KB/s, maxb=215627KB/s, mint=1800001msec, maxt=1800001msec
 
Disk stats (read/write):
  xvda: ios=188724848/22, merge=0/28, ticks=27221784/0, in_queue=27267336, util=100.00%

This provides a nice illustration of just how fast the SSDs are: each process managed a very consistent 52 thousand IOPS and between the two of them they managed to read ~720GiB of data in 30 minutes.

Baseline HDD ^

git grep test
$ cd /srv/slow
$ /usr/bin/time -f '%U user %S system %E elapsed %P CPU' \
> git clone git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git
Cloning into 'linux-stable'...
remote: Counting objects: 5948690, done.
remote: Compressing objects: 100% (124799/124799), done.
remote: Total 5948690 (delta 427756), reused 549517 (delta 425675)
Receiving objects: 100% (5948690/5948690), 1.21 GiB | 38.36 MiB/s, done.
Resolving deltas: 100% (4961965/4961965), done.
Checking out files: 100% (59844/59844), done.
194.61 user 21.48 system 3:42.23 elapsed 97% CPU
 
$ cd /srv/slow/linux-stable
$ for i in {1..20}; do
> echo 3 | sudo tee /proc/sys/vm/drop_caches >/dev/null ; \
> /usr/bin/time -f '%U user %S system %E elapsed %P CPU' \
> git --no-pager grep -q blah ; \
> sleep 1; \
> done
1.22 user 1.60 system 0:23.60 elapsed 11% CPU
1.35 user 1.32 system 0:23.08 elapsed 11% CPU
1.24 user 1.43 system 0:22.49 elapsed 11% CPU
1.15 user 1.57 system 0:22.37 elapsed 12% CPU
1.25 user 1.44 system 0:22.09 elapsed 12% CPU
1.26 user 1.48 system 0:22.60 elapsed 12% CPU
1.20 user 1.52 system 0:22.47 elapsed 12% CPU
1.21 user 1.48 system 0:22.82 elapsed 11% CPU
1.14 user 1.55 system 0:22.45 elapsed 12% CPU
1.28 user 1.49 system 0:22.32 elapsed 12% CPU
1.21 user 1.48 system 0:22.43 elapsed 12% CPU
1.26 user 1.51 system 0:22.95 elapsed 12% CPU
1.21 user 1.49 system 0:22.33 elapsed 12% CPU
1.23 user 1.45 system 0:22.54 elapsed 11% CPU
1.19 user 1.51 system 0:22.08 elapsed 12% CPU
1.26 user 1.42 system 0:22.42 elapsed 12% CPU
1.20 user 1.51 system 0:22.47 elapsed 12% CPU
1.19 user 1.53 system 0:22.42 elapsed 12% CPU
1.15 user 1.56 system 0:22.48 elapsed 12% CPU
1.24 user 1.47 system 0:22.10 elapsed 12% CPU

The initial git clone took 222.23 seconds and each git grep reliably took about 22.53 seconds.

Of course, a significant part of the clone time was spent downloading the data.

fio test
$ cd /srv/slow/fio && FIOTEST=hdd fio ~/partial.fio
partial: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=8
...
fio-2.16
Starting 2 processes
partial: Laying out IO file(s) (1 file(s) / 2048MB)
partial: Laying out IO file(s) (1 file(s) / 2048MB)
Jobs: 2 (f=2): [r(2)] [100.0% done] [3795KB/0KB/0KB /s] [948/0/0 iops] [eta 00m:00s]
partial: (groupid=0, jobs=1): err= 0: pid=29158: Wed Jul 19 04:33:19 2017
  read : io=3276.3MB, bw=1863.8KB/s, iops=465, runt=1800078msec
    slat (usec): min=2, max=132, avg=13.00, stdev= 8.10
    clat (usec): min=59, max=830656, avg=17151.07, stdev=36630.33
     lat (usec): min=64, max=830668, avg=17164.94, stdev=36630.39
    clat percentiles (usec):
     |  1.00th=[   70],  5.00th=[   78], 10.00th=[   88], 20.00th=[  111],
     | 30.00th=[  143], 40.00th=[  205], 50.00th=[  988], 60.00th=[ 1496],
     | 70.00th=[15424], 80.00th=[26240], 90.00th=[50432], 95.00th=[82432],
     | 99.00th=[177152], 99.50th=[224256], 99.90th=[346112], 99.95th=[403456],
     | 99.99th=[544768]
    lat (usec) : 100=15.10%, 250=26.82%, 500=3.79%, 750=1.99%, 1000=2.47%
    lat (msec) : 2=10.84%, 4=0.39%, 10=2.91%, 20=10.92%, 50=14.68%
    lat (msec) : 100=6.53%, 250=3.21%, 500=0.34%, 750=0.02%, 1000=0.01%
  cpu          : usr=0.39%, sys=1.17%, ctx=822685, majf=0, minf=16
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=838728/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=8
partial: (groupid=0, jobs=1): err= 0: pid=29159: Wed Jul 19 04:33:19 2017
  read : io=3328.4MB, bw=1893.4KB/s, iops=473, runt=1800076msec
    slat (usec): min=2, max=157, avg=13.20, stdev= 8.22
    clat (usec): min=4, max=1046.5K, avg=16882.45, stdev=35737.35
     lat (usec): min=64, max=1046.6K, avg=16896.52, stdev=35737.42
    clat percentiles (usec):
     |  1.00th=[   70],  5.00th=[   78], 10.00th=[   88], 20.00th=[  111],
     | 30.00th=[  143], 40.00th=[  211], 50.00th=[  996], 60.00th=[ 1528],
     | 70.00th=[15296], 80.00th=[25728], 90.00th=[49920], 95.00th=[81408],
     | 99.00th=[173056], 99.50th=[218112], 99.90th=[329728], 99.95th=[391168],
     | 99.99th=[522240]
    lat (usec) : 10=0.01%, 50=0.01%, 100=15.03%, 250=26.59%, 500=3.86%
    lat (usec) : 750=2.07%, 1000=2.49%
    lat (msec) : 2=10.70%, 4=0.36%, 10=3.03%, 20=11.45%, 50=14.46%
    lat (msec) : 100=6.49%, 250=3.17%, 500=0.30%, 750=0.01%, 1000=0.01%
    lat (msec) : 2000=0.01%
  cpu          : usr=0.47%, sys=1.15%, ctx=834258, majf=0, minf=14
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=852044/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=8
 
Run status group 0 (all jobs):
   READ: io=6604.6MB, aggrb=3757KB/s, minb=1863KB/s, maxb=1893KB/s, mint=1800076msec, maxt=1800078msec
 
Disk stats (read/write):
  xvde: ios=1690665/12, merge=0/15, ticks=28764332/2848, in_queue=28767992, util=100.00%

Both jobs together managed an average of 938 IOPS here, which is pretty weak but still considerably more than a pair of HDDs should be able to do (100–150 IOPS each, so 300 if you were amazingly lucky). This indicates that the drive cache must have been involved, and indeed the IOPS rate does very rapidly increase within the first few seconds of the test run but reaches a stable maximum within 20 seconds. That marks the time when the disk cache is full and cache hit rate can’t be maintained.

A total of 6,604.6MiB was read during the 30 minute test.

bcache ^

Setup

By default bcache uses writethrough cache mode, where writes go to both the backing store and the cache and are only acknowledged once they’ve completed at the backing store. In this mode bcache will never accelerate writes, but it is safe to lose the cache device at any time.

I felt comfortable switching the bcache device to writeback mode as the underlying storage used for the cache is already reliable. In writeback mode the writes are acknowledged as soon as they are completed on the cache device and are later sequentially written to backing device.

This should result in higher write performance as well as the higher read performance we’d expect from using a cache.

$ sudo make-bcache -B /dev/xvdd
UUID:                   cf0353eb-ac08-45bd-a3d5-f8b92fbc0fb8
Set UUID:               8c06d317-e735-41f5-8dec-52da26a37883
version:                1
block_size:             1
data_offset:            16
$ sudo mkfs.ext4 -L cached /dev/bcache0
mke2fs 1.43.4 (31-Jan-2017)
Discarding device blocks: done
Creating filesystem with 13107198 4k blocks and 3276800 inodes
Filesystem UUID: 7db8568e-282b-4fa2-bf19-a42143708a0d
Superblock backups stored on blocks:
        32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
        4096000, 7962624, 11239424
 
Allocating group tables: done
Writing inode tables: done
Creating journal (65536 blocks): done
Writing superblocks and filesystem accounting information: done
 
$ sudo make-bcache -C /dev/xvdc
UUID:                   ae581052-4449-460d-8cf5-b11d6467403d
Set UUID:               2fcc7937-9f5f-4089-9400-552675929ea5
version:                0
nbuckets:               10240
block_size:             1
bucket_size:            1024
nr_in_set:              1
nr_this_dev:            0
first_bucket:           1
$ sudo bcache-super-show /dev/xvdc
sb.magic                ok
sb.first_sector         8 [match]
sb.csum                 206ADF979437131E [match]
sb.version              3 [cache device]
 
dev.label               (empty)
dev.uuid                ae581052-4449-460d-8cf5-b11d6467403d
dev.sectors_per_block   1
dev.sectors_per_bucket  1024
dev.cache.first_sector  1024
dev.cache.cache_sectors 10484736
dev.cache.total_sectors 10485760
dev.cache.ordered       yes
dev.cache.discard       no
dev.cache.pos           0
dev.cache.replacement   0 [lru]
 
cset.uuid               2fcc7937-9f5f-4089-9400-552675929ea5
 
$ echo 2fcc7937-9f5f-4089-9400-552675929ea5 | sudo tee /sys/block/bcache0/bcache/attach >/dev/null
$ echo writeback | sudo tee /sys/block/bcache0/bcache/cache_mode >/dev/null
$ sudo mount /dev/bcache0 /srv/cache
$ sudo chown andy: /srv/cache
git grep test
$ cd /srv/cache
$ /usr/bin/time -f '%U user %S system %E elapsed %P CPU' \
> git clone git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git
Cloning into 'linux-stable'...
remote: Counting objects: 5948690, done.
remote: Compressing objects: 100% (124799/124799), done.
remote: Total 5948690 (delta 427755), reused 549518 (delta 425675)
Receiving objects: 100% (5948690/5948690), 1.21 GiB | 40.30 MiB/s, done.
Resolving deltas: 100% (4961964/4961964), done.
Checking out files: 100% (59844/59844), done.
194.88 user 21.30 system 3:34.10 elapsed 100% CPU
 
$ cd /srv/cache/linux-stable
$ for i in {1..20}; do
> echo 3 | sudo tee /proc/sys/vm/drop_caches >/dev/null ; \
> /usr/bin/time -f '%U user %S system %E elapsed %P CPU' \
> git --no-pager grep -q blah ; \
> sleep 1; \
> done
1.03 user 1.40 system 0:03.02 elapsed 80% CPU
1.01 user 1.25 system 0:02.55 elapsed 88% CPU
1.05 user 1.21 system 0:02.63 elapsed 85% CPU
0.90 user 1.34 system 0:02.56 elapsed 87% CPU
0.96 user 1.28 system 0:02.59 elapsed 86% CPU
0.90 user 1.35 system 0:02.54 elapsed 88% CPU
1.04 user 1.22 system 0:02.51 elapsed 90% CPU
1.00 user 1.25 system 0:02.50 elapsed 89% CPU
0.93 user 1.34 system 0:02.54 elapsed 89% CPU
1.02 user 1.26 system 0:02.53 elapsed 90% CPU
0.91 user 1.43 system 0:02.67 elapsed 88% CPU
0.94 user 1.39 system 0:02.57 elapsed 90% CPU
1.04 user 1.28 system 0:02.57 elapsed 90% CPU
1.00 user 1.37 system 0:02.58 elapsed 91% CPU
1.02 user 1.33 system 0:02.84 elapsed 82% CPU
1.11 user 1.21 system 0:02.58 elapsed 90% CPU
1.03 user 1.30 system 0:02.57 elapsed 90% CPU
0.96 user 1.37 system 0:02.58 elapsed 90% CPU
1.02 user 1.36 system 0:02.62 elapsed 90% CPU
0.92 user 1.41 system 0:02.62 elapsed 89% CPU

The initial git clone took 214.10 seconds and each git grep on average took about 2.61 seconds. It can be seen that the first run is slightly slower but very quickly things settle down.

The git grep results are very encouraging being only 11.6% of the time taken for the baseline HDD tests, or almost 9x faster. Each run was only 0.37 seconds slower than when run directly against SSD.

The initial clone results are disappointing though, being only 8.13 seconds faster than when run against baseline HDD. It looks like most data must have bypassed the cache and waited for HDD. bcache does bypass cache for sequential writes which is what the git clone would have done, and we can prove this:

$ cat /sys/block/bcache0/bcache/stats_total/bypassed
270M

So yes, there we go, all 270MiB bypassed the cache and waited on HDD completion.

We can give bcache hints about when to consider writes sequential, and at the extreme end tell it to never consider writes sequential, so let’s try that.

$ echo 0 | sudo tee /sys/block/bcache0/bcache/sequential_cutoff >/dev/null
$ cd /srv/cache
$ rm -r linux-stable
$ /usr/bin/time -f '%U user %S system %E elapsed %P CPU' \
> git clone git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git
Cloning into 'linux-stable'...
remote: Counting objects: 5948690, done.
remote: Compressing objects: 100% (124799/124799), done.
remote: Total 5948690 (delta 427756), reused 549517 (delta 425675)
Receiving objects: 100% (5948690/5948690), 1.21 GiB | 39.51 MiB/s, done.
Resolving deltas: 100% (4961965/4961965), done.
Checking out files: 100% (59844/59844), done.
195.10 user 23.53 system 2:39.87 elapsed 136% CPU
$ cat /sys/block/bcache0/bcache/stats_total/bypassed
299M

This time only 29MiB of data bypassed cache, and the time for clone was 159.87 seconds. That’s 62.36 seconds faster than against HDD, and only 1.98 seconds slower than against SSD!

fio test – sequential_cutoff=4M
$ cd /srv/cache/fio && FIOTEST=bcache-sc4m fio ~/partial.fio
partial: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=8
...
fio-2.16
Starting 2 processes
partial: Laying out IO file(s) (1 file(s) / 2048MB)
partial: Laying out IO file(s) (1 file(s) / 2048MB)
Jobs: 2 (f=2): [r(2)] [100.0% done] [119.9MB/0KB/0KB /s] [30.7K/0/0 iops] [eta 00m:00s]
partial: (groupid=0, jobs=1): err= 0: pid=8640: Thu Jul 20 18:13:08 2017
  read : io=45237MB, bw=25734KB/s, iops=6433, runt=1800041msec
    slat (usec): min=3, max=5707, avg= 6.14, stdev=10.97
    clat (usec): min=2, max=1238.4K, avg=1235.14, stdev=10675.04
     lat (usec): min=70, max=1238.5K, avg=1241.70, stdev=10675.36
    clat percentiles (usec):
     |  1.00th=[  101],  5.00th=[  104], 10.00th=[  106], 20.00th=[  108],
     | 30.00th=[  111], 40.00th=[  117], 50.00th=[  119], 60.00th=[  121],
     | 70.00th=[  122], 80.00th=[  124], 90.00th=[  133], 95.00th=[  151],
     | 99.00th=[35584], 99.50th=[67072], 99.90th=[148480], 99.95th=[185344],
     | 99.99th=[305152]
    lat (usec) : 4=0.01%, 50=0.01%, 100=0.37%, 250=97.11%, 500=0.08%
    lat (usec) : 750=0.02%, 1000=0.02%
    lat (msec) : 2=0.03%, 4=0.01%, 10=0.13%, 20=0.65%, 50=0.87%
    lat (msec) : 100=0.46%, 250=0.24%, 500=0.01%, 750=0.01%, 1000=0.01%
    lat (msec) : 2000=0.01%
  cpu          : usr=1.81%, sys=6.79%, ctx=10722687, majf=0, minf=17
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=11580733/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=8
partial: (groupid=0, jobs=1): err= 0: pid=8641: Thu Jul 20 18:13:08 2017
  read : io=43743MB, bw=24885KB/s, iops=6221, runt=1800037msec
    slat (usec): min=3, max=5703, avg= 6.27, stdev=10.53
    clat (usec): min=2, max=1081.7K, avg=1277.44, stdev=10884.69
     lat (usec): min=70, max=1081.8K, avg=1284.13, stdev=10884.99
    clat percentiles (usec):
     |  1.00th=[  100],  5.00th=[  103], 10.00th=[  104], 20.00th=[  106],
     | 30.00th=[  107], 40.00th=[  109], 50.00th=[  111], 60.00th=[  117],
     | 70.00th=[  120], 80.00th=[  122], 90.00th=[  129], 95.00th=[  145],
     | 99.00th=[37632], 99.50th=[69120], 99.90th=[150528], 99.95th=[189440],
     | 99.99th=[317440]
    lat (usec) : 4=0.01%, 20=0.01%, 50=0.01%, 100=0.68%, 250=96.73%
    lat (usec) : 500=0.07%, 750=0.02%, 1000=0.02%
    lat (msec) : 2=0.03%, 4=0.01%, 10=0.13%, 20=0.68%, 50=0.90%
    lat (msec) : 100=0.48%, 250=0.25%, 500=0.02%, 750=0.01%, 1000=0.01%
    lat (msec) : 2000=0.01%
  cpu          : usr=1.74%, sys=6.57%, ctx=10425775, majf=0, minf=15
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=11198290/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=8
 
Run status group 0 (all jobs):
   READ: io=88981MB, aggrb=50618KB/s, minb=24884KB/s, maxb=25734KB/s, mint=1800037msec, maxt=1800041msec
 
Disk stats (read/write):
    bcache0: ios=22775854/880, merge=0/0, ticks=28668080/36768, in_queue=197531580, util=100.00%, aggrios=11393419/289009, aggrmerge=0/182, aggrticks=14329508/435284, aggrin_queue=14764868, aggrutil=100.00%
  xvdc: ios=22230796/570128, merge=0/348, ticks=2656456/37312, in_queue=2693812, util=67.39%
  xvdd: ios=556042/7890, merge=0/17, ticks=26002560/833256, in_queue=26835924, util=100.00%

The ~15k IOPS per process achieved towards the end of this test are quite impressive and suggest that there is more scope for performance increase in a longer test. The combined average IOPS across 30 minutes were “only” 12.7k though. Vastly superior to a HDD and we would be glad to take it.

22,779,023 IOs were issued (11,580,733+11,198,290) of which 22,230,796 (97.6%) hit the cache. That’s a really great result, but we learned earlier that setting sequential_cutoff to 0 allows us to get data we write straight into the cache, so let’s see how that changes things.

fio test – sequential_cutoff=0
$ cd /srv/cache/fio && FIOTEST=bcache-sc0 fio ~/partial.fio
partial: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=8
...
fio-2.16
Starting 2 processes
partial: Laying out IO file(s) (1 file(s) / 2048MB)
partial: Laying out IO file(s) (1 file(s) / 2048MB)
Jobs: 2 (f=2): [r(2)] [100.0% done] [276.1MB/0KB/0KB /s] [70.1K/0/0 iops] [eta 00m:00s]
partial: (groupid=0, jobs=1): err= 0: pid=9190: Thu Jul 20 20:34:49 2017
  read : io=367029MB, bw=208799KB/s, iops=52199, runt=1800001msec
    slat (usec): min=3, max=350673, avg= 5.49, stdev=54.90
    clat (usec): min=22, max=558335, avg=145.49, stdev=593.01
     lat (usec): min=67, max=558342, avg=151.41, stdev=597.19
    clat percentiles (usec):
     |  1.00th=[  102],  5.00th=[  106], 10.00th=[  109], 20.00th=[  116],
     | 30.00th=[  120], 40.00th=[  123], 50.00th=[  127], 60.00th=[  133],
     | 70.00th=[  139], 80.00th=[  153], 90.00th=[  185], 95.00th=[  223],
     | 99.00th=[  310], 99.50th=[  338], 99.90th=[  442], 99.95th=[  868],
     | 99.99th=[20352]
    lat (usec) : 50=0.01%, 100=0.27%, 250=96.55%, 500=3.11%, 750=0.02%
    lat (usec) : 1000=0.01%
    lat (msec) : 2=0.02%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
    lat (msec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%
  cpu          : usr=13.23%, sys=41.79%, ctx=46793234, majf=0, minf=17
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=93959531/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=8
partial: (groupid=0, jobs=1): err= 0: pid=9191: Thu Jul 20 20:34:49 2017
  read : io=116645MB, bw=66357KB/s, iops=16589, runt=1800020msec
    slat (usec): min=3, max=86150, avg= 6.31, stdev=22.07
    clat (usec): min=16, max=581877, avg=473.49, stdev=3466.14
     lat (usec): min=67, max=581884, avg=480.26, stdev=3466.62
    clat percentiles (usec):
     |  1.00th=[  102],  5.00th=[  107], 10.00th=[  110], 20.00th=[  116],
     | 30.00th=[  120], 40.00th=[  123], 50.00th=[  126], 60.00th=[  131],
     | 70.00th=[  135], 80.00th=[  143], 90.00th=[  161], 95.00th=[  183],
     | 99.00th=[14656], 99.50th=[23680], 99.90th=[47360], 99.95th=[59648],
     | 99.99th=[94720]
    lat (usec) : 20=0.01%, 50=0.01%, 100=0.23%, 250=97.81%, 500=0.37%
    lat (usec) : 750=0.03%, 1000=0.02%
    lat (msec) : 2=0.02%, 4=0.02%, 10=0.18%, 20=0.66%, 50=0.58%
    lat (msec) : 100=0.08%, 250=0.01%, 500=0.01%, 750=0.01%
  cpu          : usr=4.86%, sys=16.81%, ctx=24164179, majf=0, minf=15
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=29861123/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=8
 
Run status group 0 (all jobs):
   READ: io=483674MB, aggrb=275153KB/s, minb=66357KB/s, maxb=208798KB/s, mint=1800001msec, maxt=1800020msec
 
Disk stats (read/write):
    bcache0: ios=123818417/28, merge=0/0, ticks=27945560/5768, in_queue=429639892, util=100.00%, aggrios=61919282/259095, aggrmerge=0/25, aggrticks=13927904/125270, aggrin_queue=14059068, aggrutil=100.00%
  xvdc: ios=123345393/500268, merge=0/35, ticks=17064408/30904, in_queue=17107112, util=99.99%
  xvdd: ios=493172/17923, merge=0/15, ticks=10791400/219636, in_queue=11011024, util=100.00%

123,820,654 (93,959,531+29,861,123) IOs issued, 123,345,393 (99.62%) hit cache. It’s just ~2% higher than the previous test, but what a difference it makes. The IOPS log indicates that performance very quickly increases to near-SSD levels, averaging 68.8k IOPS across 30 minutes.

Interestingly unlike all other tests (even unlike bcache with the default sequential_cutoff), the second job doesn’t manage anywhere near the same level of performance as the first job. It’s 52.2k IOPS vs. 16.6k IOPS. Perhaps at this high level of cache utilisation there is some sort of locking going on. Might be worth following up with upstream.

lvmcache ^

Setup
$ sudo pvcreate /dev/xvdc
  Physical volume "/dev/xvdc" successfully created.
$ sudo pvcreate /dev/xvdd
  Physical volume "/dev/xvdd" successfully created.
$ sudo vgcreate cachedvg /dev/xvd{d,c}
  Volume group "cachedvg" successfully created
$ sudo lvcreate -L50m -n lv_cache_meta cachedvg /dev/xvdc
  Rounding up size to full physical extent 52.00 MiB
  Logical volume "lv_cache_meta" created.
$ sudo lvcreate -L4g -n lv_cache cachedvg /dev/xvdc
  Logical volume "lv_cache" created.
$ sudo lvcreate -l100%PVS -n lv cachedvg /dev/xvdd
  Logical volume "lv" created.
$ sudo lvs
  LV            VG       Attr       LSize  Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  lv            cachedvg -wi-a----- 50.00g
  lv_cache      cachedvg -wi-a-----  4.00g
  lv_cache_meta cachedvg -wi-a----- 52.00m
$ sudo lvconvert --type cache-pool --cachemode writeback --poolmetadata cachedvg/lv_cache_meta cachedvg/lv_cache
  Converted cachedvg/lv_cache to cache pool.
$ sudo lvconvert --type cache --cachepool cachedvg/lv_cache cachedvg/lv
  Logical volume cachedvg/lv is now cached.
$ sudo lvs -a -o +devices
  LV               VG       Attr       LSize  Pool       Origin     Data%  Meta%  Move Log Cpy%Sync Convert Devices 
  lv               cachedvg Cwi-a-C--- 50.00g [lv_cache] [lv_corig] 0.00   1.01            0.00             lv_corig(0) 
  [lv_cache]       cachedvg Cwi---C---  4.00g                       0.00   1.01            0.00             lv_cache_cdata(0)
  [lv_cache_cdata] cachedvg Cwi-ao----  4.00g                                                               /dev/xvdc(13)
  [lv_cache_cmeta] cachedvg ewi-ao---- 52.00m                                                               /dev/xvdc(0) 
  [lv_corig]       cachedvg owi-aoC--- 50.00g                                                               /dev/xvdd(0) 
  [lvol0_pmspare]  cachedvg ewi------- 52.00m                                                               /dev/xvdc(1037)
$ sudo mkfs.ext4 -L cached /dev/cachedvg/lv
mke2fs 1.43.4 (31-Jan-2017)
Creating filesystem with 13106176 4k blocks and 3276800 inodes
Filesystem UUID: bff28530-339a-403d-889d-dd42f4465ab5
Superblock backups stored on blocks:
        32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
        4096000, 7962624, 11239424
 
Allocating group tables: done
Writing inode tables: done
Creating journal (65536 blocks): done
Writing superblocks and filesystem accounting information: done   
$ sudo mount /dev/cachedvg/lv /srv/cache
$ sudo chown andy: /srv/cache
git grep test
$ cd /srv/cache
$ /usr/bin/time -f '%U user %S system %E elapsed %P CPU' \
> git clone git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git
Cloning into 'linux-stable'...
remote: Counting objects: 5948690, done.
remote: Compressing objects: 100% (124799/124799), done.
remote: Total 5948690 (delta 427755), reused 549518 (delta 425675)
Receiving objects: 100% (5948690/5948690), 1.21 GiB | 35.05 MiB/s, done.
Resolving deltas: 100% (4961964/4961964), done.
Checking out files: 100% (59844/59844), done.
196.00 user 23.40 system 2:43.13 elapsed 134% CPU
$ cd /srv/cache/linux-stable
$ for i in {1..20}; do
> echo 3 | sudo tee /proc/sys/vm/drop_caches >/dev/null ; \
> /usr/bin/time -f '%U user %S system %E elapsed %P CPU' \
> git --no-pager grep -q blah ; \
> sleep 1; \
> done
1.00 user 1.58 system 0:06.41 elapsed 40% CPU
0.93 user 1.46 system 0:04.22 elapsed 56% CPU
1.09 user 1.30 system 0:04.09 elapsed 58% CPU
1.02 user 1.39 system 0:03.89 elapsed 62% CPU
0.97 user 1.41 system 0:03.89 elapsed 61% CPU
1.03 user 1.34 system 0:03.52 elapsed 67% CPU
1.06 user 1.30 system 0:03.60 elapsed 65% CPU
0.94 user 1.46 system 0:03.79 elapsed 63% CPU
1.05 user 1.33 system 0:03.28 elapsed 72% CPU
1.00 user 1.39 system 0:03.81 elapsed 62% CPU
1.03 user 1.37 system 0:03.25 elapsed 74% CPU
1.05 user 1.37 system 0:03.30 elapsed 73% CPU
1.07 user 1.30 system 0:03.18 elapsed 74% CPU
1.07 user 1.32 system 0:03.28 elapsed 72% CPU
0.96 user 1.46 system 0:02.95 elapsed 82% CPU
1.07 user 1.31 system 0:03.14 elapsed 76% CPU
0.98 user 1.40 system 0:03.04 elapsed 78% CPU
0.96 user 1.42 system 0:03.19 elapsed 74% CPU
1.00 user 1.40 system 0:03.12 elapsed 77% CPU
0.89 user 1.48 system 0:02.89 elapsed 82% CPU

The initial git clone took 163.13 seconds, not far off native SSD timings (157.68 seconds). Each git grep on average took about 3.35 seconds. These started off slower (6.41 seconds) but settled down with the last 15 runs averaging at 3.29 seconds.

fio test – empty cache
$ cd /srv/cache/fio && FIOTEST=lvmcache-empty fio ~/partial.fio
partial: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=8
...
fio-2.16
Starting 2 processes
partial: Laying out IO file(s) (1 file(s) / 2048MB)
partial: Laying out IO file(s) (1 file(s) / 2048MB)
Jobs: 2 (f=2): [r(2)] [100.0% done] [373.5MB/0KB/0KB /s] [95.7K/0/0 iops] [eta 00m:00s]
partial: (groupid=0, jobs=1): err= 0: pid=6652: Thu Jul 20 12:14:06 2017
  read : io=333831MB, bw=189912KB/s, iops=47478, runt=1800001msec
    slat (usec): min=2, max=245, avg= 6.41, stdev= 2.17
    clat (usec): min=29, max=234093, avg=159.69, stdev=401.74
     lat (usec): min=69, max=234098, avg=166.57, stdev=401.75
    clat percentiles (usec):
     |  1.00th=[  103],  5.00th=[  108], 10.00th=[  113], 20.00th=[  119],
     | 30.00th=[  124], 40.00th=[  129], 50.00th=[  135], 60.00th=[  143],
     | 70.00th=[  153], 80.00th=[  175], 90.00th=[  221], 95.00th=[  274],
     | 99.00th=[  370], 99.50th=[  410], 99.90th=[  502], 99.95th=[  996],
     | 99.99th=[16768]
    lat (usec) : 50=0.01%, 100=0.16%, 250=92.76%, 500=6.97%, 750=0.05%
    lat (usec) : 1000=0.01%
    lat (msec) : 2=0.01%, 4=0.01%, 10=0.02%, 20=0.02%, 50=0.01%
    lat (msec) : 100=0.01%, 250=0.01%
  cpu          : usr=12.04%, sys=44.39%, ctx=42073879, majf=0, minf=16
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=85460611/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=8
partial: (groupid=0, jobs=1): err= 0: pid=6653: Thu Jul 20 12:14:06 2017
  read : io=313896MB, bw=178572KB/s, iops=44642, runt=1800001msec
    slat (usec): min=2, max=335, avg= 6.55, stdev= 2.25
    clat (usec): min=22, max=212646, avg=170.22, stdev=721.29 
     lat (usec): min=63, max=212656, avg=177.25, stdev=721.32 
    clat percentiles (usec):
     |  1.00th=[  103],  5.00th=[  108], 10.00th=[  112], 20.00th=[  118],
     | 30.00th=[  122], 40.00th=[  127], 50.00th=[  131], 60.00th=[  137],
     | 70.00th=[  145], 80.00th=[  157], 90.00th=[  177], 95.00th=[  197],
     | 99.00th=[  258], 99.50th=[  314], 99.90th=[12096], 99.95th=[15424],
     | 99.99th=[28032]
    lat (usec) : 50=0.01%, 100=0.22%, 250=98.57%, 500=0.91%, 750=0.01%
    lat (usec) : 1000=0.01%
    lat (msec) : 2=0.01%, 4=0.02%, 10=0.11%, 20=0.11%, 50=0.03%
    lat (msec) : 100=0.01%, 250=0.01%
  cpu          : usr=11.36%, sys=42.77%, ctx=41146564, majf=0, minf=14
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=80357365/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=8
 
Run status group 0 (all jobs):
   READ: io=647726MB, aggrb=368484KB/s, minb=178571KB/s, maxb=189912KB/s, mint=1800001msec, maxt=1800001msec
 
Disk stats (read/write):
    dm-2: ios=165815070/5, merge=0/0, ticks=27172896/564, in_queue=27297688, util=100.00%, aggrios=55273566/7259, aggrmerge=0/0, aggrticks=8986252/1498, aggrin_queue=9007888, aggrutil=100.00%
    dm-1: ios=2/19047, merge=0/0, ticks=0/3072, in_queue=3072, util=0.14%, aggrios=165379217/17690, aggrmerge=0/4083, aggrticks=23890596/3616, aggrin_queue=23926592, aggrutil=100.00%
  xvdc: ios=165379217/17690, merge=0/4083, ticks=23890596/3616, in_queue=23926592, util=100.00%
  dm-0: ios=165379215/2726, merge=0/0, ticks=23917816/860, in_queue=23979056, util=100.00%
    dm-3: ios=441483/6, merge=0/0, ticks=3040940/564, in_queue=3041536, util=75.31%, aggrios=441483/5, aggrmerge=0/1, aggrticks=3040896/464, aggrin_queue=3041312, aggrutil=75.30%
  xvdd: ios=441483/5, merge=0/1, ticks=3040896/464, in_queue=3041312, util=75.30%

We can see that the first job issued 85,460,611 IOs and the second job issued 80,357,365 IOs, for a total of 165817976 IOs. Of these, 165,379,217 (99.74%) came from xvdc, the cache (SSD) device. Only 441,483 (0.27%) came from xvdd, the backing (HDD) device. Again this is very aggressive cache population, with a combined average of 92k IOPS being achieved over 30 minutes.

80k IOPS are achieved within 20 seconds of the start of testing, which is probably helped by the disk cache, but the steady maximum of ~90k IOPS is still achieved by 40 seconds in. Average IOPS ended up at 88% that of the SSD.

This is incredible performance, but I could only get it to happen the first time, against an empty cache device. Since the next test run will unlink the previous run’s files, the cache device will at that point be full of data that is of no use as it’s been deleted. The next test will write out 4GiB of new, uncached data.

fio test – full cache
$ cd /srv/cache/fio && FIOTEST=lvmcache-full fio ~/partial.fio
partial: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=8
...
fio-2.16
Starting 2 processes
partial: Laying out IO file(s) (1 file(s) / 2048MB)
partial: Laying out IO file(s) (1 file(s) / 2048MB)
Jobs: 2 (f=2): [r(2)] [100.0% done] [6072KB/0KB/0KB /s] [1518/0/0 iops] [eta 00m:00s]
partial: (groupid=0, jobs=1): err= 0: pid=6857: Thu Jul 20 12:58:00 2017
  read : io=5266.1MB, bw=2996.2KB/s, iops=749, runt=1800111msec
    slat (usec): min=3, max=117, avg=12.93, stdev= 7.50
    clat (usec): min=58, max=1060.8K, avg=10663.38, stdev=31025.49
     lat (usec): min=73, max=1060.8K, avg=10676.99, stdev=31025.92
    clat percentiles (usec):
     |  1.00th=[  105],  5.00th=[  112], 10.00th=[  117], 20.00th=[  125],
     | 30.00th=[  127], 40.00th=[  133], 50.00th=[  137], 60.00th=[  153],
     | 70.00th=[  179], 80.00th=[14144], 90.00th=[33024], 95.00th=[59136],
     | 99.00th=[144384], 99.50th=[191488], 99.90th=[337920], 99.95th=[411648],
     | 99.99th=[618496]
    lat (usec) : 100=0.29%, 250=74.34%, 500=0.46%, 750=0.20%, 1000=0.21%
    lat (msec) : 2=0.47%, 4=0.03%, 10=1.30%, 20=7.26%, 50=9.23%
    lat (msec) : 100=4.09%, 250=1.88%, 500=0.22%, 750=0.02%, 1000=0.01%
    lat (msec) : 2000=0.01%
  cpu          : usr=0.48%, sys=1.52%, ctx=1335232, majf=0, minf=17
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=1348336/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=8
partial: (groupid=0, jobs=1): err= 0: pid=6858: Thu Jul 20 12:58:00 2017
  read : io=5171.2MB, bw=2941.7KB/s, iops=735, runt=1800133msec
    slat (usec): min=3, max=320, avg=13.05, stdev= 7.56
    clat (usec): min=63, max=1145.3K, avg=10861.20, stdev=30827.62
     lat (usec): min=72, max=1145.3K, avg=10874.93, stdev=30828.07
    clat percentiles (usec):
     |  1.00th=[  109],  5.00th=[  116], 10.00th=[  120], 20.00th=[  126],
     | 30.00th=[  131], 40.00th=[  133], 50.00th=[  141], 60.00th=[  155],
     | 70.00th=[  185], 80.00th=[14912], 90.00th=[33536], 95.00th=[59136],
     | 99.00th=[142336], 99.50th=[189440], 99.90th=[333824], 99.95th=[411648],
     | 99.99th=[602112]
    lat (usec) : 100=0.33%, 250=73.29%, 500=0.50%, 750=0.22%, 1000=0.22%
    lat (msec) : 2=0.52%, 4=0.03%, 10=1.38%, 20=7.58%, 50=9.63%
    lat (msec) : 100=4.18%, 250=1.88%, 500=0.21%, 750=0.02%, 1000=0.01%
    lat (msec) : 2000=0.01%
  cpu          : usr=0.43%, sys=1.55%, ctx=1310390, majf=0, minf=15
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=1323812/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=8
 
Run status group 0 (all jobs):
   READ: io=10438MB, aggrb=5937KB/s, minb=2941KB/s, maxb=2996KB/s, mint=1800111msec, maxt=1800133msec
 
Disk stats (read/write):
    dm-2: ios=2671993/27, merge=0/0, ticks=28764200/8380, in_queue=28773988, util=100.00%, aggrios=891886/15335, aggrmerge=0/0, aggrticks=9387100/6312, aggrin_queue=9393456, aggrutil=100.00%
    dm-1: ios=173/42639, merge=0/0, ticks=16/9316, in_queue=9336, util=0.24%, aggrios=1967684/42191, aggrmerge=0/3788, aggrticks=291076/9440, aggrin_queue=300476, aggrutil=15.68%
  xvdc: ios=1967684/42191, merge=0/3788, ticks=291076/9440, in_queue=300476, util=15.68%
  dm-0: ios=1967511/3340, merge=0/0, ticks=291184/1240, in_queue=292496, util=15.55%
    dm-3: ios=707976/28, merge=0/0, ticks=27870100/8380, in_queue=27878536, util=100.00%, aggrios=707976/12, aggrmerge=0/16, aggrticks=27870028/2908, aggrin_queue=27872968, aggrutil=100.00%
  xvdd: ios=707976/12, merge=0/16, ticks=27870028/2908, in_queue=27872968, util=100.00%

Testing with an already-full cache produces dramatically different results. We can see that a total of 2,672,148 (1,348,336+1,323,812) IOs were issued, with 1,967,684 (73.64%) coming from the cache. 707,976 (26.49%) came from the backing device. Although on the face of it that sounds like a good hit rate, it was poor enough to severely limit the average IOPS to 1484 across both jobs. We saw how much the difference was between 99.6% and 97.6% cache hit rate, and now we see what the difference is between ~99% and ~73%.

1,484 IOPS represents a 158% increase on the baseline HDD IOPS of 938 so it’s nothing to be sniffed at, and as I say it may be the case that it only gets better over much longer time span.

Even after this 30 minute test the performance of the lvmcache was still very slowly going up, so I decided to run another test across 3 hours instead of 30 seconds. The performance did continue to increase very very slowly peaking at about 1,640 aggregate IOPS after 3 hours. It seems like it is quite difficult to evict data from the cache in favour of new data that is being read, whereas when the cache was empty it promoted everything that was read immediately. Possibly if this was left for long enough there would eventually be enough cache hits to achieve similar performance to the first run.

bcache did not suffer from this problem: every test run against bcache was the same, except for the effect of preloading the cache device by messing with sequential_cutoff. I suspect that this is because bcache is a write cache; when set to writethrough or writeback mode, writes are going through the cache and immediately being cached. lvmcache is not a write cache; it will only promote things when they are read.

It seems like lvmcache needs a very large number of reads before it will evict data from the cache and bring new data in.

Further tests? ^

If you have any further tests you’d like to suggest please leave a comment or a tweet. I think it would be good to come up with more fio tests for this, but bear in mind that it really needs to test random read IO with a repetitive element so that a cache can actually be effective. If you read every part of 100GiB equally as often then there will be no hot spots and a comparatively small cache will not be effective, no matter what technology is used.

Also real-world benchmarks would be good, so perhaps a really complex SQL SELECT on a data set that won’t fit in (1GiB) memory but will fit in (5GiB) SSD cache.

It might be interesting to use a cache device much smaller than the working set. These tests laid out two 2GiB files then used a Zipf distribution to randomly read 10% of each (so two sets of ~205MiB data) while the cache devices were 4 or 5GiB in size. It is difficult to guess what the hot spots would be on a production file system, but a 90/10 or 80/20 distribution does seem reasonable.

What may be less reasonable is starting off with such a small data set. What would it be like to try to cache 1TiB with 10GiB of SSD? I’d rather not write out 1TiB of data to check, but I could simulate by writing out 51GiB (5%) and having a cache device that is only 512MiB. On a 90/10 distribution it would still want to cache ~5.2GiB and only have 512MiB to put it in.

Conclusions ^

SSDs are lovely but they’re still really expensive. Although they are fun for turning out impressive benchmarks, in the real world you only need things to be fast enough. If you want consistently great performance and you have the money, buy some decent SSDs. If you don’t have the money then there is a whole world of trade-offs.

bcache from what I can measure provides really impressive performance but I am concerned about relying upon it. The existing implementation is no longer the focus of the main developer’s attention as the developer has now moved on to bcachefs, a complete filesystem in its own right that builds upon the bcache concept.

I’m really conservative when it comes to filesystems and I know it can take years, even decades, to get to feature parity with all the nasty bugs hammered out. I also worry about continued maintenance of a separate filesystem.

lvmcache on the other hand is providing only a very modest performance boost. In many cases it won’t be worth it. I do like its simplicity though and I’m already very familiar with LVM.

The way that any logical volume can have a cache attached without needing to start over with the filesystem makes lvmcache very flexible; it also means that some volumes can be cached and others not. Furthermore with a slow PV and a fast PV both in the volume group, if necessary an individual logical volume can be allocated entirely from the fast PV.

I may experiment with bcache some more in situations where I can quickly restore the data in event of calamity, or fairly easily move to a different solution if bcache becomes abandonware. I may experiment with lvmcache in places where a modest speedup would be welcome.

3 thoughts on “bcache and lvmcache

Leave a Reply

Your email address will not be published. Required fields are marked *