When is a 64-bit counter not a 64-bit counter?

…when you run a Xen device backend (commonly dom0) on a kernel version earlier than 4.10, e.g. Debian stable.

TL;DR ^

Xen netback devices used 32-bit counters until that bug was fixed and released in kernel version 4.10.

On a kernel with that bug you will see counter wraps much sooner than you would expect, and if the interface is doing enough traffic for there to be multiple wraps in 5 minutes, your monitoring will no longer be accurate.

The problem ^

A high-bandwidth VPS customer reported that the bandwidth figures presented by BitFolk’s monitoring bore no resemblance to their own statistics gathered from inside their VPS. Their figures were a lot higher.

About octet counters ^

The Linux kernel maintains byte/octet counters for its network interfaces. You can view them in /sys/class/net/<interface>/statistics/*_bytes.

They’re a simple count of bytes transferred, and so the count always goes up. Typically these are 64-bit unsigned integers so their maximum value would be 18,446,744,073,709,551,615 (264-1).

When you’re monitoring bandwidth use the monitoring system records the value and the timestamp. The difference in value over a known period allows the monitoring system to work out the rate.

Wrapping ^

Monitoring of network devices is often done using SNMP. SNMP has 32-bit and 64-bit counters.

The maximum value that can be held in a 32-bit counter is 4,294,967,295. As that is a byte count, that represents 34,359,738,368 bits or 34,359.74 megabits. Divide that by 300 (seconds in 5 minutes) and you get 114.5. Therefore if the average bandwidth is above 114.5Mbit/s for 5 minutes, you will overflow a 32-bit counter. When the counter overflows it wraps back through zero.

Wrapping a counter once is fine. We have to expect that a counter will wrap eventually, and as counters never decrease, if a new value is smaller than the previous one then we know it has wrapped and can still work out what the rate should be.

The problem comes when the counter wraps more than once. There is no way to tell how many times it has wrapped so the monitoring system will have to assume the answer is once. Once traffic reaches ~229Mbit/s the counters will be wrapping at least twice in 5 minutes and the statistics become meaningless.

64-bit counters to the rescue ^

For that reason, network traffic is normally monitored using 64-bit counters. You would have to have a traffic rate of almost 492 Petabit/s to wrap a 64-bit byte counter in 5 minutes.

The thing is, I was already using 64-bit SNMP counters.

Examining the sysfs files ^

I decided to remove SNMP from the equation by going to the source of the data that SNMP uses: the kernel on the device being monitored.

As mentioned, the kernel’s interface byte counters are exposed in sysfs at /sys/class/net/<interface>/statistics/*_bytes. I dumped out those values every 10 seconds and watched them scroll in a terminal session.

What I observed was that these counters, for that particular customer, were wrapping every couple of minutes. I never observed a value greater than 8,469,862,875. That’s larger than a 32-bit counter would hold, but very close to what a 33 bit counter would hold (8,589,934,591).

64-bit counters not to the rescue ^

Once I realised that the kernel’s own counters were wrapping every couple of minutes inside the kernel it became clear that using 64-bit counters in SNMP was not going to help at all, and multiple wraps would be seen in 5 minutes.

What a difference a minute makes ^

To test the hypothesis I switched to 1-minute polling. Here’s what 12 hours of real data looks like under both 5- and 1-minute polling.

As you can see that is a pretty dramatic difference.

The bug ^

By this point, I’d realised that there must be a bug in Xen’s netback driver (the thing that makes virtual network interfaces in dom0).

I went searching through the source of the kernel and found that the counters had changed from an unsigned long in kernel version 4.9 to a u64 in kernel version 4.10.

Of course, once I knew what to search for it was easy to unearth a previous bug report. If I’d found that at the time of the initial report that would have saved 2 days of investigation!

Even so, the fix for this was only committed in February of this year so, unfortunately, is not present in the kernel in use by the current Debian stable. Nor in many other current distributions.

For Xen set-ups on Debian the bug could be avoided by using a backports kernel or packaging an upstream kernel.

Or you could do 1-minute polling as that would only wrap one time at an average bandwidth of ~572Mbit/s and should be safe from multiple wraps up to ~1.1Gbit/s.

Inside the VPS the counters are 64-bit so it isn’t an issue for guest administrators.

A slightly more realistic look at lvmcache

Recap ^

And then… ^

I decided to perform some slightly more realistic benchmarks against lvmcache.

The problem with the initial benchmark was that it only covered 4GiB of data with a 4GiB cache device. Naturally once lvmcache was working correctly its performance was awesome – the entire dataset was in the cache. But clearly if you have enough fast block device available to fit all your data then you don’t need to cache it at all and may as well just use the fast device directly.

I decided to perform some fio tests with varying data sizes, some of which were larger than the cache device.

Test methodology ^

Once again I used a Zipf distribution with a factor of 1.2, which should have caused about 90% of the hits to come from just 10% of the data. I kept the cache device at 4GiB but varied the data size. The following data sizes were tested:

  • 1GiB
  • 2GiB
  • 4GiB
  • 8GiB
  • 16GiB
  • 32GiB
  • 48GiB

With the 48GiB test I expected to see lvmcache struggling, as the hot 10% (~4.8GiB) would no longer fit within the 4GiB cache device.

A similar fio job spec to those from the earlier articles was used:

[cachesize-1g]
size=512m
ioengine=libaio
direct=1
iodepth=8
numjobs=2
readwrite=randread
random_distribution=zipf:1.2
bs=4k
unlink=1
runtime=30m
time_based=1
per_job_logs=1
log_avg_msec=500
write_iops_log=/var/tmp/fio-${FIOTEST}

…the only difference being that several different job files were used each with a different size= directive. Note that as there are two jobs, the size= is half the desired total data size: each job lays out a data file of the specified size.

For each data size I took care to fill the cache with data first before doing a test run, as unreproducible performance is still seen against a completely empty cache device. This produced IOPS logs and a completion latency histogram. Test were also run against SSD and HDD to provide baseline figures.

Results ^

IOPS graphs ^

All-in-one ^

Immediately we can see that for data sizes 4GiB and below performance converges quite quickly to near-SSD levels. That is very much what we would expect when the cache device is 4GiB, so big enough to completely cache everything.

Let’s just have a look at the lower-performing configurations.

Low-end performers ^

For 8, 16 and 32GiB data sizes performance clearly gets progressively worse, but it is still much better than baseline HDD. The 10% of hot data still fits within the cache device, so plenty of acceleration is still happening.

For the 48GiB data size it is a little bit of a different story. Performance is still better (on average) than baseline HDD, but there are periodic dips back down to roughly HDD figures. This is because not all of the 10% hot data fits into the cache device any more. Cache misses cause reads from HDD and consequently end up with HDD levels of performance for those reads.

The results no longer look quite so impressive, with even the 8GiB data set achieving only a few thousand IOPS on average. Are things as bad as they seem? Well no, I don’t think they are, and to see why we will have to look at the completion latency histograms.

Completion latency histograms ^

The above graphs are generated by fitting a Bezier curve to a scatter of data points each of which represents a 500ms average of IOPS achieved. The problem there is the word average.

It’s important to understand what effect averaging the figures gives. We’ve already seen that HDDs are really slow. Even if only a few percent of IOs end up missing cache and going to HDD, the massive latency of those requests will pull the average for the whole 500ms window way down.

Presumably we have a cache because we suspect we have hot spots of data, and we’ve been trying to evaluate that by doing most of the reads from only 10% of the data. Do we care what the average performance is then? Well it’s a useful metric but it’s not going to say much about the performance of reads from the hot data.

The histogram of completion latencies can be more useful. This shows how long it took between issuing the IO and completing the read for a certain percentage of issued IOs. Below I have focused on the 50% to 99% latency buckets, with the times for each bucket averaged between the two jobs. In the interests of being able to see anything at all I’ve had to double the height of the graph and still cut off the y axis for the three worst performers.

A couple of observations:

  • Somewhere between 70% and 80% of IOs complete with a latency that’s so close to SSD performance as to be near-indistinguishable, no matter what the data size. So what I think I am proving is that:

    you can cache a 48GiB slow backing device with 4GiB of fast SSD and if you have 10% hot data then you can expect it to be served up at near-SSD latencies 70%–80% of the time. If your hot spots are larger (not so hot) then you won’t achieve that. If your fast device is larger than 1/12th the backing device then you should do better than 70%–80%.

  • If the cache were perfect then we should expect the 90th percentile to be near SSD performance even for the 32GiB data set, as the 10% hot spot of ~3.2GiB fits inside the 4GiB cache. For whatever reason this is not achieved, but for that data size the 90th percentile latency is still about half that of HDD.
  • When the backing device is many times larger (32GiB+) than the cache device, the 99th percentile latencies can be slightly worse than for baseline HDD.

    I hesitate to suggest there is a problem here as there are going to be very few samples in the top 1%, so it could just be showing close to HDD performance.

Conclusion

Assuming you are okay with using a 4.12..x kernel, and assuming you are already comfortable using LVM, then at the moment it looks fairly harmless to deploy lvmcache.

Getting a decent performance boost out of it though will require you to check that your data really does have hot spots and size your cache appropriately.

Measuring your existing workload with something like blktrace is probably advisable, and these days you can feed the output of blktrace back into fio to see what performance might be like in a difference configuration.

Full test output

You probably want to stop reading here unless the complete output of all the fio runs is of interest to you.
Continue reading “A slightly more realistic look at lvmcache” ^

Tracking down the lvmcache fix

Background ^

In the previous article I covered how, in order to get decent performance out of lvmcache with a packaged Debian kernel, you’d have to use the 4.12.2-1~exp1 kernel from experimental. The kernels packaged in sid, testing (buster) and stable (stretch) aren’t new enough.

I decided to bisect the Linux kernel upstream git repository to find out exactly which commit(s) fixed things.

Results ^

Here’s a graph showing the IOPS over time for baseline SSD and lvmcache with a full cache under several different kernel versions. As in previous articles, the lines are actually Bezier curves fitted to the data which is scattered all over the place from 500ms averages.

What we can see here is that performance starts to improve with commit 4d44ec5ab751 authored by Joe Thornber:

dm cache policy smq: put newly promoted entries at the top of the multiqueue

This stops entries bouncing in and out of the cache quickly.

This is part of a set of commits authored by Joe Thornber on the drivers/md/dm-cache-policy-smq.c file and committed on 2017-05-14. By the time we reach commit 6cf4cc8f8b3b we have the long-term good performance that we were looking for.

The first of Joe Thornber’s commits on that day in the dm-cache area was 072792dcdfc8 and stepping back to the commit immediately prior to that one (2ea659a9ef48) we get a kernel representing the moment that Linus designated the v4.12-rc1 tag. Joe’s commits went into -rc1, and without them the performance of lvmcache under these test conditions isn’t much better than baseline HDD.

It seems like some of Joe’s changes helped a lot and then the last one really provided the long term performance.

git bisect procedure ^

Normally when you do a git bisect you’re starting with something that works and you’re looking for the commit that introduced a bug. In this case I was starting off with a known-good state and was interested in which commit(s) got me there. The normal bisect key words of “good” and “bad” in this case would be backwards to what I wanted. Dominic gave me the tip that I could alias the terms in order to reduce my confusion:

$ git bisect start --term-old broken --term-new fixed

From here on, when I encountered a test run that produced poor results I would issue:

$ git bisect broken

and when I had a test run with good results I would issue:

$ git bisect fixed

As I knew that the tag v4.13-rc1 produced a good run and v4.11 was bad, I could start off with:

$ git bisect reset v4.13-rc1
$ git bisect fixed
$ git bisect broken v4.11

git would then keep bisecting the search space of commits until I would find the one(s) that resulted in the high performance I was looking for.

Good and bad? ^

As before I’m using fio to conduct the testing, with the same job specification:

ioengine=libaio
direct=1
iodepth=8
numjobs=2
readwrite=randread
random_distribution=zipf:1.2
bs=4k
size=2g
unlink=1
runtime=15m
time_based=1
per_job_logs=1
log_avg_msec=500
write_iops_log=/var/tmp/fio-${FIOTEST}

The only difference from the other articles was that the run time was reduced to 15 minutes as all of the interesting behaviour happened within the first 11 minutes.

To recap, this fio job specification lays out two 2GiB files of random data and then starts two processes that perform 4kiB-sized reads against the files. Direct IO is used, in order to bypass the page cache.

A Zipfian distribution with a factor of 1.2 is used; this gives a 90/10 split where about 90% of the reads should come from about 10% of the data. The purpose of this is to simulate the hot spots of popular data that occur in real life. If the access pattern were to be perfectly and uniformly random then caching would not be effective.

In previous tests we had observed that dramatically different performance would be seen on the first run against an empty cache device compared to all other subsequent runs against what would be a full cache device. In the tests using kernels with the fix present the IOPS achieved would converge towards baseline SSD performance, whereas in kernels without the fix the performance would remain down near the level of baseline HDD. Therefore the fio tests were carried out twice.

Where to next? ^

I think I am going to see what happens when the cache device is pretty small in comparison to the working data.

All of the tests so far have used a 4GiB cache with 4GiB of data, so if everything got promoted it would entirely fit in cache. Not only that but the Zipf distribution makes most of the hits come from 10% of the data, so it’s actually just ~400MiB of hot data. I think it would be interesting to see what happens when the hot 10% is bigger than the cache device.

git bisect progress and test output ^

Unless you are particularly interested in the fio output and why I considered each one to be either fixed or broken, you probably want to stop reading now.

Continue reading “Tracking down the lvmcache fix”

lvmcache with a 4.12.3 kernel

Background ^

In the previous two articles I had discovered that lvmcache had amazing performance on an empty cache but then on every run after that (i.e. when the cache device was full of junk) went scarcely better than baseline HDD.

A few days ago I happened across an email on the linux-lvm list where Mike Snitzer advised:

the [CentOS] 7.4 dm-cache will be much more performant than the 7.3 cache you appear to be using.

…and…

It could be that your workload isn’t accessing the data enough to warrant promotion to the cache. dm-cache is a “hotspot” cache. If you aren’t accessing the data repeatedly then you won’t see much benefit (particularly with the 7.3 and earlier releases).

Just to get a feel, you could try the latest upstream 4.12 kernel to see how effective the 7.4 dm-cache will be for your setup.

I don’t know what kernel version CentOS 7.3 uses, but the VM I’m testing with is Debian testing (buster), so some version of 4.11.x plus backported patches.

That seemed pretty new, but Mike is suggesting 4.12.x so I thought I’d re-test lvmcache with the latest stable upstream kernel, which at the time of writing is version 4.12.3.

Test methodology ^

This time around I only focused on fio tests, using the same settings as before:

[partial]
ioengine=libaio
direct=1
iodepth=8
numjobs=2
readwrite=randread
random_distribution=zipf:1.2
bs=4k
size=2g
unlink=1
runtime=20m
time_based=1
per_job_logs=1
log_avg_msec=500
write_iops_log=/var/tmp/fio-${FIOTEST}

The only changes were:

  1. to reduce the run time to 20 minutes from 30 minutes, since all the interesting things happened within the first 20 minutes before.
  2. to write an IOPS log entry every 500ms instead of ever 1000ms, as the log files were quite small and some higher resolution might help smooth graphs out.

Last time there was a dramatic difference between the initial run with an empty cache and subsequent runs with a cache volume full of junk, so I did a test for each of those conditions, as well as tests for the baseline SSD and HDD.

The virtual machine had been upgraded from Debian 9 (stretch) to testing (buster), so it still had packaged kernel versions 4.9.30-2 and 4.11.6-1 laying around to test things with. In addition I compiled up version 4.12.3 by copying the .config from 4.11.6-1 then doing make oldconfig accepting all defaults.

Results ^

Although the fio job spec was essentially the same as in the previous article, I have since worked out how to merge the IOPS logs from both jobs so the graphs will seem to show about double the IOPS as they did before.

All-in-one ^

Well that’s an interesting set of graphs but rather hard to distinguish. Let’s try that by kernel version.

Baseline SSD by kernel version ^

A couple of weird things here:

  1. 4.12.3 and 4.11.6-1 are actually fairly consistent, but 4.9.30-2 varies rather a lot.
  2. All kernels show a sharp dip a few minutes in. I don’t know what that is about.

Although these lines do look quite far apart, bear in mind that this graph’s y axis starts at 92k IOPS. The average IOPS didn’t vary that much:

Average IOPS by kernel version
4.9.30-2 4.11.6-1 4.12.3
102,325 102,742 104,352

So there was actually only a 1.9% difference between the worst performer and the best.

Baseline HDD by kernel version ^

4.9.30-2 and 4.12.3 are close enough here to probably be within the margin of error, but there is something weird going on with 4.11.6-1.

Its average IOPS across the 20 minute test were only 56% of those for 4.12.3 and 53% of those for 4.9.30-2, which is quite a big disparity. I re-ran these tests 5 times to check it wasn’t some anomaly, but no, it’s reproducible.

Maybe something to look into another day.

lvmcache by kernel version ^

Dragging things back to the point of this article: previously we discovered that lvmcache worked great the first time through, when its cache volume was completely empty, but then subsequent runs all absolutely sucked. They didn’t perform significantly better than HDD baseline.

Let’s graph all the lvmcache results for each kernel version against the SSD baseline for that kernel to see if things changed at all.

lvmcache 4.9.30-2 ^

This is the similar to what we saw before: an empty cache volume produces decent results of around 47k IOPS. Although it’s interesting that the second job is down around 1k IOPS. Again the results on a full cache are poor. In fact the results for the second job of the empty cache are about the same as the results for both jobs on a full cache.

lvmcache 4.11.6-1 ^

Same story again here, although the performance is a little higher. Again the first job on an empty cache is getting the big results of almost 60k IOPS while the second job—and both jobs on a full cache—get only around 1k IOPS.

lvmcache 4.12.3 ^

Wow. Something dramatic has been fixed. The performance on an empty cache is still better, but crucially the performance on a full cache pretty quickly becomes very close to baseline SSD.

Also the runs against both the empty and full cache device result in both jobs getting roughly the same IOPS performance rather than the first job being great and all others very poor.

What’s next? ^

It’s really encouraging that the performance is so much better with 4.12.3. It’s changed lvmcache from a “hmm, maybe” option to one that I would strongly consider using anywhere I could.

It’s a shame though that such a new kernel is required. The kernel version in Debian testing (buster) is currently 4.11.6-1. Debian experimental’s linux-image-4.12.0-trunk-amd64 package currently has version 4.12.2-1 so I should test if that is new enough I tested to see if that was new enough.

Failing that I think I should git bisect or similar in order to find out exactly which changeset fixed this, so I could have some chance of knowing when it hits a packaged version.

Continue reading “lvmcache with a 4.12.3 kernel”

12 hours of lvmcache

In the previous post I noted that the performance of lvmcache was still increasing and it might be worth testing it for longer than 3 hours.

Here’a a 12 hour test ^

$ cd /srv/cache/fio && FIOTEST=lvmcache-12h fio ~/lvmcache-12h.fio
partial: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=8
...
fio-2.16
Starting 2 processes
partial: Laying out IO file(s) (1 file(s) / 2048MB)
partial: Laying out IO file(s) (1 file(s) / 2048MB)
Jobs: 2 (f=2): [r(2)] [100.0% done] [6272KB/0KB/0KB /s] [1568/0/0 iops] [eta 00m:00s]
partial: (groupid=0, jobs=1): err= 0: pid=11130: Fri Jul 21 16:37:30 2017
  read : io=136145MB, bw=3227.2KB/s, iops=806, runt=43200062msec
    slat (usec): min=3, max=586402, avg=14.27, stdev=619.54
    clat (usec): min=2, max=1517.9K, avg=9897.80, stdev=29334.14
     lat (usec): min=71, max=1517.9K, avg=9912.72, stdev=29344.74
    clat percentiles (usec):
     |  1.00th=[  103],  5.00th=[  110], 10.00th=[  113], 20.00th=[  119],
     | 30.00th=[  124], 40.00th=[  129], 50.00th=[  133], 60.00th=[  143],
     | 70.00th=[  157], 80.00th=[11840], 90.00th=[30848], 95.00th=[56576],
     | 99.00th=[136192], 99.50th=[179200], 99.90th=[309248], 99.95th=[382976],
     | 99.99th=[577536]
    lat (usec) : 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%, 100=0.03%
    lat (usec) : 250=76.84%, 500=0.26%, 750=0.13%, 1000=0.13%
    lat (msec) : 2=0.25%, 4=0.02%, 10=1.19%, 20=6.67%, 50=8.59%
    lat (msec) : 100=3.93%, 250=1.77%, 500=0.18%, 750=0.02%, 1000=0.01%
    lat (msec) : 2000=0.01%
  cpu          : usr=0.44%, sys=1.63%, ctx=34524570, majf=0, minf=17
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=34853153/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=8
partial: (groupid=0, jobs=1): err= 0: pid=11131: Fri Jul 21 16:37:30 2017
  read : io=134521MB, bw=3188.7KB/s, iops=797, runt=43200050msec
    slat (usec): min=3, max=588479, avg=14.35, stdev=613.38
    clat (usec): min=2, max=1530.3K, avg=10017.42, stdev=29196.28
     lat (usec): min=70, max=1530.3K, avg=10032.43, stdev=29207.06
    clat percentiles (usec):
     |  1.00th=[  103],  5.00th=[  109], 10.00th=[  112], 20.00th=[  118],
     | 30.00th=[  124], 40.00th=[  127], 50.00th=[  133], 60.00th=[  143],
     | 70.00th=[  157], 80.00th=[12352], 90.00th=[31360], 95.00th=[57600],
     | 99.00th=[138240], 99.50th=[179200], 99.90th=[301056], 99.95th=[370688],
     | 99.99th=[561152]
    lat (usec) : 4=0.01%, 20=0.01%, 50=0.01%, 100=0.04%, 250=76.56%
    lat (usec) : 500=0.26%, 750=0.12%, 1000=0.13%
    lat (msec) : 2=0.26%, 4=0.02%, 10=1.20%, 20=6.75%, 50=8.65%
    lat (msec) : 100=4.01%, 250=1.82%, 500=0.17%, 750=0.01%, 1000=0.01%
    lat (msec) : 2000=0.01%
  cpu          : usr=0.45%, sys=1.60%, ctx=34118324, majf=0, minf=15
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=34437257/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=8
 
Run status group 0 (all jobs):
   READ: io=270666MB, aggrb=6415KB/s, minb=3188KB/s, maxb=3227KB/s, mint=43200050msec, maxt=43200062msec
 
Disk stats (read/write):
    dm-2: ios=69290239/1883, merge=0/0, ticks=690078104/246240, in_queue=690329868, util=100.00%, aggrios=23098543/27863, aggrmerge=0/0, aggrticks=229728200/637782, aggrin_queue=230366965, aggrutil=100.00%
    dm-1: ios=247/64985, merge=0/0, ticks=36/15464, in_queue=15504, util=0.02%, aggrios=53025553/63449, aggrmerge=0/7939, aggrticks=7413340/14760, aggrin_queue=7427028, aggrutil=16.42%
  xvdc: ios=53025553/63449, merge=0/7939, ticks=7413340/14760, in_queue=7427028, util=16.42%
  dm-0: ios=53025306/6403, merge=0/0, ticks=7417028/1852, in_queue=7419784, util=16.42%
    dm-3: ios=16270078/12201, merge=0/0, ticks=681767536/1896032, in_queue=683665608, util=100.00%, aggrios=16270077/12200, aggrmerge=1/1, aggrticks=681637744/1813744, aggrin_queue=683453224, aggrutil=100.00%
  xvdd: ios=16270077/12200, merge=1/1, ticks=681637744/1813744, in_queue=683453224, util=100.00%

It’s still going up, slowly. The cache hit rate was 76.53%. In the 30 minute test the hit rate was 73.64%.

Over 30 minutes the average IOPS was 1,484.

Over 12 hours the average IOPS was 1,603.

I was kind of hoping to reach the point where the hit rate is so high that it just takes off like bcache does and we start to see tens of thousands of IOPS, but it wasn’t to be.

24 hours of lvmcache ^

…so I went ahead and ran the same thing for 24 hours.

I’ve skipped the first 2 hours of results since we know what they look like. It appears to still be going up, although the results past 20 hours leave some doubt there.

Here’s the full fio output.

$ cd /srv/cache/fio && FIOTEST=lvmcache-24h fio ~/lvmcache-24h.fio
partial: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=8
...
fio-2.16
Starting 2 processes
partial: Laying out IO file(s) (1 file(s) / 2048MB)
partial: Laying out IO file(s) (1 file(s) / 2048MB)
Jobs: 2 (f=2): [r(2)] [100.0% done] [7152KB/0KB/0KB /s] [1788/0/0 iops] [eta 00m:00s]
partial: (groupid=0, jobs=1): err= 0: pid=14676: Sat Jul 22 21:34:12 2017
  read : io=278655MB, bw=3302.6KB/s, iops=825, runt=86400091msec
    slat (usec): min=3, max=326, avg=12.43, stdev= 6.97
    clat (usec): min=1, max=1524.1K, avg=9673.02, stdev=28748.45
     lat (usec): min=71, max=1525.7K, avg=9686.11, stdev=28748.87
    clat percentiles (usec):
     |  1.00th=[  103],  5.00th=[  106], 10.00th=[  111], 20.00th=[  116],
     | 30.00th=[  119], 40.00th=[  125], 50.00th=[  131], 60.00th=[  139],
     | 70.00th=[  155], 80.00th=[11456], 90.00th=[30336], 95.00th=[55552],
     | 99.00th=[134144], 99.50th=[177152], 99.90th=[305152], 99.95th=[374784],
     | 99.99th=[569344]
    lat (usec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
    lat (usec) : 100=0.03%, 250=77.16%, 500=0.22%, 750=0.12%, 1000=0.12%
    lat (msec) : 2=0.23%, 4=0.02%, 10=1.18%, 20=6.65%, 50=8.54%
    lat (msec) : 100=3.84%, 250=1.70%, 500=0.17%, 750=0.01%, 1000=0.01%
    lat (msec) : 2000=0.01%
  cpu          : usr=0.47%, sys=1.64%, ctx=70653446, majf=0, minf=17
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=71335660/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=8
partial: (groupid=0, jobs=1): err= 0: pid=14677: Sat Jul 22 21:34:12 2017
  read : io=283280MB, bw=3357.4KB/s, iops=839, runt=86400074msec
    slat (usec): min=3, max=330, avg=12.44, stdev= 6.98
    clat (usec): min=2, max=1515.9K, avg=9514.83, stdev=28128.86
     lat (usec): min=71, max=1515.2K, avg=9527.92, stdev=28129.29
    clat percentiles (usec):
     |  1.00th=[  103],  5.00th=[  109], 10.00th=[  112], 20.00th=[  118],
     | 30.00th=[  123], 40.00th=[  126], 50.00th=[  133], 60.00th=[  141],
     | 70.00th=[  157], 80.00th=[11328], 90.00th=[29824], 95.00th=[55040],
     | 99.00th=[132096], 99.50th=[173056], 99.90th=[292864], 99.95th=[362496],
     | 99.99th=[544768]
    lat (usec) : 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%, 100=0.03%
    lat (usec) : 250=77.29%, 500=0.23%, 750=0.11%, 1000=0.12%
    lat (msec) : 2=0.23%, 4=0.02%, 10=1.18%, 20=6.65%, 50=8.49%
    lat (msec) : 100=3.81%, 250=1.66%, 500=0.15%, 750=0.01%, 1000=0.01%
    lat (msec) : 2000=0.01%
  cpu          : usr=0.47%, sys=1.67%, ctx=71794214, majf=0, minf=15
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=72519640/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=8
 
Run status group 0 (all jobs):
   READ: io=561935MB, aggrb=6659KB/s, minb=3302KB/s, maxb=3357KB/s, mint=86400074msec, maxt=86400091msec
 
Disk stats (read/write):
    dm-2: ios=143855123/29, merge=0/0, ticks=1380761492/1976, in_queue=1380772508, util=100.00%, aggrios=47953627/25157, aggrmerge=0/0, aggrticks=459927326/7329, aggrin_queue=459937080, aggrutil=100.00%
    dm-1: ios=314/70172, merge=0/0, ticks=40/15968, in_queue=16008, util=0.01%, aggrios=110839338/72760, aggrmerge=0/2691, aggrticks=15300392/17100, aggrin_queue=15315432, aggrutil=16.92%
  xvdc: ios=110839338/72760, merge=0/2691, ticks=15300392/17100, in_queue=15315432, util=16.92%
  dm-0: ios=110839024/5279, merge=0/0, ticks=15308540/1768, in_queue=15312588, util=16.93%
    dm-3: ios=33021544/20, merge=0/0, ticks=1364473400/4252, in_queue=1364482644, util=100.00%, aggrios=33021544/19, aggrmerge=0/1, aggrticks=1364468920/4076, aggrin_queue=1364476064, aggrutil=100.00%
  xvdd: ios=33021544/19, merge=0/1, ticks=1364468920/4076, in_queue=1364476064, util=100.00%

So, 1,664 average IOPS (825 + 839), 77.05% (110,839,338 / (71,335,660 + 72,519,640)*100) cache hit rate.

Not sure I can be bothered to run a multi-day test on this now!

bcache and lvmcache

Background ^

Over at BitFolk we offer both SSD-backed storage and HDD-backed archive storage. The SSDs we use are really nice and basically have removed all IO performance problems we have ever encountered in the past. I can’t begin to describe how pleasant it is to just never have to think about that whole class of problems.

The main downside of course is that SSD capacity is still really expensive. That’s why we introduced the HDD-backed archive storage: for bulk storage of things that didn’t need to have high performance.

You’d really think though that by now there would be some kind of commodity tiered storage that would allow a relatively small amount of fast SSD storage to accelerate a much larger pool of HDDs. Of course there are various enterprise solutions, and there is also ZFS where SSDs could be used for the ZIL and L2ARC while HDDs are used for the pool.

ZFS is quite appealing but I’m not ready to switch everything to that yet, and I’m certainly not going to buy some enterprise storage solution. I also don’t necessarily want to commit to putting all storage behind such a system.

I decided to explore Linux’s own block device caching solutions.

Scope ^

I’ve restricted the scope to the two solutions which are part of the mainline Linux kernel as of July 2017, these being bcache and lvmcache.

lvmcache is based upon dm-cache which has been included with the mainline kernel since April 2013. It’s quite conservative, and having been around for quite a while is considered stable. It has the advantage that it can work with any LVM logical volume no matter what the contents. That brings the disadvantage that you do need to run LVM.

bcache has been around for a little longer but is a much more ambitious project. Being completely dedicated to accelerating slow block devices with fast ones it is claimed to be able to achieve higher performance than other caching solutions, but as it’s much more complicated than dm-cache there are still bugs being found. Also it requires you format your block devices as bcache before you use them for anything.

Test environment ^

I’m testing this on a Debian testing (buster) Xen virtual machine with a 20GiB xvda virtual disk containing the main operating system. That disk is backed by a software (md) RAID-10 composed of two Samsung sm863 SSDs. It was also used for testing the baseline SSD performance from the directory /srv/ssd.

The virtual machine had 1GiB of memory but the pagecache was cleared between each test run in an attempt to prevent anything being cached in memory.

A 5GiB xvdc virtual disk was provided, backed again on the SSD RAID. This was used for the cache role both in bcache and lvmcache.

A 50GiB xvdd virtual disk was provided, backed by a pair of Seagate ST4000LM016-1N2170 HDDs in software RAID-1. This was used for the HDD backing store in each of the caching implementations. The resulting cache device was mounted at /srv/cache.

Finally a 50GiB xvde virtual disk also backed on HDD was used to test baseline HDD performance, mounted at /srv/slow.

The filesystem in use in all cases was ext4 with default options. In dom0, deadline scheduler was used in all cases.

TL;DR, I just want graphs ^

In case you can’t be bothered to read the rest of this article, here’s just the graphs with some attempt at interpreting them. Down at the tests section you’ll find details of the actual testing process and more commentary on why certain graphs were produced.

git test graphs ^

Times to git clone and git grep.

fio IOPS graphs ^

These are graphs of IOPS across the 30 minutes of testing. There’s two important things to note about these graphs:

  1. They’re a Bezier curve fitted to the data points which are one per second. The actual data points are all over the place, because achieved IOPS depends on how many cache hits/misses there were, which is statistical.
  2. Only the IOPS for the first job is graphed. Even when using the per_job_logs=0 setting my copy of fio writes a set of results for each job. I couldn’t work out how to easily combine these so I’ve shown only the results for the first job.

    For all tests except bcache (sequential_cutoff=0) you just have to bear in mind that there is a second job working in parallel doing pretty much the same amount of IOPS. Strangely for that second bcache test the second job only managed a fraction of the IOPS (though still more than 10k IOPS) and I don’t know why.

IOPS over time for all tests

Well, those results are so extreme that it kind of makes it hard to distinguish between the low-end results.

A couple of observations:

  • SSD is incredibly and consistently fast.
  • For everything else there is a short steep section at the beginning which is likely to be the effect of HDD drive cache.
  • With sequential_cutoff set to 0, bcache very quickly reaches near-SSD performance for this workload (4k reads, 90% hitting 10% of data that fits entirely in the bcache). This is probably because the initial write put data straight into cache as it’s set to writeback.
  • When starting with a completely empty cache, lvmcache is no slouch either. It’s not quite as performant as bcache but that is still up near the 48k IOPS per process region, and very predictable.
  • When sequential_cutoff is left at its default of 4M, bcache performs much worse though still blazing compared to an HDD on its own. At the end of this 30 minute test performance was still increasing so it might be worth performing a longer test
  • The performance of lvmcache when starting with a cache already full of junk data seems to be not that much better than HDD baseline.
IOPS over time for low-end results

Leaving the high-performers out to see if there is anything interesting going on near the bottom of the previous graph.

Apart from the initial spike, HDD results are flat as expected.

Although the lvmcache (full cache) results in the previous graph seemed flat too, looking closer we can see that performance is still increasing, just very slowly. It may be interesting to test for longer to see if performance does continue to increase.

Both HDD and lvmcache have a very similar spike at the start of the test so let’s look closer at that.

IOPS for first 30 seconds

For all the lower-end performers the first 19 seconds are steeper and I can only think this is the effect of HDD drive cache. Once that is filled, HDD remains basically flat, lvmcache (full cache) increases performance more slowly and bcache with the default sequential_cutoff starts to take off.

SSDs don’t have the same sort of cache and bcache with no sequential_cutoff spikes up too quickly to really be noticeable at this scale.

3-hour lvmcache test

Since it seemed like lvmcache with a full cache device was still slowly increasing in performance I did a 3-hour testing on that one.

Skipping the first 20 minutes which show stronger growth, even after 3 hours there is still some performance increase happening. It seems like even a full cache would eventually promote read hot spots, but it could take a very very long time.

Continue reading “bcache and lvmcache”

XFS, Reflinks and Deduplication

btrfs Past ^

This post is about XFS but it’s about features that first hit Linux in btrfs, so we need to talk about btrfs for a bit first.

For a long time now, btrfs has had a useful feature called reflinks. Basically this is exposed as cp --reflink=always and takes advantage of extents and copy-on-write in order to do a quick copy of data by merely adding another reference to the extents that the data is currently using, rather than having to read all the data and write it out again, as would be the case in other filesystems.

Here’s an excerpt from the man page for cp:

When –reflink[=always] is specified, perform a lightweight copy, where the data blocks are copied only when modified. If this is not possible the copy fails, or if –reflink=auto is specified, fall back to a standard copy.

Without reflinks a common technique for making a quick copy of a file is the hardlink. Hardlinks have a number of disadvantages though, mainly due to the fact that since there is only one inode all hardlinked copies must have the same metadata (owner, group, permissions, etc.). Software that might modify the files also needs to be aware of hardlinks: naive modification of a hardlinked file modifies all copies of the file.

With reflinks, life becomes much easier:

  • Each copy has its own inode so can have different metadata. Only the data extents are shared.
  • The filesystem ensures that any write causes a copy-on-write, so applications don’t need to do anything special.
  • Space is saved on a per-extent basis so changing one extent still allows all the other extents to remain shared. A change to a hardlinked file requires a new copy of the whole file.

Another feature that extents and copy-on-write allow is block-level out-of-band deduplication.

  • Deduplication – the technique of finding and removing duplicate copies of data.
  • Block-level – operating on the blocks of data on storage, not just whole files.
  • Out-of-band – something that happens only when triggered or scheduled, not automatically as part of the normal operation of the filesystem.

btrfs has an ioctl that a userspace program can use—presumably after finding a sequence of blocks that are identical—to tell the kernel to turn one into a reference to the other, thus saving some space.

It’s necessary that the kernel does it so that any IO that may be going on at the same time that may modify the data can be dealt with. Modifications after the data is reflinked will just case a copy-on-write. If you tried to do it all in a userspace app then you’d risk something else modifying the files at the same time, but by having the kernel do it then in theory it becomes completely safe to do it at any time. The kernel also checks that the sequence of extents really are identical.

In-band deduplication is a feature that’s being worked on in btrfs. It already exists in ZFS though, and there is it rarely recommended for use as it requires a huge amount of memory for keeping hashes of data that has been written. It’s going to be the same story with btrfs, so out-of-band deduplication is still something that will remain useful. And it exists as a feature right now, which is always a bonus.

XFS Future ^

So what has all this got to do with XFS?

Well, in recognition that there might be more than one Linux filesystem with extents and so that reflinks might be more generally useful, the extent-same ioctl got lifted up to be in the VFS layer of the kernel instead of just in btrfs. And the good news is that XFS recently became able to make use of it.

When I say “recently” I do mean really recently. I mean like kernel release 4.9.1 which came out on 2017-01-04. At the moment it comes with massive EXPERIMENTAL warnings, requires a new filesystem to be created with a special format option, and will need an xfsprogs compiled from recent git in order to have a mkfs.xfs that can create such a filesystem.

So before going further, I’m going to assume you’ve compiled a new enough kernel and booted into it, then compiled up a new enough xfsprogs. Both of these are quite simple things to do, for example the Debian documentation for building kernel packages from upstream code works fine.

XFS Reflink Demo ^

Make yourself a new filesystem, with the reflink=1 format option.

# mkfs.xfs -L reflinkdemo -m reflink=1 /dev/xvdc
meta-data=/dev/xvdc              isize=512    agcount=4, agsize=3276800 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=0, rmapbt=0, reflink=1
data     =                       bsize=4096   blocks=13107200, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal log           bsize=4096   blocks=6400, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

Put it in /etc/fstab for convenience, and mount it somewhere.

# echo "LABEL=reflinkdemo /mnt/xfs xfs relatime 0 2" >> /etc/fstab
# mkdir -vp /mnt/xfs
mkdir: created directory ‘/mnt/xfs’
# mount /mnt/xfs
# df -h /mnt/xfs
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvdc        50G  339M   50G   1% /mnt/xfs

Create a few files with random data.

# mkdir -vp /mnt/xfs/reflink
mkdir: created directory ‘/mnt/xfs/reflink’
# chown -c andy: /mnt/xfs/reflink
changed ownership of ‘/mnt/xfs/reflink’ from root:root to andy:andy
# exit
$ for i in {1..5}; do
> echo "Writing $i…"; dd if=/dev/urandom of=/mnt/xfs/reflink/$i bs=1M count=1024;
> done
Writing 1…
1024+0 records in 
1024+0 records out
1073741824 bytes (1.1 GB) copied, 4.34193 s, 247 MB/s
Writing 2…
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 4.33207 s, 248 MB/s
Writing 3…
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 4.33527 s, 248 MB/s
Writing 4…
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 4.33362 s, 248 MB/s
Writing 5…
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 4.32859 s, 248 MB/s
$ df -h /mnt/xfs
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvdc        50G  5.4G   45G  11% /mnt/xfs
$ du -csh /mnt/xfs
5.0G    /mnt/xfs
5.0G    total

Copy a file and as expected usage will go up by 1GiB. And it will take a little while, even on my nice fast SSDs.

$ time cp -v /mnt/xfs/reflink/{,copy_}1
‘/mnt/xfs/reflink/1’ -> ‘/mnt/xfs/reflink/copy_1’
 
real    0m3.420s
user    0m0.008s
sys     0m0.676s
$ df -h /mnt/xfs; du -csh /mnt/xfs/reflink
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvdc        50G  6.4G   44G  13% /mnt/xfs
6.0G    /mnt/xfs/reflink
6.0G    total

So what about a reflink copy?

$ time cp -v --reflink=always /mnt/xfs/reflink/{,reflink_}1
‘/mnt/xfs/reflink/1’ -> ‘/mnt/xfs/reflink/reflink_1’
 
real    0m0.003s
user    0m0.000s
sys     0m0.004s
$ df -h /mnt/xfs; du -csh /mnt/xfs/reflink
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvdc        50G  6.4G   44G  13% /mnt/xfs
7.0G    /mnt/xfs/reflink
7.0G    total

The apparent usage went up by 1GiB but the amount of free space as shown by df stayed the same. No more actual storage was used because the new copy is a reflink. And the copy got done in 4ms as opposed to 3,420ms.

Can we tell more about how these files are laid out? Yes, we can use the filefrag -v command to tell us more.

$ filefrag -v /mnt/xfs/reflink/{,copy_,reflink_}1
Filesystem type is: 58465342
File size of /mnt/xfs/reflink/1 is 1073741824 (262144 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..  262143:    1572884..   1835027: 262144:             last,shared,eof
/mnt/xfs/reflink/1: 1 extent found
File size of /mnt/xfs/reflink/copy_1 is 1073741824 (262144 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..  262143:     917508..   1179651: 262144:             last,eof
/mnt/xfs/reflink/copy_1: 1 extent found
File size of /mnt/xfs/reflink/reflink_1 is 1073741824 (262144 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..  262143:    1572884..   1835027: 262144:             last,shared,eof
/mnt/xfs/reflink/reflink_1: 1 extent found

What we can see here is that all three files are composed of a single extent which is 262,144 4KiB blocks in size, but it also tells us that /mnt/xfs/reflink/1 and /mnt/xfs/reflink/reflink_1 are using the same range of physical blocks: 1572884..1835027.

XFS Deduplication Demo ^

We’ve demonstrated that you can use cp --reflink=always to take a cheap copy of your data, but what about data that may already be duplicates without your knowledge? Is there any way to take advantage of the extent-same ioctl for deduplication?

There’s a couple of software solutions for out-of-band deduplication in btrfs, but one I know that works also in XFS is duperemove. You will need to use a git checkout of duperemove for this to work.

A quick reminder of the storage use before we start.

$ df -h /mnt/xfs; du -csh /mnt/xfs/reflink
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvdc        50G  6.4G   44G  13% /mnt/xfs
7.0G    /mnt/xfs/reflink
7.0G    total
$ filefrag -v /mnt/xfs/reflink/{,copy_,reflink_}1
Filesystem type is: 58465342
File size of /mnt/xfs/reflink/1 is 1073741824 (262144 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..  262143:    1572884..   1835027: 262144:             last,shared,eof
/mnt/xfs/reflink/1: 1 extent found
File size of /mnt/xfs/reflink/copy_1 is 1073741824 (262144 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..  262143:     917508..   1179651: 262144:             last,eof
/mnt/xfs/reflink/copy_1: 1 extent found
File size of /mnt/xfs/reflink/reflink_1 is 1073741824 (262144 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..  262143:    1572884..   1835027: 262144:             last,shared,eof
/mnt/xfs/reflink/reflink_1: 1 extent found

Run duperemove.

# duperemove -hdr --hashfile=/var/tmp/dr.hash /mnt/xfs/reflink
Using 128K blocks
Using hash: murmur3
Gathering file list...
Adding files from database for hashing.
Loading only duplicated hashes from hashfile.
Using 2 threads for dedupe phase
Kernel processed data (excludes target files): 4.0G
Comparison of extent info shows a net change in shared extents of: 1.0G
$ df -h /mnt/xfs; du -csh /mnt/xfs/reflink
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvdc        50G  5.4G   45G  11% /mnt/xfs
7.0G    /mnt/xfs/reflink
7.0G    total
$ filefrag -v /mnt/xfs/reflink/{,copy_,reflink_}1
Filesystem type is: 58465342
File size of /mnt/xfs/reflink/1 is 1073741824 (262144 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..  262143:    1572884..   1835027: 262144:             last,shared,eof
/mnt/xfs/reflink/1: 1 extent found
File size of /mnt/xfs/reflink/copy_1 is 1073741824 (262144 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..  262143:    1572884..   1835027: 262144:             last,shared,eof
/mnt/xfs/reflink/copy_1: 1 extent found
File size of /mnt/xfs/reflink/reflink_1 is 1073741824 (262144 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..  262143:    1572884..   1835027: 262144:             last,shared,eof
/mnt/xfs/reflink/reflink_1: 1 extent found

The output of du remained the same, but df says that there’s now 1GiB more free space, and filefrag confirms that what’s changed is that copy_1 now uses the same extents as 1 and reflink_1. The duplicate data in copy_1 that in theory we did not know was there, has been discovered and safely reference-linked to the extent from 1, saving us 1GiB of storage.

By the way, I told duperemove to use a hash file because otherwise it will keep that in RAM. For the sake of 7 files that won’t matter but it will if I have millions of files so it’s a habit I get into. It uses that hash file to avoid having to repeatedly re-hash files that haven’t changed.

All that has been demonstrated so far though is whole-file deduplication, as copy_1 was just a regular copy of 1. What about when a file is only partially composed of duplicate data? Well okay.

$ cat /mnt/xfs/reflink/{1,2} > /mnt/xfs/reflink/1_2
$ ls -lah /mnt/xfs/reflink/{1,2,1_2}
-rw-r--r-- 1 andy andy 1.0G Jan 10 15:41 /mnt/xfs/reflink/1
-rw-r--r-- 1 andy andy 2.0G Jan 10 16:55 /mnt/xfs/reflink/1_2
-rw-r--r-- 1 andy andy 1.0G Jan 10 15:41 /mnt/xfs/reflink/2
$ df -h /mnt/xfs; du -csh /mnt/xfs/reflink
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvdc        50G  7.4G   43G  15% /mnt/xfs
9.0G    /mnt/xfs/reflink
9.0G    total
$ filefrag -v /mnt/xfs/reflink/{1,2,1_2}
Filesystem type is: 58465342
File size of /mnt/xfs/reflink/1 is 1073741824 (262144 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..  262143:    1572884..   1835027: 262144:             last,shared,eof
/mnt/xfs/reflink/1: 1 extent found
File size of /mnt/xfs/reflink/2 is 1073741824 (262144 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..  262127:         20..    262147: 262128:            
   1:   262128..  262143:    2129908..   2129923:     16:     262148: last,eof
/mnt/xfs/reflink/2: 2 extents found
File size of /mnt/xfs/reflink/1_2 is 2147483648 (524288 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..  262127:     262164..    524291: 262128:            
   1:   262128..  524287:     655380..    917539: 262160:     524292: last,eof
/mnt/xfs/reflink/1_2: 2 extents found

I’ve concatenated 1 and 2 together into a file called 1_2 and as expected, usage goes up by 2GiB. filefrag confirms that the physical extents in 1_2 are new. We should be able to do better because this 1_2 file does not contain any new unique data.

$ duperemove -hdr --hashfile=/var/tmp/dr.hash /mnt/xfs/reflink
Using 128K blocks
Using hash: murmur3
Gathering file list...
Adding files from database for hashing.
Using 2 threads for file hashing phase
Kernel processed data (excludes target files): 4.0G
Comparison of extent info shows a net change in shared extents of: 3.0G
$ df -h /mnt/xfs; du -csh /mnt/xfs/reflink
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvdc        50G  5.4G   45G  11% /mnt/xfs
9.0G    /mnt/xfs/reflink
9.0G    total

We can. Apparent usage stays at 9GiB but real usage went back to 5.4GiB which is where we were before we created 1_2.

And the physical layout of the files?

$ filefrag -v /mnt/xfs/reflink/{1,2,1_2}
Filesystem type is: 58465342
File size of /mnt/xfs/reflink/1 is 1073741824 (262144 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..  262143:    1572884..   1835027: 262144:             last,shared,eof
/mnt/xfs/reflink/1: 1 extent found
File size of /mnt/xfs/reflink/2 is 1073741824 (262144 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..  262127:         20..    262147: 262128:             shared
   1:   262128..  262143:    2129908..   2129923:     16:     262148: last,shared,eof
/mnt/xfs/reflink/2: 2 extents found
File size of /mnt/xfs/reflink/1_2 is 2147483648 (524288 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..  262143:    1572884..   1835027: 262144:             shared
   1:   262144..  524271:         20..    262147: 262128:    1835028: shared
   2:   524272..  524287:    2129908..   2129923:     16:     262148: last,shared,eof
/mnt/xfs/reflink/1_2: 3 extents found

It shows that 1_2 is now made up from the same extents as 1 and 2 combined, as expected.

Less of the urandom ^

These synthetic demonstrations using a handful of 1GiB blobs of data from /dev/urandom are all very well, but what about something a little more like the real world?

Okay well let’s see what happens when I take ~30GiB of backup data created by rsnapshot on another host.

rsnapshot is a backup program which makes heavy use of hardlinks. It runs periodically and compares the previous backup data with the new. If they are identical then instead of storing an identical copy it makes a hardlink. This saves a lot of space but does have a lot of limitations as discussed previously.

This won’t be the best example because in some ways there is expected to be more duplication; this data is composed of multiple backups of the same file trees. But on the other hand there shouldn’t be as much because any truly identical files have already been hardlinked together by rsnapshot. But it is a convenient source of real-world data.

So, starting state:

(I deleted all the reflink files)

$ df -h /mnt/xfs; sudo du -csh /mnt/xfs/rsnapshot
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvdc        50G   30G   21G  59% /mnt/xfs
29G     /mnt/xfs/rsnapshot
29G     total

A small diversion about how rsnapshot lays out its backups may be useful here. They are stored like this:

  • rsnapshot_root / [iteration a] / [client foo] / [directory structure from client foo]
  • rsnapshot_root / [iteration a] / [client bar] / [directory structure from client bar]
  • rsnapshot_root / [iteration b] / [client foo] / [directory structure from client foo]
  • rsnapshot_root / [iteration b] / [client bar] / [directory structure from client bar]

The iterations are commonly things like daily.0, daily.1daily.6. As a consequence, the paths:

rsnapshot/daily.*/client_foo

would be backups only from host foo, and:

rsnapshot/daily.0/*

would be backups from all hosts but only the most recent daily sync.

Let’s first see what the savings would be like in looking for duplicates in just one client’s backups.

Here’s the backups I have in this blob of data. The names of the clients are completely made up, though they are real backups.

Client Size (MiB)
darbee 14,504
achorn 11,297
spader 2,612
reilly 2,276
chino 2,203
audun 2,184

So let’s try deduplicating all of the biggest one’s—darbee‘s—backups:

$ df -h /mnt/xfs
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvdc        50G   30G   21G  59% /mnt/xfs
# time duperemove -hdr --hashfile=/var/tmp/dr.hash /mnt/xfs/rsnapshot/*/darbee
Using 128K blocks
Using hash: murmur3
Gathering file list...
Kernel processed data (excludes target files): 8.8G
Comparison of extent info shows a net change in shared extents of: 6.8G
9.85user 78.70system 3:27.23elapsed 42%CPU (0avgtext+0avgdata 23384maxresident)k
50703656inputs+790184outputs (15major+20912minor)pagefaults 0swaps
$ df -h /mnt/xfs
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvdc        50G   25G   26G  50% /mnt/xfs

3m27s of run time, somewhere between 5 and 6.8GiB saved. That’s 35%!

Now to deduplicate the lot.

# time duperemove -hdr --hashfile=/var/tmp/dr.hash /mnt/xfs/rsnapshot
Using 128K blocks
Using hash: murmur3
Gathering file list...
Kernel processed data (excludes target files): 5.4G
Comparison of extent info shows a net change in shared extents of: 3.4G
29.12user 188.08system 5:02.31elapsed 71%CPU (0avgtext+0avgdata 34040maxresident)k
34978360inputs+572128outputs (18major+45094minor)pagefaults 0swaps
$ df -h /mnt/xfs
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvdc        50G   23G   28G  45% /mnt/xfs

5m02 used this time, and another 2–3.4G saved.

Since the actual deduplication does take some time (the kernel having to read the extents, mainly), and most of it was already done in the first pass, a full pass would more likely take the sum of the times, i.e. more like 8m29s.

Still, a total of about 7GiB was saved which is 23%.

It would be very interesting to try this on one of my much larger backup stores.

Why Not Just Use btrfs? ^

Using a filesystem that already has all of these features would certainly seem easier, but I personally don’t think btrfs is stable enough yet. I use it at home in a relatively unexciting setup (8 devices, raid1 for data and metadata, no compression or deduplication) and I wish I didn’t. I wouldn’t dream of using it in a production environment yet.

I’m on the btrfs mailing list and there are way too many posts regarding filesystems that give ENOSPC and become unavailable for writes, or systems that were unexpectedly powered off and when powered back on the btrfs filesystem is completely lost.

I expect the reflink feature in XFS to become non-experimental before btrfs is stable enough for production use.

ZFS? ^

ZFS is great. It doesn’t have out-of-band deduplication or reflinks though, and they don’t plan to any time soon.

Supermicro SATA DOM flash devices don’t report lifetime writes correctly

I’m playing around with a pair of Supermicro SATA DOM flash devices at the moment, evaluating them for use as the operating system storage for servers (as opposed to where customer data goes).

They’re flash devices with a limited write endurance. The smallest model (16GB), for example, is good for 17TB of writes. Therefore it’s important to know how much you’ve actually written to it.

Many SSDs and other flash devices expose the total amount written through the SMART attribute 241, Total_LBAs_Written. The SATA DOM devices do seem to expose this attribute, but right now they say this:

$ for dom in $(sudo lsblk --paths -d -o NAME,MODEL --noheadings |
    awk '/SATA SSD/ { print $1 }')
do
    echo -n "$dom: "
    sudo smartctl -A "$dom" |
      awk '/^241/ { print $10 * 512 * 1.0e-9, "GB" }'
done
/dev/sda: 0.00856934 GB
/dev/sdb: 0.00881715 GB

This being after install and (as of now) more than a week of uptime, ~9MB of lifetime writes isn’t credible.

Another place we can look for amount of bytes written is /proc/diskstats. The 10th column is the number of (512-byte) sectors written, so:

$ for dom in $(sudo lsblk -d -o NAME,MODEL --noheadings |
    awk '/SATA SSD/ { print $1 }')
do
     awk "/$dom / {
        print \$3, \$10 / 2 * 1.0e-6, \"GB\"
    }" /proc/diskstats
done
sda 3.93009 GB
sdb 3.93009 GB

Almost 4GB is a lot more believable, so can we just use /proc/diskstats? Well, the problem there is that those figures are only since boot. That won’t include, for example, all the data written during install.

Okay, so, are these figures even consistent? Let’s write 100MB and see what changes.

Since the figure provided by SMART attribute 241 apparently isn’t actually 512-byte blocks we’ll just print the raw value there.

Before:

$ for dom in $(sudo lsblk -d -o NAME,MODEL --noheadings |
    awk '/SATA SSD/ { print $1 }')
do
     awk "/$dom / {
        print \$3, \$10 / 2 * 1.0e-6, \"GB\"
    }" /proc/diskstats
done
sda 4.03076 GB
sdb 4.03076 GB
$ for dom in $(sudo lsblk --paths -d -o NAME,MODEL --noheadings |
  awk '/SATA SSD/ { print $1 }')
do
    echo -n "$dom: "
    sudo smartctl -A "$dom" |
      awk '/^241/ { print $10 }'
done
/dev/sda: 16835
/dev/sdb: 17318

Write 100MB:

$ dd if=/dev/urandom bs=1MB count=100 > /var/tmp/one_hundred_megabytes
100+0 records in
100+0 records out
100000000 bytes (100 MB) copied, 7.40454 s, 13.5 MB/s

(I used /dev/urandom just in case some compression might take place or something)

After:

$ for dom in $(sudo lsblk -d -o NAME,MODEL --noheadings |
    awk '/SATA SSD/ { print $1 }')
do
     awk "/$dom / {
        print \$3, \$10 / 2 * 1.0e-6, \"GB\"
    }" /proc/diskstats
done
sda 4.13046 GB
sdb 4.13046 GB
$ for dom in $(sudo lsblk --paths -d -o NAME,MODEL --noheadings |
  awk '/SATA SSD/ { print $1 }')
do
    echo -n "$dom: "
    sudo smartctl -A "$dom" |
      awk '/^241/ { print $10 }'
done
/dev/sda: 16932
/dev/sdb: 17416

Well, alright, all is apparently not lost: SMART attribute 241 went up by ~100 and diskstats agrees that ~100MB was written too, so it looks like it does actually report lifetime writes, but it’s reporting them as megabytes (109 bytes), not 512-byte sectors.

Every reference I can find says that Total_LBAs_Written is the number of 512-byte sectors, though, so in reporting units of 1MB I feel that these devices are doing the wrong thing.

Anyway, I’m a little alarmed that ~0.1% of the lifetime has gone already, although a lot of that would have been the install. I probably should take this opportunity to get rid of a lot of writes by tracking down logging of mundane garbage. Also this is the smallest model; the devices are rated for 1 DWPD so just over-provisioning by using a larger model than necessary will help.

Using a TOTP app for multi-factor SSH auth

I’ve been playing around with enabling multi-factor authentication (MFA) on web services and went with TOTP. It’s pretty simple to implement in Perl, and there are plenty of apps for it including Google Authenticator, 1Password and others.

I also wanted to use the same multi-factor auth for SSH logins. Happily, from Debian jessie onwards libpam-google-authenticator is packaged. To enable it for SSH you would just add the following:

auth required pam_google_authenticator.so

to /etc/pam.d/sshd (put it just after @include common-auth).

and ensure that:

ChallengeResponseAuthentication yes

is in /etc/ssh/sshd_config.

Not all my users will have MFA enabled though, so to skip prompting for these I use:

auth required pam_google_authenticator.so nullok

Finally, I only wanted users in a particular Unix group to be prompted for an MFA token so (assuming that group was totp) that would be:

auth [success=1 default=ignore] pam_succeed_if.so quiet user notingroup totp
auth required pam_google_authenticator.so nullok

If the pam_succeed_if conditions are met then the next line is skipped, so that causes pam_google_authenticator to be skipped for users not in the group totp.

Each user will require a TOTP secret key generating and storing. If you’re only setting this up for SSH then you can use the google-authenticator binary from the libpam-google-authenticator package. This asks you some simple questions and then populates the file $HOME/.google_authenticator with the key and some configuration options. That looks like:

T6Z2KSDCG7CEWPD6EPA6BICBFD4KYKCSGO2JEQVII7ZJNCXECRZPJ4GJHD3CWC43FZIKQUSV5LR2LFFP
" RATE_LIMIT 3 30 1462548404
" DISALLOW_REUSE 48751610
" TOTP_AUTH
11494760
25488108
33980423
43620625
84061586

The first line is the secret key; the five numbers are emergency codes that will always work (once each) if locked out.

If generating keys elsewhere then you can just populate this file yourself. If the file isn’t present then that’s when “nullok” applies; without “nullok” authentication would fail.

Note that despite the repeated mentions of “google” here, this is not a Google-specific service and no data is sent to Google. Google are the authors of the open source Google Authenticator mobile app and the libpam-google-authenticator PAM module, but (as evidenced by the Perl example) this is an open standard and client and server sides can be implemented in any language.

So that is how you can make a web service and an SSH service use the same TOTP multi-factor authentication.