lvmcache with a 4.12.3 kernel

Background ^

In the previous two articles I had discovered that lvmcache had amazing performance on an empty cache but then on every run after that (i.e. when the cache device was full of junk) went scarcely better than baseline HDD.

A few days ago I happened across an email on the linux-lvm list where Mike Snitzer advised:

the [CentOS] 7.4 dm-cache will be much more performant than the 7.3 cache you appear to be using.

…and…

It could be that your workload isn’t accessing the data enough to warrant promotion to the cache. dm-cache is a “hotspot” cache. If you aren’t accessing the data repeatedly then you won’t see much benefit (particularly with the 7.3 and earlier releases).

Just to get a feel, you could try the latest upstream 4.12 kernel to see how effective the 7.4 dm-cache will be for your setup.

I don’t know what kernel version CentOS 7.3 uses, but the VM I’m testing with is Debian testing (buster), so some version of 4.11.x plus backported patches.

That seemed pretty new, but Mike is suggesting 4.12.x so I thought I’d re-test lvmcache with the latest stable upstream kernel, which at the time of writing is version 4.12.3.

Test methodology ^

This time around I only focused on fio tests, using the same settings as before:

[partial]
ioengine=libaio
direct=1
iodepth=8
numjobs=2
readwrite=randread
random_distribution=zipf:1.2
bs=4k
size=2g
unlink=1
runtime=20m
time_based=1
per_job_logs=1
log_avg_msec=500
write_iops_log=/var/tmp/fio-${FIOTEST}

The only changes were:

  1. to reduce the run time to 20 minutes from 30 minutes, since all the interesting things happened within the first 20 minutes before.
  2. to write an IOPS log entry every 500ms instead of ever 1000ms, as the log files were quite small and some higher resolution might help smooth graphs out.

Last time there was a dramatic difference between the initial run with an empty cache and subsequent runs with a cache volume full of junk, so I did a test for each of those conditions, as well as tests for the baseline SSD and HDD.

The virtual machine had been upgraded from Debian 9 (stretch) to testing (buster), so it still had packaged kernel versions 4.9.30-2 and 4.11.6-1 laying around to test things with. In addition I compiled up version 4.12.3 by copying the .config from 4.11.6-1 then doing make oldconfig accepting all defaults.

Results ^

Although the fio job spec was essentially the same as in the previous article, I have since worked out how to merge the IOPS logs from both jobs so the graphs will seem to show about double the IOPS as they did before.

All-in-one ^

Well that’s an interesting set of graphs but rather hard to distinguish. Let’s try that by kernel version.

Baseline SSD by kernel version ^

A couple of weird things here:

  1. 4.12.3 and 4.11.6-1 are actually fairly consistent, but 4.9.30-2 varies rather a lot.
  2. All kernels show a sharp dip a few minutes in. I don’t know what that is about.

Although these lines do look quite far apart, bear in mind that this graph’s y axis starts at 92k IOPS. The average IOPS didn’t vary that much:

Average IOPS by kernel version
4.9.30-2 4.11.6-1 4.12.3
102,325 102,742 104,352

So there was actually only a 1.9% difference between the worst performer and the best.

Baseline HDD by kernel version ^

4.9.30-2 and 4.12.3 are close enough here to probably be within the margin of error, but there is something weird going on with 4.11.6-1.

Its average IOPS across the 20 minute test were only 56% of those for 4.12.3 and 53% of those for 4.9.30-2, which is quite a big disparity. I re-ran these tests 5 times to check it wasn’t some anomaly, but no, it’s reproducible.

Maybe something to look into another day.

lvmcache by kernel version ^

Dragging things back to the point of this article: previously we discovered that lvmcache worked great the first time through, when its cache volume was completely empty, but then subsequent runs all absolutely sucked. They didn’t perform significantly better than HDD baseline.

Let’s graph all the lvmcache results for each kernel version against the SSD baseline for that kernel to see if things changed at all.

lvmcache 4.9.30-2 ^

This is the similar to what we saw before: an empty cache volume produces decent results of around 47k IOPS. Although it’s interesting that the second job is down around 1k IOPS. Again the results on a full cache are poor. In fact the results for the second job of the empty cache are about the same as the results for both jobs on a full cache.

lvmcache 4.11.6-1 ^

Same story again here, although the performance is a little higher. Again the first job on an empty cache is getting the big results of almost 60k IOPS while the second job—and both jobs on a full cache—get only around 1k IOPS.

lvmcache 4.12.3 ^

Wow. Something dramatic has been fixed. The performance on an empty cache is still better, but crucially the performance on a full cache pretty quickly becomes very close to baseline SSD.

Also the runs against both the empty and full cache device result in both jobs getting roughly the same IOPS performance rather than the first job being great and all others very poor.

What’s next? ^

It’s really encouraging that the performance is so much better with 4.12.3. It’s changed lvmcache from a “hmm, maybe” option to one that I would strongly consider using anywhere I could.

It’s a shame though that such a new kernel is required. The kernel version in Debian testing (buster) is currently 4.11.6-1. Debian experimental’s linux-image-4.12.0-trunk-amd64 package currently has version 4.12.2-1 so I should test if that is new enough I tested to see if that was new enough.

Failing that I think I should git bisect or similar in order to find out exactly which changeset fixed this, so I could have some chance of knowing when it hits a packaged version.

Continue reading “lvmcache with a 4.12.3 kernel”

12 hours of lvmcache

In the previous post I noted that the performance of lvmcache was still increasing and it might be worth testing it for longer than 3 hours.

Here’a a 12 hour test ^

$ cd /srv/cache/fio && FIOTEST=lvmcache-12h fio ~/lvmcache-12h.fio
partial: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=8
...
fio-2.16
Starting 2 processes
partial: Laying out IO file(s) (1 file(s) / 2048MB)
partial: Laying out IO file(s) (1 file(s) / 2048MB)
Jobs: 2 (f=2): [r(2)] [100.0% done] [6272KB/0KB/0KB /s] [1568/0/0 iops] [eta 00m:00s]
partial: (groupid=0, jobs=1): err= 0: pid=11130: Fri Jul 21 16:37:30 2017
  read : io=136145MB, bw=3227.2KB/s, iops=806, runt=43200062msec
    slat (usec): min=3, max=586402, avg=14.27, stdev=619.54
    clat (usec): min=2, max=1517.9K, avg=9897.80, stdev=29334.14
     lat (usec): min=71, max=1517.9K, avg=9912.72, stdev=29344.74
    clat percentiles (usec):
     |  1.00th=[  103],  5.00th=[  110], 10.00th=[  113], 20.00th=[  119],
     | 30.00th=[  124], 40.00th=[  129], 50.00th=[  133], 60.00th=[  143],
     | 70.00th=[  157], 80.00th=[11840], 90.00th=[30848], 95.00th=[56576],
     | 99.00th=[136192], 99.50th=[179200], 99.90th=[309248], 99.95th=[382976],
     | 99.99th=[577536]
    lat (usec) : 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%, 100=0.03%
    lat (usec) : 250=76.84%, 500=0.26%, 750=0.13%, 1000=0.13%
    lat (msec) : 2=0.25%, 4=0.02%, 10=1.19%, 20=6.67%, 50=8.59%
    lat (msec) : 100=3.93%, 250=1.77%, 500=0.18%, 750=0.02%, 1000=0.01%
    lat (msec) : 2000=0.01%
  cpu          : usr=0.44%, sys=1.63%, ctx=34524570, majf=0, minf=17
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=34853153/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=8
partial: (groupid=0, jobs=1): err= 0: pid=11131: Fri Jul 21 16:37:30 2017
  read : io=134521MB, bw=3188.7KB/s, iops=797, runt=43200050msec
    slat (usec): min=3, max=588479, avg=14.35, stdev=613.38
    clat (usec): min=2, max=1530.3K, avg=10017.42, stdev=29196.28
     lat (usec): min=70, max=1530.3K, avg=10032.43, stdev=29207.06
    clat percentiles (usec):
     |  1.00th=[  103],  5.00th=[  109], 10.00th=[  112], 20.00th=[  118],
     | 30.00th=[  124], 40.00th=[  127], 50.00th=[  133], 60.00th=[  143],
     | 70.00th=[  157], 80.00th=[12352], 90.00th=[31360], 95.00th=[57600],
     | 99.00th=[138240], 99.50th=[179200], 99.90th=[301056], 99.95th=[370688],
     | 99.99th=[561152]
    lat (usec) : 4=0.01%, 20=0.01%, 50=0.01%, 100=0.04%, 250=76.56%
    lat (usec) : 500=0.26%, 750=0.12%, 1000=0.13%
    lat (msec) : 2=0.26%, 4=0.02%, 10=1.20%, 20=6.75%, 50=8.65%
    lat (msec) : 100=4.01%, 250=1.82%, 500=0.17%, 750=0.01%, 1000=0.01%
    lat (msec) : 2000=0.01%
  cpu          : usr=0.45%, sys=1.60%, ctx=34118324, majf=0, minf=15
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=34437257/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=8
 
Run status group 0 (all jobs):
   READ: io=270666MB, aggrb=6415KB/s, minb=3188KB/s, maxb=3227KB/s, mint=43200050msec, maxt=43200062msec
 
Disk stats (read/write):
    dm-2: ios=69290239/1883, merge=0/0, ticks=690078104/246240, in_queue=690329868, util=100.00%, aggrios=23098543/27863, aggrmerge=0/0, aggrticks=229728200/637782, aggrin_queue=230366965, aggrutil=100.00%
    dm-1: ios=247/64985, merge=0/0, ticks=36/15464, in_queue=15504, util=0.02%, aggrios=53025553/63449, aggrmerge=0/7939, aggrticks=7413340/14760, aggrin_queue=7427028, aggrutil=16.42%
  xvdc: ios=53025553/63449, merge=0/7939, ticks=7413340/14760, in_queue=7427028, util=16.42%
  dm-0: ios=53025306/6403, merge=0/0, ticks=7417028/1852, in_queue=7419784, util=16.42%
    dm-3: ios=16270078/12201, merge=0/0, ticks=681767536/1896032, in_queue=683665608, util=100.00%, aggrios=16270077/12200, aggrmerge=1/1, aggrticks=681637744/1813744, aggrin_queue=683453224, aggrutil=100.00%
  xvdd: ios=16270077/12200, merge=1/1, ticks=681637744/1813744, in_queue=683453224, util=100.00%

It’s still going up, slowly. The cache hit rate was 76.53%. In the 30 minute test the hit rate was 73.64%.

Over 30 minutes the average IOPS was 1,484.

Over 12 hours the average IOPS was 1,603.

I was kind of hoping to reach the point where the hit rate is so high that it just takes off like bcache does and we start to see tens of thousands of IOPS, but it wasn’t to be.

24 hours of lvmcache ^

…so I went ahead and ran the same thing for 24 hours.

I’ve skipped the first 2 hours of results since we know what they look like. It appears to still be going up, although the results past 20 hours leave some doubt there.

Here’s the full fio output.

$ cd /srv/cache/fio && FIOTEST=lvmcache-24h fio ~/lvmcache-24h.fio
partial: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=8
...
fio-2.16
Starting 2 processes
partial: Laying out IO file(s) (1 file(s) / 2048MB)
partial: Laying out IO file(s) (1 file(s) / 2048MB)
Jobs: 2 (f=2): [r(2)] [100.0% done] [7152KB/0KB/0KB /s] [1788/0/0 iops] [eta 00m:00s]
partial: (groupid=0, jobs=1): err= 0: pid=14676: Sat Jul 22 21:34:12 2017
  read : io=278655MB, bw=3302.6KB/s, iops=825, runt=86400091msec
    slat (usec): min=3, max=326, avg=12.43, stdev= 6.97
    clat (usec): min=1, max=1524.1K, avg=9673.02, stdev=28748.45
     lat (usec): min=71, max=1525.7K, avg=9686.11, stdev=28748.87
    clat percentiles (usec):
     |  1.00th=[  103],  5.00th=[  106], 10.00th=[  111], 20.00th=[  116],
     | 30.00th=[  119], 40.00th=[  125], 50.00th=[  131], 60.00th=[  139],
     | 70.00th=[  155], 80.00th=[11456], 90.00th=[30336], 95.00th=[55552],
     | 99.00th=[134144], 99.50th=[177152], 99.90th=[305152], 99.95th=[374784],
     | 99.99th=[569344]
    lat (usec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
    lat (usec) : 100=0.03%, 250=77.16%, 500=0.22%, 750=0.12%, 1000=0.12%
    lat (msec) : 2=0.23%, 4=0.02%, 10=1.18%, 20=6.65%, 50=8.54%
    lat (msec) : 100=3.84%, 250=1.70%, 500=0.17%, 750=0.01%, 1000=0.01%
    lat (msec) : 2000=0.01%
  cpu          : usr=0.47%, sys=1.64%, ctx=70653446, majf=0, minf=17
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=71335660/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=8
partial: (groupid=0, jobs=1): err= 0: pid=14677: Sat Jul 22 21:34:12 2017
  read : io=283280MB, bw=3357.4KB/s, iops=839, runt=86400074msec
    slat (usec): min=3, max=330, avg=12.44, stdev= 6.98
    clat (usec): min=2, max=1515.9K, avg=9514.83, stdev=28128.86
     lat (usec): min=71, max=1515.2K, avg=9527.92, stdev=28129.29
    clat percentiles (usec):
     |  1.00th=[  103],  5.00th=[  109], 10.00th=[  112], 20.00th=[  118],
     | 30.00th=[  123], 40.00th=[  126], 50.00th=[  133], 60.00th=[  141],
     | 70.00th=[  157], 80.00th=[11328], 90.00th=[29824], 95.00th=[55040],
     | 99.00th=[132096], 99.50th=[173056], 99.90th=[292864], 99.95th=[362496],
     | 99.99th=[544768]
    lat (usec) : 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%, 100=0.03%
    lat (usec) : 250=77.29%, 500=0.23%, 750=0.11%, 1000=0.12%
    lat (msec) : 2=0.23%, 4=0.02%, 10=1.18%, 20=6.65%, 50=8.49%
    lat (msec) : 100=3.81%, 250=1.66%, 500=0.15%, 750=0.01%, 1000=0.01%
    lat (msec) : 2000=0.01%
  cpu          : usr=0.47%, sys=1.67%, ctx=71794214, majf=0, minf=15
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=72519640/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=8
 
Run status group 0 (all jobs):
   READ: io=561935MB, aggrb=6659KB/s, minb=3302KB/s, maxb=3357KB/s, mint=86400074msec, maxt=86400091msec
 
Disk stats (read/write):
    dm-2: ios=143855123/29, merge=0/0, ticks=1380761492/1976, in_queue=1380772508, util=100.00%, aggrios=47953627/25157, aggrmerge=0/0, aggrticks=459927326/7329, aggrin_queue=459937080, aggrutil=100.00%
    dm-1: ios=314/70172, merge=0/0, ticks=40/15968, in_queue=16008, util=0.01%, aggrios=110839338/72760, aggrmerge=0/2691, aggrticks=15300392/17100, aggrin_queue=15315432, aggrutil=16.92%
  xvdc: ios=110839338/72760, merge=0/2691, ticks=15300392/17100, in_queue=15315432, util=16.92%
  dm-0: ios=110839024/5279, merge=0/0, ticks=15308540/1768, in_queue=15312588, util=16.93%
    dm-3: ios=33021544/20, merge=0/0, ticks=1364473400/4252, in_queue=1364482644, util=100.00%, aggrios=33021544/19, aggrmerge=0/1, aggrticks=1364468920/4076, aggrin_queue=1364476064, aggrutil=100.00%
  xvdd: ios=33021544/19, merge=0/1, ticks=1364468920/4076, in_queue=1364476064, util=100.00%

So, 1,664 average IOPS (825 + 839), 77.05% (110,839,338 / (71,335,660 + 72,519,640)*100) cache hit rate.

Not sure I can be bothered to run a multi-day test on this now!

bcache and lvmcache

Background ^

Over at BitFolk we offer both SSD-backed storage and HDD-backed archive storage. The SSDs we use are really nice and basically have removed all IO performance problems we have ever encountered in the past. I can’t begin to describe how pleasant it is to just never have to think about that whole class of problems.

The main downside of course is that SSD capacity is still really expensive. That’s why we introduced the HDD-backed archive storage: for bulk storage of things that didn’t need to have high performance.

You’d really think though that by now there would be some kind of commodity tiered storage that would allow a relatively small amount of fast SSD storage to accelerate a much larger pool of HDDs. Of course there are various enterprise solutions, and there is also ZFS where SSDs could be used for the ZIL and L2ARC while HDDs are used for the pool.

ZFS is quite appealing but I’m not ready to switch everything to that yet, and I’m certainly not going to buy some enterprise storage solution. I also don’t necessarily want to commit to putting all storage behind such a system.

I decided to explore Linux’s own block device caching solutions.

Scope ^

I’ve restricted the scope to the two solutions which are part of the mainline Linux kernel as of July 2017, these being bcache and lvmcache.

lvmcache is based upon dm-cache which has been included with the mainline kernel since April 2013. It’s quite conservative, and having been around for quite a while is considered stable. It has the advantage that it can work with any LVM logical volume no matter what the contents. That brings the disadvantage that you do need to run LVM.

bcache has been around for a little longer but is a much more ambitious project. Being completely dedicated to accelerating slow block devices with fast ones it is claimed to be able to achieve higher performance than other caching solutions, but as it’s much more complicated than dm-cache there are still bugs being found. Also it requires you format your block devices as bcache before you use them for anything.

Test environment ^

I’m testing this on a Debian testing (buster) Xen virtual machine with a 20GiB xvda virtual disk containing the main operating system. That disk is backed by a software (md) RAID-10 composed of two Samsung sm863 SSDs. It was also used for testing the baseline SSD performance from the directory /srv/ssd.

The virtual machine had 1GiB of memory but the pagecache was cleared between each test run in an attempt to prevent anything being cached in memory.

A 5GiB xvdc virtual disk was provided, backed again on the SSD RAID. This was used for the cache role both in bcache and lvmcache.

A 50GiB xvdd virtual disk was provided, backed by a pair of Seagate ST4000LM016-1N2170 HDDs in software RAID-1. This was used for the HDD backing store in each of the caching implementations. The resulting cache device was mounted at /srv/cache.

Finally a 50GiB xvde virtual disk also backed on HDD was used to test baseline HDD performance, mounted at /srv/slow.

The filesystem in use in all cases was ext4 with default options. In dom0, deadline scheduler was used in all cases.

TL;DR, I just want graphs ^

In case you can’t be bothered to read the rest of this article, here’s just the graphs with some attempt at interpreting them. Down at the tests section you’ll find details of the actual testing process and more commentary on why certain graphs were produced.

git test graphs ^

Times to git clone and git grep.

fio IOPS graphs ^

These are graphs of IOPS across the 30 minutes of testing. There’s two important things to note about these graphs:

  1. They’re a Bezier curve fitted to the data points which are one per second. The actual data points are all over the place, because achieved IOPS depends on how many cache hits/misses there were, which is statistical.
  2. Only the IOPS for the first job is graphed. Even when using the per_job_logs=0 setting my copy of fio writes a set of results for each job. I couldn’t work out how to easily combine these so I’ve shown only the results for the first job.

    For all tests except bcache (sequential_cutoff=0) you just have to bear in mind that there is a second job working in parallel doing pretty much the same amount of IOPS. Strangely for that second bcache test the second job only managed a fraction of the IOPS (though still more than 10k IOPS) and I don’t know why.

IOPS over time for all tests

Well, those results are so extreme that it kind of makes it hard to distinguish between the low-end results.

A couple of observations:

  • SSD is incredibly and consistently fast.
  • For everything else there is a short steep section at the beginning which is likely to be the effect of HDD drive cache.
  • With sequential_cutoff set to 0, bcache very quickly reaches near-SSD performance for this workload (4k reads, 90% hitting 10% of data that fits entirely in the bcache). This is probably because the initial write put data straight into cache as it’s set to writeback.
  • When starting with a completely empty cache, lvmcache is no slouch either. It’s not quite as performant as bcache but that is still up near the 48k IOPS per process region, and very predictable.
  • When sequential_cutoff is left at its default of 4M, bcache performs much worse though still blazing compared to an HDD on its own. At the end of this 30 minute test performance was still increasing so it might be worth performing a longer test
  • The performance of lvmcache when starting with a cache already full of junk data seems to be not that much better than HDD baseline.
IOPS over time for low-end results

Leaving the high-performers out to see if there is anything interesting going on near the bottom of the previous graph.

Apart from the initial spike, HDD results are flat as expected.

Although the lvmcache (full cache) results in the previous graph seemed flat too, looking closer we can see that performance is still increasing, just very slowly. It may be interesting to test for longer to see if performance does continue to increase.

Both HDD and lvmcache have a very similar spike at the start of the test so let’s look closer at that.

IOPS for first 30 seconds

For all the lower-end performers the first 19 seconds are steeper and I can only think this is the effect of HDD drive cache. Once that is filled, HDD remains basically flat, lvmcache (full cache) increases performance more slowly and bcache with the default sequential_cutoff starts to take off.

SSDs don’t have the same sort of cache and bcache with no sequential_cutoff spikes up too quickly to really be noticeable at this scale.

3-hour lvmcache test

Since it seemed like lvmcache with a full cache device was still slowly increasing in performance I did a 3-hour testing on that one.

Skipping the first 20 minutes which show stronger growth, even after 3 hours there is still some performance increase happening. It seems like even a full cache would eventually promote read hot spots, but it could take a very very long time.

Continue reading “bcache and lvmcache”

XFS, Reflinks and Deduplication

btrfs Past ^

This post is about XFS but it’s about features that first hit Linux in btrfs, so we need to talk about btrfs for a bit first.

For a long time now, btrfs has had a useful feature called reflinks. Basically this is exposed as cp --reflink=always and takes advantage of extents and copy-on-write in order to do a quick copy of data by merely adding another reference to the extents that the data is currently using, rather than having to read all the data and write it out again, as would be the case in other filesystems.

Here’s an excerpt from the man page for cp:

When –reflink[=always] is specified, perform a lightweight copy, where the data blocks are copied only when modified. If this is not possible the copy fails, or if –reflink=auto is specified, fall back to a standard copy.

Without reflinks a common technique for making a quick copy of a file is the hardlink. Hardlinks have a number of disadvantages though, mainly due to the fact that since there is only one inode all hardlinked copies must have the same metadata (owner, group, permissions, etc.). Software that might modify the files also needs to be aware of hardlinks: naive modification of a hardlinked file modifies all copies of the file.

With reflinks, life becomes much easier:

  • Each copy has its own inode so can have different metadata. Only the data extents are shared.
  • The filesystem ensures that any write causes a copy-on-write, so applications don’t need to do anything special.
  • Space is saved on a per-extent basis so changing one extent still allows all the other extents to remain shared. A change to a hardlinked file requires a new copy of the whole file.

Another feature that extents and copy-on-write allow is block-level out-of-band deduplication.

  • Deduplication – the technique of finding and removing duplicate copies of data.
  • Block-level – operating on the blocks of data on storage, not just whole files.
  • Out-of-band – something that happens only when triggered or scheduled, not automatically as part of the normal operation of the filesystem.

btrfs has an ioctl that a userspace program can use—presumably after finding a sequence of blocks that are identical—to tell the kernel to turn one into a reference to the other, thus saving some space.

It’s necessary that the kernel does it so that any IO that may be going on at the same time that may modify the data can be dealt with. Modifications after the data is reflinked will just case a copy-on-write. If you tried to do it all in a userspace app then you’d risk something else modifying the files at the same time, but by having the kernel do it then in theory it becomes completely safe to do it at any time. The kernel also checks that the sequence of extents really are identical.

In-band deduplication is a feature that’s being worked on in btrfs. It already exists in ZFS though, and there is it rarely recommended for use as it requires a huge amount of memory for keeping hashes of data that has been written. It’s going to be the same story with btrfs, so out-of-band deduplication is still something that will remain useful. And it exists as a feature right now, which is always a bonus.

XFS Future ^

So what has all this got to do with XFS?

Well, in recognition that there might be more than one Linux filesystem with extents and so that reflinks might be more generally useful, the extent-same ioctl got lifted up to be in the VFS layer of the kernel instead of just in btrfs. And the good news is that XFS recently became able to make use of it.

When I say “recently” I do mean really recently. I mean like kernel release 4.9.1 which came out on 2017-01-04. At the moment it comes with massive EXPERIMENTAL warnings, requires a new filesystem to be created with a special format option, and will need an xfsprogs compiled from recent git in order to have a mkfs.xfs that can create such a filesystem.

So before going further, I’m going to assume you’ve compiled a new enough kernel and booted into it, then compiled up a new enough xfsprogs. Both of these are quite simple things to do, for example the Debian documentation for building kernel packages from upstream code works fine.

XFS Reflink Demo ^

Make yourself a new filesystem, with the reflink=1 format option.

# mkfs.xfs -L reflinkdemo -m reflink=1 /dev/xvdc
meta-data=/dev/xvdc              isize=512    agcount=4, agsize=3276800 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=0, rmapbt=0, reflink=1
data     =                       bsize=4096   blocks=13107200, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal log           bsize=4096   blocks=6400, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

Put it in /etc/fstab for convenience, and mount it somewhere.

# echo "LABEL=reflinkdemo /mnt/xfs xfs relatime 0 2" >> /etc/fstab
# mkdir -vp /mnt/xfs
mkdir: created directory ‘/mnt/xfs’
# mount /mnt/xfs
# df -h /mnt/xfs
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvdc        50G  339M   50G   1% /mnt/xfs

Create a few files with random data.

# mkdir -vp /mnt/xfs/reflink
mkdir: created directory ‘/mnt/xfs/reflink’
# chown -c andy: /mnt/xfs/reflink
changed ownership of ‘/mnt/xfs/reflink’ from root:root to andy:andy
# exit
$ for i in {1..5}; do
> echo "Writing $i…"; dd if=/dev/urandom of=/mnt/xfs/reflink/$i bs=1M count=1024;
> done
Writing 1…
1024+0 records in 
1024+0 records out
1073741824 bytes (1.1 GB) copied, 4.34193 s, 247 MB/s
Writing 2…
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 4.33207 s, 248 MB/s
Writing 3…
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 4.33527 s, 248 MB/s
Writing 4…
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 4.33362 s, 248 MB/s
Writing 5…
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 4.32859 s, 248 MB/s
$ df -h /mnt/xfs
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvdc        50G  5.4G   45G  11% /mnt/xfs
$ du -csh /mnt/xfs
5.0G    /mnt/xfs
5.0G    total

Copy a file and as expected usage will go up by 1GiB. And it will take a little while, even on my nice fast SSDs.

$ time cp -v /mnt/xfs/reflink/{,copy_}1
‘/mnt/xfs/reflink/1’ -> ‘/mnt/xfs/reflink/copy_1’
 
real    0m3.420s
user    0m0.008s
sys     0m0.676s
$ df -h /mnt/xfs; du -csh /mnt/xfs/reflink
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvdc        50G  6.4G   44G  13% /mnt/xfs
6.0G    /mnt/xfs/reflink
6.0G    total

So what about a reflink copy?

$ time cp -v --reflink=always /mnt/xfs/reflink/{,reflink_}1
‘/mnt/xfs/reflink/1’ -> ‘/mnt/xfs/reflink/reflink_1’
 
real    0m0.003s
user    0m0.000s
sys     0m0.004s
$ df -h /mnt/xfs; du -csh /mnt/xfs/reflink
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvdc        50G  6.4G   44G  13% /mnt/xfs
7.0G    /mnt/xfs/reflink
7.0G    total

The apparent usage went up by 1GiB but the amount of free space as shown by df stayed the same. No more actual storage was used because the new copy is a reflink. And the copy got done in 4ms as opposed to 3,420ms.

Can we tell more about how these files are laid out? Yes, we can use the filefrag -v command to tell us more.

$ filefrag -v /mnt/xfs/reflink/{,copy_,reflink_}1
Filesystem type is: 58465342
File size of /mnt/xfs/reflink/1 is 1073741824 (262144 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..  262143:    1572884..   1835027: 262144:             last,shared,eof
/mnt/xfs/reflink/1: 1 extent found
File size of /mnt/xfs/reflink/copy_1 is 1073741824 (262144 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..  262143:     917508..   1179651: 262144:             last,eof
/mnt/xfs/reflink/copy_1: 1 extent found
File size of /mnt/xfs/reflink/reflink_1 is 1073741824 (262144 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..  262143:    1572884..   1835027: 262144:             last,shared,eof
/mnt/xfs/reflink/reflink_1: 1 extent found

What we can see here is that all three files are composed of a single extent which is 262,144 4KiB blocks in size, but it also tells us that /mnt/xfs/reflink/1 and /mnt/xfs/reflink/reflink_1 are using the same range of physical blocks: 1572884..1835027.

XFS Deduplication Demo ^

We’ve demonstrated that you can use cp --reflink=always to take a cheap copy of your data, but what about data that may already be duplicates without your knowledge? Is there any way to take advantage of the extent-same ioctl for deduplication?

There’s a couple of software solutions for out-of-band deduplication in btrfs, but one I know that works also in XFS is duperemove. You will need to use a git checkout of duperemove for this to work.

A quick reminder of the storage use before we start.

$ df -h /mnt/xfs; du -csh /mnt/xfs/reflink
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvdc        50G  6.4G   44G  13% /mnt/xfs
7.0G    /mnt/xfs/reflink
7.0G    total
$ filefrag -v /mnt/xfs/reflink/{,copy_,reflink_}1
Filesystem type is: 58465342
File size of /mnt/xfs/reflink/1 is 1073741824 (262144 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..  262143:    1572884..   1835027: 262144:             last,shared,eof
/mnt/xfs/reflink/1: 1 extent found
File size of /mnt/xfs/reflink/copy_1 is 1073741824 (262144 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..  262143:     917508..   1179651: 262144:             last,eof
/mnt/xfs/reflink/copy_1: 1 extent found
File size of /mnt/xfs/reflink/reflink_1 is 1073741824 (262144 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..  262143:    1572884..   1835027: 262144:             last,shared,eof
/mnt/xfs/reflink/reflink_1: 1 extent found

Run duperemove.

# duperemove -hdr --hashfile=/var/tmp/dr.hash /mnt/xfs/reflink
Using 128K blocks
Using hash: murmur3
Gathering file list...
Adding files from database for hashing.
Loading only duplicated hashes from hashfile.
Using 2 threads for dedupe phase
Kernel processed data (excludes target files): 4.0G
Comparison of extent info shows a net change in shared extents of: 1.0G
$ df -h /mnt/xfs; du -csh /mnt/xfs/reflink
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvdc        50G  5.4G   45G  11% /mnt/xfs
7.0G    /mnt/xfs/reflink
7.0G    total
$ filefrag -v /mnt/xfs/reflink/{,copy_,reflink_}1
Filesystem type is: 58465342
File size of /mnt/xfs/reflink/1 is 1073741824 (262144 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..  262143:    1572884..   1835027: 262144:             last,shared,eof
/mnt/xfs/reflink/1: 1 extent found
File size of /mnt/xfs/reflink/copy_1 is 1073741824 (262144 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..  262143:    1572884..   1835027: 262144:             last,shared,eof
/mnt/xfs/reflink/copy_1: 1 extent found
File size of /mnt/xfs/reflink/reflink_1 is 1073741824 (262144 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..  262143:    1572884..   1835027: 262144:             last,shared,eof
/mnt/xfs/reflink/reflink_1: 1 extent found

The output of du remained the same, but df says that there’s now 1GiB more free space, and filefrag confirms that what’s changed is that copy_1 now uses the same extents as 1 and reflink_1. The duplicate data in copy_1 that in theory we did not know was there, has been discovered and safely reference-linked to the extent from 1, saving us 1GiB of storage.

By the way, I told duperemove to use a hash file because otherwise it will keep that in RAM. For the sake of 7 files that won’t matter but it will if I have millions of files so it’s a habit I get into. It uses that hash file to avoid having to repeatedly re-hash files that haven’t changed.

All that has been demonstrated so far though is whole-file deduplication, as copy_1 was just a regular copy of 1. What about when a file is only partially composed of duplicate data? Well okay.

$ cat /mnt/xfs/reflink/{1,2} > /mnt/xfs/reflink/1_2
$ ls -lah /mnt/xfs/reflink/{1,2,1_2}
-rw-r--r-- 1 andy andy 1.0G Jan 10 15:41 /mnt/xfs/reflink/1
-rw-r--r-- 1 andy andy 2.0G Jan 10 16:55 /mnt/xfs/reflink/1_2
-rw-r--r-- 1 andy andy 1.0G Jan 10 15:41 /mnt/xfs/reflink/2
$ df -h /mnt/xfs; du -csh /mnt/xfs/reflink
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvdc        50G  7.4G   43G  15% /mnt/xfs
9.0G    /mnt/xfs/reflink
9.0G    total
$ filefrag -v /mnt/xfs/reflink/{1,2,1_2}
Filesystem type is: 58465342
File size of /mnt/xfs/reflink/1 is 1073741824 (262144 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..  262143:    1572884..   1835027: 262144:             last,shared,eof
/mnt/xfs/reflink/1: 1 extent found
File size of /mnt/xfs/reflink/2 is 1073741824 (262144 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..  262127:         20..    262147: 262128:            
   1:   262128..  262143:    2129908..   2129923:     16:     262148: last,eof
/mnt/xfs/reflink/2: 2 extents found
File size of /mnt/xfs/reflink/1_2 is 2147483648 (524288 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..  262127:     262164..    524291: 262128:            
   1:   262128..  524287:     655380..    917539: 262160:     524292: last,eof
/mnt/xfs/reflink/1_2: 2 extents found

I’ve concatenated 1 and 2 together into a file called 1_2 and as expected, usage goes up by 2GiB. filefrag confirms that the physical extents in 1_2 are new. We should be able to do better because this 1_2 file does not contain any new unique data.

$ duperemove -hdr --hashfile=/var/tmp/dr.hash /mnt/xfs/reflink
Using 128K blocks
Using hash: murmur3
Gathering file list...
Adding files from database for hashing.
Using 2 threads for file hashing phase
Kernel processed data (excludes target files): 4.0G
Comparison of extent info shows a net change in shared extents of: 3.0G
$ df -h /mnt/xfs; du -csh /mnt/xfs/reflink
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvdc        50G  5.4G   45G  11% /mnt/xfs
9.0G    /mnt/xfs/reflink
9.0G    total

We can. Apparent usage stays at 9GiB but real usage went back to 5.4GiB which is where we were before we created 1_2.

And the physical layout of the files?

$ filefrag -v /mnt/xfs/reflink/{1,2,1_2}
Filesystem type is: 58465342
File size of /mnt/xfs/reflink/1 is 1073741824 (262144 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..  262143:    1572884..   1835027: 262144:             last,shared,eof
/mnt/xfs/reflink/1: 1 extent found
File size of /mnt/xfs/reflink/2 is 1073741824 (262144 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..  262127:         20..    262147: 262128:             shared
   1:   262128..  262143:    2129908..   2129923:     16:     262148: last,shared,eof
/mnt/xfs/reflink/2: 2 extents found
File size of /mnt/xfs/reflink/1_2 is 2147483648 (524288 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..  262143:    1572884..   1835027: 262144:             shared
   1:   262144..  524271:         20..    262147: 262128:    1835028: shared
   2:   524272..  524287:    2129908..   2129923:     16:     262148: last,shared,eof
/mnt/xfs/reflink/1_2: 3 extents found

It shows that 1_2 is now made up from the same extents as 1 and 2 combined, as expected.

Less of the urandom ^

These synthetic demonstrations using a handful of 1GiB blobs of data from /dev/urandom are all very well, but what about something a little more like the real world?

Okay well let’s see what happens when I take ~30GiB of backup data created by rsnapshot on another host.

rsnapshot is a backup program which makes heavy use of hardlinks. It runs periodically and compares the previous backup data with the new. If they are identical then instead of storing an identical copy it makes a hardlink. This saves a lot of space but does have a lot of limitations as discussed previously.

This won’t be the best example because in some ways there is expected to be more duplication; this data is composed of multiple backups of the same file trees. But on the other hand there shouldn’t be as much because any truly identical files have already been hardlinked together by rsnapshot. But it is a convenient source of real-world data.

So, starting state:

(I deleted all the reflink files)

$ df -h /mnt/xfs; sudo du -csh /mnt/xfs/rsnapshot
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvdc        50G   30G   21G  59% /mnt/xfs
29G     /mnt/xfs/rsnapshot
29G     total

A small diversion about how rsnapshot lays out its backups may be useful here. They are stored like this:

  • rsnapshot_root / [iteration a] / [client foo] / [directory structure from client foo]
  • rsnapshot_root / [iteration a] / [client bar] / [directory structure from client bar]
  • rsnapshot_root / [iteration b] / [client foo] / [directory structure from client foo]
  • rsnapshot_root / [iteration b] / [client bar] / [directory structure from client bar]

The iterations are commonly things like daily.0, daily.1daily.6. As a consequence, the paths:

rsnapshot/daily.*/client_foo

would be backups only from host foo, and:

rsnapshot/daily.0/*

would be backups from all hosts but only the most recent daily sync.

Let’s first see what the savings would be like in looking for duplicates in just one client’s backups.

Here’s the backups I have in this blob of data. The names of the clients are completely made up, though they are real backups.

Client Size (MiB)
darbee 14,504
achorn 11,297
spader 2,612
reilly 2,276
chino 2,203
audun 2,184

So let’s try deduplicating all of the biggest one’s—darbee‘s—backups:

$ df -h /mnt/xfs
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvdc        50G   30G   21G  59% /mnt/xfs
# time duperemove -hdr --hashfile=/var/tmp/dr.hash /mnt/xfs/rsnapshot/*/darbee
Using 128K blocks
Using hash: murmur3
Gathering file list...
Kernel processed data (excludes target files): 8.8G
Comparison of extent info shows a net change in shared extents of: 6.8G
9.85user 78.70system 3:27.23elapsed 42%CPU (0avgtext+0avgdata 23384maxresident)k
50703656inputs+790184outputs (15major+20912minor)pagefaults 0swaps
$ df -h /mnt/xfs
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvdc        50G   25G   26G  50% /mnt/xfs

3m27s of run time, somewhere between 5 and 6.8GiB saved. That’s 35%!

Now to deduplicate the lot.

# time duperemove -hdr --hashfile=/var/tmp/dr.hash /mnt/xfs/rsnapshot
Using 128K blocks
Using hash: murmur3
Gathering file list...
Kernel processed data (excludes target files): 5.4G
Comparison of extent info shows a net change in shared extents of: 3.4G
29.12user 188.08system 5:02.31elapsed 71%CPU (0avgtext+0avgdata 34040maxresident)k
34978360inputs+572128outputs (18major+45094minor)pagefaults 0swaps
$ df -h /mnt/xfs
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvdc        50G   23G   28G  45% /mnt/xfs

5m02 used this time, and another 2–3.4G saved.

Since the actual deduplication does take some time (the kernel having to read the extents, mainly), and most of it was already done in the first pass, a full pass would more likely take the sum of the times, i.e. more like 8m29s.

Still, a total of about 7GiB was saved which is 23%.

It would be very interesting to try this on one of my much larger backup stores.

Why Not Just Use btrfs? ^

Using a filesystem that already has all of these features would certainly seem easier, but I personally don’t think btrfs is stable enough yet. I use it at home in a relatively unexciting setup (8 devices, raid1 for data and metadata, no compression or deduplication) and I wish I didn’t. I wouldn’t dream of using it in a production environment yet.

I’m on the btrfs mailing list and there are way too many posts regarding filesystems that give ENOSPC and become unavailable for writes, or systems that were unexpectedly powered off and when powered back on the btrfs filesystem is completely lost.

I expect the reflink feature in XFS to become non-experimental before btrfs is stable enough for production use.

ZFS? ^

ZFS is great. It doesn’t have out-of-band deduplication or reflinks though, and they don’t plan to any time soon.

Supermicro SATA DOM flash devices don’t report lifetime writes correctly

I’m playing around with a pair of Supermicro SATA DOM flash devices at the moment, evaluating them for use as the operating system storage for servers (as opposed to where customer data goes).

They’re flash devices with a limited write endurance. The smallest model (16GB), for example, is good for 17TB of writes. Therefore it’s important to know how much you’ve actually written to it.

Many SSDs and other flash devices expose the total amount written through the SMART attribute 241, Total_LBAs_Written. The SATA DOM devices do seem to expose this attribute, but right now they say this:

$ for dom in $(sudo lsblk --paths -d -o NAME,MODEL --noheadings |
    awk '/SATA SSD/ { print $1 }')
do
    echo -n "$dom: "
    sudo smartctl -A "$dom" |
      awk '/^241/ { print $10 * 512 * 1.0e-9, "GB" }'
done
/dev/sda: 0.00856934 GB
/dev/sdb: 0.00881715 GB

This being after install and (as of now) more than a week of uptime, ~9MB of lifetime writes isn’t credible.

Another place we can look for amount of bytes written is /proc/diskstats. The 10th column is the number of (512-byte) sectors written, so:

$ for dom in $(sudo lsblk -d -o NAME,MODEL --noheadings |
    awk '/SATA SSD/ { print $1 }')
do
     awk "/$dom / {
        print \$3, \$10 / 2 * 1.0e-6, \"GB\"
    }" /proc/diskstats
done
sda 3.93009 GB
sdb 3.93009 GB

Almost 4GB is a lot more believable, so can we just use /proc/diskstats? Well, the problem there is that those figures are only since boot. That won’t include, for example, all the data written during install.

Okay, so, are these figures even consistent? Let’s write 100MB and see what changes.

Since the figure provided by SMART attribute 241 apparently isn’t actually 512-byte blocks we’ll just print the raw value there.

Before:

$ for dom in $(sudo lsblk -d -o NAME,MODEL --noheadings |
    awk '/SATA SSD/ { print $1 }')
do
     awk "/$dom / {
        print \$3, \$10 / 2 * 1.0e-6, \"GB\"
    }" /proc/diskstats
done
sda 4.03076 GB
sdb 4.03076 GB
$ for dom in $(sudo lsblk --paths -d -o NAME,MODEL --noheadings |
  awk '/SATA SSD/ { print $1 }')
do
    echo -n "$dom: "
    sudo smartctl -A "$dom" |
      awk '/^241/ { print $10 }'
done
/dev/sda: 16835
/dev/sdb: 17318

Write 100MB:

$ dd if=/dev/urandom bs=1MB count=100 > /var/tmp/one_hundred_megabytes
100+0 records in
100+0 records out
100000000 bytes (100 MB) copied, 7.40454 s, 13.5 MB/s

(I used /dev/urandom just in case some compression might take place or something)

After:

$ for dom in $(sudo lsblk -d -o NAME,MODEL --noheadings |
    awk '/SATA SSD/ { print $1 }')
do
     awk "/$dom / {
        print \$3, \$10 / 2 * 1.0e-6, \"GB\"
    }" /proc/diskstats
done
sda 4.13046 GB
sdb 4.13046 GB
$ for dom in $(sudo lsblk --paths -d -o NAME,MODEL --noheadings |
  awk '/SATA SSD/ { print $1 }')
do
    echo -n "$dom: "
    sudo smartctl -A "$dom" |
      awk '/^241/ { print $10 }'
done
/dev/sda: 16932
/dev/sdb: 17416

Well, alright, all is apparently not lost: SMART attribute 241 went up by ~100 and diskstats agrees that ~100MB was written too, so it looks like it does actually report lifetime writes, but it’s reporting them as megabytes (109 bytes), not 512-byte sectors.

Note: A comment below says this is actually mebibytes (220 bytes).

Every reference I can find says that Total_LBAs_Written is the number of 512-byte sectors, though, so in reporting units of 1MB I feel that these devices are doing the wrong thing.

Anyway, I’m a little alarmed that ~0.1% of the lifetime has gone already, although a lot of that would have been the install. I probably should take this opportunity to get rid of a lot of writes by tracking down logging of mundane garbage. Also this is the smallest model; the devices are rated for 1 DWPD so just over-provisioning by using a larger model than necessary will help.

Using a TOTP app for multi-factor SSH auth

I’ve been playing around with enabling multi-factor authentication (MFA) on web services and went with TOTP. It’s pretty simple to implement in Perl, and there are plenty of apps for it including Google Authenticator, 1Password and others.

I also wanted to use the same multi-factor auth for SSH logins. Happily, from Debian jessie onwards libpam-google-authenticator is packaged. To enable it for SSH you would just add the following:

auth required pam_google_authenticator.so

to /etc/pam.d/sshd (put it just after @include common-auth).

and ensure that:

ChallengeResponseAuthentication yes

is in /etc/ssh/sshd_config.

Not all my users will have MFA enabled though, so to skip prompting for these I use:

auth required pam_google_authenticator.so nullok

Finally, I only wanted users in a particular Unix group to be prompted for an MFA token so (assuming that group was totp) that would be:

auth [success=1 default=ignore] pam_succeed_if.so quiet user notingroup totp
auth required pam_google_authenticator.so nullok

If the pam_succeed_if conditions are met then the next line is skipped, so that causes pam_google_authenticator to be skipped for users not in the group totp.

Each user will require a TOTP secret key generating and storing. If you’re only setting this up for SSH then you can use the google-authenticator binary from the libpam-google-authenticator package. This asks you some simple questions and then populates the file $HOME/.google_authenticator with the key and some configuration options. That looks like:

T6Z2KSDCG7CEWPD6EPA6BICBFD4KYKCSGO2JEQVII7ZJNCXECRZPJ4GJHD3CWC43FZIKQUSV5LR2LFFP
" RATE_LIMIT 3 30 1462548404
" DISALLOW_REUSE 48751610
" TOTP_AUTH
11494760
25488108
33980423
43620625
84061586

The first line is the secret key; the five numbers are emergency codes that will always work (once each) if locked out.

If generating keys elsewhere then you can just populate this file yourself. If the file isn’t present then that’s when “nullok” applies; without “nullok” authentication would fail.

Note that despite the repeated mentions of “google” here, this is not a Google-specific service and no data is sent to Google. Google are the authors of the open source Google Authenticator mobile app and the libpam-google-authenticator PAM module, but (as evidenced by the Perl example) this is an open standard and client and server sides can be implemented in any language.

So that is how you can make a web service and an SSH service use the same TOTP multi-factor authentication.

rsync and sudo conundrum

Scenario:

  • You’re logged in to hostA
  • You need to rsync some files from hostB to hostA
  • The files on hostB are only readable by root and they must be written by root locally (hostA)
  • You have sudo access to root on both
  • You have ssh public key access to both
  • root can’t ssh between the two

Normally you’d do this:

hostA$ rsync -av hostB:/foo/ /foo/

but you can’t because your user can’t read /foo on hostB.

So then you might try making rsync run as root on hostB:

hostA$ rsync --rsync-path='sudo rsync' -av hostB:/foo/ /foo/

but that fails because ssh needs a pseudo-terminal to ask you for your sudo password on hostB:

sudo: no tty present and no askpass program specified
rsync: connection unexpectedly closed (0 bytes received so far) [Receiver]
rsync error: error in rsync protocol data stream (code 12) at io.c(226) [Receiver=3.1.1]

So then you can try giving it an askpass program:

hostA$ rsync \
       --rsync-path='SUDO_ASKPASS=/usr/bin/ssh-askpass sudo rsync' \
       -av hostB:/foo/ /foo/

and that nearly works! It pops up an askpass dialog (so you need X11 forwarding) which takes your password and does stuff as root on hostB. But ultimately fails because it’s running as your unprivileged user locally (hostA) and can’t write the files. So then you try running the lot under sudo:

hostA$ sudo rsync \
       --rsync-path='SUDO_ASKPASS=/usr/bin/ssh-askpass sudo rsync' \
       -av hostB:/foo/ /foo/

This fails because X11 forwarding doesn’t work through the local sudo. So become root locally first, then tell rsync to ssh as you:

hostA$ sudo -i
hostA# rsync \
       -e 'sudo -u youruser ssh' \
       --rsync-path 'SUDO_ASKPASS=/usr/bin/ssh-askpass sudo rsync'\
       -av hostB:/foo /foo

Success!

Answer cobbled together with help from dutchie, dne and dg12158. Any improvements? Not needing X11 forwarding would be nice.

Alternate methods:

  • Use tar:
    $ ssh \
      -t hostB 'sudo tar -C /foo -cf - .' \
      | sudo tar -C /foo -xvf -
  • Add public key access for root
  • Use filesystem ACLs to allow unprivileged user to read files on hostB.

Your Debian netboot suddenly can’t do Ext4?

If, like me, you’ve just done a Debian netboot install over PXE and discovered that the partitioner suddenly seems to have no option for Ext4 filesystem (leaving only btrfs and XFS), despite the fact that it worked fine a couple of weeks ago, do not be alarmed. You aren’t losing your mind. It seems to be a bug.

As the comment says, downloading netboot.tar.gz version 20150422+deb8u3 fixes it. You can find your version in the debian-installer/amd64/boot-screens/f1.txt file. I was previously using 20150422+deb8u1 and the commenter was using 20150422+deb8u2.

Looking at the dates on the files I’m guessing this broke on 23rd January 2016. There was a Debian point release around then, so possibly you are supposed to download a new netboot.tar.gz with each one – not sure. Although if this is the case it would still be nice to know you’re doing something wrong as opposed to having the installer appear to proceed normally except for denying the existence of any filesystems except XFS and btrfs.

Oh and don’t forget to restart your TFTP daemon. tftpd-hpa at least seems to cache things (or maybe hold the tftp directory open, as I had just moved the old directory out of the way), so I was left even more confused when it still seemed to be serving 20150422+deb8u1.

Disabling the default IPMI credentials on a Supermicro server

In an earlier post I mentioned that you should disable the default ADMIN / ADMIN credentials on the IPMI controller. Here’s how.

Install ipmitool ^

ipmitool is the utility that you will use from the command line of another machine in order to interact with the IPMI controllers on your servers.

# apt-get install ipmitool

List the current users ^

$ ipmitool -I lanplus -H 192.168.1.22 -U ADMIN -a user list
Password:
ID  Name             Callin  Link Auth  IPMI Msg   Channel Priv Limit
2   ADMIN            false   false      true       ADMINISTRATOR

Here you are specifying the IP address of the server’s IPMI controller. ADMIN is the IPMI user name you will use to log in, and it’s prompting you for the password which is also ADMIN by default.

Add a new user ^

You should add a new user with a name other than ADMIN.

I suppose it would be safe to just change the password of the existing ADMIN user, but there is no need to have it named that, so you may as well pick a new name.

$ ipmitool -I lanplus -H 192.168.1.22 -U ADMIN -a user set name 3 somename
Password:
$ ipmitool -I lanplus -H 192.168.1.22 -U ADMIN -a user set password 3
Password:
Password for user 3:
Password for user 3:
$ ipmitool -I lanplus -H 192.168.1.22 -U ADMIN -a channel setaccess 1 3 link=on ipmi=on callin=on privilege=4
Password:
$ ipmitool -I lanplus -H 192.168.1.22 -U ADMIN -a user enable 3
Password:

From this point on you can switch to using the new user instead.

$ ipmitool -I lanplus -H 192.168.1.22 -U somename -a user list
Password:
ID  Name             Callin  Link Auth  IPMI Msg   Channel Priv Limit
2   ADMIN            false   false      true       ADMINISTRATOR
3   somename         true    true       true       ADMINISTRATOR

Disable ADMIN user ^

Before doing this bit you may wish to check that the new user you added works for everything you need it to. Those things might include:

  • ssh to somename@192.168.1.22
  • Log in on web interface at https://192.168.1.22/
  • Various ipmitool commands like querying power status:
    $ ipmitool -I lanplus -H 192.168.1.22 -U somename -a power status
    Password:
    Chassis power is on

If all of that is okay then you can disable ADMIN:

$ ipmitool -I lanplus -H 192.168.1.22 -U somename -a user disable 2
Password:

If you are paranoid (or this is just the first time you’ve done this) you could now check to see that none of the above things now work when you try to use ADMIN / ADMIN.

Specifying the password ^

I have not done so in these examples but if you get bored of typing the password every time then you could put it in the IPMI_PASSWORD environment variable and use -E instead of -a on the ipmitool command line.

When setting the IPMI_PASSWORD environment variable you probably don’t want it logged in your shell’s history file. Depending on which shell you use there may be different ways to achieve that.

With bash, if you have ignorespace in the HISTCONTROL environment variable then commands prefixed by one or more spaces won’t be logged. Alternatively you could temporarily disable history logging with:

$ set +o history
$ sensitive commend goes here
$ set -o history # re-enable history logging

So anyway…

$ echo $HISTCONTROL
ignoredups:ignorespace
$     export IPMI_PASSWORD=letmein
$ # ^ note the leading spaces here
$ # to prevent the shell logging it
$ ipmitool -I lanplus -H 192.168.1.22 -U somename -E power status
Chassis Power is on

Installing Debian by PXE using Supermicro IPMI Serial over LAN

Here’s how to install Debian jessie on a Supermicro server using PXE boot and the IPMI serial-over-LAN.

Using these instructions you will be able to complete an install of a remote machine, although you will initially need access to the BIOS to configure the IPMI part.

BIOS settings ^

This bit needs you to be in the same location as the machine, or else have someone who is make the required changes.

Press DEL to go into the BIOS configuration.

Under Advanced > PCIe/PCI/PnP Configuration make sure that the network interface through which you’ll reach your PXE server has the “PXE” option ROM set:

BIOS PSE option ROM

Under Advanced > Serial Port Console Redirection you’ll want to enable SOL Console Redirection.

BIOS serial console redirection

(Pictured here is also COM1 Console Redirection. This is for the physical serial port on the machine, not the one in the IPMI.)

Under SOL Console Redirection Settings you may as well set the Bits per second to 115200.

BIOS SOL redirection settings

Now it’s time to configure the IPMI so you can interact with it over the network. Under IPMI > BMC Network Configuration, put the IPMI on your management network:

IPMI network configuration

Connecting to the IPMI serial ^

With the above BIOS settings in place you should be able to save and reboot and then connect to the IPMI serial console. The default credentials are ADMIn / ADMIN which you should of course change with ipmitool, but that is for a different post.

There’s two ways to connect to the serial-over-LAN: You can ssh to the IPMI controller, or you can use ipmitool. Personally I prefer ssh, but the ipmitool way is like this:

$ ipmitool -I lanplus -H 192.168.1.22 -U ADMIN -a sol activate

The ssh way:

$ ssh ADMIN@192.168.1.22
The authenticity of host '192.168.1.22 (192.168.1.22)' can't be established.
RSA key fingerprint is b7:e1:12:94:37:81:fc:f7:db:6f:1c:00:e4:e0:e1:c4.
Are you sure you want to continue connecting (yes/no)?
Warning: Permanently added '192.168.1.22,192.168.1.22' (RSA) to the list of known hosts.
ADMIN@192.168.1.22's password:
 
ATEN SMASH-CLP System Management Shell, version 1.05
Copyright (c) 2008-2009 by ATEN International CO., Ltd.
All Rights Reserved 
 
 
-> cd /system1/sol1
/system1/sol1
 
-> start
/system1/sol1
press <Enter>, <Esc>, and then <T> to terminate session
(press the keys in sequence, one after the other)

They both end up displaying basically the same thing.

The serial console should just be displaying the boot process, which won’t go anywhere.

DHCP and TFTP server ^

You will need to configure a DHCP and TFTP server on an already-existing machine on the same LAN as your new server. They can both run on the same host.

The DHCP server responds to the initial requests for IP address configuration and passes along where to get the boot environment from. The TFTP server serves up that boot environment. The boot environment here consists of a kernel, initramfs and some configuration for passing arguments to the bootloader/kernel. The boot environment is provided by the Debian project.

DHCP ^

I’m using isc-dhcp-server. Its configuration file is at /etc/dhcp/dhcpd.conf.

You’ll need to know the MAC address of the server, which can be obtained either from the front page of the IPMI controller’s web interface (i.e. https://192.168.1.22/ in this case) or else it is displayed on the serial console when it attempts to do a PXE boot. So, add a section for that:

subnet 192.168.2.0 netmask 255.255.255.0 {
}
 
host foo {
    hardware ethernet 0C:C4:7A:7C:28:40;
    fixed-address 192.168.2.22;
    filename "pxelinux.0";
    next-server 192.168.2.251;
    option subnet-mask 255.255.255.0;
    option routers 192.168.2.1;
}

Here we set the network configuration of the new server with fixed-address, option subnet-mask and option routers. The IP address in next-server refers to the IP address of the TFTP server, and pxelinux.0 is what the new server will download from it.

Make sure that is running:

# service isc-dhcp-server start

DHCP uses UDP port 67, so make sure that is allowed through your firewall.

TFTP ^

A number of different TFTP servers are available. I use tftpd-hpa, which is mostly configured by variables in /etc/default/tftp-hpa:

TFTP_OPTIONS="--secure"
TFTP_USERNAME="tftp"
TFTP_DIRECTORY="/srv/tftp"
TFTP_ADDRESS="0.0.0.0:69"

TFTP_DIRECTORY is where you’ll put the files for the PXE environment.

Make sure that the TFTP server is running:

# service tftpd-hpa start

TFTP uses UDP port 69, so make sure that is allowed through your firewall.

Download the netboot files from your local Debian mirror:

$ cd /srv/tftp
$ curl -s http://ftp.YOUR-MIRROR.debian.org/debian/dists/jessie/main/installer-amd64/current/images/netboot/netboot.tar.gz | sudo tar zxvf -
./                
./version.info    
./ldlinux.c32     
./pxelinux.0      
./pxelinux.cfg    
./debian-installer/
./debian-installer/amd64/
./debian-installer/amd64/bootnetx64.efi
./debian-installer/amd64/grub/
./debian-installer/amd64/grub/grub.cfg
./debian-installer/amd64/grub/font.pf2
…

(This assumes you are installing a device with architecture amd64.)

At this point your TFTP server root should contain a debian-installer subdirectory and a couple of links into it:

$ ls -l .
total 8
drwxrwxr-x 3 root root 4096 Jun  4  2015 debian-installer
lrwxrwxrwx 1 root root   47 Jun  4  2015 ldlinux.c32 -> debian-installer/amd64/boot-screens/ldlinux.c32
lrwxrwxrwx 1 root root   33 Jun  4  2015 pxelinux.0 -> debian-installer/amd64/pxelinux.0
lrwxrwxrwx 1 root root   35 Jun  4  2015 pxelinux.cfg -> debian-installer/amd64/pxelinux.cfg
-rw-rw-r-- 1 root root   61 Jun  4  2015 version.info

You could now boot your server and it would call out to PXE to do its netboot, but would be displaying the installer process on the VGA output. If you intend to carry it out using the Remote Console facility of the IPMI interface then that may be good enough. If you want to do it over the serial-over-LAN though, you’ll need to edit some of the files that came out of the netboot.tar.gz to configure that.

Here’s a list of the files you need to edit. All you are doing in each one is telling it to use serial console. The changes are quite mechanical so you can easily come up with a script to do it, but here I will show the changes verbosely. All the files live in the debian-installer/amd64/boot-screens/ directory.

ttyS1 is used here because this system has a real serial port on ttyS0. 115200 is the baud rate of ttyS1 as configured in the BIOS earlier.

adtxt.cfg

From:

label expert
        menu label ^Expert install
        kernel debian-installer/amd64/linux
        append priority=low vga=788 initrd=debian-installer/amd64/initrd.gz --- 
include debian-installer/amd64/boot-screens/rqtxt.cfg
label auto
        menu label ^Automated install
        kernel debian-installer/amd64/linux
        append auto=true priority=critical vga=788 initrd=debian-installer/amd64/initrd.gz --- quiet

To:

label expert
        menu label ^Expert install
        kernel debian-installer/amd64/linux
        append priority=low console=ttyS1,115200n8 initrd=debian-installer/amd64/initrd.gz --- 
include debian-installer/amd64/boot-screens/rqtxt.cfg
label auto
        menu label ^Automated install
        kernel debian-installer/amd64/linux
        append auto=true priority=critical console=ttyS1,115200n8 initrd=debian-installer/amd64/initrd.gz --- quiet
rqtxt.cfg

From:

label rescue
        menu label ^Rescue mode
        kernel debian-installer/amd64/linux
        append vga=788 initrd=debian-installer/amd64/initrd.gz rescue/enable=true --- quiet

To:

label rescue
        menu label ^Rescue mode
        kernel debian-installer/amd64/linux
        append console=ttyS1,115200n8 initrd=debian-installer/amd64/initrd.gz rescue/enable=true --- quiet
syslinux.cfg

From:

# D-I config version 2.0
# search path for the c32 support libraries (libcom32, libutil etc.)
path debian-installer/amd64/boot-screens/
include debian-installer/amd64/boot-screens/menu.cfg
default debian-installer/amd64/boot-screens/vesamenu.c32
prompt 0
timeout 0

To:

serial 1 115200
console 1
# D-I config version 2.0
# search path for the c32 support libraries (libcom32, libutil etc.)
path debian-installer/amd64/boot-screens/
include debian-installer/amd64/boot-screens/menu.cfg
default debian-installer/amd64/boot-screens/vesamenu.c32
prompt 0
timeout 0
txt.cfg

From:

default install
label install
        menu label ^Install
        menu default
        kernel debian-installer/amd64/linux
        append vga=788 initrd=debian-installer/amd64/initrd.gz --- quiet

To:

default install
label install
        menu label ^Install
        menu default
        kernel debian-installer/amd64/linux
        append console=ttyS1,115200n8 initrd=debian-installer/amd64/initrd.gz --- quiet

Perform the install ^

Connect to the serial-over-LAN and get started. If the server doesn’t have anything currently installed then it should go straight to trying PXE boot. If it does have something on the storage that it would boot then you will have to use F12 at the BIOS screen to convince it to jump straight to PXE boot.

$ ssh ADMIN@192.168.1.22
ADMIN@192.168.1.22's password:
 
ATEN SMASH-CLP System Management Shell, version 1.05
Copyright (c) 2008-2009 by ATEN International CO., Ltd.
All Rights Reserved 
 
 
-> cd /system1/sol1
/system1/sol1
 
-> start
/system1/sol1
press <Enter>, <Esc>, and then <T> to terminate session
(press the keys in sequence, one after the other)
 
Intel(R) Boot Agent GE v1.5.13                                                  
Copyright (C) 1997-2013, Intel Corporation                                      
 
CLIENT MAC ADDR: 0C C4 7A 7C 28 40  GUID: 00000000 0000 0000 0000 0CC47A7C2840  
CLIENT IP: 192.168.2.22  MASK: 255.255.255.0  DHCP IP: 192.168.2.252             
GATEWAY IP: 192.168.2.1
 
PXELINUX 6.03 PXE 20150107 Copyright (C) 1994-2014 H. Peter Anvin et al    
 
 
 
 
 
 
                 ┌───────────────────────────────────────┐
                 │ Debian GNU/Linux installer boot menu  │
                 ├───────────────────────────────────────┤
                 │ Install                               │
                 │ Advanced options                    > │
                 │ Help                                  │
                 │ Install with speech synthesis         │
                 │                                       │
                 │                                       │
                 │                                       │
                 │                                       │
                 │                                       │
                 │                                       │
                 └───────────────────────────────────────┘
 
 
 
              Press ENTER to boot or TAB to edit a menu entry
 
 
  ┌───────────────────────┤ [!!] Select a language ├────────────────────────┐
  │                                                                         │
  │ Choose the language to be used for the installation process. The        │
  │ selected language will also be the default language for the installed   │
  │ system.                                                                 │
  │                                                                         │
  │ Language:                                                               │
  │                                                                         │
  │                               C                                         │
  │                               English                                   │
  │                                                                         │
  │     <Go Back>                                                           │
  │                                                                         │
  └─────────────────────────────────────────────────────────────────────────┘
 
 
 
 
<Tab> moves; <Space> selects; <Enter> activates buttons

…and now the installation proceeds as normal.

At the end of this you should be left with a system that uses ttyS1 for its console. You may need to tweak that depending on whether you want the VGA console also.