Another disappointing btrfs experience

I’ve been using btrfs on my home fileserver for about 4½ years. I am not entirely happy with it and kind of wish I never did it; I will certainly not be introducing it anywhere else. I’m also pretty lazy though, which probably explains why I haven’t ripped it out and replaced it with something else yet.

I’ve had a few problems with it over the years. To be fair I’ve never lost any data; it’s really the availability aspects of it which I feel just aren’t ready yet. When I use multiple storage devices it’s generally to increase availability. I don’t expect device failure to stop me doing what I need to do, at least for small amounts of device failure.

Unfortunately btrfs has consistently not lived up to these expectations. Almost every single-disk failure I’ve had in the past has resulted in an “outage” of some sort. As this is just our data, at home, it may be strange to think of it as an outage, but that’s what it is. Our data became unavailable in some way for some period of time.

This time around, one of the drives started throwing up “Currently unreadable (pending)” and “Offline uncorrectable” sectors a few days ago. That means that there’s areas of the drive that it cannot read. Initially there were just a small number, and a scrub came back clean so that suggested the problem sectors were at that time outside of any filesystem.

In a more critical setting I’d have spare drives available and would just swap them, but for home use I’m usually comfortable with forcing the drive to reallocate these by forcing a write, before ordering a replacement if the problem doesn’t go away. Worst case, I have backups.

After a day or so though, the number of problem sectors was increasing and it was obvious the drive was going to die fairly soon. I ordered a replacement. About 6 hours before the replacement arrived the drive completely stopped responding.

Now, this drive was at the time one of five in the btrfs filesystem, and the filesystem has a raid1 storage policy so there should have been no issue with one device going missing. But apparently there was a problem. btrfs sits spewing the kernel log with errors about lost writes to a device that’s no longer there; the filesystem goes read-only.

The replacement drive arrives, but with the filesystem read-only I can’t add it. I can’t even unmount the filesystem (says it is busy but lsof doesn’t see any users). Nope, I had to reboot the fileserver, at which point the filesystem wouldn’t mount at all because you have to give it the degraded mount option if you want it to mount with any devices missing.

Add the replacement drive, btrfs device remove missing /path/to/fs to kick off a remove of the dead device. Things are at least up and running read-write while this is going on. In fact it’s still going on, because there was 1.2TiB of data on the dead device and reconstructing it is painfully slow. As I write this we’re now about 9 hours in and there’s still about 421GiB to go.

So, it’s not terrible. No data was lost (probably). A short outage due to a required reboot. But it is kind of disappointing and not really how I want to be spending my time just because a single HDD slipped its mortal coil. I am massively thankful that the operating system of that fileserver is still on four other HDDs on ext4+lvm+md and never give me any trouble. Otherwise I’d have to be booting into a rescue OS to fix this sort of thing. When the thing you’re glad of is that you didn’t use a filesystem, that isn’t a great advert for that filesystem.

I should probably try to find some time to play (again) with ZFS-on-Linux. I did actually give it a go last year but got bogged down trying to compare its performance against btrfs and ext4+lvm+md using fio, which proved quite difficult to do, and I moved on to other things.

One of the things that initially attracted me to btrfs is the possibility of using a mish-mash of differently-sized drives. Due to BitFolk constantly replacing hardware I have in my possession plenty of HDDs of differing sizes that are individually perfectly serviceable, but would be awkward to try to match up into identical sizes for conventional RAID arrays. Over the years of this btrfs filesystem it had started out with mostly 250G drives and just before this failure it was 1x 1TB, 3x 2TB and 1x 3TB.

I had thought that ZFS requires every device to be the same capacity (i.e. it would only use the smallest capacity) but I’ve since been informed that ZFS will just use the capacity of the smallest device in the vdev. So assuming mirror vdevs, I’d just need to pair the drives up (or accept that the capacity will be that of the smaller of the two).

That doesn’t seem too onerous at all, when considering the advantages that ZFS would bring. I’m most interested in the self-healing (checksums) and the storage tiering (through using faster devices like SSDs for L2ARC and ZIL). btrfs doesn’t have a good solution for tiering yet, unless you are insane and want to play with bcache(fs).

So, yeah, should stop being lazy and crack on with ZFS again. In my copious free time.

Disabling edge tiling on GNOME 3.28 / Debian testing (buster)

We’ve been here before ^

In an earlier post I mentioned how to disable edge tiling. That was for my desktop machine which at the time was running Ubuntu 17.10 and GNOME 3.26.

My laptop, however, currently runs Debian testing (buster) with GNOME 3.28, and this method does not work.

Things that work ^

In fact, one of the ways the Internet suggested that didn’t work for Ubuntu, does work on my Debian laptop. That is:

$ gsettings set org.gnome.shell.overrides edge-tiling false

I have no idea why, sorry.

Things that don’t work ^

So, for my Debian buster laptop running GNOME 3.28 under Xorg, the things that don’t work are:

$ dconf write /org/gnome/shell/extensions/classic-overrides/edge-tiling false
$ dconf write /org/gnome/mutter/edge-tiling false
$ dconf write /org/gnome/shell/overrides/edge-tiling false

Using a different theme for Mediawiki’s SyntaxHighlight extension

Probably the best syntax highlighting plugin for Mediawiki at the moment is the one simply called SyntaxHighlight. It uses Pygments to do the heavy lifting. What sets it apart from the other extensions is that it supports line numbers and picking out highlighted lines.

Unfortunately the default style (theme) is dark-on-light whereas for most of my syntax highlighting I am giving examples of either shell sessions or code. All my shell sessions and code are viewed as light-on-dark, so I would prefer that the wiki’s syntax highlighting followed suit.

I spent quite a while messing about with editing the extension itself but to little effect, until Robert pointed out that I just needed to edit the Common.css file inside the wiki itself. Then you get some decent results.

I used something like this to generate the correct CSS for the “native” style:

$ ./extensions/SyntaxHighlight_GeSHi/pygments/pygmentize -S native -f html|sed -e 's/^/.mw-highlight > pre /'
.mw-highlight > pre .hll { background-color: #404040 }
.mw-highlight > pre .c { color: #999999; font-style: italic } /* Comment */
.mw-highlight > pre .err { color: #a61717; background-color: #e3d2d2 } /* Error */
.mw-highlight > pre .esc { color: #d0d0d0 } /* Escape */
.mw-highlight > pre .g { color: #d0d0d0 } /* Generic */
.mw-highlight > pre .k { color: #6ab825; font-weight: bold } /* Keyword */
.mw-highlight > pre .l { color: #d0d0d0 } /* Literal */
.mw-highlight > pre .n { color: #d0d0d0 } /* Name */
.mw-highlight > pre .o { color: #d0d0d0 } /* Operator */
.mw-highlight > pre .x { color: #d0d0d0 } /* Other */
.mw-highlight > pre .p { color: #d0d0d0 } /* Punctuation */
.mw-highlight > pre .ch { color: #999999; font-style: italic } /* Comment.Hashbang */
.mw-highlight > pre .cm { color: #999999; font-style: italic } /* Comment.Multiline */
.mw-highlight > pre .cp { color: #cd2828; font-weight: bold } /* Comment.Preproc */
.mw-highlight > pre .cpf { color: #999999; font-style: italic } /* Comment.PreprocFile */
.mw-highlight > pre .c1 { color: #999999; font-style: italic } /* Comment.Single */
.mw-highlight > pre .cs { color: #e50808; font-weight: bold; background-color: #520000 } /* Comment.Special */
.mw-highlight > pre .gd { color: #d22323 } /* Generic.Deleted */
.mw-highlight > pre .ge { color: #d0d0d0; font-style: italic } /* Generic.Emph */
.mw-highlight > pre .gr { color: #d22323 } /* Generic.Error */
.mw-highlight > pre .gh { color: #ffffff; font-weight: bold } /* Generic.Heading */
.mw-highlight > pre .gi { color: #589819 } /* Generic.Inserted */
.mw-highlight > pre .go { color: #cccccc } /* Generic.Output */
.mw-highlight > pre .gp { color: #aaaaaa } /* Generic.Prompt */
.mw-highlight > pre .gs { color: #d0d0d0; font-weight: bold } /* Generic.Strong */
.mw-highlight > pre .gu { color: #ffffff; text-decoration: underline } /* Generic.Subheading */
.mw-highlight > pre .gt { color: #d22323 } /* Generic.Traceback */
.mw-highlight > pre .kc { color: #6ab825; font-weight: bold } /* Keyword.Constant */
.mw-highlight > pre .kd { color: #6ab825; font-weight: bold } /* Keyword.Declaration */
.mw-highlight > pre .kn { color: #6ab825; font-weight: bold } /* Keyword.Namespace */
.mw-highlight > pre .kp { color: #6ab825 } /* Keyword.Pseudo */
.mw-highlight > pre .kr { color: #6ab825; font-weight: bold } /* Keyword.Reserved */
.mw-highlight > pre .kt { color: #6ab825; font-weight: bold } /* Keyword.Type */
.mw-highlight > pre .ld { color: #d0d0d0 } /* Literal.Date */
.mw-highlight > pre .m { color: #3677a9 } /* Literal.Number */
.mw-highlight > pre .s { color: #ed9d13 } /* Literal.String */
.mw-highlight > pre .na { color: #bbbbbb } /* Name.Attribute */
.mw-highlight > pre .nb { color: #24909d } /* Name.Builtin */
.mw-highlight > pre .nc { color: #447fcf; text-decoration: underline } /* Name.Class */
.mw-highlight > pre .no { color: #40ffff } /* Name.Constant */
.mw-highlight > pre .nd { color: #ffa500 } /* Name.Decorator */
.mw-highlight > pre .ni { color: #d0d0d0 } /* Name.Entity */
.mw-highlight > pre .ne { color: #bbbbbb } /* Name.Exception */
.mw-highlight > pre .nf { color: #447fcf } /* Name.Function */
.mw-highlight > pre .nl { color: #d0d0d0 } /* Name.Label */
.mw-highlight > pre .nn { color: #447fcf; text-decoration: underline } /* Name.Namespace */
.mw-highlight > pre .nx { color: #d0d0d0 } /* Name.Other */
.mw-highlight > pre .py { color: #d0d0d0 } /* Name.Property */
.mw-highlight > pre .nt { color: #6ab825; font-weight: bold } /* Name.Tag */
.mw-highlight > pre .nv { color: #40ffff } /* Name.Variable */
.mw-highlight > pre .ow { color: #6ab825; font-weight: bold } /* Operator.Word */
.mw-highlight > pre .w { color: #666666 } /* Text.Whitespace */
.mw-highlight > pre .mb { color: #3677a9 } /* Literal.Number.Bin */
.mw-highlight > pre .mf { color: #3677a9 } /* Literal.Number.Float */
.mw-highlight > pre .mh { color: #3677a9 } /* Literal.Number.Hex */
.mw-highlight > pre .mi { color: #3677a9 } /* Literal.Number.Integer */
.mw-highlight > pre .mo { color: #3677a9 } /* Literal.Number.Oct */
.mw-highlight > pre .sa { color: #ed9d13 } /* Literal.String.Affix */
.mw-highlight > pre .sb { color: #ed9d13 } /* Literal.String.Backtick */
.mw-highlight > pre .sc { color: #ed9d13 } /* Literal.String.Char */
.mw-highlight > pre .dl { color: #ed9d13 } /* Literal.String.Delimiter */
.mw-highlight > pre .sd { color: #ed9d13 } /* Literal.String.Doc */
.mw-highlight > pre .s2 { color: #ed9d13 } /* Literal.String.Double */
.mw-highlight > pre .se { color: #ed9d13 } /* Literal.String.Escape */
.mw-highlight > pre .sh { color: #ed9d13 } /* Literal.String.Heredoc */
.mw-highlight > pre .si { color: #ed9d13 } /* Literal.String.Interpol */
.mw-highlight > pre .sx { color: #ffa500 } /* Literal.String.Other */
.mw-highlight > pre .sr { color: #ed9d13 } /* Literal.String.Regex */
.mw-highlight > pre .s1 { color: #ed9d13 } /* Literal.String.Single */
.mw-highlight > pre .ss { color: #ed9d13 } /* Literal.String.Symbol */
.mw-highlight > pre .bp { color: #24909d } /* Name.Builtin.Pseudo */
.mw-highlight > pre .fm { color: #447fcf } /* Name.Function.Magic */
.mw-highlight > pre .vc { color: #40ffff } /* Name.Variable.Class */
.mw-highlight > pre .vg { color: #40ffff } /* Name.Variable.Global */
.mw-highlight > pre .vi { color: #40ffff } /* Name.Variable.Instance */
.mw-highlight > pre .vm { color: #40ffff } /* Name.Variable.Magic */
.mw-highlight > pre .il { color: #3677a9 } /* Literal.Number.Integer.Long */

(Yes, I also need to do the light-on-dark thing here in this blog)

To get a list of available styles:

$ ./extensions/SyntaxHighlight_GeSHi/pygments/pygmentize -L styles
Pygments version 2.2.0, (c) 2006-2017 by Georg Brandl.
 
Styles:
~~~~~~~
* manni:
    A colorful style, inspired by the terminal highlighting style.
* igor:
    Pygments version of the official colors for Igor Pro procedures.
* lovelace:
    The style used in Lovelace interactive learning environment. Tries to avoid the "angry fruit salad" effect with desaturated and dim colours.
* xcode:
    Style similar to the Xcode default colouring theme.
* vim:
    Styles somewhat like vim 7.0
* autumn:
    A colorful style, inspired by the terminal highlighting style.
* abap:
 
* vs:
 
* rrt:
    Minimalistic "rrt" theme, based on Zap and Emacs defaults.
* native:
    Pygments version of the "native" vim theme.
* perldoc:
    Style similar to the style used in the perldoc code blocks.
* borland:
    Style similar to the style used in the borland IDEs.
* arduino:
    The Arduino® language style. This style is designed to highlight the Arduino source code, so exepect the best results with it.
* tango:
    The Crunchy default Style inspired from the color palette from the Tango Icon Theme Guidelines.
* emacs:
    The default style (inspired by Emacs 22).
* friendly:
    A modern style based on the VIM pyte theme.
* monokai:
    This style mimics the Monokai color scheme.
* paraiso-dark:
 
* colorful:
    A colorful style, inspired by CodeRay.
* murphy:
    Murphy's style from CodeRay.
* bw:
 
* pastie:
    Style similar to the pastie default style.
* rainbow_dash:
    A bright and colorful syntax highlighting theme.
* algol_nu:
 
* paraiso-light:
 
* trac:
    Port of the default trac highlighter design.
* default:
    The default style (inspired by Emacs 22).
* algol:
 
* fruity:
    Pygments version of the "native" vim theme.

Although you may find it easier looking at the Pygments style gallery.

Let’s Encrypt wildcard certificates, acme.sh and automated DNS verification

Let’s Encrypt’s wildcard certificates ^

Now that Let’s Encrypt can issue wildcard TLS certificates I found some time to look into that.

I already use a Lua script with haproxy which takes care of automatically answering http-01 ACME challenges, but to issue/renew a wildcard certificate you need to answer a dns-01 challenge. A different client/setup would be needed.

dns-01 ACME challenges ^

Most of the clients that support ACME v2 offer a range of integrations for DNS providers, plus a manual mode that prints out the DNS record that you need to add and then waits for you to indicate that you’ve done it. I run my own DNS infrastructure so the thing to do would be RFC2136 dynamic DNS updates.

One wrinkle here is that currently none of my DNS zones have dynamic updates enabled. At the moment I manage them as zone files (some are automatically generated by scripts though). After looking at a few of the client options I found that acme.sh supports an “alias zone”.

Basically, in your main zone you create a CNAME for the challenge record that points at another zone, and then enable dynamic updates in that other zone. The other zone is dedicated for this purpose, so the only updates which will be happening will be for the purpose of answering dns-01 ACME challenges. I made my dynamic zone a sub-zone of my main one:

strugglers.net zone file content ^

These records need to be added to the main zone for this to work.

.
.
.
; sub-zone purely used for dns-01 ACME challenges.
acmesh          NS a.authns.bitfolk.co.uk.
                NS b.authns.bitfolk.com.
                NS c.authns.bitfolk.com.
 
; Alias the dns-01 challenge record into the dedicated zone.
_acme-challenge CNAME _acme-challenge.acmesh.strugglers.net.
.
.
.

acmesh.strugglers.net zone file content ^

Initially this just needs to be an empty zone with only SOA and NS records, so this is the entire content of the file.

$ORIGIN .
$TTL 86400      ; 1 day
acmesh.strugglers.net   IN SOA  a.authns.bitfolk.co.uk. hostmaster.bitfolk.com. (
                                2018031905 ; serial
                                14400      ; refresh (4 hours)
                                7200       ; retry (2 hours)
                                1209600    ; expire (2 weeks)
                                43200      ; minimum (12 hours)
                                )
                        NS      a.authns.bitfolk.co.uk.
                        NS      b.authns.bitfolk.com.
                        NS      c.authns.bitfolk.com.

DNS server configuration ^

The DNS server needs to know a key by which it will authenticate acme.sh‘s updates, and also needs to be told that the new zone is a dynamic zone. I use BIND, so it goes as follows.

Generate a key for dynamic DNS updates ^

Use the dnssec-keygen command to generate a key suitable for authenticating DNS updates.

$ dnssec-keygen -r /dev/urandom -a HMAC-SHA512 -b 512 -n HOST DDNS_UPDATE

This creates two files named like Kddns_update.+165+14059.key and Kddns_update.+165+14059.private.

Put the key in the BIND config ^

Look in the private file and take the key from the line that starts “Key:”. Put that in some config file that you will load into your BIND like this:

key "strugglers" {
    algorithm hmac-sha512;
    secret "Sb8nvwpO8bhiU4haPB+NiJKoMO6vVJumrr29Bj3daSuB8hBoTKoqPKMBKTYLRUv12pbKPwJATgdsU6BtL4Hmcw==";
};

The thing in quotes after “key” is a symbolic name for this key and can be anything that makes sense to you. The “secret” is the key from the private file. You can delete the two Kddns_update.+165+14059.* files now.

Put the new zone into the BIND config ^

The config for the zone itself looks something like this:

zone "acmesh.strugglers.net" {
    type master;
    file "/path/to/acmesh.strugglers.net";
    allow-update {
        key "strugglers";
    };
};

Reload the DNS server ^

Once BIND has been reloaded the log file should indicate that the acemsh.strugglers.net zone was loaded correctly, and in my case that triggers DNS NOTIFY to my secondary servers which automatically begin zone transfers.

Check things out with nsupdate ^

At this point it might be worth using the nsupdate command to check that you can do dynamic DNS updates.

Just type the nsupdate line in the shell, the > is a prompt at which you will type the updates you wish to send. We’ll add a trivial TXT record. The -k argument is the path to the file containing the key.

$ nsupdate -k /path/to/strugglers.key -v
> server a.authns.bitfolk.co.uk
> debug yes
> zone acmesh.strugglers.net.
> update add foo.acmesh.strugglers.net. 86400 TXT "bar"
> show
Outgoing update query:
;; ->>HEADER<<- opcode: UPDATE, status: NOERROR, id:      0
;; flags:; ZONE: 0, PREREQ: 0, UPDATE: 0, ADDITIONAL: 0
;; ZONE SECTION:
;acmesh.strugglers.net.         IN      SOA
 
;; UPDATE SECTION:
foo.acmesh.strugglers.net. 86400 IN     TXT     "bar"
 
> send
Sending update to 85.119.80.222#53
Outgoing update query:
;; ->>HEADER<<- opcode: UPDATE, status: NOERROR, id:  19987
;; flags:; ZONE: 1, PREREQ: 0, UPDATE: 1, ADDITIONAL: 1
;; ZONE SECTION:
;acmesh.strugglers.net.         IN      SOA
 
;; UPDATE SECTION:
foo.acmesh.strugglers.net. 86400 IN     TXT     "bar"
 
;; TSIG PSEUDOSECTION:
strugglers.             0       ANY     TSIG    hmac-sha512. 1521454639 300 64 dPndp1/ZyqzmSEn0AKIsGR62HrsplJBhntWioM4oBdPlNXUIAwg7Jwpg DGSM2S3kY+5hfGTleNqwXZrMvnBhUQ== 19987 NOERROR 0 
 
 
Reply from update query:
;; ->>HEADER<<- opcode: UPDATE, status: NOERROR, id:  19987
;; flags: qr; ZONE: 1, PREREQ: 0, UPDATE: 0, ADDITIONAL: 1
;; ZONE SECTION:
;acmesh.strugglers.net.         IN      SOA
 
;; TSIG PSEUDOSECTION:
strugglers.             0       ANY     TSIG    hmac-sha512. 1521454639 300 64 NfH/78kvq6f+59RXnyJwC6kfFRLGjG6Rh9jdYRId7UjH0jwIbtRVpqCu xx4HToGmlJrDTUqpgbYZq2orUOZlkQ== 19987 NOERROR 0
 
> [Ctrl-D]

And to verify it really got added (though the status of NOERROR should be confirmation enough):

$ dig +short -t txt foo.acmesh.strugglers.net
"bar"

That it; you can do dynamic DNS updates.

acme.sh ^

I’m going to assume you’ve installed acme.sh according to one of its supported installation methods. Personally I am not into curl | sh so I:

  • Create a system user that can’t log in.
  • git clone the source.
  • acme.sh --install it as that user.

acme.sh doesn’t have to be run on the primary DNS server, because it’s going to use a dynamic DNS update to do all the DNS things. It just needs access to the dynamic DNS update key file. Either you can install acme.sh on each host that will need to generate/renew certificates and copy the DNS key there, or else do all the certificate generation/renewal in one place and copy the certificate files around.

However you manage it, make sure that the user you’re going to run acme.sh as can read the dynamic DNS update key file.

Issuing the first wildcard certificate ^

The first time you issue the certificate you need to set NSUPDATE_KEY and NSUPDATE_SERVER in your environment. After the first successful issuance acme.sh will store these variables in its configuration for use in the automated renewals.

$ NSUPDATE_SERVER=a.authns.bitfolk.co.uk NSUPDATE_KEY=/path/to/strugglers.key ./acme.sh --issue -d strugglers.net -d '*.strugglers.net' --challenge-alias acmesh.strugglers.net --dns dns_nsupdate
[Mon 19 Mar 09:19:00 UTC 2018] Multi domain='DNS:strugglers.net,DNS:*.strugglers.net'
[Mon 19 Mar 09:19:00 UTC 2018] Getting domain auth token for each domain
[Mon 19 Mar 09:19:03 UTC 2018] Getting webroot for domain='strugglers.net'
[Mon 19 Mar 09:19:03 UTC 2018] Getting webroot for domain='*.strugglers.net'
[Mon 19 Mar 09:19:04 UTC 2018] Found domain api file: /path/to/acmesh/dnsapi/dns_nsupdate.sh
[Mon 19 Mar 09:19:04 UTC 2018] adding _acme-challenge.acmesh.strugglers.net. 60 in txt "WmenhbXRtenhpNLYLOBjznyHcVvFk-jjxurCVTrhWc8"
[Mon 19 Mar 09:19:04 UTC 2018] Found domain api file: /path/to/acmesh/dnsapi/dns_nsupdate.sh
[Mon 19 Mar 09:19:04 UTC 2018] adding _acme-challenge.acmesh.strugglers.net. 60 in txt "fwZPUBHijOQkJJaoOF_nIn3Z_FtuVU9R635NDVz_hPA"
[Mon 19 Mar 09:19:04 UTC 2018] Sleep 120 seconds for the txt records to take effect

At this point a DNS update has been crafted and sent so you should see your zone update and zone transfer happen to any secondary servers. If that doesn’t happen within 120 seconds then when Let’s Encrypt tries to verify the challenge it might query a DNS server that doesn’t yet have the record. Your zone transfers need to be reliable.

[Mon 19 Mar 09:21:08 UTC 2018] Verifying:strugglers.net
[Mon 19 Mar 09:21:12 UTC 2018] Success
[Mon 19 Mar 09:21:12 UTC 2018] Verifying:*.strugglers.net
[Mon 19 Mar 09:21:15 UTC 2018] Success
[Mon 19 Mar 09:21:15 UTC 2018] Removing DNS records.
[Mon 19 Mar 09:21:15 UTC 2018] removing _acme-challenge.acmesh.strugglers.net. txt
[Mon 19 Mar 09:21:16 UTC 2018] removing _acme-challenge.acmesh.strugglers.net. txt
[Mon 19 Mar 09:21:16 UTC 2018] Verify finished, start to sign.
[Mon 19 Mar 09:21:18 UTC 2018] Cert success.
-----BEGIN CERTIFICATE-----
MIIFETCCA/mgAwIBAgISAz4ZQV27n1FgemVAEhIqiUZnMA0GCSqGSIb3DQEBCwUA
MEoxCzAJBgNVBAYTAlVTMRYwFAYDVQQKEw1MZXQncyBFbmNyeXB0MSMwIQYDVQQD
.
.
.
NeAmr5I=
-----END CERTIFICATE-----
[Mon 19 Mar 09:21:18 UTC 2018] Your cert is in  /path/to/acmesh/.acme.sh/strugglers.net/strugglers.net.cer 
[Mon 19 Mar 09:21:18 UTC 2018] Your cert key is in  /path/to/acmesh/.acme.sh/strugglers.net/strugglers.net.key 
[Mon 19 Mar 09:21:18 UTC 2018] The intermediate CA cert is in  /path/to/acmesh/.acme.sh/strugglers.net/ca.cer 
[Mon 19 Mar 09:21:18 UTC 2018] And the full chain certs is there:  /path/to/acmesh/.acme.sh/strugglers.net/fullchain.cer

Examining a certificate ^

Just for peace of mind…

$ openssl x509 -text -noout -certopt no_subject,no_header,no_version,no_serial,no_signame,no_subject,no_issuer,no_pubkey,no_sigdump,no_aux -in /path/to/acmesh/.acme.sh/strugglers.net/strugglers.net.cer
        Validity
            Not Before: Mar 19 08:21:17 2018 GMT
            Not After : Jun 17 08:21:17 2018 GMT
        X509v3 extensions:
            X509v3 Key Usage: critical
                Digital Signature, Key Encipherment
            X509v3 Extended Key Usage: 
                TLS Web Server Authentication, TLS Web Client Authentication
            X509v3 Basic Constraints: critical
                CA:FALSE
            X509v3 Subject Key Identifier: 
                BF:C7:8E:F5:87:05:D0:6E:15:AC:7B:37:9F:82:05:C3:E3:11:B7:32
            X509v3 Authority Key Identifier: 
                keyid:A8:4A:6A:63:04:7D:DD:BA:E6:D1:39:B7:A6:45:65:EF:F3:A8:EC:A1
 
            Authority Information Access: 
                OCSP - URI:http://ocsp.int-x3.letsencrypt.org
                CA Issuers - URI:http://cert.int-x3.letsencrypt.org/
 
            X509v3 Subject Alternative Name: 
                DNS:*.strugglers.net, DNS:strugglers.net
            X509v3 Certificate Policies: 
                Policy: 2.23.140.1.2.1
                Policy: 1.3.6.1.4.1.44947.1.1.1
                  CPS: http://cps.letsencrypt.org
                  User Notice:
                    Explicit Text: This Certificate may only be relied upon by Relying Parties and only in accordance with the Certificate Policy found at https://letsencrypt.org/repository/

From the Subject Alternative Name we can see it is a wildcard certificate.

Disabling edge tiling on GNOME 3.26 / Ubuntu 17.10

Edge tiling? ^

It’s that thing where when you drag a window so it hits the edge of the screen, GNOME offers to maximise the window. Generally the number of times I will knowingly want to maximise a window by dragging it to the top of the screen is 0, while the number of times it happens accidentally is over 9,000 by lunch time.

Things that work ^

$ dconf write /org/gnome/mutter/edge-tiling false

It should take effect immediately.

If you like a pointy clicky way to do it then install dconf-editor package and run dconf-editor, but really all you will do is click down the tree orggnomemutter and then toggle edge-tiling so I don’t really see the point.

Things that people on the Internet say work, but don’t – a non-exhaustive list ^

These suggestions silently fail to do anything, as far as I can see. They may have been correct for earlier versions of GNOME, but I am using GNOME on Ubuntu 17.10 and they didn’t work for me.

dconf write /org/gnome/shell/extensions/classic-overrides/edge-tiling false
gsettings set org.gnome.shell.overrides edge-tiling false
dconf write /org/gnome/shell/overrides/edge-tiling false

When is a 64-bit counter not a 64-bit counter?

…when you run a Xen device backend (commonly dom0) on a kernel version earlier than 4.10, e.g. Debian stable.

TL;DR ^

Xen netback devices used 32-bit counters until that bug was fixed and released in kernel version 4.10.

On a kernel with that bug you will see counter wraps much sooner than you would expect, and if the interface is doing enough traffic for there to be multiple wraps in 5 minutes, your monitoring will no longer be accurate.

The problem ^

A high-bandwidth VPS customer reported that the bandwidth figures presented by BitFolk’s monitoring bore no resemblance to their own statistics gathered from inside their VPS. Their figures were a lot higher.

About octet counters ^

The Linux kernel maintains byte/octet counters for its network interfaces. You can view them in /sys/class/net/<interface>/statistics/*_bytes.

They’re a simple count of bytes transferred, and so the count always goes up. Typically these are 64-bit unsigned integers so their maximum value would be 18,446,744,073,709,551,615 (264-1).

When you’re monitoring bandwidth use the monitoring system records the value and the timestamp. The difference in value over a known period allows the monitoring system to work out the rate.

Wrapping ^

Monitoring of network devices is often done using SNMP. SNMP has 32-bit and 64-bit counters.

The maximum value that can be held in a 32-bit counter is 4,294,967,295. As that is a byte count, that represents 34,359,738,368 bits or 34,359.74 megabits. Divide that by 300 (seconds in 5 minutes) and you get 114.5. Therefore if the average bandwidth is above 114.5Mbit/s for 5 minutes, you will overflow a 32-bit counter. When the counter overflows it wraps back through zero.

Wrapping a counter once is fine. We have to expect that a counter will wrap eventually, and as counters never decrease, if a new value is smaller than the previous one then we know it has wrapped and can still work out what the rate should be.

The problem comes when the counter wraps more than once. There is no way to tell how many times it has wrapped so the monitoring system will have to assume the answer is once. Once traffic reaches ~229Mbit/s the counters will be wrapping at least twice in 5 minutes and the statistics become meaningless.

64-bit counters to the rescue ^

For that reason, network traffic is normally monitored using 64-bit counters. You would have to have a traffic rate of almost 492 Petabit/s to wrap a 64-bit byte counter in 5 minutes.

The thing is, I was already using 64-bit SNMP counters.

Examining the sysfs files ^

I decided to remove SNMP from the equation by going to the source of the data that SNMP uses: the kernel on the device being monitored.

As mentioned, the kernel’s interface byte counters are exposed in sysfs at /sys/class/net/<interface>/statistics/*_bytes. I dumped out those values every 10 seconds and watched them scroll in a terminal session.

What I observed was that these counters, for that particular customer, were wrapping every couple of minutes. I never observed a value greater than 8,469,862,875. That’s larger than a 32-bit counter would hold, but very close to what a 33 bit counter would hold (8,589,934,591).

64-bit counters not to the rescue ^

Once I realised that the kernel’s own counters were wrapping every couple of minutes inside the kernel it became clear that using 64-bit counters in SNMP was not going to help at all, and multiple wraps would be seen in 5 minutes.

What a difference a minute makes ^

To test the hypothesis I switched to 1-minute polling. Here’s what 12 hours of real data looks like under both 5- and 1-minute polling.

As you can see that is a pretty dramatic difference.

The bug ^

By this point, I’d realised that there must be a bug in Xen’s netback driver (the thing that makes virtual network interfaces in dom0).

I went searching through the source of the kernel and found that the counters had changed from an unsigned long in kernel version 4.9 to a u64 in kernel version 4.10.

Of course, once I knew what to search for it was easy to unearth a previous bug report. If I’d found that at the time of the initial report that would have saved 2 days of investigation!

Even so, the fix for this was only committed in February of this year so, unfortunately, is not present in the kernel in use by the current Debian stable. Nor in many other current distributions.

For Xen set-ups on Debian the bug could be avoided by using a backports kernel or packaging an upstream kernel.

Or you could do 1-minute polling as that would only wrap one time at an average bandwidth of ~572Mbit/s and should be safe from multiple wraps up to ~1.1Gbit/s.

Inside the VPS the counters are 64-bit so it isn’t an issue for guest administrators.

A slightly more realistic look at lvmcache

Recap ^

And then… ^

I decided to perform some slightly more realistic benchmarks against lvmcache.

The problem with the initial benchmark was that it only covered 4GiB of data with a 4GiB cache device. Naturally once lvmcache was working correctly its performance was awesome – the entire dataset was in the cache. But clearly if you have enough fast block device available to fit all your data then you don’t need to cache it at all and may as well just use the fast device directly.

I decided to perform some fio tests with varying data sizes, some of which were larger than the cache device.

Test methodology ^

Once again I used a Zipf distribution with a factor of 1.2, which should have caused about 90% of the hits to come from just 10% of the data. I kept the cache device at 4GiB but varied the data size. The following data sizes were tested:

  • 1GiB
  • 2GiB
  • 4GiB
  • 8GiB
  • 16GiB
  • 32GiB
  • 48GiB

With the 48GiB test I expected to see lvmcache struggling, as the hot 10% (~4.8GiB) would no longer fit within the 4GiB cache device.

A similar fio job spec to those from the earlier articles was used:

[cachesize-1g]
size=512m
ioengine=libaio
direct=1
iodepth=8
numjobs=2
readwrite=randread
random_distribution=zipf:1.2
bs=4k
unlink=1
runtime=30m
time_based=1
per_job_logs=1
log_avg_msec=500
write_iops_log=/var/tmp/fio-${FIOTEST}

…the only difference being that several different job files were used each with a different size= directive. Note that as there are two jobs, the size= is half the desired total data size: each job lays out a data file of the specified size.

For each data size I took care to fill the cache with data first before doing a test run, as unreproducible performance is still seen against a completely empty cache device. This produced IOPS logs and a completion latency histogram. Test were also run against SSD and HDD to provide baseline figures.

Results ^

IOPS graphs ^

All-in-one ^

Immediately we can see that for data sizes 4GiB and below performance converges quite quickly to near-SSD levels. That is very much what we would expect when the cache device is 4GiB, so big enough to completely cache everything.

Let’s just have a look at the lower-performing configurations.

Low-end performers ^

For 8, 16 and 32GiB data sizes performance clearly gets progressively worse, but it is still much better than baseline HDD. The 10% of hot data still fits within the cache device, so plenty of acceleration is still happening.

For the 48GiB data size it is a little bit of a different story. Performance is still better (on average) than baseline HDD, but there are periodic dips back down to roughly HDD figures. This is because not all of the 10% hot data fits into the cache device any more. Cache misses cause reads from HDD and consequently end up with HDD levels of performance for those reads.

The results no longer look quite so impressive, with even the 8GiB data set achieving only a few thousand IOPS on average. Are things as bad as they seem? Well no, I don’t think they are, and to see why we will have to look at the completion latency histograms.

Completion latency histograms ^

The above graphs are generated by fitting a Bezier curve to a scatter of data points each of which represents a 500ms average of IOPS achieved. The problem there is the word average.

It’s important to understand what effect averaging the figures gives. We’ve already seen that HDDs are really slow. Even if only a few percent of IOs end up missing cache and going to HDD, the massive latency of those requests will pull the average for the whole 500ms window way down.

Presumably we have a cache because we suspect we have hot spots of data, and we’ve been trying to evaluate that by doing most of the reads from only 10% of the data. Do we care what the average performance is then? Well it’s a useful metric but it’s not going to say much about the performance of reads from the hot data.

The histogram of completion latencies can be more useful. This shows how long it took between issuing the IO and completing the read for a certain percentage of issued IOs. Below I have focused on the 50% to 99% latency buckets, with the times for each bucket averaged between the two jobs. In the interests of being able to see anything at all I’ve had to double the height of the graph and still cut off the y axis for the three worst performers.

A couple of observations:

  • Somewhere between 70% and 80% of IOs complete with a latency that’s so close to SSD performance as to be near-indistinguishable, no matter what the data size. So what I think I am proving is that:

    you can cache a 48GiB slow backing device with 4GiB of fast SSD and if you have 10% hot data then you can expect it to be served up at near-SSD latencies 70%–80% of the time. If your hot spots are larger (not so hot) then you won’t achieve that. If your fast device is larger than 1/12th the backing device then you should do better than 70%–80%.

  • If the cache were perfect then we should expect the 90th percentile to be near SSD performance even for the 32GiB data set, as the 10% hot spot of ~3.2GiB fits inside the 4GiB cache. For whatever reason this is not achieved, but for that data size the 90th percentile latency is still about half that of HDD.
  • When the backing device is many times larger (32GiB+) than the cache device, the 99th percentile latencies can be slightly worse than for baseline HDD.

    I hesitate to suggest there is a problem here as there are going to be very few samples in the top 1%, so it could just be showing close to HDD performance.

Conclusion

Assuming you are okay with using a 4.12..x kernel, and assuming you are already comfortable using LVM, then at the moment it looks fairly harmless to deploy lvmcache.

Getting a decent performance boost out of it though will require you to check that your data really does have hot spots and size your cache appropriately.

Measuring your existing workload with something like blktrace is probably advisable, and these days you can feed the output of blktrace back into fio to see what performance might be like in a difference configuration.

Full test output

You probably want to stop reading here unless the complete output of all the fio runs is of interest to you.
Continue reading “A slightly more realistic look at lvmcache” ^

Tracking down the lvmcache fix

Background ^

In the previous article I covered how, in order to get decent performance out of lvmcache with a packaged Debian kernel, you’d have to use the 4.12.2-1~exp1 kernel from experimental. The kernels packaged in sid, testing (buster) and stable (stretch) aren’t new enough.

I decided to bisect the Linux kernel upstream git repository to find out exactly which commit(s) fixed things.

Results ^

Here’s a graph showing the IOPS over time for baseline SSD and lvmcache with a full cache under several different kernel versions. As in previous articles, the lines are actually Bezier curves fitted to the data which is scattered all over the place from 500ms averages.

What we can see here is that performance starts to improve with commit 4d44ec5ab751 authored by Joe Thornber:

dm cache policy smq: put newly promoted entries at the top of the multiqueue

This stops entries bouncing in and out of the cache quickly.

This is part of a set of commits authored by Joe Thornber on the drivers/md/dm-cache-policy-smq.c file and committed on 2017-05-14. By the time we reach commit 6cf4cc8f8b3b we have the long-term good performance that we were looking for.

The first of Joe Thornber’s commits on that day in the dm-cache area was 072792dcdfc8 and stepping back to the commit immediately prior to that one (2ea659a9ef48) we get a kernel representing the moment that Linus designated the v4.12-rc1 tag. Joe’s commits went into -rc1, and without them the performance of lvmcache under these test conditions isn’t much better than baseline HDD.

It seems like some of Joe’s changes helped a lot and then the last one really provided the long term performance.

git bisect procedure ^

Normally when you do a git bisect you’re starting with something that works and you’re looking for the commit that introduced a bug. In this case I was starting off with a known-good state and was interested in which commit(s) got me there. The normal bisect key words of “good” and “bad” in this case would be backwards to what I wanted. Dominic gave me the tip that I could alias the terms in order to reduce my confusion:

$ git bisect start --term-old broken --term-new fixed

From here on, when I encountered a test run that produced poor results I would issue:

$ git bisect broken

and when I had a test run with good results I would issue:

$ git bisect fixed

As I knew that the tag v4.13-rc1 produced a good run and v4.11 was bad, I could start off with:

$ git bisect reset v4.13-rc1
$ git bisect fixed
$ git bisect broken v4.11

git would then keep bisecting the search space of commits until I would find the one(s) that resulted in the high performance I was looking for.

Good and bad? ^

As before I’m using fio to conduct the testing, with the same job specification:

ioengine=libaio
direct=1
iodepth=8
numjobs=2
readwrite=randread
random_distribution=zipf:1.2
bs=4k
size=2g
unlink=1
runtime=15m
time_based=1
per_job_logs=1
log_avg_msec=500
write_iops_log=/var/tmp/fio-${FIOTEST}

The only difference from the other articles was that the run time was reduced to 15 minutes as all of the interesting behaviour happened within the first 11 minutes.

To recap, this fio job specification lays out two 2GiB files of random data and then starts two processes that perform 4kiB-sized reads against the files. Direct IO is used, in order to bypass the page cache.

A Zipfian distribution with a factor of 1.2 is used; this gives a 90/10 split where about 90% of the reads should come from about 10% of the data. The purpose of this is to simulate the hot spots of popular data that occur in real life. If the access pattern were to be perfectly and uniformly random then caching would not be effective.

In previous tests we had observed that dramatically different performance would be seen on the first run against an empty cache device compared to all other subsequent runs against what would be a full cache device. In the tests using kernels with the fix present the IOPS achieved would converge towards baseline SSD performance, whereas in kernels without the fix the performance would remain down near the level of baseline HDD. Therefore the fio tests were carried out twice.

Where to next? ^

I think I am going to see what happens when the cache device is pretty small in comparison to the working data.

All of the tests so far have used a 4GiB cache with 4GiB of data, so if everything got promoted it would entirely fit in cache. Not only that but the Zipf distribution makes most of the hits come from 10% of the data, so it’s actually just ~400MiB of hot data. I think it would be interesting to see what happens when the hot 10% is bigger than the cache device.

git bisect progress and test output ^

Unless you are particularly interested in the fio output and why I considered each one to be either fixed or broken, you probably want to stop reading now.

Continue reading “Tracking down the lvmcache fix”

lvmcache with a 4.12.3 kernel

Background ^

In the previous two articles I had discovered that lvmcache had amazing performance on an empty cache but then on every run after that (i.e. when the cache device was full of junk) went scarcely better than baseline HDD.

A few days ago I happened across an email on the linux-lvm list where Mike Snitzer advised:

the [CentOS] 7.4 dm-cache will be much more performant than the 7.3 cache you appear to be using.

…and…

It could be that your workload isn’t accessing the data enough to warrant promotion to the cache. dm-cache is a “hotspot” cache. If you aren’t accessing the data repeatedly then you won’t see much benefit (particularly with the 7.3 and earlier releases).

Just to get a feel, you could try the latest upstream 4.12 kernel to see how effective the 7.4 dm-cache will be for your setup.

I don’t know what kernel version CentOS 7.3 uses, but the VM I’m testing with is Debian testing (buster), so some version of 4.11.x plus backported patches.

That seemed pretty new, but Mike is suggesting 4.12.x so I thought I’d re-test lvmcache with the latest stable upstream kernel, which at the time of writing is version 4.12.3.

Test methodology ^

This time around I only focused on fio tests, using the same settings as before:

[partial]
ioengine=libaio
direct=1
iodepth=8
numjobs=2
readwrite=randread
random_distribution=zipf:1.2
bs=4k
size=2g
unlink=1
runtime=20m
time_based=1
per_job_logs=1
log_avg_msec=500
write_iops_log=/var/tmp/fio-${FIOTEST}

The only changes were:

  1. to reduce the run time to 20 minutes from 30 minutes, since all the interesting things happened within the first 20 minutes before.
  2. to write an IOPS log entry every 500ms instead of ever 1000ms, as the log files were quite small and some higher resolution might help smooth graphs out.

Last time there was a dramatic difference between the initial run with an empty cache and subsequent runs with a cache volume full of junk, so I did a test for each of those conditions, as well as tests for the baseline SSD and HDD.

The virtual machine had been upgraded from Debian 9 (stretch) to testing (buster), so it still had packaged kernel versions 4.9.30-2 and 4.11.6-1 laying around to test things with. In addition I compiled up version 4.12.3 by copying the .config from 4.11.6-1 then doing make oldconfig accepting all defaults.

Results ^

Although the fio job spec was essentially the same as in the previous article, I have since worked out how to merge the IOPS logs from both jobs so the graphs will seem to show about double the IOPS as they did before.

All-in-one ^

Well that’s an interesting set of graphs but rather hard to distinguish. Let’s try that by kernel version.

Baseline SSD by kernel version ^

A couple of weird things here:

  1. 4.12.3 and 4.11.6-1 are actually fairly consistent, but 4.9.30-2 varies rather a lot.
  2. All kernels show a sharp dip a few minutes in. I don’t know what that is about.

Although these lines do look quite far apart, bear in mind that this graph’s y axis starts at 92k IOPS. The average IOPS didn’t vary that much:

Average IOPS by kernel version
4.9.30-2 4.11.6-1 4.12.3
102,325 102,742 104,352

So there was actually only a 1.9% difference between the worst performer and the best.

Baseline HDD by kernel version ^

4.9.30-2 and 4.12.3 are close enough here to probably be within the margin of error, but there is something weird going on with 4.11.6-1.

Its average IOPS across the 20 minute test were only 56% of those for 4.12.3 and 53% of those for 4.9.30-2, which is quite a big disparity. I re-ran these tests 5 times to check it wasn’t some anomaly, but no, it’s reproducible.

Maybe something to look into another day.

lvmcache by kernel version ^

Dragging things back to the point of this article: previously we discovered that lvmcache worked great the first time through, when its cache volume was completely empty, but then subsequent runs all absolutely sucked. They didn’t perform significantly better than HDD baseline.

Let’s graph all the lvmcache results for each kernel version against the SSD baseline for that kernel to see if things changed at all.

lvmcache 4.9.30-2 ^

This is the similar to what we saw before: an empty cache volume produces decent results of around 47k IOPS. Although it’s interesting that the second job is down around 1k IOPS. Again the results on a full cache are poor. In fact the results for the second job of the empty cache are about the same as the results for both jobs on a full cache.

lvmcache 4.11.6-1 ^

Same story again here, although the performance is a little higher. Again the first job on an empty cache is getting the big results of almost 60k IOPS while the second job—and both jobs on a full cache—get only around 1k IOPS.

lvmcache 4.12.3 ^

Wow. Something dramatic has been fixed. The performance on an empty cache is still better, but crucially the performance on a full cache pretty quickly becomes very close to baseline SSD.

Also the runs against both the empty and full cache device result in both jobs getting roughly the same IOPS performance rather than the first job being great and all others very poor.

What’s next? ^

It’s really encouraging that the performance is so much better with 4.12.3. It’s changed lvmcache from a “hmm, maybe” option to one that I would strongly consider using anywhere I could.

It’s a shame though that such a new kernel is required. The kernel version in Debian testing (buster) is currently 4.11.6-1. Debian experimental’s linux-image-4.12.0-trunk-amd64 package currently has version 4.12.2-1 so I should test if that is new enough I tested to see if that was new enough.

Failing that I think I should git bisect or similar in order to find out exactly which changeset fixed this, so I could have some chance of knowing when it hits a packaged version.

Continue reading “lvmcache with a 4.12.3 kernel”