Debian-installer, mdadm configuration and the Bad Blocks Controversy

Updates! ^

Since this was posted on 2020-09-13 there was some interest in the comments and on Hacker News and I learned some things which required updates. I’ve tried to indicate them with struck out text.

Of particular note is the re-add method of removing BBLs.

MD and mdadm ^

MD is the Linux kernel driver that is used for running software RAID arrays. mdadm is the software that you run to manage MD devices. They are both part of the same project.

First, about the Bad Blocks List ^

Since about 2010, MD has had a bad blocks log (BBL) feature. When it fails to read from an underlying device it will (sometimes?) mark that block as bad and read the correct data from a different device, and then forever more redirect reads away from those bad blocks. This feature defaults to being on.

One problem with this feature is that read errors can occur for many reasons besides permanent failure of part of a storage device. For example, it could be a failure of the backplane or controller that causes many read errors on multiple devices, or the devices could be reached over a network of some sort and temporary network problems could propagate errors.

Even if the particular part of the device is unreadable, the operating system is supposed to try to write the correct data over the top. This write will either clear the problem or else be redirected to a spare sector on the drive by the drive’s firmware. The operating system is not supposed to be taking on this role, the drives are, and when the drives fail to do so then the redundancy of the array is supposed to save the day.

Even worse, there are apparently bugs somewhere in the BBL code that cause a device’s BBL to be copied onto a new device when the array is rebuilt or a device replaced. Clearly it does not make sense for a new device to get a copy of another device’s BBL because they are inherently a per-device thing. So far there has been no successful intentional reproduction of this, only people unwittingly hitting it at the worst possible moments. It has been reproduced that adding or replacing a device results in a BBL being copied. I am not aware of a formal bug report for this yet.

mdadm doesn’t even try particularly hard to warn you if a new bad block is found. Unlike when a device fails, it doesn’t send you an email. The MD driver writes in the syslog about the bad block(s). There’s also no change to /proc/mdstat. You have to examine some files in sysfs.

As a result the current situation is that:

No one seems to have made any progress on fixing any of this in 10 years.

Doing something about it ^

I’ll say right now that this story doesn’t (yet?) have a satisfying ending.

I’ve been aware of the “Bad Blocks Controversy” for about 5 years but I haven’t ever personally experienced any problems and it was always at the bottom of my list to look at. Roy’s recent thread spurred me into deciding that in future no MD array I created would have a BBL.

I also took the opportunity to deploy Sarah Newman’s Ansible role which checks all array components have an empty BBL. None of BitFolk‘s array components currently have any entries in their BBLs – phew!

Removing an existing BBL ^

Currently the only way to remove a BBL from an array component is to stop the array and then assemble it with an argument like this:

There are two ways to remove the BBL from the devices of existing arrays.

Fail and re-add each device with update

It doesn’t seem to be documented anywhere, but you can fail a device out of an array and re-add it with an update to remove the BBL on that device, like this:

# mdadm --fail /dev/md0 /dev/sdb1 \
        --remove /dev/sdb1 \
        --re-add /dev/sdb1 \
        --update=no-bbl
mdadm: set /dev/sdb1 faulty in /dev/md0                                              
mdadm: hot removed /dev/sdb1 from /dev/md0                   
mdadm: re-added /dev/sdb1

This will only work if your array has a bitmap, otherwise it will refuse to re-add. Most arrays do get a bitmap, but small arrays won’t by default. Fortunately you can easily add a bitmap like this:

# mdadm --grow --bitmap=internal /dev/md0

The downside of this approach is that your array will have reduced redundancy while it rebuilds. It should rebuild pretty quickly though as the bitmap will cause only changed parts to be rewritten.

(This won’t work if a BBL currently has any entries)

Stop the array and assemble again with update

The other way to remove BBL from devices is to stop the array and assemble it manually like this:

# mdadm --assemble /dev/mdX --update=no-bbl

The big problem with this is that stopping the array obviously causes downtime for whatever is using it. If your root filesystem is on an MD array (and why wouldn’t it be, if you use MD?) then that means the entire server, and you’re having to do this from sort of rescue environment.

I have suggested that a config option be added to remove a BBL on assembly, so that this will happen the next time the machine is rebooted. This does not appear to have provoked any interest.

This method is quicker since it operates on all devices and doesn’t require a rebuild, but personally I usually find downtime more painful so I’d be inclined to schedule an “at-risk” maintenance window and do it the re-add way.

Avoiding the BBL at creation time ^

So if the BBL cannot be easily removed, at least it can be prevented from ever existing, right? When Neil Brown, the previous MD maintainer, was asked in 2016 if the feature could be defaulted to off, Neil said that putting this in the config file was as good as that:

CREATE bbl=no

The thing is, it’s not as good as disabling it by default when you consider what many users’ experience is of running the mdadm command: they don’t run mdadm, something else runs it for them. I’d go as far as to say that the majority of uses of mdadm are done by helper scripts and installers, not by human beings.

If it’s a program that is running mdadm for you then you are going to have to find out how to set that mdadm.conf before it reads it.

Take for example my own process of installing Debian. I do it by booting the Debian Installer by PXE. I have some pre-seeding done to answer a lot of the installer questions, but actually I do still do the disk partitioning stage in the installer’s text interface.

So there I was thinking this is actually going to be quite simple, because the Debian Installer is really lovely about letting you execute a shell and poke around. Surely all I am going to need to do is open a shell once and edit /etc/mdadm/mdadm.conf and then go back into the mdcfg menu and carry on, right? Oh dear me no.

You can read the details of my wild ride that involved me uploading a binary of strace into the d-i to run mdadm under to work out what was going on, but just the relevant discoveries are in this article for those who’d rather not.

mdadm in d-i uses a config file at /tmp/mdadm.conf

After quite a bit of confusion over why even arrays I created manually with the mdadm command in the d-i shell still had a BBL, I discovered that the mdadm binary in d-i is compiled to have its config at /tmp/mdadm.conf. I don’t know why, but probably there is a good reason.

(At this point a number of people responded, “that’s because everything else will be set read-only.” That’s not the case with debian-installer which runs entirely off of a tmpfs. It’s all writeable.)

So just make the edit to /tmp/mdadm.conf then?

Oh ho ho no. Every time you go into the MD configuration section (mdcfg) it clobbers its own /tmp/mdadm.conf, and you can’t get to the “execute a shell” option without returning to the MD configuration section.

If you’re on something with multiple virtual consoles (like if you’re sitting in front of a conventional PC) then you could switch to one of those after you’ve entered the MD configuration part and modify /tmp/mdadm.conf then. I don’t have that option because I’m on a serial console.

I thought I didn’t have that option because I’m on a serial console, but it was pointed out to me that when the Debian installer detects it’s running in a serial console it runs itself under GNU Screen. So, by using the usual screen commands of ctrl+a n or ctrl+a p, one can switch backwards and forwards through the different virtual consoles. Neat!

There is also an earlier option to load an installer component that enables one to continue the installation process over SSH. If you select that then you can SSH in to the running installer system so if you do that after you’ve entered the MD configuration bit in your main console then I guess you can then edit the config file and continue.

By one of those methods of getting a shell, after you’ve already entered the array configuration part but before you’ve actually created any arrays, I think you could edit /tmp/mdadm.conf to have “CREATE bbl=no” and the installer’s mdadm binary would respect that when you switch back.

Alternatively you could just use the shell to create your arrays instead of using the Ddebian installer to do it. If it’s a simple case where you’ve just got an sda and an sdb disk identically partitioned and you want to make a bunch of arrays on them, it can be a fairly legible shell session like:

~ # mkdir -vp /etc/mdadm && echo "CREATE bbl=no" > /etc/mdadm/mdadm.conf
~ # for part in 1 2 3 5; do \
      mdadm --create \
            -v \
            --config=/etc/mdadm/mdadm.conf \
            /dev/md${part} \
            --level=1 \
            --raid-devices=2 \
            /dev/sd[ab]${part}; \
    done

Do not try this until you understand exactly what it is doing.

It iterates the list 1, 2, 3, 5 (I use the 4th partition for something else) and makes arrays called mdX out of sdaX and sdbX. The mdadm binary is forced to use our config file that disables creation of a BBL.

You can verify that a BBL does not exist on any of the array components like this:

~ # mdadm --examine-badblocks /dev/sda1
No bad-blocks list configured on /dev/sda1

You should get identical output for every component. If a component did have a BBL it would output something like this:

~ # mdadm --examine-badblocks /dev/sda1
Bad-blocks list is empty in /dev/sda1

You can then exit the d-i shell and go back to the disk partitioning section. You won’t need the MD configuration part now but even if you do go into it, it should detect all your manually-created arrays.

How to make progress? ^

All of this isn’t great but at least it’s fairly easy to pause the Debian installer and take some manual action. I suspect users of other Linux distributions may not be so lucky, and so I too think it would be a good idea if this buggy feature was disabled by default, or at least if there were a way to tell mdadm to remove the BBL on assembly.

In fact I would very much like to be able to tell it to remove the BBL on assembly so that I can disable the BBL feature on all my existing servers.

mdadm actually gets called by udev from inside the initramfs in incremental assembly mode, so I think the incremental assembly code needs to look in the config file for this “remove all the BBLs” directive and do it then during assembly as if update=no-bbl had been specified on a command line.

It should be possible to write a script that:

  1. Looks in /sys/block/md* to find device components of all arrays.
  2. Checks each one to see if it has a BBL.
  3. If any are found, add a bitmap if necessary.
  4. Do the fail/remove/re-add trick on each one in turn, waiting for the array to go back into sync each time.

i.e. it should be possible to automate this and run it at the end of an install so the entire install process can remain automated, or run it on a host any time after it’s been provisioned.

Experiments with RDRAND and EntropyKey

Entropy, when the shannons are gone and you can’t go on ^

The new release of Debian 10 (buster) brings with it some significant things related to entropy:

  1. systemd doesn’t trust entropy saved at last boot
  2. Many system daemons now use getrandom() which requires the CRNG be primed with good entropy
  3. The kernel by default trusts the CPU’s RDRAND instruction if it’s available

A lot of machines — especially virtual machines — don’t have access to a lot of entropy when they start up, and now that systemd isn’t accrediting stored entropy from the previous boot some essential services like ssh may take minutes to start up.

Back in 2011 or so, Intel added a CPU instruction called RDRAND which provides entropy, but there was some concern that it was an unauditable feature that could easily have been compromised, so it never did get used as the sole source of entropy on capable CPUs.

Later on, an option to trust the CPU for providing boot-time entropy was added, and this option was enabled by default in Debian kernels from 10.0 onwards.

I am okay with using RDRAND for boot-time entropy, but some people got very upset about it.

Out of interest I had a look at what effect the various kernel options related to RDRAND would have, and also what about when I use BitFolk’s entropy service.

(As of July 2019 this wiki article is in dire need of rewrite since I believe it states some untrue things about urandom, but the details of what the entropy service is and how to use it are correct)

Experiments ^

These experiments were carried out on a virtual machine which is a default install of Debian 10 (buster) on BitFolk. At package selection only “Standard system utilities” and “SSH server” were selected.

Default boot ^

SSH is available just over 1 second after boot.

[    1.072760] random: get_random_bytes called from start_kernel+0x93/0x52c with crng_init=0
[    1.138541] random: crng done (trusting CPU's manufacturer)

Don’t trust RDRAND for early entropy ^

If I tell the kernel not to trust RDRAND for early entropy by using random.trust_cpu=off on the kernel command line then SSH is available after about 4.5 seconds.

[    1.115416] random: get_random_bytes called from start_kernel+0x93/0x52c with crng_init=0
[    1.231606] random: fast init done
[    4.260130] random: systemd-random-: uninitialized urandom read (512 bytes read)
[    4.484274] random: crng init done

Don’t use RDRAND at all ^

If I completely disable the kernel’s use of RDRAND by using nordrand on the kernel command line then SSH is available after just under 49 seconds.

[    1.110475] random: get_random_bytes called from start_kernel+0x93/0x52c with crng_init=0
[    1.225991] random: fast init done
[    4.298185] random: systemd-random-: uninitialized urandom read (512 bytes read)
[    4.674676] random: dbus-daemon: uninitialized urandom read (12 bytes read)
[    4.682873] random: dbus-daemon: uninitialized urandom read (12 bytes read)
[   48.876084] random: crng init done

Use entropy service but not RDRAND ^

If I disable RDRAND but use BitFolk’s entropy service then SSH is available in just over 10 seconds. I suppose this is slower than with random.trust_cpu=off because in that case RDRAND is still allowed after initial seeding, and we must wait for a userland daemon to start.

Using the entropy service requires the network to be up so I’m not sure how easy it would be to decrease this delay, but 10 seconds is still a lot better than 49 seconds.

[    1.075910] random: get_random_bytes called from start_kernel+0x93/0x52c with crng_init=0
[    1.186650] random: fast init done
[    4.207010] random: systemd-random-: uninitialized urandom read (512 bytes read)
[    4.606789] random: dbus-daemon: uninitialized urandom read (12 bytes read)
[    4.613975] random: dbus-daemon: uninitialized urandom read (12 bytes read)
[   10.257513] random: crng init done

Use entropy service but don’t trust CPU for early seeding ^

This was no different to just random.trust_cpu=off (about 4.5s). I suspect because early seeding completed and then RDRAND supplied more entropy before the network came up and the entropy service daemon could start.

Thoughts ^

I’m glad that my CPUs have RDRAND and I’m prepared to use it for boot-time seeding of the CSPRNG, but not as the machines’ sole entropy source.

With RDRAND available, using the BitFolk entropy service probably doesn’t make that much sense as RDRAND will always be able to supply.

More paranoid customers may want to use random.trust_cpu=off but even then probably don’t need the entropy service since once the CSPRNG is seeded, RDRAND can be mixed in and away they go.

The truly paranoid may want to disable RDRAND in which case using the entropy service would be recommended since otherwise long delays at boot will happen and severe delays during times of high entropy demand could be seen.

For those who aren’t BitFolk customers and don’t have access to hardware entropy sources and don’t have a CPU with RDRAND support there are some tough choices. Every other option listed on Debian’s relevant wiki article has at least one expert who says it’s a bad choice.

Disabling edge tiling on GNOME 3.28 / Debian testing (buster)

We’ve been here before ^

In an earlier post I mentioned how to disable edge tiling. That was for my desktop machine which at the time was running Ubuntu 17.10 and GNOME 3.26.

My laptop, however, currently runs Debian testing (buster) with GNOME 3.28, and this method does not work.

Things that work ^

In fact, one of the ways the Internet suggested that didn’t work for Ubuntu, does work on my Debian laptop. That is:

$ gsettings set org.gnome.shell.overrides edge-tiling false

I have no idea why, sorry.

Things that don’t work ^

So, for my Debian buster laptop running GNOME 3.28 under Xorg, the things that don’t work are:

$ dconf write /org/gnome/shell/extensions/classic-overrides/edge-tiling false
$ dconf write /org/gnome/mutter/edge-tiling false
$ dconf write /org/gnome/shell/overrides/edge-tiling false

Let’s Encrypt wildcard certificates, acme.sh and automated DNS verification

Let’s Encrypt’s wildcard certificates ^

Now that Let’s Encrypt can issue wildcard TLS certificates I found some time to look into that.

I already use a Lua script with haproxy which takes care of automatically answering http-01 ACME challenges, but to issue/renew a wildcard certificate you need to answer a dns-01 challenge. A different client/setup would be needed.

dns-01 ACME challenges ^

Most of the clients that support ACME v2 offer a range of integrations for DNS providers, plus a manual mode that prints out the DNS record that you need to add and then waits for you to indicate that you’ve done it. I run my own DNS infrastructure so the thing to do would be RFC2136 dynamic DNS updates.

One wrinkle here is that currently none of my DNS zones have dynamic updates enabled. At the moment I manage them as zone files (some are automatically generated by scripts though). After looking at a few of the client options I found that acme.sh supports an “alias zone”.

Basically, in your main zone you create a CNAME for the challenge record that points at another zone, and then enable dynamic updates in that other zone. The other zone is dedicated for this purpose, so the only updates which will be happening will be for the purpose of answering dns-01 ACME challenges. I made my dynamic zone a sub-zone of my main one:

strugglers.net zone file content ^

These records need to be added to the main zone for this to work.

.
.
.
; sub-zone purely used for dns-01 ACME challenges.
acmesh          NS a.authns.bitfolk.co.uk.
                NS b.authns.bitfolk.com.
                NS c.authns.bitfolk.com.
 
; Alias the dns-01 challenge record into the dedicated zone.
_acme-challenge CNAME _acme-challenge.acmesh.strugglers.net.
.
.
.

acmesh.strugglers.net zone file content ^

Initially this just needs to be an empty zone with only SOA and NS records, so this is the entire content of the file.

$ORIGIN .
$TTL 86400      ; 1 day
acmesh.strugglers.net   IN SOA  a.authns.bitfolk.co.uk. hostmaster.bitfolk.com. (
                                2018031905 ; serial
                                14400      ; refresh (4 hours)
                                7200       ; retry (2 hours)
                                1209600    ; expire (2 weeks)
                                43200      ; minimum (12 hours)
                                )
                        NS      a.authns.bitfolk.co.uk.
                        NS      b.authns.bitfolk.com.
                        NS      c.authns.bitfolk.com.

DNS server configuration ^

The DNS server needs to know a key by which it will authenticate acme.sh‘s updates, and also needs to be told that the new zone is a dynamic zone. I use BIND, so it goes as follows.

Generate a key for dynamic DNS updates ^

Use the dnssec-keygen command to generate a key suitable for authenticating DNS updates.

$ dnssec-keygen -r /dev/urandom -a HMAC-SHA512 -b 512 -n HOST DDNS_UPDATE

This creates two files named like Kddns_update.+165+14059.key and Kddns_update.+165+14059.private.

Put the key in the BIND config ^

Look in the private file and take the key from the line that starts “Key:”. Put that in some config file that you will load into your BIND like this:

key "strugglers" {
    algorithm hmac-sha512;
    secret "Sb8nvwpO8bhiU4haPB+NiJKoMO6vVJumrr29Bj3daSuB8hBoTKoqPKMBKTYLRUv12pbKPwJATgdsU6BtL4Hmcw==";
};

The thing in quotes after “key” is a symbolic name for this key and can be anything that makes sense to you. The “secret” is the key from the private file. You can delete the two Kddns_update.+165+14059.* files now.

Put the new zone into the BIND config ^

The config for the zone itself looks something like this:

zone "acmesh.strugglers.net" {
    type master;
    file "/path/to/acmesh.strugglers.net";
    allow-update {
        key "strugglers";
    };
};

Reload the DNS server ^

Once BIND has been reloaded the log file should indicate that the acemsh.strugglers.net zone was loaded correctly, and in my case that triggers DNS NOTIFY to my secondary servers which automatically begin zone transfers.

Check things out with nsupdate ^

At this point it might be worth using the nsupdate command to check that you can do dynamic DNS updates.

Just type the nsupdate line in the shell, the > is a prompt at which you will type the updates you wish to send. We’ll add a trivial TXT record. The -k argument is the path to the file containing the key.

$ nsupdate -k /path/to/strugglers.key -v
> server a.authns.bitfolk.co.uk
> debug yes
> zone acmesh.strugglers.net.
> update add foo.acmesh.strugglers.net. 86400 TXT "bar"
> show
Outgoing update query:
;; ->>HEADER<<- opcode: UPDATE, status: NOERROR, id:      0
;; flags:; ZONE: 0, PREREQ: 0, UPDATE: 0, ADDITIONAL: 0
;; ZONE SECTION:
;acmesh.strugglers.net.         IN      SOA
 
;; UPDATE SECTION:
foo.acmesh.strugglers.net. 86400 IN     TXT     "bar"
 
> send
Sending update to 85.119.80.222#53
Outgoing update query:
;; ->>HEADER<<- opcode: UPDATE, status: NOERROR, id:  19987
;; flags:; ZONE: 1, PREREQ: 0, UPDATE: 1, ADDITIONAL: 1
;; ZONE SECTION:
;acmesh.strugglers.net.         IN      SOA
 
;; UPDATE SECTION:
foo.acmesh.strugglers.net. 86400 IN     TXT     "bar"
 
;; TSIG PSEUDOSECTION:
strugglers.             0       ANY     TSIG    hmac-sha512. 1521454639 300 64 dPndp1/ZyqzmSEn0AKIsGR62HrsplJBhntWioM4oBdPlNXUIAwg7Jwpg DGSM2S3kY+5hfGTleNqwXZrMvnBhUQ== 19987 NOERROR 0 
 
 
Reply from update query:
;; ->>HEADER<<- opcode: UPDATE, status: NOERROR, id:  19987
;; flags: qr; ZONE: 1, PREREQ: 0, UPDATE: 0, ADDITIONAL: 1
;; ZONE SECTION:
;acmesh.strugglers.net.         IN      SOA
 
;; TSIG PSEUDOSECTION:
strugglers.             0       ANY     TSIG    hmac-sha512. 1521454639 300 64 NfH/78kvq6f+59RXnyJwC6kfFRLGjG6Rh9jdYRId7UjH0jwIbtRVpqCu xx4HToGmlJrDTUqpgbYZq2orUOZlkQ== 19987 NOERROR 0
 
> [Ctrl-D]

And to verify it really got added (though the status of NOERROR should be confirmation enough):

$ dig +short -t txt foo.acmesh.strugglers.net
"bar"

That it; you can do dynamic DNS updates.

acme.sh ^

I’m going to assume you’ve installed acme.sh according to one of its supported installation methods. Personally I am not into curl | sh so I:

  • Create a system user that can’t log in.
  • git clone the source.
  • acme.sh --install it as that user.

acme.sh doesn’t have to be run on the primary DNS server, because it’s going to use a dynamic DNS update to do all the DNS things. It just needs access to the dynamic DNS update key file. Either you can install acme.sh on each host that will need to generate/renew certificates and copy the DNS key there, or else do all the certificate generation/renewal in one place and copy the certificate files around.

However you manage it, make sure that the user you’re going to run acme.sh as can read the dynamic DNS update key file.

Issuing the first wildcard certificate ^

The first time you issue the certificate you need to set NSUPDATE_KEY and NSUPDATE_SERVER in your environment. After the first successful issuance acme.sh will store these variables in its configuration for use in the automated renewals.

$ NSUPDATE_SERVER=a.authns.bitfolk.co.uk NSUPDATE_KEY=/path/to/strugglers.key ./acme.sh --issue -d strugglers.net -d '*.strugglers.net' --challenge-alias acmesh.strugglers.net --dns dns_nsupdate
[Mon 19 Mar 09:19:00 UTC 2018] Multi domain='DNS:strugglers.net,DNS:*.strugglers.net'
[Mon 19 Mar 09:19:00 UTC 2018] Getting domain auth token for each domain
[Mon 19 Mar 09:19:03 UTC 2018] Getting webroot for domain='strugglers.net'
[Mon 19 Mar 09:19:03 UTC 2018] Getting webroot for domain='*.strugglers.net'
[Mon 19 Mar 09:19:04 UTC 2018] Found domain api file: /path/to/acmesh/dnsapi/dns_nsupdate.sh
[Mon 19 Mar 09:19:04 UTC 2018] adding _acme-challenge.acmesh.strugglers.net. 60 in txt "WmenhbXRtenhpNLYLOBjznyHcVvFk-jjxurCVTrhWc8"
[Mon 19 Mar 09:19:04 UTC 2018] Found domain api file: /path/to/acmesh/dnsapi/dns_nsupdate.sh
[Mon 19 Mar 09:19:04 UTC 2018] adding _acme-challenge.acmesh.strugglers.net. 60 in txt "fwZPUBHijOQkJJaoOF_nIn3Z_FtuVU9R635NDVz_hPA"
[Mon 19 Mar 09:19:04 UTC 2018] Sleep 120 seconds for the txt records to take effect

At this point a DNS update has been crafted and sent so you should see your zone update and zone transfer happen to any secondary servers. If that doesn’t happen within 120 seconds then when Let’s Encrypt tries to verify the challenge it might query a DNS server that doesn’t yet have the record. Your zone transfers need to be reliable.

[Mon 19 Mar 09:21:08 UTC 2018] Verifying:strugglers.net
[Mon 19 Mar 09:21:12 UTC 2018] Success
[Mon 19 Mar 09:21:12 UTC 2018] Verifying:*.strugglers.net
[Mon 19 Mar 09:21:15 UTC 2018] Success
[Mon 19 Mar 09:21:15 UTC 2018] Removing DNS records.
[Mon 19 Mar 09:21:15 UTC 2018] removing _acme-challenge.acmesh.strugglers.net. txt
[Mon 19 Mar 09:21:16 UTC 2018] removing _acme-challenge.acmesh.strugglers.net. txt
[Mon 19 Mar 09:21:16 UTC 2018] Verify finished, start to sign.
[Mon 19 Mar 09:21:18 UTC 2018] Cert success.
-----BEGIN CERTIFICATE-----
MIIFETCCA/mgAwIBAgISAz4ZQV27n1FgemVAEhIqiUZnMA0GCSqGSIb3DQEBCwUA
MEoxCzAJBgNVBAYTAlVTMRYwFAYDVQQKEw1MZXQncyBFbmNyeXB0MSMwIQYDVQQD
.
.
.
NeAmr5I=
-----END CERTIFICATE-----
[Mon 19 Mar 09:21:18 UTC 2018] Your cert is in  /path/to/acmesh/.acme.sh/strugglers.net/strugglers.net.cer 
[Mon 19 Mar 09:21:18 UTC 2018] Your cert key is in  /path/to/acmesh/.acme.sh/strugglers.net/strugglers.net.key 
[Mon 19 Mar 09:21:18 UTC 2018] The intermediate CA cert is in  /path/to/acmesh/.acme.sh/strugglers.net/ca.cer 
[Mon 19 Mar 09:21:18 UTC 2018] And the full chain certs is there:  /path/to/acmesh/.acme.sh/strugglers.net/fullchain.cer

Examining a certificate ^

Just for peace of mind…

$ openssl x509 -text -noout -certopt no_subject,no_header,no_version,no_serial,no_signame,no_subject,no_issuer,no_pubkey,no_sigdump,no_aux -in /path/to/acmesh/.acme.sh/strugglers.net/strugglers.net.cer
        Validity
            Not Before: Mar 19 08:21:17 2018 GMT
            Not After : Jun 17 08:21:17 2018 GMT
        X509v3 extensions:
            X509v3 Key Usage: critical
                Digital Signature, Key Encipherment
            X509v3 Extended Key Usage: 
                TLS Web Server Authentication, TLS Web Client Authentication
            X509v3 Basic Constraints: critical
                CA:FALSE
            X509v3 Subject Key Identifier: 
                BF:C7:8E:F5:87:05:D0:6E:15:AC:7B:37:9F:82:05:C3:E3:11:B7:32
            X509v3 Authority Key Identifier: 
                keyid:A8:4A:6A:63:04:7D:DD:BA:E6:D1:39:B7:A6:45:65:EF:F3:A8:EC:A1
 
            Authority Information Access: 
                OCSP - URI:http://ocsp.int-x3.letsencrypt.org
                CA Issuers - URI:http://cert.int-x3.letsencrypt.org/
 
            X509v3 Subject Alternative Name: 
                DNS:*.strugglers.net, DNS:strugglers.net
            X509v3 Certificate Policies: 
                Policy: 2.23.140.1.2.1
                Policy: 1.3.6.1.4.1.44947.1.1.1
                  CPS: http://cps.letsencrypt.org
                  User Notice:
                    Explicit Text: This Certificate may only be relied upon by Relying Parties and only in accordance with the Certificate Policy found at https://letsencrypt.org/repository/

From the Subject Alternative Name we can see it is a wildcard certificate.

Tracking down the lvmcache fix

Background ^

In the previous article I covered how, in order to get decent performance out of lvmcache with a packaged Debian kernel, you’d have to use the 4.12.2-1~exp1 kernel from experimental. The kernels packaged in sid, testing (buster) and stable (stretch) aren’t new enough.

I decided to bisect the Linux kernel upstream git repository to find out exactly which commit(s) fixed things.

Results ^

Here’s a graph showing the IOPS over time for baseline SSD and lvmcache with a full cache under several different kernel versions. As in previous articles, the lines are actually Bezier curves fitted to the data which is scattered all over the place from 500ms averages.

What we can see here is that performance starts to improve with commit 4d44ec5ab751 authored by Joe Thornber:

dm cache policy smq: put newly promoted entries at the top of the multiqueue

This stops entries bouncing in and out of the cache quickly.

This is part of a set of commits authored by Joe Thornber on the drivers/md/dm-cache-policy-smq.c file and committed on 2017-05-14. By the time we reach commit 6cf4cc8f8b3b we have the long-term good performance that we were looking for.

The first of Joe Thornber’s commits on that day in the dm-cache area was 072792dcdfc8 and stepping back to the commit immediately prior to that one (2ea659a9ef48) we get a kernel representing the moment that Linus designated the v4.12-rc1 tag. Joe’s commits went into -rc1, and without them the performance of lvmcache under these test conditions isn’t much better than baseline HDD.

It seems like some of Joe’s changes helped a lot and then the last one really provided the long term performance.

git bisect procedure ^

Normally when you do a git bisect you’re starting with something that works and you’re looking for the commit that introduced a bug. In this case I was starting off with a known-good state and was interested in which commit(s) got me there. The normal bisect key words of “good” and “bad” in this case would be backwards to what I wanted. Dominic gave me the tip that I could alias the terms in order to reduce my confusion:

$ git bisect start --term-old broken --term-new fixed

From here on, when I encountered a test run that produced poor results I would issue:

$ git bisect broken

and when I had a test run with good results I would issue:

$ git bisect fixed

As I knew that the tag v4.13-rc1 produced a good run and v4.11 was bad, I could start off with:

$ git bisect reset v4.13-rc1
$ git bisect fixed
$ git bisect broken v4.11

git would then keep bisecting the search space of commits until I would find the one(s) that resulted in the high performance I was looking for.

Good and bad? ^

As before I’m using fio to conduct the testing, with the same job specification:

ioengine=libaio
direct=1
iodepth=8
numjobs=2
readwrite=randread
random_distribution=zipf:1.2
bs=4k
size=2g
unlink=1
runtime=15m
time_based=1
per_job_logs=1
log_avg_msec=500
write_iops_log=/var/tmp/fio-${FIOTEST}

The only difference from the other articles was that the run time was reduced to 15 minutes as all of the interesting behaviour happened within the first 11 minutes.

To recap, this fio job specification lays out two 2GiB files of random data and then starts two processes that perform 4kiB-sized reads against the files. Direct IO is used, in order to bypass the page cache.

A Zipfian distribution with a factor of 1.2 is used; this gives a 90/10 split where about 90% of the reads should come from about 10% of the data. The purpose of this is to simulate the hot spots of popular data that occur in real life. If the access pattern were to be perfectly and uniformly random then caching would not be effective.

In previous tests we had observed that dramatically different performance would be seen on the first run against an empty cache device compared to all other subsequent runs against what would be a full cache device. In the tests using kernels with the fix present the IOPS achieved would converge towards baseline SSD performance, whereas in kernels without the fix the performance would remain down near the level of baseline HDD. Therefore the fio tests were carried out twice.

Where to next? ^

I think I am going to see what happens when the cache device is pretty small in comparison to the working data.

All of the tests so far have used a 4GiB cache with 4GiB of data, so if everything got promoted it would entirely fit in cache. Not only that but the Zipf distribution makes most of the hits come from 10% of the data, so it’s actually just ~400MiB of hot data. I think it would be interesting to see what happens when the hot 10% is bigger than the cache device.

git bisect progress and test output ^

Unless you are particularly interested in the fio output and why I considered each one to be either fixed or broken, you probably want to stop reading now.

Continue reading “Tracking down the lvmcache fix”

Using a TOTP app for multi-factor SSH auth

I’ve been playing around with enabling multi-factor authentication (MFA) on web services and went with TOTP. It’s pretty simple to implement in Perl, and there are plenty of apps for it including Google Authenticator, 1Password and others.

I also wanted to use the same multi-factor auth for SSH logins. Happily, from Debian jessie onwards libpam-google-authenticator is packaged. To enable it for SSH you would just add the following:

auth required pam_google_authenticator.so

to /etc/pam.d/sshd (put it just after @include common-auth).

and ensure that:

ChallengeResponseAuthentication yes

is in /etc/ssh/sshd_config.

Not all my users will have MFA enabled though, so to skip prompting for these I use:

auth required pam_google_authenticator.so nullok

Finally, I only wanted users in a particular Unix group to be prompted for an MFA token so (assuming that group was totp) that would be:

auth [success=1 default=ignore] pam_succeed_if.so quiet user notingroup totp
auth required pam_google_authenticator.so nullok

If the pam_succeed_if conditions are met then the next line is skipped, so that causes pam_google_authenticator to be skipped for users not in the group totp.

Each user will require a TOTP secret key generating and storing. If you’re only setting this up for SSH then you can use the google-authenticator binary from the libpam-google-authenticator package. This asks you some simple questions and then populates the file $HOME/.google_authenticator with the key and some configuration options. That looks like:

T6Z2KSDCG7CEWPD6EPA6BICBFD4KYKCSGO2JEQVII7ZJNCXECRZPJ4GJHD3CWC43FZIKQUSV5LR2LFFP
" RATE_LIMIT 3 30 1462548404
" DISALLOW_REUSE 48751610
" TOTP_AUTH
11494760
25488108
33980423
43620625
84061586

The first line is the secret key; the five numbers are emergency codes that will always work (once each) if locked out.

If generating keys elsewhere then you can just populate this file yourself. If the file isn’t present then that’s when “nullok” applies; without “nullok” authentication would fail.

Note that despite the repeated mentions of “google” here, this is not a Google-specific service and no data is sent to Google. Google are the authors of the open source Google Authenticator mobile app and the libpam-google-authenticator PAM module, but (as evidenced by the Perl example) this is an open standard and client and server sides can be implemented in any language.

So that is how you can make a web service and an SSH service use the same TOTP multi-factor authentication.

Your Debian netboot suddenly can’t do Ext4?

If, like me, you’ve just done a Debian netboot install over PXE and discovered that the partitioner suddenly seems to have no option for Ext4 filesystem (leaving only btrfs and XFS), despite the fact that it worked fine a couple of weeks ago, do not be alarmed. You aren’t losing your mind. It seems to be a bug.

As the comment says, downloading netboot.tar.gz version 20150422+deb8u3 fixes it. You can find your version in the debian-installer/amd64/boot-screens/f1.txt file. I was previously using 20150422+deb8u1 and the commenter was using 20150422+deb8u2.

Looking at the dates on the files I’m guessing this broke on 23rd January 2016. There was a Debian point release around then, so possibly you are supposed to download a new netboot.tar.gz with each one – not sure. Although if this is the case it would still be nice to know you’re doing something wrong as opposed to having the installer appear to proceed normally except for denying the existence of any filesystems except XFS and btrfs.

Oh and don’t forget to restart your TFTP daemon. tftpd-hpa at least seems to cache things (or maybe hold the tftp directory open, as I had just moved the old directory out of the way), so I was left even more confused when it still seemed to be serving 20150422+deb8u1.

Installing Debian by PXE using Supermicro IPMI Serial over LAN

Here’s how to install Debian jessie on a Supermicro server using PXE boot and the IPMI serial-over-LAN.

Using these instructions you will be able to complete an install of a remote machine, although you will initially need access to the BIOS to configure the IPMI part.

BIOS settings ^

This bit needs you to be in the same location as the machine, or else have someone who is make the required changes.

Press DEL to go into the BIOS configuration.

Under Advanced > PCIe/PCI/PnP Configuration make sure that the network interface through which you’ll reach your PXE server has the “PXE” option ROM set:

BIOS PSE option ROM

Under Advanced > Serial Port Console Redirection you’ll want to enable SOL Console Redirection.

BIOS serial console redirection

(Pictured here is also COM1 Console Redirection. This is for the physical serial port on the machine, not the one in the IPMI.)

Under SOL Console Redirection Settings you may as well set the Bits per second to 115200.

BIOS SOL redirection settings

Now it’s time to configure the IPMI so you can interact with it over the network. Under IPMI > BMC Network Configuration, put the IPMI on your management network:

IPMI network configuration

Connecting to the IPMI serial ^

With the above BIOS settings in place you should be able to save and reboot and then connect to the IPMI serial console. The default credentials are ADMIn / ADMIN which you should of course change with ipmitool, but that is for a different post.

There’s two ways to connect to the serial-over-LAN: You can ssh to the IPMI controller, or you can use ipmitool. Personally I prefer ssh, but the ipmitool way is like this:

$ ipmitool -I lanplus -H 192.168.1.22 -U ADMIN -a sol activate

The ssh way:

$ ssh ADMIN@192.168.1.22
The authenticity of host '192.168.1.22 (192.168.1.22)' can't be established.
RSA key fingerprint is b7:e1:12:94:37:81:fc:f7:db:6f:1c:00:e4:e0:e1:c4.
Are you sure you want to continue connecting (yes/no)?
Warning: Permanently added '192.168.1.22,192.168.1.22' (RSA) to the list of known hosts.
ADMIN@192.168.1.22's password:
 
ATEN SMASH-CLP System Management Shell, version 1.05
Copyright (c) 2008-2009 by ATEN International CO., Ltd.
All Rights Reserved 
 
 
-> cd /system1/sol1
/system1/sol1
 
-> start
/system1/sol1
press <Enter>, <Esc>, and then <T> to terminate session
(press the keys in sequence, one after the other)

They both end up displaying basically the same thing.

The serial console should just be displaying the boot process, which won’t go anywhere.

DHCP and TFTP server ^

You will need to configure a DHCP and TFTP server on an already-existing machine on the same LAN as your new server. They can both run on the same host.

The DHCP server responds to the initial requests for IP address configuration and passes along where to get the boot environment from. The TFTP server serves up that boot environment. The boot environment here consists of a kernel, initramfs and some configuration for passing arguments to the bootloader/kernel. The boot environment is provided by the Debian project.

DHCP ^

I’m using isc-dhcp-server. Its configuration file is at /etc/dhcp/dhcpd.conf.

You’ll need to know the MAC address of the server, which can be obtained either from the front page of the IPMI controller’s web interface (i.e. https://192.168.1.22/ in this case) or else it is displayed on the serial console when it attempts to do a PXE boot. So, add a section for that:

subnet 192.168.2.0 netmask 255.255.255.0 {
}
 
host foo {
    hardware ethernet 0C:C4:7A:7C:28:40;
    fixed-address 192.168.2.22;
    filename "pxelinux.0";
    next-server 192.168.2.251;
    option subnet-mask 255.255.255.0;
    option routers 192.168.2.1;
}

Here we set the network configuration of the new server with fixed-address, option subnet-mask and option routers. The IP address in next-server refers to the IP address of the TFTP server, and pxelinux.0 is what the new server will download from it.

Make sure that is running:

# service isc-dhcp-server start

DHCP uses UDP port 67, so make sure that is allowed through your firewall.

TFTP ^

A number of different TFTP servers are available. I use tftpd-hpa, which is mostly configured by variables in /etc/default/tftp-hpa:

TFTP_OPTIONS="--secure"
TFTP_USERNAME="tftp"
TFTP_DIRECTORY="/srv/tftp"
TFTP_ADDRESS="0.0.0.0:69"

TFTP_DIRECTORY is where you’ll put the files for the PXE environment.

Make sure that the TFTP server is running:

# service tftpd-hpa start

TFTP uses UDP port 69, so make sure that is allowed through your firewall.

Download the netboot files from your local Debian mirror:

$ cd /srv/tftp
$ curl -s http://ftp.YOUR-MIRROR.debian.org/debian/dists/jessie/main/installer-amd64/current/images/netboot/netboot.tar.gz | sudo tar zxvf -
./                
./version.info    
./ldlinux.c32     
./pxelinux.0      
./pxelinux.cfg    
./debian-installer/
./debian-installer/amd64/
./debian-installer/amd64/bootnetx64.efi
./debian-installer/amd64/grub/
./debian-installer/amd64/grub/grub.cfg
./debian-installer/amd64/grub/font.pf2
…

(This assumes you are installing a device with architecture amd64.)

At this point your TFTP server root should contain a debian-installer subdirectory and a couple of links into it:

$ ls -l .
total 8
drwxrwxr-x 3 root root 4096 Jun  4  2015 debian-installer
lrwxrwxrwx 1 root root   47 Jun  4  2015 ldlinux.c32 -> debian-installer/amd64/boot-screens/ldlinux.c32
lrwxrwxrwx 1 root root   33 Jun  4  2015 pxelinux.0 -> debian-installer/amd64/pxelinux.0
lrwxrwxrwx 1 root root   35 Jun  4  2015 pxelinux.cfg -> debian-installer/amd64/pxelinux.cfg
-rw-rw-r-- 1 root root   61 Jun  4  2015 version.info

You could now boot your server and it would call out to PXE to do its netboot, but would be displaying the installer process on the VGA output. If you intend to carry it out using the Remote Console facility of the IPMI interface then that may be good enough. If you want to do it over the serial-over-LAN though, you’ll need to edit some of the files that came out of the netboot.tar.gz to configure that.

Here’s a list of the files you need to edit. All you are doing in each one is telling it to use serial console. The changes are quite mechanical so you can easily come up with a script to do it, but here I will show the changes verbosely. All the files live in the debian-installer/amd64/boot-screens/ directory.

ttyS1 is used here because this system has a real serial port on ttyS0. 115200 is the baud rate of ttyS1 as configured in the BIOS earlier.

adtxt.cfg

From:

label expert
        menu label ^Expert install
        kernel debian-installer/amd64/linux
        append priority=low vga=788 initrd=debian-installer/amd64/initrd.gz --- 
include debian-installer/amd64/boot-screens/rqtxt.cfg
label auto
        menu label ^Automated install
        kernel debian-installer/amd64/linux
        append auto=true priority=critical vga=788 initrd=debian-installer/amd64/initrd.gz --- quiet

To:

label expert
        menu label ^Expert install
        kernel debian-installer/amd64/linux
        append priority=low console=ttyS1,115200n8 initrd=debian-installer/amd64/initrd.gz --- 
include debian-installer/amd64/boot-screens/rqtxt.cfg
label auto
        menu label ^Automated install
        kernel debian-installer/amd64/linux
        append auto=true priority=critical console=ttyS1,115200n8 initrd=debian-installer/amd64/initrd.gz --- quiet
rqtxt.cfg

From:

label rescue
        menu label ^Rescue mode
        kernel debian-installer/amd64/linux
        append vga=788 initrd=debian-installer/amd64/initrd.gz rescue/enable=true --- quiet

To:

label rescue
        menu label ^Rescue mode
        kernel debian-installer/amd64/linux
        append console=ttyS1,115200n8 initrd=debian-installer/amd64/initrd.gz rescue/enable=true --- quiet
syslinux.cfg

From:

# D-I config version 2.0
# search path for the c32 support libraries (libcom32, libutil etc.)
path debian-installer/amd64/boot-screens/
include debian-installer/amd64/boot-screens/menu.cfg
default debian-installer/amd64/boot-screens/vesamenu.c32
prompt 0
timeout 0

To:

serial 1 115200
console 1
# D-I config version 2.0
# search path for the c32 support libraries (libcom32, libutil etc.)
path debian-installer/amd64/boot-screens/
include debian-installer/amd64/boot-screens/menu.cfg
default debian-installer/amd64/boot-screens/vesamenu.c32
prompt 0
timeout 0
txt.cfg

From:

default install
label install
        menu label ^Install
        menu default
        kernel debian-installer/amd64/linux
        append vga=788 initrd=debian-installer/amd64/initrd.gz --- quiet

To:

default install
label install
        menu label ^Install
        menu default
        kernel debian-installer/amd64/linux
        append console=ttyS1,115200n8 initrd=debian-installer/amd64/initrd.gz --- quiet

Perform the install ^

Connect to the serial-over-LAN and get started. If the server doesn’t have anything currently installed then it should go straight to trying PXE boot. If it does have something on the storage that it would boot then you will have to use F12 at the BIOS screen to convince it to jump straight to PXE boot.

$ ssh ADMIN@192.168.1.22
ADMIN@192.168.1.22's password:
 
ATEN SMASH-CLP System Management Shell, version 1.05
Copyright (c) 2008-2009 by ATEN International CO., Ltd.
All Rights Reserved 
 
 
-> cd /system1/sol1
/system1/sol1
 
-> start
/system1/sol1
press <Enter>, <Esc>, and then <T> to terminate session
(press the keys in sequence, one after the other)
 
Intel(R) Boot Agent GE v1.5.13                                                  
Copyright (C) 1997-2013, Intel Corporation                                      
 
CLIENT MAC ADDR: 0C C4 7A 7C 28 40  GUID: 00000000 0000 0000 0000 0CC47A7C2840  
CLIENT IP: 192.168.2.22  MASK: 255.255.255.0  DHCP IP: 192.168.2.252             
GATEWAY IP: 192.168.2.1
 
PXELINUX 6.03 PXE 20150107 Copyright (C) 1994-2014 H. Peter Anvin et al    
 
 
 
 
 
 
                 ┌───────────────────────────────────────┐
                 │ Debian GNU/Linux installer boot menu  │
                 ├───────────────────────────────────────┤
                 │ Install                               │
                 │ Advanced options                    > │
                 │ Help                                  │
                 │ Install with speech synthesis         │
                 │                                       │
                 │                                       │
                 │                                       │
                 │                                       │
                 │                                       │
                 │                                       │
                 └───────────────────────────────────────┘
 
 
 
              Press ENTER to boot or TAB to edit a menu entry
 
 
  ┌───────────────────────┤ [!!] Select a language ├────────────────────────┐
  │                                                                         │
  │ Choose the language to be used for the installation process. The        │
  │ selected language will also be the default language for the installed   │
  │ system.                                                                 │
  │                                                                         │
  │ Language:                                                               │
  │                                                                         │
  │                               C                                         │
  │                               English                                   │
  │                                                                         │
  │     <Go Back>                                                           │
  │                                                                         │
  └─────────────────────────────────────────────────────────────────────────┘
 
 
 
 
<Tab> moves; <Space> selects; <Enter> activates buttons

…and now the installation proceeds as normal.

At the end of this you should be left with a system that uses ttyS1 for its console. You may need to tweak that depending on whether you want the VGA console also.

systemd on Debian, reading the persistent system logs as a user

All the documentation and guides I found say that to enable a persistent journal on Debian you just need to create /var/log/journal. It is true that once you create that directory you will get a persistent journal.

All the documentation and guides I found say that as long as you are in group adm (or sometimes they say group systemd-journal) it is possible to see all system logs by just typing journalctl, without having to run it as root. Having simply done mkdir /var/log/journal I can tell you that is not the case. All you will see is logs relating to your user.

The missing piece of info is contained in /usr/share/doc/systemd/README.Debian:


Enabling persistent logging in journald
=======================================

To enable persistent logging, create /var/log/journal and set up proper permissions:

install -d -g systemd-journal /var/log/journal
setfacl -R -nm g:adm:rx,d:g:adm:rx /var/log/journal

-- Tollef Fog Heen <tfheen@debian.org>, Wed, 12 Oct 2011 08:43:50 +0200

Without the above you will not have permission to read the /var/log/journal//system.journal file, and the ACL is necessary for journal files created in the future to also be readable.

Paranoid, Init

Having marvelled at the er… unique nature of MikeeUSA’s Systemd Blues: Took our thing (Wooo) blues homage to the perils of using systemd, I decided what the world actually needs is something from the metal genre.

So, here’s the lyrics to Paranoid, Init.

Default soon on Debian
This doesn’t help me with my mind
People think I’m insane
Because I am trolling all the time

All day long I fight Red Hat
And uphold UNIX philosophy
Think I’ll lose my mind
If I can’t use sysvinit on jessie

Can you help me
Terrorise pid 1?
Oh yeah!

Tried to show the committee
That things were wrong with this design
They can’t see Poettering’s plan in this
They must be blind

Some sick joke I could just cry
GNOME needs logind API
QR codes gave me a feel
Then binary logs just broke the deal

And so as you hear these words
Telling you now of my state
Can’t log off and enjoy life
I’ve another sock puppet to create