Mutt wins again – subject munging on display

TL;DR: ^

You can munge subjects for display only using the subjectrx Mutt configuration directive.

The Setup ^

I use the terminal-based email reader Mutt.

Many projects that I follow are switching away from email discussion lists in favour of web-first interfaces (“forums”, I think the youngsters are calling them now) like Discourse. That is fine—there’s lots of problems with trying to run a busy community over email—but Discourse offers a “mailing list mode” and I still find my Mutt email client to be a comfortable way to follow discussions. So all my accounts on the various Discourse instances are set to mailing list mode.

The Problem ^

One of the slight issues I have with this is the subject lines that Discourse uses. On an instance with a lot of categories and sub-categories, these will all be prepended to the subject line of each email using up quite a lot of screen space.

The same is true for legacy mailing list subject tags, but in that environment the admins were generally conscious that whatever text they chose would be prepended to every subject, so they tend to choose terse tags like “[users]” for example.

There was a time when subject line tags were controversial amongst experienced email users, because experienced email users know how to sort and filter their mails based on headers and don’t need a tag in the subject line to let them know what an email is. It doesn’t seem to be very controversial any more; I hypothesise that’s because new Internet users don’t use email as much and so don’t value spending much time working out how to get their filtering just right, and so on. So, most legacy mailing lists that I’m part of now do use terse subject tags and not many people complain about that.

Since the posts on Discourse are primarily intended for a web browser, the verbosity of the categories is not an issue. It’s not uncommon to see a category called, say, “Help & Support” and then within that a sub-category for a particular project, e.g. “Footronic 5.x”. When Discourse sends out an email for a post to such a category, it’ll look like this:

Subject: [Help & Support] [Footronic 5.x] Need some help getting my Foo into alignment after passing through a bar-quux transform

Lots of space used by that prefix, on every message, and pointlessly so for me since these mails will have been filtered into a folder so I always know which folder I’m looking at anyway: all of the messages in that folder would be for help and support on Footronic 5.x. Like most email clients, Mutt has an index view that shows an overview of all the emails with a single line for each. Long subjects are truncated at the edge of my terminal.

I’ve put up with this for years now but the last straw was the newly-launched Ansible forum. Their category names are large and there’s lots of sub-categories. Here’s an example of what that looks like in my 95 character wide terminal.

The index view of a Mutt email client
The index view of a Mutt email client

This is quite irritating! I wondered if it could be fixed.

The Fix ^

Of course the Mutt community thought of this, and years ago. subjectrx! You put it in your config, specifying a regular expression to match and what it should be replaced with. For example:

subjectrx '\[Ansible\] \[[^]]*\] *' '%L%R'

That matches any sequence of “[Ansible] ” followed by another set of “[]” that have anything inside them, and replaces all of that with the left side of the match (%L) and the right side of the match (%R). So that effectively discards the tags.

This happens only on display; it doesn’t modify the actual emails on storage.

Here’s what that looks like afterwards:

The index view of a Mutt email client, with tags removed from subject lines
The index view of a Mutt email client, with tags removed from subject lines

Much better, right?

And that’s one of the reasons why I continue to use Mutt.

Other Solutions ^

Off the top of my head, there are some other ways this could have been done.

Alter emails upon delivery ^

It would have been pretty simple to strip these tags out of emails as they were being delivered, but I really like to keep emails on storage the same as they were when they arrived. At the very least doing this will cause a DKIM failure as I would have modified the message after it was signed. That wouldn’t be an issue for my delivery since my DKIM check would happen before any such editing, but I’d still rather not.

Run the subject lines through an external filter program ^

The format of many things in Mutt is highly configurable and one such format is index_format, which controls how the lines on the index view are displayed.

Sadly there is not a builtin format specifier to search and replace in the subject tag (or any other tag), but you can run the whole thing through an external program, which could do anything you liked to it. That would involve fork()ing and exec()ing a process for every single mail in a mailbox though. Yuck.

On Discourse ^

This is not a gripe about Discourse. I think Discourse is a better way to run a busy community than email lists. At this point I’d be happy for most mailing lists I’m part of to switch to Discourse instances, especially the very busy ones. I’m impressed with the amount of work and features that Discourse now has.

The only exception to that I think is that purely question-answer support mailing lists might be better off with a StackOverflow-style approach like AskUbuntu. But failing that, I think Discourse is still many times better than a mailing list for that use case.

Not that you asked, but I think the primary problem with email as a community platform is that only old people use email. In the 21st century it’s an unacceptable barrier to entry.

The next most serious problem with email for running a community is that any decently-sized community will have a certain percentage of utter numpties; these utter numpties won’t be self-aware enough to know they are utter numpties, and they will post a lot of nonsense. The only way to counter a numpty posting nonsense is to reply to it and call them out. That is exhausting, unrewarding work, which frequently goes wrong, adding to the noise and ill-feeling. Problem posters do not get dealt with until they reach a level bad enough to warrant their posting rights being removed. Forums like Discourse scale their moderation tasks much better, with a lot of it being amenable to wide community input.

I could go on to list a lot more serious problems but those two are the worst in my opinion.

Happy birthday, /dev/sdd?

One of my hard drives reaches 120,000 hours of operation in about a month:

$ ~/src/blkleaderboard/
     sdd 119195 hours (13.59 years) 0.29TiB ST3320620AS
     sdb 114560 hours (13.06 years) 0.29TiB ST3320620AS
     sda 113030 hours (12.89 years) 0.29TiB ST3320620AS
     sdk  76904 hours ( 8.77 years) 2.73TiB WDC WD30EZRX-00D
     sdh  66018 hours ( 7.53 years) 0.91TiB Hitachi HUA72201
     sde  45746 hours ( 5.21 years) 0.91TiB SanDisk SDSSDH31
     sdc  39179 hours ( 4.46 years) 0.29TiB ST3320418AS
     sdf  28758 hours ( 3.28 years) 1.82TiB Samsung SSD 860
     sdj  28637 hours ( 3.26 years) 1.75TiB KINGSTON SUV5001
     sdg  23067 hours ( 2.63 years) 1.75TiB KINGSTON SUV5001
     sdi   9596 hours ( 1.09 years) 0.45TiB ST500DM002-1BD14

It’s a 320GB Seagate Barracuda 7200.10.

The machine these are in is a fileserver at my home. The four 320GB HDDs are what the operating system is installed on, whereas the hodge podge assortment of differently-sized HDDs and SSDs are the main storage for files.

That is not the most performant way to do things, but it’s only at home and doesn’t need best performance. It mostly just uses up discarded storage from other machines as they get replaced.

sdd has seen every release of Debian since 4.0 (etch) and several iterations of hardware, but this can’t go on much longer. The machine that the four 320GB HDDs are in now is very underpowered but any replacement I can think of won’t be needing four 3.5″ SATA devices inside it. More like 2x 2.5″ NVMe or M.2.

Then again, I’ve been saying that it must be replaced for about 5 years now, so who knows how long it will take me. sdd will definitely reach 120,000 hours barring hardware failure in the next month. is on GitHub, by the way.

PowerDNS Truncated SOA Response Problem

I recently upgraded bind9 on my primary nameserver and soon after I noticed that one particular zone would no longer transfer to my secondary nameservers, which run PowerDNS. All the PowerDNS servers were saying:

Nov 18 00:25:26 daiquiri pdns_server[32452]: While checking domain freshness: Query to '2001:ba8:1f1:f085::53' for SOA of '' did not return a SOA

The confusing thing was that manually using dig to query for this did work fine:

daiquiri$ dig +short -t soa @2001:ba8:1f1:f085::53 1668670704 28800 14400 3600000 86400

After scratching my head for several hours over this yesterday, I eventually broke out tcpdump and was surprised to see that the response to PowerDNS’s SOA query was indeed empty. And it was also truncated!

Back to dig, I could see that this zone was DNSSEC-signed and the SOA query with DNSSEC info was 2293 bytes in size:

daiquiri$ dig +dnssec -t soa @2001:ba8:1f1:f085::53 | grep MSG
;; MSG SIZE  rcvd: 2293

That’s bigger than a DNS response can be in UDP, so it truncates and the client is supposed to retry over TCP. dig has no problem doing that, but PowerDNS can’t (yet).

Specifically what has changed in bind9 is the EDNS buffer size, down from its previous default of 4096 bytes to 1232 bytes.

I can stop PowerDNS from doing the SOA check at all by upgrading all PowerDNS servers to v4.7.x and using the secondary-check-signature-freshness=no option.

I could put bind9’s EDNS buffer size back up to 4096, but it doesn’t seem advisable to go over about 1400 bytes and so that won’t help.

For now I have enabled the minimal-responses option in bind9, which drops extra records from the Authority and Additional sections of responses unless they are absolutely required. This reduces the response size of that SOA query to 685 bytes, so it no longer truncates and PowerDNS is happy.

I’m not sure if an SOA response can ever go above 1232 bytes now. Maybe as DNSSEC signatures get bigger. So this might not be a permanenet solution and hopefully PowerDNS will gain the ability to retry those SOA queries over TCP.

Blood pressure and glucose aggravation

I haven’t really been taking good care of myself which is highly inadvisable as someone with diabetes, but last week I had an extremely high blood pressure reading and so this cannot go on. At the same time I had a very high blood glucose reading although that wasn’t a surprise to me as I’ve long found it difficult to control this.

I have actually been taking my diet and level of activity half seriously and as a result I am currently at my lowest weight in over 25 years, though still at the lower end of “obese” by BMI standards. To put that into some context, in 2006 when first diagnosed with diabetes I was almost 140kg.

Anyway since my weight while not ideal is better than it’s ever been, the high blood pressure was even more of a worry. I was concerned that my kidneys might have packed up or something. Happily the result of yesterday’s bloods had my GP saying that my kidney and liver function were “perfect” (his words), which is very relieving but does leave me wondering what else I can do.

For now the GP has prescribed me some pills for blood pressure and told me to come back in a month for another blood test and blood pressure check, so it seems like he isn’t overly concerned. With the blood pressure as high as it is I was seriously wondering if he was going to call me with the results and say “go to hospital now”.

Clearly even though I thought I was doing okay with the weight loss I am going to have to step things up a bit.

Also I am hopelessly addicted to (sugar free) fizzy drinks but I’m going to have to do something about this. Although there’s no proven link to high blood pressure, it is a bit more likely that the artificial sweeteners are going to be playing havoc with my blood sugar levels and appetite.

My habit was at the ridiculous level of over 2 litres per day but since last week’s shock I’ve started by instituting a policy of one full pint of water between any fizzy drink. Small steps but I don’t feel like I can do cold turkey.

What I have found after sticking to this policy for the last 9 days is that I often can’t physically ingest any more liquid so I don’t reach for the fizzy drink, and that it really does seem to have reduced my appetite as well. Today I had my first fizzy drink of the day with dinner, whereas before I might have had 2 litres already by that time. At the moment I’m at around 500ml a day. I hope I can keep to something like that.

The caffeine withdrawal is not pleasant. I don’t think it is a good idea to try to find another caffeine source until my blood pressure is under control. By then I might not feel the need.

If I don’t get the blood pressure under control then my near future will feature a stroke or heart attack. If I don’t get the blood sugar under control then my near future will include insulin injections.

Exim: Adding the Autonomous System Number as a header in received emails

Updates ^

2022-11-05 ^

  • Added a bit about timeouts, as concern was expressed that I am “bonkers”.

The Problem ^

For statistical purposes I wanted to add the Autonomous System Number (ASN) for the IP address of the connecting host as a header in the received email, like this:

X-ASN: AS63949 2a01:7e01::/32

The Answer ^

You can obtain this information through a DNS query to Team Cymru:

$ sipcalc -r 2a01:7e01::f03c:92ff:fe32:a408
-[ipv6 : 2a01:7e01::f03c:92ff:fe32:a408] - 0
Reverse DNS (  -
$ dig +short -t txt
"63949 | 2a01:7e01::/32 | US | ripencc | 2011-02-01"

Or for legacy Internet addresses:

$ dig +noall +question -x
;   IN      PTR
$ dig +short -t txt
"13414 | | US | arin | 2010-11-23"

So for IPv6 addresses the process is:

  1. Expand the address out fully (2a01:7e01::f03c:92ff:fe32:a4082a01:7e01:0000:0000:f03c:92ff:fe32:a408)
  2. Remove the colons (2a01:7e01:0000:0000:f03c:92ff:fe32:a4082a017e0100000000f03c92fffe32a408)
  3. Reverse it (2a017e0100000000f03c92fffe32a408804a23efff29c30f0000000010e710a2)
  4. Add a dot after every hexadecimal number (804a23efff29c30f0000000010e710a28.0.4.a.2.3.e.f.f.f.2.9.c.3.0.f.
  5. Add on the end (
  6. Query that TXT record and parse out the first two values separated by ‘|’ in the response.

For legacy IP addresses the process is much simpler; reverse the octets, add on the end and query that.

An Exim Answer ^

In Exim configuration you can do it like this:

(This is meant to go inside an ACL like your check_rcpt or check_data. Maybe near the end of check_data at the point where you’ve already decided to accept the email. No point in doing this for an email you will reject.)

# Add X-ASN: header for IPv6 senders.
  warn message = X-ASN: AS${sg{${extract{1}{|}{$acl_m9}}}{\N\s+\N}{}} ${sg{${extract{2}{{|}{$acl_m9}}}{\N\s+\N}{}}
     condition = ${if isip6{$sender_host_address}}
    set acl_m9 = ${lookup dnsdb{txt=${reverse_ip:$sender_host_address}}}
# Add X-ASN: header for legacy IP senders.
  warn message = X-ASN: AS${sg{${extract{1}{|}{$acl_m9}}}{\N\s+\N}{}} ${sg{${extract{2}{{|}{$acl_m9}}}{\N\s+\N}{}}
     condition = ${if isip4{$sender_host_address}}
    set acl_m9 = ${lookup dnsdb{txt=${reverse_ip:$sender_host_address}}}

I dislike that I’ve had to use two tests that are almost exactly the same except they query slightly different DNS names ( vs I’m sure it could be done in one, but I’m not good enough with the Exim string evaluations. They send me cross-eyed. I couldn’t find a better way so I decided to use the time-honoured tactic of posting what I have in order to provoke people into correcting me. Please let me know if you can improve it!

The amount of nested {} will probably drive you mad, but basically:

  • ${reverse_ip:$sender_host_address} handles expanding and reversing an IP address into the form you would use for a reverse DNS query.
  • That gets queried in DNS with the correct suffix and the full response stored in $acl_m9.
  • warn message = X-ASN: adds a header to the email, the content of which is built from two fields extracted out of $acl_m9 with all whitespace removed (${sg{source}{regex}{replacement}}).

What about timeouts? ^

One piece of feedback I got was that I am “bonkers” to make my email delivery rely on a real time network lookup. I can kind of see the argument, but also not: this is a DNS query exactly the same as a typical DNSBL query (Team Cymru IP-to-ASN service is used exactly like a typical DNSBL).

Most people’s mail servers do multiple DNSBL queries already and nobody really is up in arms saying it’s bonkers to do so. My Exim already does a couple of DNSBL queries and then if it is going to deliver the email it will call out to SpamAssassin which does many DNSBL queries itself. If these hit a timeout then it would slow down my mail delivery.

In the past where a DNSBL has unceremoniously shut down and made its nameservers unresponsive I have seen problems, as it caused the delivery processes to pile up while they waited on their timeouts and then Exim would complain that there’s too many processes. That would be resolved by removing the errant DNSBL(s) from the configuration.

Query load is not a concern as DNS is highly scalable and my system is not going to add noticeable load to Team Cymru’s already public service. The SpamAssassin ASN plugin is already out there, hard coded to use this same service and must have many many users already.

As far as I can tell, in Exim dnsdb queries use the same timeouts and retries as dnslist queries do, that being controlled by the dns_retrans and dns_rety settings. These settings both default to 0, which means “operating system / resolver library default”. If you were worried you could explicitly set these to their minimum value:

If still worried then you would first have to either turn off all DNSBLs or make sure you had local copies of them (e.g. by arranging AXFR to your own local servers). Then to do the IP-to-ASN locally you’d arrange to have a local BGP feed that you could query. I think you’d need to have an absolutely huge mail server before these issues became real concerns.

set acl_m9 = ${lookup dnsdb{retrans_1s,retry_1,txt=${reverse_ip:$sender_host_address}}}

As for dnslist, the consequence of a time out is that you get no data, so it would just result in an empty header.

But Why? ^

I’ve actually been doing this for a while with SpamAssassin’s ASN plugin but I’ve changed the way in which I query SpamAssassin and now I don’t directly get the rewritten email that SpamAssassin makes (with its X-Spam-ASN: header in).

I use it for feeding into Bayes to see if there’s a particular prevalence of ASN amongst the email that is classified as spam, and I sometimes add a few points on manually for ASNs that are particularly bad. That is a lot less work than trying to track down all their IP addresses and keep that up to date.

Using Duplicity to back up to Amazon S3 over IPv6 (only)

Scenario ^

I have a server that I use for making backups. I also send backups from that server into Amazon S3 at the “Infrequent Access” storage class. That class is cheaper to store but expensive to access. It’s intended for backups of last resort that you only access in an emergency. I use Duplicity to handle the S3 part.

(I could save a bit more by using one of the “Glacier” classes but at the moment the cost is minimal and I’m not that brave.)

I recently decided to change which server I use for the backups. I noticed that renting a server with only IPv6 connectivity was cheaper, and as all the hosts I back up have IPv6 connectivity I decided to give that a go.

This mostly worked fine. The only thing I really noticed was when I tried to install some software from GitHub. GitHub doesn’t support IPv6, so I had to piggy back that download through another host.

Then I came to set up Duplicity again and found that I needed to make some non-obvious changes to make it work with S3 over IPv6-only.

S3 endpoint ^

The main issue is that the default S3 endpoint URL is https://s3.<region>, and this host only has an A (IPv4) record! For example:

$ host has address

If you run Duplicity with a target like s3://yourbucketname/path/to/backup then it will try that endpoint, get only an IPv4 address, and return Network unreachable.

S3 does actually support IPv6, but for that to work you need to use a dual stack endpoint! They look like this:

$ host has address has IPv6 address 2600:1fa0:80dc:5101:34d9:451e::

So we need to specify the S3 endpoint to use.

Specifying the S3 endpoint ^

In order to do this you need to switch Duplicity to the “boto3” backend. Assuming you’ve installed the correct package (python3-boto3 on Debian), this is as simple as changing the target from s3://… to boto3+s3://….

That then allows you to use the command line arguments --s3-region-name and --s3-endpoint-url so you can tell it which host to talk to. That ends up giving you both an IPv4 and an IPv6 address and your system correctly chooses the IPv6 one.

The full script ^

The new, working script now looks something like this:

export PASSPHRASE="highlysecret"
export AWS_ACCESS_KEY_ID="notquiteassecret"
export AWS_SECRET_ACCESS_KEY="extremelysecret"
# Somewhere with plenty of free space.
export TMPDIR=/var/tmp
duplicity --encrypt-key ABCDEF0123456789 \
          --asynchronous-upload \
          -v 4 \
          --archive-dir=/path/to/your/duplicity/archives \
          --s3-use-ia \
          --s3-use-multiprocessing \
          --s3-use-new-style \
          --s3-region-name "us-east-1" \
          --s3-endpoint-url "" \
          incr \
          --full-if-older-than 30D \
          /stuff/you/want/backed/up \

The previous version of the script looked a bit like:

# All the exports stayed the same
duplicity --encrypt-key ABCDEF0123456789 \
          --asynchronous-upload \
          -v 4 \
          --archive-dir=/path/to/your/duplicity/archives \
          --s3-use-ia \
          --s3-use-multiprocessing \
          incr \
          --full-if-older-than 30D \
          /stuff/you/want/backed/up \

Building BitFolk’s Rescue VM

Overview ^

BitFolk‘s Rescue VM is a live system based on the Debian Live project. You boot it, it finds its root filesystem over read-only NFS, and then it mounts a unionfs RAM disk over that so that you can make changes (e.g. install packages) that don’t persist. People generally use it to repair broken operating systems, reset root passwords etc.

Every few years I have to rebuild it, because it’s important that it’s new enough to be able to effectively poke around in guest filesystems. Each time I have to try to remember how I did it. It’s not that difficult but it’s well past time that I document how it’s done.

Basic concept of Debian Live ^

The idea is that everything under the config/ directory of your build area is either

  • a set of configuration options for the process itself,
  • some files to put in the image,
  • some scripts to run while building the image, or
  • some scripts to run while booting the image.

Install packages ^

Pick a host running at least the latest Debian stable. It might be possible to build a live image for a newer version of Debian, but the live-build system and its dependencies like debootstrap might end up being too old.

$ sudo apt install live-build live-boot live-config

Prepare the work directory ^

$ sudo mkdir -vp /srv/lb/auto
$ cd /srv/lb

Main configuration ^

All of these config options are described in the lb_config man page.

$ sudo tee auto/config >/dev/null <<'_EOF_'
set -e
lb config noauto \
    --architectures                     amd64 \
    --distribution                      bullseye \
    --binary-images                     netboot \
    --archive-areas                     main \
    --apt-source-archives               false \
    --apt-indices                       false \
    --backports                         true \
    --mirror-bootstrap                  "$main_mirror" \
    --mirror-chroot-security            "$sec_mirror" \
    --mirror-binary                     "$main_mirror" \
    --mirror-binary-security            "$sec_mirror" \
    --memtest                           none \
    --net-tarball                       true \

The variables at the top just save me having to repeat myself for all the mirrors. They make both the build process and the resulting image use BitFolk’s apt-cacher to proxy the mirror.

I’m not going to describe every config option as you can just look them up in the man page. The most important one is --binary-images netboot to make sure it builds an image that can be booted by network.

Extra packages ^

There’s some extra packages I want available in the rescue image. Here’s how to get them installed.

$ sudo tee config/package-lists/bitfolk_rescue.list.chroot > /dev/null <<_EOF_

Installing a backports kernel ^

I want the rescue system to be Debian 11 (bullseye), but with a bullseye-backports kernel.

We already used --backports true to make sure that we have access to the backports package mirrors but we need to run a script hook to actually install the backports kernel in the image while it’s being built.

$ sudo tee config/hooks/live/9000-install-backports-kernel.hook.chroot >/dev/null <<'_EOF_'
set -e
apt -y install -t bullseye-backports linux-image-amd64
apt -y purge -t bullseye linux-image-amd64
apt -y purge -t bullseye 'linux-image-5.10.*'

Set a static /etc/resolv.conf ^

This image will only be booted on one network where I know what the nameservers are, so may as well statically override them. If you were building an image to use on different networks you’d probably instead want to use one of the public resolvers or accept what DHCP gives you.

$ sudo tee config/includes.chroot/etc/resolv.conf >/dev/null <<_EOF_

Set an explanatory footer text in /etc/issue.footer ^

The people using this rescue image don’t necessarily know what it is and how to use it. I take the opportunity to put some basic info in the file /etc/issue.footer in the image, which will later end up in the real /etc/issue

$ sudo tee config/includes.chroot/etc/issue.footer >/dev/null <<_EOF_
BitFolk Rescue Environment -
Blah blah about what this is and how to use it

Set a random password at boot ^

By default a Debian Live image has a user name of “user” and a password of “live“. This isn’t suitable for a networked service that will have sshd active from the start, so we will install a hook script that sets a random password. This will be run near the end of the image’s boot process.

$ sudo tee config/includes.chroot/lib/live/config/2000-passwd >/dev/null <<'_EOF_'
set -e
echo -n " random-password "
NEWPASS=$(/usr/bin/pwgen -c -N 1)
printf "user:%s\n" "$NEWPASS" | chpasswd
    printf "****************************************\n";
    printf "Resetting user password to random value:\n";
    printf "\t${RED}New user password:${NORMAL} %s\n" "$NEWPASS";
    printf "****************************************\n";
    cat /etc/issue.footer
} >> /etc/issue

This script puts the random password and the footer text into the /etc/issue file which is displayed above the console login prompt, so the user can see what the password is.

Fix initial networking setup ^

This one’s a bit unfortunate and is a huge hack, but I’m not sure enough of the details to report a bug yet.

The live image when booted is supposed to be able to set up its network by a number of different ways. DHCP would be the most sensible for an image you take with you to different networks.

The BitFolk Rescue VM is only ever booted in one network though, and we don’t use DHCP. I want to set static networking through the ip=…s syntax of the kernel command line.

Unfortunately it doesn’t seem to work properly with live-boot as shipped. I had to hack the /lib/live/boot/ file to make it parse the values out of the kernel command line.

Here’s a diff. Copy /lib/live/boot/ to config/includes.chroot/usr/lib/live/boot/ and then apply that patch to it.

It’s simple enough that you could probably edit it by hand. All it does is comment out one section and replace it with some bits that parse IP setup out of the $STATICIP variable.

Fix the shutdown process ^

Again this is a horrible hack and I’m sure there is a better way to handle it, but I couldn’t work out anything better and this works.

This image will be running with its root filesystem on NFS. When a shutdown or halt command is issued however, systemd seems extremely keen to shut off the network as soon as possible. That leaves the shutdown process unable to continue because it can’t read or write its root filesystem any more. The shutdown process stalls forever.

As this is a read-only system with no persistent state I don’t care how brutal the shutdown process is. I care more that it does actually shut down. So, I have added a systemd service that issues systemctl –force –force poweroff any time that it’s about to shut down by any means.

$ sudo tee config/includes.chroot/etc/systemd/system/always-brutally-poweroff.service >/dev/null <<_EOF_
Description=Every kind of shutdown will be brutal poweroff
ExecStart=/usr/bin/systemctl --force --force poweroff

And to force it to be enabled at boot time:

$ sudo tee config/includes.chroot/etc/rc.local >/dev/null <<_EOF_
set -e
systemctl enable always-brutally-poweroff

Build it ^

At last we’re ready to build the image.

$ sudo lb clean && sudo lb config && sudo lb build

The “lb clean” is there because you probably won’t get this right first time and will want to iterate on it.

Once complete you’ll find the files to put on your NFS server in binary/ and the kernel and initramfs to boot on your client machine in tftpboot/live/

$ sudo rsync -av binary/ my.nfs.server:/srv/rescue/

Booting it ^

The details of exactly how I boot the client side (which in BitFolk’s case is a customer VM) are out of scope here, but this is sort of what the kernel command line looks like on the client (normally all on one line):



Get root filesystem from NFS.
Static IP configuration on kernel command line. Separated by colons:

  • Client’s IP
  • NFS server’s IP
  • Default gateway
  • Netmask
  • Host name
Host name.
Where to mount root from on NFS server.
NFS client options to use.
Tell live-boot that this is a live image.

Look for persistent data.

In action ^

Here’s an Asciinema of this image in action.

Improvements ^

There’s a few things in here which are hacks. What I have works but no doubt I am doing some things wrong. If you know better please do let me know in comments or whatever. Ideally I’d like to stick with Debian Live though because it’s got a lot of problems solved already.

btrfs compression wins

Some quite good btrfs compression results from my backup hosts (which back up customer data).

Type       Perc     Disk Usage   Uncompressed Referenced
TOTAL       64%       68G         105G         1.2T
none       100%       24G          24G         434G
zlib        54%       43G          80G         797G
Type       Perc     Disk Usage   Uncompressed Referenced
TOTAL       74%       91G         123G         992G
none       100%       59G          59G         599G
lzo         50%       32G          63G         393G
Type       Perc     Disk Usage   Uncompressed Referenced
TOTAL       73%       16G          22G         459G
none       100%       12G          12G         269G
lzo         40%      4.1G          10G         190G
Type       Perc     Disk Usage   Uncompressed Referenced
TOTAL       71%      105G         148G         1.9T
none       100%       70G          70G         910G
zlib        40%       24G          60G         1.0T
lzo         58%       10G          17G          17G

So that’s 398G that takes up 280G, a 29.6% reduction.

The “none” type is incompressible files such as media that’s already compressed. I started off with lzo compression but I’m switching to zlib now as it compresses more and this data is rarely accessed so I’m not too concerned about performance. I need newer kernels on these before I can try zstd.

I’ve had serious concerns about btrfs before based on issues I’ve had using it at home, but these were mostly around multiple device usage. Here they get a single block device that has redundancy underneath so the only remotely interesting thing that btrfs is doing here is the compression.

Might try some offline deduplication next.

Resolving a sector offset to a logical volume

The Problem ^

Sometimes Linux logs interesting things with sector offsets. For example:

Jul 23 23:11:19 tanqueray kernel: [197925.429561] sg[22] phys_addr:0x00000015bac60000 offset:0 length:4096 dma_address:0x00000012cf47a000 dma_length:4096
Jul 23 23:11:19 tanqueray kernel: [197925.430323] sg[23] phys_addr:0x00000015bac5d000 offset:0 length:4608 dma_address:0x00000012cf47b000 dma_length:4608
Jul 23 23:11:19 tanqueray kernel: [197925.431052] sg[24] phys_addr:0x00000015bac5e200 offset:512 length:3584 dma_address:0x00000012cf47c200 dma_length:3584
Jul 23 23:11:19 tanqueray kernel: [197925.431824] sg[25] phys_addr:0x00000015bac2e000 offset:0 length:4096 dma_address:0x00000012cf47d000 dma_length:4096
Jul 23 23:11:19 tanqueray kernel: [197925.434447] Invalid SGL for payload:131072 nents:32
Jul 23 23:11:19 tanqueray kernel: [197925.454419] blk_update_request: I/O error, dev nvme0n1, sector 509505343 op 0x1:(WRITE) flags 0x800 phys_seg 32 prio class 0
Jul 23 23:11:19 tanqueray kernel: [197925.464644] md/raid1:md5: Disk failure on nvme0n1p5, disabling device.
Jul 23 23:11:19 tanqueray kernel: [197925.464644] md/raid1:md5: Operation continuing on 1 devices.

What is at sector 509505343 of /dev/nvme0n1p5 anyway? Well, that’s part of an md array and then on top of that is an lvm physical volume, which has a number of logical volumes.

I’d like to know which logical volume sector 509505343 of /dev/nvme0n1p5 corresponds to.

At the md level ^

Thankfully this is a RAID-1 so every device in it has the exact same layout.

$ grep -A 2 ^md5 /proc/mdstat 
md5 : active raid1 nvme0n1p5[0] sda5[1]
      3738534208 blocks super 1.2 [2/2] [UU]
      bitmap: 2/28 pages [8KB], 65536KB chunk

The superblock format of 1.2 also means that the RAID metadata is at the end of each device, so there is no offset there to worry about.

For all intents and purposes sector 509505343 of /dev/nvme0n1p5 is the same as sector 509505343 of /dev/md5.

If I’d been using a different RAID level like 5 or 6 then this would have been far more complicated as the data would have been striped across multiple devices at different offsets, together with parity. Some layouts of Linux RAID-10 would also have different offsets.

At the lvm level ^

LVM has physical volumes (PVs) that are split into extents, then one or more ranges of one or more extents make up a logical volume (LV). The physical volumes are just the underlying device, so in my case that’s /dev/md5.

Offset into the PV ^

LVM has some metadata at the start of the PV, so we first work out how far into the PV the extents can start:

$ sudo pvs --noheadings -o pe_start --units s /dev/md5

So, sector 509505343 is actually 509503295 sectors into this PV, because the first 2048 sectors are reserved for metadata.

How big is an extent? ^

Next we need to know how big an LVM extent is.

$ sudo pvdisplay --units s /dev/md5 | grep 'PE Size'
  PE Size               8192 Se

There’s 8192 sectors in each of the extents in this PV, so this sector is inside extent number 509503295 / 8192 = 62195.22644043.

It’s fractional because naturally the sector is not on an exact PE boundary. If I need to I could work out from the remainder how many sectors into PE 62195 this is, but I’m only interested in the LV name and each LV has an integer number of PEs, so that’s fine: PE 62195.

Look at the PV’s mappings ^

Now you can dump out a list of mappings for the PV. This will show you what each range of extents corresponds to. Note that there might be multiple ranges for an LV if it’s been grown later on.

$ sudo pvdisplay --maps /dev/md5 | grep -A1 'Physical extent'
  Physical extent 58934 to 71733:
    Logical volume      /dev/myvg/domu_backup4_xvdd
  Physical extent 71734 to 912726:

So, extent 62195 is inside /dev/myvg/domu_backup4_xvdd.

What’s going on here then? ^

I’m not sure, but there appears to be a kernel bug and it’s probably got something to do with the fact that this LV is a disk with an unaligned partition table:

$ sudo fdisk -u -l /dev/myvg/domu_backup4_xvdd
Disk /dev/myvg/domu_backup4_xvdd: 50 GiB, 53687091200 bytes, 104857600 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: dos
Disk identifier: 0x07c7ce4c

Device Boot Start End Sectors Size Id Type
/dev/myvg/domu_backup4_xvdd1 63 104857599 104857537 50G 83 Linux
Partition 1 does not start on physical sector boundary.

The Linux NVMe driver can only do IO in multiples of 4096 bytes. As seen in the initial logs, two of the requests were for 4608 and 3584 bytes respectively; these are not divisible by 4096 and thus hit a WARN().

Jul 23 23:11:19 tanqueray kernel: [197925.430323] sg[23] phys_addr:0x00000015bac5d000 offset:0 length:4608 dma_address:0x00000012cf47b000 dma_length:4608
Jul 23 23:11:19 tanqueray kernel: [197925.431052] sg[24] phys_addr:0x00000015bac5e200 offset:512 length:3584 dma_address:0x00000012cf47c200 dma_length:3584

Going further: finding the file ^

I’m not interested in doing this because it’s fairly likely that it’s because of the offset partition and many kinds of IO to it will cause this.

If you did want to though, you’d first have to look at the partition table to see where your filesystem starts. 0.22644043 * 8192 = 1855 sectors into the disk. Partition 1 starts at 63, so this file is at 1792 sectors.

You can then (for ext4) use debugfs to poke about and see which file that corresponds to.

Keeping firewall logs out of Linux’s kernel log with ulogd2

A few words about iptables vs nft ^

nftables is the new thing and iptables is deprecated, but I haven’t found time to convert everything to nft rules syntax yet.

I’m still using iptables rules but it’s the iptables frontend to nftables. All of this works both with legacy iptables and with nft but with different syntax.

Logging with iptables ^

As a contrived example let’s log inbound ICMP packets at a maximum rate of 1 per second:

-A INPUT -m limit --limit 1/s -p icmp -j LOG --log-level 7 --log-prefix "ICMP: "

The Problem ^

If you have logging rules in your firewall then they’ll log to your kernel log, which is available at /dev/kmsg. The dmesg command displays the contents of /dev/kmsg but /dev/kmsg is a fixed size circular buffer, so after a while your firewall logs will crowd out every other thing.

On a modern systemd system this stuff does get copied to the journal, so if you set that to be persistent then you can keep the kernel logs forever. Or you can additionally run a syslog daemon like rsyslogd, and have that keep things forever.

Either way though your dmesg or journalctl -k commands are only going to display the contents of the kernel’s ring buffer which will be a limited amount.

I’m not that interested in firewall logs. They’re nice to have and very occasionally valuable when debugging something, but most of the time I’d rather they weren’t in my kernel log.

An answer: ulogd2 ^

One answer to this problem is ulogd2. ulogd2 is a userspace logging daemon into which you can feed netfilter data and have it log it in a flexible way, to multiple different formats and destinations.

I actually already use it to log certain firewall things to a MariaDB database for monitoring purposes, but you can also emit plain text, JSON, netflow and all manner of things. Since I’m already running it I decided to switch my general firewall logging to it.

Configuring ulogd2 ^

I added the following to /etc/ulogd.conf:

# This one for logging to local file in emulated syslog format.

I already had a stack called log1 for logging to MariaDB, so I called the new one log2 with its output being emu1.

The log2 section can then be told to expect messages from netfilter group 2. Don’t worry about this, just know that this is what you refer to in your firewall rules, and you can’t use group 0 because that’s used for something else.

The emu1 section then says which file to write this stuff to.

That’s it. Restart the daemon.

Configuring iptables ^

Now it’s time to make iptables log to netfilter group 2 instead of its normal LOG target. As a reminder, here’s what the rule was like before:

-A INPUT -m limit --limit 1/s -p icmp -j LOG --log-level 7 --log-prefix "ICMP: "

And here’s what you’d change it to:

-A INPUT -m limit --limit 1/s -p icmp -j NFLOG --nflog-group 2 --nflog-prefix "ICMP:"

The --nflog-group 2 needs to match what you put in /etc/ulogd.conf.

You’re now logging with ulogd2 and none of this will be going to the kernel log buffer. Don’t forget to rotate the new log file! Or maybe you’d like to play with logging this as JSON or into a SQLite DB?