PowerDNS Truncated SOA Response Problem

I recently upgraded bind9 on my primary nameserver and soon after I noticed that one particular zone would no longer transfer to my secondary nameservers, which run PowerDNS. All the PowerDNS servers were saying:

Nov 18 00:25:26 daiquiri pdns_server[32452]: While checking domain freshness: Query to '2001:ba8:1f1:f085::53' for SOA of 'example.com' did not return a SOA

The confusing thing was that manually using dig to query for this did work fine:

daiquiri$ dig +short -t soa example.com @2001:ba8:1f1:f085::53
ns0.example.com. bind.example.com. 1668670704 28800 14400 3600000 86400

After scratching my head for several hours over this yesterday, I eventually broke out tcpdump and was surprised to see that the response to PowerDNS’s SOA query was indeed empty. And it was also truncated!

Back to dig, I could see that this zone was DNSSEC-signed and the SOA query with DNSSEC info was 2293 bytes in size:

daiquiri$ dig +dnssec -t soa example.com @2001:ba8:1f1:f085::53 | grep MSG
;; MSG SIZE  rcvd: 2293

That’s bigger than a DNS response can be in UDP, so it truncates and the client is supposed to retry over TCP. dig has no problem doing that, but PowerDNS can’t (yet).

Specifically what has changed in bind9 is the EDNS buffer size, down from its previous default of 4096 bytes to 1232 bytes.

I can stop PowerDNS from doing the SOA check at all by upgrading all PowerDNS servers to v4.7.x and using the secondary-check-signature-freshness=no option.

I could put bind9’s EDNS buffer size back up to 4096, but it doesn’t seem advisable to go over about 1400 bytes and so that won’t help.

For now I have enabled the minimal-responses option in bind9, which drops extra records from the Authority and Additional sections of responses unless they are absolutely required. This reduces the response size of that SOA query to 685 bytes, so it no longer truncates and PowerDNS is happy.

I’m not sure if an SOA response can ever go above 1232 bytes now. Maybe as DNSSEC signatures get bigger. So this might not be a permanenet solution and hopefully PowerDNS will gain the ability to retry those SOA queries over TCP.

Blood pressure and glucose aggravation

I haven’t really been taking good care of myself which is highly inadvisable as someone with diabetes, but last week I had an extremely high blood pressure reading and so this cannot go on. At the same time I had a very high blood glucose reading although that wasn’t a surprise to me as I’ve long found it difficult to control this.

I have actually been taking my diet and level of activity half seriously and as a result I am currently at my lowest weight in over 25 years, though still at the lower end of “obese” by BMI standards. To put that into some context, in 2006 when first diagnosed with diabetes I was almost 140kg.

Anyway since my weight while not ideal is better than it’s ever been, the high blood pressure was even more of a worry. I was concerned that my kidneys might have packed up or something. Happily the result of yesterday’s bloods had my GP saying that my kidney and liver function were “perfect” (his words), which is very relieving but does leave me wondering what else I can do.

For now the GP has prescribed me some pills for blood pressure and told me to come back in a month for another blood test and blood pressure check, so it seems like he isn’t overly concerned. With the blood pressure as high as it is I was seriously wondering if he was going to call me with the results and say “go to hospital now”.

Clearly even though I thought I was doing okay with the weight loss I am going to have to step things up a bit.

Also I am hopelessly addicted to (sugar free) fizzy drinks but I’m going to have to do something about this. Although there’s no proven link to high blood pressure, it is a bit more likely that the artificial sweeteners are going to be playing havoc with my blood sugar levels and appetite.

My habit was at the ridiculous level of over 2 litres per day but since last week’s shock I’ve started by instituting a policy of one full pint of water between any fizzy drink. Small steps but I don’t feel like I can do cold turkey.

What I have found after sticking to this policy for the last 9 days is that I often can’t physically ingest any more liquid so I don’t reach for the fizzy drink, and that it really does seem to have reduced my appetite as well. Today I had my first fizzy drink of the day with dinner, whereas before I might have had 2 litres already by that time. At the moment I’m at around 500ml a day. I hope I can keep to something like that.

The caffeine withdrawal is not pleasant. I don’t think it is a good idea to try to find another caffeine source until my blood pressure is under control. By then I might not feel the need.

If I don’t get the blood pressure under control then my near future will feature a stroke or heart attack. If I don’t get the blood sugar under control then my near future will include insulin injections.

Exim: Adding the Autonomous System Number as a header in received emails



  • Added a bit about timeouts, as concern was expressed that I am “bonkers”.

The Problem

For statistical purposes I wanted to add the Autonomous System Number (ASN) for the IP address of the connecting host as a header in the received email, like this:

X-ASN: AS63949 2a01:7e01::/32

The Answer

You can obtain this information through a DNS query to Team Cymru:

$ sipcalc -r 2a01:7e01::f03c:92ff:fe32:a408
-[ipv6 : 2a01:7e01::f03c:92ff:fe32:a408] - 0
Reverse DNS (ip6.arpa)  -
$ dig +short -t txt 8.0.4.a.2.3.e.f.f.f.2.9.c.3.0.f.
"63949 | 2a01:7e01::/32 | US | ripencc | 2011-02-01"

Or for legacy Internet addresses:

$ dig +noall +question -x
;   IN      PTR
$ dig +short -t txt
"13414 | | US | arin | 2010-11-23"

So for IPv6 addresses the process is:

  1. Expand the address out fully (2a01:7e01::f03c:92ff:fe32:a4082a01:7e01:0000:0000:f03c:92ff:fe32:a408)
  2. Remove the colons (2a01:7e01:0000:0000:f03c:92ff:fe32:a4082a017e0100000000f03c92fffe32a408)
  3. Reverse it (2a017e0100000000f03c92fffe32a408804a23efff29c30f0000000010e710a2)
  4. Add a dot after every hexadecimal number (804a23efff29c30f0000000010e710a28.0.4.a.2.3.e.f.f.f.2.9.c.3.0.f.
  5. Add origin6.asn.cymru.com on the end (8.0.4.a.2.3.e.f.f.f.2.9.c.3.0.f.
  6. Query that TXT record and parse out the first two values separated by ‘|’ in the response.

For legacy IP addresses the process is much simpler; reverse the octets, add origin.asn.cymru.com on the end and query that.

An Exim Answer

In Exim configuration you can do it like this:

(This is meant to go inside an ACL like your check_rcpt or check_data. Maybe near the end of check_data at the point where you’ve already decided to accept the email. No point in doing this for an email you will reject.)

# Add X-ASN: header for IPv6 senders.
  warn message = X-ASN: AS${sg{${extract{1}{|}{$acl_m9}}}{\N\s+\N}{}} ${sg{${extract{2}{{|}{$acl_m9}}}{\N\s+\N}{}}
     condition = ${if isip6{$sender_host_address}}
    set acl_m9 = ${lookup dnsdb{txt=${reverse_ip:$sender_host_address}.origin6.asn.cymru.com}}
# Add X-ASN: header for legacy IP senders.
  warn message = X-ASN: AS${sg{${extract{1}{|}{$acl_m9}}}{\N\s+\N}{}} ${sg{${extract{2}{{|}{$acl_m9}}}{\N\s+\N}{}}
     condition = ${if isip4{$sender_host_address}}
    set acl_m9 = ${lookup dnsdb{txt=${reverse_ip:$sender_host_address}.origin.asn.cymru.com}}

I dislike that I’ve had to use two tests that are almost exactly the same except they query slightly different DNS names (origin6.asn.cymru.com vs origin.asn.cymru.com). I’m sure it could be done in one, but I’m not good enough with the Exim string evaluations. They send me cross-eyed. I couldn’t find a better way so I decided to use the time-honoured tactic of posting what I have in order to provoke people into correcting me. Please let me know if you can improve it!

The amount of nested {} will probably drive you mad, but basically:

  • ${reverse_ip:$sender_host_address} handles expanding and reversing an IP address into the form you would use for a reverse DNS query.
  • That gets queried in DNS with the correct suffix and the full response stored in $acl_m9.
  • warn message = X-ASN: adds a header to the email, the content of which is built from two fields extracted out of $acl_m9 with all whitespace removed (${sg{source}{regex}{replacement}}).

What about timeouts?

One piece of feedback I got was that I am “bonkers” to make my email delivery rely on a real time network lookup. I can kind of see the argument, but also not: this is a DNS query exactly the same as a typical DNSBL query (Team Cymru IP-to-ASN service is used exactly like a typical DNSBL).

Most people’s mail servers do multiple DNSBL queries already and nobody really is up in arms saying it’s bonkers to do so. My Exim already does a couple of DNSBL queries and then if it is going to deliver the email it will call out to SpamAssassin which does many DNSBL queries itself. If these hit a timeout then it would slow down my mail delivery.

In the past where a DNSBL has unceremoniously shut down and made its nameservers unresponsive I have seen problems, as it caused the delivery processes to pile up while they waited on their timeouts and then Exim would complain that there’s too many processes. That would be resolved by removing the errant DNSBL(s) from the configuration.

Query load is not a concern as DNS is highly scalable and my system is not going to add noticeable load to Team Cymru’s already public service. The SpamAssassin ASN plugin is already out there, hard coded to use this same service and must have many many users already.

As far as I can tell, in Exim dnsdb queries use the same timeouts and retries as dnslist queries do, that being controlled by the dns_retrans and dns_rety settings. These settings both default to 0, which means “operating system / resolver library default”. If you were worried you could explicitly set these to their minimum value:

If still worried then you would first have to either turn off all DNSBLs or make sure you had local copies of them (e.g. by arranging AXFR to your own local servers). Then to do the IP-to-ASN locally you’d arrange to have a local BGP feed that you could query. I think you’d need to have an absolutely huge mail server before these issues became real concerns.

set acl_m9 = ${lookup dnsdb{retrans_1s,retry_1,txt=${reverse_ip:$sender_host_address}.origin6.asn.cymru.com}}

As for dnslist, the consequence of a time out is that you get no data, so it would just result in an empty header.

But Why?

I’ve actually been doing this for a while with SpamAssassin’s ASN plugin but I’ve changed the way in which I query SpamAssassin and now I don’t directly get the rewritten email that SpamAssassin makes (with its X-Spam-ASN: header in).

I use it for feeding into Bayes to see if there’s a particular prevalence of ASN amongst the email that is classified as spam, and I sometimes add a few points on manually for ASNs that are particularly bad. That is a lot less work than trying to track down all their IP addresses and keep that up to date.

Using Duplicity to back up to Amazon S3 over IPv6 (only)


I have a server that I use for making backups. I also send backups from that server into Amazon S3 at the “Infrequent Access” storage class. That class is cheaper to store but expensive to access. It’s intended for backups of last resort that you only access in an emergency. I use Duplicity to handle the S3 part.

(I could save a bit more by using one of the “Glacier” classes but at the moment the cost is minimal and I’m not that brave.)

I recently decided to change which server I use for the backups. I noticed that renting a server with only IPv6 connectivity was cheaper, and as all the hosts I back up have IPv6 connectivity I decided to give that a go.

This mostly worked fine. The only thing I really noticed was when I tried to install some software from GitHub. GitHub doesn’t support IPv6, so I had to piggy back that download through another host.

Then I came to set up Duplicity again and found that I needed to make some non-obvious changes to make it work with S3 over IPv6-only.

S3 endpoint

The main issue is that the default S3 endpoint URL is https://s3.<region>.amazonaws.com, and this host only has an A (IPv4) record! For example:

$ host s3.us-east-1.amazonaws.com
s3.us-east-1.amazonaws.com has address

If you run Duplicity with a target like s3://yourbucketname/path/to/backup then it will try that endpoint, get only an IPv4 address, and return Network unreachable.

S3 does actually support IPv6, but for that to work you need to use a dual stack endpoint! They look like this:

$ host s3.dualstack.us-east-1.amazonaws.com
s3.dualstack.us-east-1.amazonaws.com has address
s3.dualstack.us-east-1.amazonaws.com has IPv6 address 2600:1fa0:80dc:5101:34d9:451e::

So we need to specify the S3 endpoint to use.

Specifying the S3 endpoint

In order to do this you need to switch Duplicity to the “boto3” backend. Assuming you’ve installed the correct package (python3-boto3 on Debian), this is as simple as changing the target from s3://… to boto3+s3://….

That then allows you to use the command line arguments --s3-region-name and --s3-endpoint-url so you can tell it which host to talk to. That ends up giving you both an IPv4 and an IPv6 address and your system correctly chooses the IPv6 one.

The full script

The new, working script now looks something like this:

export PASSPHRASE="highlysecret"
export AWS_ACCESS_KEY_ID="notquiteassecret"
export AWS_SECRET_ACCESS_KEY="extremelysecret"
# Somewhere with plenty of free space.
export TMPDIR=/var/tmp
duplicity --encrypt-key ABCDEF0123456789 \
          --asynchronous-upload \
          -v 4 \
          --archive-dir=/path/to/your/duplicity/archives \
          --s3-use-ia \
          --s3-use-multiprocessing \
          --s3-use-new-style \
          --s3-region-name "us-east-1" \
          --s3-endpoint-url "https://s3.dualstack.us-east-1.amazonaws.com" \
          incr \
          --full-if-older-than 30D \
          /stuff/you/want/backed/up \

The previous version of the script looked a bit like:

# All the exports stayed the same
duplicity --encrypt-key ABCDEF0123456789 \
          --asynchronous-upload \
          -v 4 \
          --archive-dir=/path/to/your/duplicity/archives \
          --s3-use-ia \
          --s3-use-multiprocessing \
          incr \
          --full-if-older-than 30D \
          /stuff/you/want/backed/up \

Building BitFolk’s Rescue VM


BitFolk‘s Rescue VM is a live system based on the Debian Live project. You boot it, it finds its root filesystem over read-only NFS, and then it mounts a unionfs RAM disk over that so that you can make changes (e.g. install packages) that don’t persist. People generally use it to repair broken operating systems, reset root passwords etc.

Every few years I have to rebuild it, because it’s important that it’s new enough to be able to effectively poke around in guest filesystems. Each time I have to try to remember how I did it. It’s not that difficult but it’s well past time that I document how it’s done.

Basic concept of Debian Live

The idea is that everything under the config/ directory of your build area is either

  • a set of configuration options for the process itself,
  • some files to put in the image,
  • some scripts to run while building the image, or
  • some scripts to run while booting the image.

Install packages

Pick a host running at least the latest Debian stable. It might be possible to build a live image for a newer version of Debian, but the live-build system and its dependencies like debootstrap might end up being too old.

$ sudo apt install live-build live-boot live-config

Prepare the work directory

$ sudo mkdir -vp /srv/lb/auto
$ cd /srv/lb

Main configuration

All of these config options are described in the lb_config man page.

$ sudo tee auto/config >/dev/null <<'_EOF_'
set -e
lb config noauto \
    --architectures                     amd64 \
    --distribution                      bullseye \
    --binary-images                     netboot \
    --archive-areas                     main \
    --apt-source-archives               false \
    --apt-indices                       false \
    --backports                         true \
    --mirror-bootstrap                  "$main_mirror" \
    --mirror-chroot-security            "$sec_mirror" \
    --mirror-binary                     "$main_mirror" \
    --mirror-binary-security            "$sec_mirror" \
    --memtest                           none \
    --net-tarball                       true \

The variables at the top just save me having to repeat myself for all the mirrors. They make both the build process and the resulting image use BitFolk’s apt-cacher to proxy the deb.debian.org mirror.

I’m not going to describe every config option as you can just look them up in the man page. The most important one is --binary-images netboot to make sure it builds an image that can be booted by network.

Extra packages

There’s some extra packages I want available in the rescue image. Here’s how to get them installed.

$ sudo tee config/package-lists/bitfolk_rescue.list.chroot > /dev/null <<_EOF_

Installing a backports kernel

I want the rescue system to be Debian 11 (bullseye), but with a bullseye-backports kernel.

We already used --backports true to make sure that we have access to the backports package mirrors but we need to run a script hook to actually install the backports kernel in the image while it’s being built.

$ sudo tee config/hooks/live/9000-install-backports-kernel.hook.chroot >/dev/null <<'_EOF_'
set -e
apt -y install -t bullseye-backports linux-image-amd64
apt -y purge -t bullseye linux-image-amd64
apt -y purge -t bullseye 'linux-image-5.10.*'

Set a static /etc/resolv.conf

This image will only be booted on one network where I know what the nameservers are, so may as well statically override them. If you were building an image to use on different networks you’d probably instead want to use one of the public resolvers or accept what DHCP gives you.

$ sudo tee config/includes.chroot/etc/resolv.conf >/dev/null <<_EOF_

Set an explanatory footer text in /etc/issue.footer

The people using this rescue image don’t necessarily know what it is and how to use it. I take the opportunity to put some basic info in the file /etc/issue.footer in the image, which will later end up in the real /etc/issue

$ sudo tee config/includes.chroot/etc/issue.footer >/dev/null <<_EOF_
BitFolk Rescue Environment - https://tools.bitfolk.com/wiki/Rescue
Blah blah about what this is and how to use it

Set a random password at boot

By default a Debian Live image has a user name of “user” and a password of “live“. This isn’t suitable for a networked service that will have sshd active from the start, so we will install a hook script that sets a random password. This will be run near the end of the image’s boot process.

$ sudo tee config/includes.chroot/lib/live/config/2000-passwd >/dev/null <<'_EOF_'
set -e
echo -n " random-password "
NEWPASS=$(/usr/bin/pwgen -c -N 1)
printf "user:%s\n" "$NEWPASS" | chpasswd
    printf "****************************************\n";
    printf "Resetting user password to random value:\n";
    printf "\t${RED}New user password:${NORMAL} %s\n" "$NEWPASS";
    printf "****************************************\n";
    cat /etc/issue.footer
} >> /etc/issue

This script puts the random password and the footer text into the /etc/issue file which is displayed above the console login prompt, so the user can see what the password is.

Fix initial networking setup

This one’s a bit unfortunate and is a huge hack, but I’m not sure enough of the details to report a bug yet.

The live image when booted is supposed to be able to set up its network by a number of different ways. DHCP would be the most sensible for an image you take with you to different networks.

The BitFolk Rescue VM is only ever booted in one network though, and we don’t use DHCP. I want to set static networking through the ip=…s syntax of the kernel command line.

Unfortunately it doesn’t seem to work properly with live-boot as shipped. I had to hack the /lib/live/boot/9990-networking.sh file to make it parse the values out of the kernel command line.

Here’s a diff. Copy /lib/live/boot/9990-networking.sh to config/includes.chroot/usr/lib/live/boot/9990-networking.sh and then apply that patch to it.

It’s simple enough that you could probably edit it by hand. All it does is comment out one section and replace it with some bits that parse IP setup out of the $STATICIP variable.

Fix the shutdown process

Again this is a horrible hack and I’m sure there is a better way to handle it, but I couldn’t work out anything better and this works.

This image will be running with its root filesystem on NFS. When a shutdown or halt command is issued however, systemd seems extremely keen to shut off the network as soon as possible. That leaves the shutdown process unable to continue because it can’t read or write its root filesystem any more. The shutdown process stalls forever.

As this is a read-only system with no persistent state I don’t care how brutal the shutdown process is. I care more that it does actually shut down. So, I have added a systemd service that issues systemctl –force –force poweroff any time that it’s about to shut down by any means.

$ sudo tee config/includes.chroot/etc/systemd/system/always-brutally-poweroff.service >/dev/null <<_EOF_
Description=Every kind of shutdown will be brutal poweroff
ExecStart=/usr/bin/systemctl --force --force poweroff

And to force it to be enabled at boot time:

$ sudo tee config/includes.chroot/etc/rc.local >/dev/null <<_EOF_
set -e
systemctl enable always-brutally-poweroff

Build it

At last we’re ready to build the image.

$ sudo lb clean && sudo lb config && sudo lb build

The “lb clean” is there because you probably won’t get this right first time and will want to iterate on it.

Once complete you’ll find the files to put on your NFS server in binary/ and the kernel and initramfs to boot on your client machine in tftpboot/live/

$ sudo rsync -av binary/ my.nfs.server:/srv/rescue/

Booting it

The details of exactly how I boot the client side (which in BitFolk’s case is a customer VM) are out of scope here, but this is sort of what the kernel command line looks like on the client (normally all on one line):



Get root filesystem from NFS.
Static IP configuration on kernel command line. Separated by colons:

  • Client’s IP
  • NFS server’s IP
  • Default gateway
  • Netmask
  • Host name
Host name.
Where to mount root from on NFS server.
NFS client options to use.
Tell live-boot that this is a live image.

Look for persistent data.

In action

Here’s an Asciinema of this image in action.


There’s a few things in here which are hacks. What I have works but no doubt I am doing some things wrong. If you know better please do let me know in comments or whatever. Ideally I’d like to stick with Debian Live though because it’s got a lot of problems solved already.

btrfs compression wins

Some quite good btrfs compression results from my backup hosts (which back up customer data).

Type       Perc     Disk Usage   Uncompressed Referenced
TOTAL       64%       68G         105G         1.2T
none       100%       24G          24G         434G
zlib        54%       43G          80G         797G
Type       Perc     Disk Usage   Uncompressed Referenced
TOTAL       74%       91G         123G         992G
none       100%       59G          59G         599G
lzo         50%       32G          63G         393G
Type       Perc     Disk Usage   Uncompressed Referenced
TOTAL       73%       16G          22G         459G
none       100%       12G          12G         269G
lzo         40%      4.1G          10G         190G
Type       Perc     Disk Usage   Uncompressed Referenced
TOTAL       71%      105G         148G         1.9T
none       100%       70G          70G         910G
zlib        40%       24G          60G         1.0T
lzo         58%       10G          17G          17G

So that’s 398G that takes up 280G, a 29.6% reduction.

The “none” type is incompressible files such as media that’s already compressed. I started off with lzo compression but I’m switching to zlib now as it compresses more and this data is rarely accessed so I’m not too concerned about performance. I need newer kernels on these before I can try zstd.

I’ve had serious concerns about btrfs before based on issues I’ve had using it at home, but these were mostly around multiple device usage. Here they get a single block device that has redundancy underneath so the only remotely interesting thing that btrfs is doing here is the compression.

Might try some offline deduplication next.

Resolving a sector offset to a logical volume

The Problem

Sometimes Linux logs interesting things with sector offsets. For example:

Jul 23 23:11:19 tanqueray kernel: [197925.429561] sg[22] phys_addr:0x00000015bac60000 offset:0 length:4096 dma_address:0x00000012cf47a000 dma_length:4096
Jul 23 23:11:19 tanqueray kernel: [197925.430323] sg[23] phys_addr:0x00000015bac5d000 offset:0 length:4608 dma_address:0x00000012cf47b000 dma_length:4608
Jul 23 23:11:19 tanqueray kernel: [197925.431052] sg[24] phys_addr:0x00000015bac5e200 offset:512 length:3584 dma_address:0x00000012cf47c200 dma_length:3584
Jul 23 23:11:19 tanqueray kernel: [197925.431824] sg[25] phys_addr:0x00000015bac2e000 offset:0 length:4096 dma_address:0x00000012cf47d000 dma_length:4096
Jul 23 23:11:19 tanqueray kernel: [197925.434447] Invalid SGL for payload:131072 nents:32
Jul 23 23:11:19 tanqueray kernel: [197925.454419] blk_update_request: I/O error, dev nvme0n1, sector 509505343 op 0x1:(WRITE) flags 0x800 phys_seg 32 prio class 0
Jul 23 23:11:19 tanqueray kernel: [197925.464644] md/raid1:md5: Disk failure on nvme0n1p5, disabling device.
Jul 23 23:11:19 tanqueray kernel: [197925.464644] md/raid1:md5: Operation continuing on 1 devices.

What is at sector 509505343 of /dev/nvme0n1p5 anyway? Well, that’s part of an md array and then on top of that is an lvm physical volume, which has a number of logical volumes.

I’d like to know which logical volume sector 509505343 of /dev/nvme0n1p5 corresponds to.

At the md level

Thankfully this is a RAID-1 so every device in it has the exact same layout.

$ grep -A 2 ^md5 /proc/mdstat 
md5 : active raid1 nvme0n1p5[0] sda5[1]
      3738534208 blocks super 1.2 [2/2] [UU]
      bitmap: 2/28 pages [8KB], 65536KB chunk

The superblock format of 1.2 also means that the RAID metadata is at the end of each device, so there is no offset there to worry about.

For all intents and purposes sector 509505343 of /dev/nvme0n1p5 is the same as sector 509505343 of /dev/md5.

If I’d been using a different RAID level like 5 or 6 then this would have been far more complicated as the data would have been striped across multiple devices at different offsets, together with parity. Some layouts of Linux RAID-10 would also have different offsets.

At the lvm level

LVM has physical volumes (PVs) that are split into extents, then one or more ranges of one or more extents make up a logical volume (LV). The physical volumes are just the underlying device, so in my case that’s /dev/md5.

Offset into the PV

LVM has some metadata at the start of the PV, so we first work out how far into the PV the extents can start:

$ sudo pvs --noheadings -o pe_start --units s /dev/md5

So, sector 509505343 is actually 509503295 sectors into this PV, because the first 2048 sectors are reserved for metadata.

How big is an extent?

Next we need to know how big an LVM extent is.

$ sudo pvdisplay --units s /dev/md5 | grep 'PE Size'
  PE Size               8192 Se

There’s 8192 sectors in each of the extents in this PV, so this sector is inside extent number 509503295 / 8192 = 62195.22644043.

It’s fractional because naturally the sector is not on an exact PE boundary. If I need to I could work out from the remainder how many sectors into PE 62195 this is, but I’m only interested in the LV name and each LV has an integer number of PEs, so that’s fine: PE 62195.

Look at the PV’s mappings

Now you can dump out a list of mappings for the PV. This will show you what each range of extents corresponds to. Note that there might be multiple ranges for an LV if it’s been grown later on.

$ sudo pvdisplay --maps /dev/md5 | grep -A1 'Physical extent'
  Physical extent 58934 to 71733:
    Logical volume      /dev/myvg/domu_backup4_xvdd
  Physical extent 71734 to 912726:

So, extent 62195 is inside /dev/myvg/domu_backup4_xvdd.

What’s going on here then?

I’m not sure, but there appears to be a kernel bug and it’s probably got something to do with the fact that this LV is a disk with an unaligned partition table:

$ sudo fdisk -u -l /dev/myvg/domu_backup4_xvdd
Disk /dev/myvg/domu_backup4_xvdd: 50 GiB, 53687091200 bytes, 104857600 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: dos
Disk identifier: 0x07c7ce4c

Device Boot Start End Sectors Size Id Type
/dev/myvg/domu_backup4_xvdd1 63 104857599 104857537 50G 83 Linux
Partition 1 does not start on physical sector boundary.

The Linux NVMe driver can only do IO in multiples of 4096 bytes. As seen in the initial logs, two of the requests were for 4608 and 3584 bytes respectively; these are not divisible by 4096 and thus hit a WARN().

Jul 23 23:11:19 tanqueray kernel: [197925.430323] sg[23] phys_addr:0x00000015bac5d000 offset:0 length:4608 dma_address:0x00000012cf47b000 dma_length:4608
Jul 23 23:11:19 tanqueray kernel: [197925.431052] sg[24] phys_addr:0x00000015bac5e200 offset:512 length:3584 dma_address:0x00000012cf47c200 dma_length:3584

Going further: finding the file

I’m not interested in doing this because it’s fairly likely that it’s because of the offset partition and many kinds of IO to it will cause this.

If you did want to though, you’d first have to look at the partition table to see where your filesystem starts. 0.22644043 * 8192 = 1855 sectors into the disk. Partition 1 starts at 63, so this file is at 1792 sectors.

You can then (for ext4) use debugfs to poke about and see which file that corresponds to.

Keeping firewall logs out of Linux’s kernel log with ulogd2

A few words about iptables vs nft

nftables is the new thing and iptables is deprecated, but I haven’t found time to convert everything to nft rules syntax yet.

I’m still using iptables rules but it’s the iptables frontend to nftables. All of this works both with legacy iptables and with nft but with different syntax.

Logging with iptables

As a contrived example let’s log inbound ICMP packets at a maximum rate of 1 per second:

-A INPUT -m limit --limit 1/s -p icmp -j LOG --log-level 7 --log-prefix "ICMP: "

The Problem

If you have logging rules in your firewall then they’ll log to your kernel log, which is available at /dev/kmsg. The dmesg command displays the contents of /dev/kmsg but /dev/kmsg is a fixed size circular buffer, so after a while your firewall logs will crowd out every other thing.

On a modern systemd system this stuff does get copied to the journal, so if you set that to be persistent then you can keep the kernel logs forever. Or you can additionally run a syslog daemon like rsyslogd, and have that keep things forever.

Either way though your dmesg or journalctl -k commands are only going to display the contents of the kernel’s ring buffer which will be a limited amount.

I’m not that interested in firewall logs. They’re nice to have and very occasionally valuable when debugging something, but most of the time I’d rather they weren’t in my kernel log.

An answer: ulogd2

One answer to this problem is ulogd2. ulogd2 is a userspace logging daemon into which you can feed netfilter data and have it log it in a flexible way, to multiple different formats and destinations.

I actually already use it to log certain firewall things to a MariaDB database for monitoring purposes, but you can also emit plain text, JSON, netflow and all manner of things. Since I’m already running it I decided to switch my general firewall logging to it.

Configuring ulogd2

I added the following to /etc/ulogd.conf:

# This one for logging to local file in emulated syslog format.

I already had a stack called log1 for logging to MariaDB, so I called the new one log2 with its output being emu1.

The log2 section can then be told to expect messages from netfilter group 2. Don’t worry about this, just know that this is what you refer to in your firewall rules, and you can’t use group 0 because that’s used for something else.

The emu1 section then says which file to write this stuff to.

That’s it. Restart the daemon.

Configuring iptables

Now it’s time to make iptables log to netfilter group 2 instead of its normal LOG target. As a reminder, here’s what the rule was like before:

-A INPUT -m limit --limit 1/s -p icmp -j LOG --log-level 7 --log-prefix "ICMP: "

And here’s what you’d change it to:

-A INPUT -m limit --limit 1/s -p icmp -j NFLOG --nflog-group 2 --nflog-prefix "ICMP:"

The --nflog-group 2 needs to match what you put in /etc/ulogd.conf.

You’re now logging with ulogd2 and none of this will be going to the kernel log buffer. Don’t forget to rotate the new log file! Or maybe you’d like to play with logging this as JSON or into a SQLite DB?

rsync and sudo without X forwarding

Five years ago I wrote about how to do rsync as root on both sides. That solution required using ssh-askpass which in turn requires X forwarding.

The main complication here is that sudo on the remote side is going to ask for a password, which either requires an interactive terminal or a forwarded X session.

I thought I would mention that if you’ve disabled tty_tickets in the sudo configuration then you can “prime” the sudo authentication with some harmless command and then do the real rsync without it asking for a sudo password:

local$ ssh -t you@remote.example.com sudo whoami
[sudo] password for you: 
local$ sudo rsync --rsync-path="sudo rsync" -av --delete \ 
  you@remote.example.com:/etc/secret/ /etc/secret/

This suggestion was already supplied as a comment on the earlier post five years ago, but I keep forgetting it.

I suggest this is only for ad hoc commands and not for automation. For automation you need to find a way to make sudo not ever ask for a password, and some would say to add configuration to sudo with a NOPASSWD directive to accomplish that.

I would instead suggest allowing a root login by ssh using a public key that is only for the specific purpose, as you can lock it down to only ever be able to execute that one script/program.

Also bear in mind that if you permanently allow “host A” to run rsync as root with unrestricted parameters on “host B” then a compromise of “host A” is also a compromise of “host B”, as full write access to filesystem is granted. Whereas if you only allow “host A” to run a specific script/program on “host B” then you’ve a better chance of things being contained.

grub-install: error: embedding is not possible, but this is required for RAID and LVM install

The Initial Problem

The recent security update of the GRUB bootloader did not want to install on my fileserver at home:

$ sudo apt dist-upgrade
Reading package lists... Done
Building dependency tree
Reading state information... Done
Calculating upgrade... Done
The following packages will be upgraded:
  grub-common grub-pc grub-pc-bin grub2-common
4 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
Need to get 4,067 kB of archives.
After this operation, 72.7 kB of additional disk space will be used.
Do you want to continue? [Y/n]
Setting up grub-pc (2.02+dfsg1-20+deb10u4) ...
Installing for i386-pc platform.
grub-install: warning: your core.img is unusually large.  It won't fit in the embedding area.
grub-install: error: embedding is not possible, but this is required for RAID and LVM install.
Installing for i386-pc platform.
grub-install: warning: your core.img is unusually large.  It won't fit in the embedding area.
grub-install: error: embedding is not possible, but this is required for RAID and LVM install.
Installing for i386-pc platform.
grub-install: warning: your core.img is unusually large.  It won't fit in the embedding area.
grub-install: error: embedding is not possible, but this is required for RAID and LVM install.
Installing for i386-pc platform.
grub-install: warning: your core.img is unusually large.  It won't fit in the embedding area.
grub-install: error: embedding is not possible, but this is required for RAID and LVM install.

Four identical error messages, because this server has four drives upon which the operating system is installed, and I’d decided to do a four way RAID-1 of a small first partition to make up /boot. This error is coming from grub-install.

Ancient History

This system came to life in 2006, so it’s 15 years old. It’s always been Debian stable, so right now it runs Debian buster and during those 15 years it’s been transplanted into several different iterations of hardware.

Choices were made in 2006 that were reasonable for 2006, but it’s not 2006 now. Some of these choices are now causing problems.

Aside: four way RAID-1 might seem excessive, but we’re only talking about the small /boot partition. Back in 2006 I chose a ~256M one so if I did the minimal thing of only having a RAID-1 pair I’d have 2x 256M spare on the two other drives, which isn’t very useful. I’d honestly rather have all four system drives with the same partition table and there’s hardly ever writes to /boot anyway.

Here’s what the identical partition tables of the drives /dev/sd[abcd] look like:

$ sudo fdisk -u -l /dev/sda
Disk /dev/sda: 298.1 GiB, 320069031424 bytes, 625134827 sectors
Disk model: ST3320620AS     
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x00000000
Device     Boot   Start       End   Sectors  Size Id Type
/dev/sda1  *         63    514079    514017  251M fd Linux raid autodetect
/dev/sda2        514080   6393869   5879790  2.8G fd Linux raid autodetect
/dev/sda3       6393870 625121279 618727410  295G fd Linux raid autodetect

Note that the first partition starts at sector 63, 32,256 bytes into the disk. Modern partition tools tend to start partitions at sector 2,048 (1,024KiB in), but this was acceptable in 2006 for me and worked up until a few days ago.

Those four partitions /dev/sd[abcd]1 make up an mdadm RAID-1 with metadata version 0.90. This was purposefully chosen because at the time of install GRUB did not have RAID support. This metadata version lives at the end of the member device so anything that just reads the device can pretend it’s an ext2 filesystem. That’s what people did many years ago to boot off of software RAID.

What’s Gone Wrong?

The last successful update of grub-pc seems to have been done on 7 February 2021:

$ ls -la /boot/grub/i386-pc/core.img
-rw-r--r-- 1 root root 31082 Feb  7 17:19 /boot/grub/i386-pc/core.img

I’ve got 62 sectors available for the core.img so that’s 31,744 bytes – just 662 bytes more than what is required.

The update of grub-pc appears to be detecting that my /boot partition is on a software RAID and is now including MD RAID support even though I don’t strictly require it. This makes the core.img larger than the space I have available for it.

I don’t think it is great that such a major change has been introduced as a security update, and it doesn’t seem like there is any easy way to tell it not to include the MD RAID support, but I’m sure everyone is doing their best here and it’s more important to get the security update out.

Possible Fixes

So, how to fix? It seems to me the choices are:

  1. Ignore the problem and stay on the older grub-pc
  2. Create a core.img with only the modules I need
  3. Rebuild my /boot partition

Option #1 is okay short term, especially if you don’t use Secure Boot as that’s what the security update was about.

Option #2 doesn’t seem that feasible as I can’t find a way to influence how Debian’s upgrade process calls grub-install. I don’t want that to become a manual process.

Option #3 seems like the easiest thing to do, as shaving ~1MiB off the size of my /boot isn’t going to cause me any issues.

Rebuilding My /boot

Take a backup

/boot is only relatively small so it seemed easiest just to tar it up ready to put it back later.

$ sudo tar -C /boot -cvf ~/boot.tar .

I then sent that tar file off to another machine as well, just in case the worst should happen.

Unmount /boot and stop the RAID array that it’s on

I’ve already checked in /etc/fstab that /boot is on /dev/md0.

$ sudo umount /boot
$ sudo mdadm --stop md0         
mdadm: stopped md0

At this point I would also recommend doing a wipefs -a on each of the partitions in order to remove the MD superblocks. I didn’t and it caused me a slight problem later as we shall see.

Delete and recreate first partition on each drive

I chose to use parted, but should be doable with fdisk or sfdisk or whatever you prefer.

I know from the fdisk output way above that the new partition needs to start at sector 2048 and end at sector 514,079.

$ sudo parted /dev/sda                                                             
GNU Parted 3.2
Using /dev/sda
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) unit s
(parted) rm 1
(parted) mkpart primary ext4 2048 514079s
(parted) set 1 raid on
(parted) set 1 boot on
(parted) p
Model: ATA ST3320620AS (scsi)
Disk /dev/sda: 625134827s
Sector size (logical/physical): 512B/512B
Partition Table: msdos
Disk Flags:
Number  Start     End         Size        Type     File system  Flags
 1      2048s     514079s     512032s     primary  ext4         boot, raid, lba
 2      514080s   6393869s    5879790s    primary               raid
 3      6393870s  625121279s  618727410s  primary               raid
(parted) q
Information: You may need to update /etc/fstab.

Do that for each drive in turn. When I got to /dev/sdd, this happened:

Error: Partition(s) 1 on /dev/sdd have been written, but we have been unable to
inform the kernel of the change, probably because it/they are in use.  As a result,
the old partition(s) will remain in use.  You should reboot now before making further changes.

The reason for this seems to be that something has decided that there is still a RAID signature on /dev/sdd1 and so it will try to incrementally assemble the RAID-1 automatically in the background. This is why I recommend a wipefs of each member device.

To get out of this situation without rebooting I needed to repeat my mdadm --stop /dev/md0 command and then do a wipefs -a /dev/sdd1. I was then able to partition it with parted.

Create md0 array again

I’m going to stick with metadata format 0.90 for this one even though it may not be strictly necessary.

$ sudo mdadm --create /dev/md0 \
             --metadata 0.9 \
             --level=1 \
             --raid-devices=4 \
mdadm: array /dev/md0 started.

Again, if you did not do a wipefs earlier then mdadm will complain that these devices already have a RAID array on them and ask for confirmation.

Get the Array UUID

$ sudo mdadm --detail /dev/md0
           Version : 0.90
     Creation Time : Sat Mar  6 03:20:10 2021
        Raid Level : raid1
        Array Size : 255936 (249.94 MiB 262.08 MB)
     Used Dev Size : 255936 (249.94 MiB 262.08 MB)
      Raid Devices : 4
     Total Devices : 4
   Preferred Minor : 0
       Persistence : Superblock is persistent
       Update Time : Sat Mar  6 03:20:16 2021
             State : clean
    Active Devices : 4
   Working Devices : 4
    Failed Devices : 0
     Spare Devices : 0
Consistency Policy : resync
              UUID : e05aa2fc:91023169:da7eb873:22131b12 (local to host specialbrew.localnet)            Events : 0.18
    Number   Major   Minor   RaidDevice State
       0       8        1        0      active sync   /dev/sda1
       1       8       17        1      active sync   /dev/sdb1
       2       8       33        2      active sync   /dev/sdc1
       3       8       49        3      active sync   /dev/sdd1

Change your /etc/mdadm/mdadm.conf for the updated UUID of md0:

$ grep md0 /etc/mdadm/mdadm.conf
ARRAY /dev/md0 level=raid1 num-devices=4 UUID=e05aa2fc:91023169:da7eb873:22131b12

Make a new filesystem on /dev/md0

$ sudo mkfs.ext4 -m0 -L boot /dev/md0
mke2fs 1.44.5 (15-Dec-2018)
Creating filesystem with 255936 1k blocks and 64000 inodes
Filesystem UUID: fdc611f2-e82a-4877-91d3-0f5f8a5dd31d
Superblock backups stored on blocks:
        8193, 24577, 40961, 57345, 73729, 204801, 221185
Allocating group tables: done
Writing inode tables: done
Creating journal (4096 blocks): done
Writing superblocks and filesystem accounting information: done

My /etc/fstab didn’t need a change because it mounted by device name, i.e. /dev/md0, but if yours uses UUID or label then you’ll need to update that now, too.

Mount it and put your files back

$ sudo mount /boot
$ sudo tar -C /boot -xvf ~/boot.tar

Reinstall grub-pc

$ sudo apt reinstall grub-pc
Setting up grub-pc (2.02+dfsg1-20+deb10u4) ...
Installing for i386-pc platform.
Installation finished. No error reported.
Installing for i386-pc platform.
Installation finished. No error reported.
Installing for i386-pc platform.
Installation finished. No error reported.
Installing for i386-pc platform.
Installation finished. No error reported.


You probably should reboot now to make sure it all works when you have time to fix any problems, as opposed to risking issues when you least expect it.

$ uprecords 
     #               Uptime | System                                     Boot up
     1   392 days, 16:45:55 | Linux 4.7.0               Thu Jun 14 16:13:52 2018
     2   325 days, 03:20:18 | Linux 3.16.0-0.bpo.4-amd  Wed Apr  1 14:43:32 2015
->   3   287 days, 16:03:12 | Linux 4.19.0-9-amd64      Fri May 22 12:33:27 2020     4   257 days, 07:31:42 | Linux 4.19.0-6-amd64      Sun Sep  8 05:00:38 2019
     5   246 days, 14:45:10 | Linux 4.7.0               Sat Aug  6 06:27:52 2016
     6   165 days, 01:24:22 | Linux 4.5.0-rc4-specialb  Sat Feb 20 18:18:47 2016
     7   131 days, 18:27:51 | Linux 3.16.0              Tue Sep 16 08:01:05 2014
     8    89 days, 16:01:40 | Linux 4.7.0               Fri May 26 18:28:40 2017
     9    85 days, 17:33:51 | Linux 4.7.0               Mon Feb 19 17:17:39 2018
    10    63 days, 18:57:12 | Linux 3.16.0-0.bpo.4-amd  Mon Jan 26 02:33:47 2015
1up in    37 days, 11:17:07 | at                        Mon Apr 12 15:53:46 2021
no1 in   105 days, 00:42:44 | at                        Sat Jun 19 05:19:23 2021
    up  2362 days, 06:33:25 | since                     Tue Sep 16 08:01:05 2014
  down     0 days, 14:02:09 | since                     Tue Sep 16 08:01:05 2014
   %up               99.975 | since                     Tue Sep 16 08:01:05 2014

My Kingdom For 7 Bytes

My new core.img is 7 bytes too big to fit before my original /boot:

$ ls -la /boot/grub/i386-pc/core.img
-rw-r--r-- 1 root root 31751 Mar  6 03:24 /boot/grub/i386-pc/core.img