Building BitFolk’s Rescue VM

Posted on January 28, 2022 by Andy

Contents

Overview
Basic concept of Debian Live
Install packages
Prepare the work directory
Main configuration
Extra packages
Installing a backports kernel
Set a static /etc/resolv.conf
Set an explanatory footer text in /etc/issue.footer
Set a random password at boot
Fix initial networking setup
Fix the shutdown process
Build it
Booting it
In action
Improvements

Overview ^

BitFolk‘s Rescue VM is a live system based on the Debian Live project. You boot it, it finds its root filesystem over read-only NFS, and then it mounts a unionfs RAM disk over that so that you can make changes (e.g. install packages) that don’t persist. People generally use it to repair broken operating systems, reset root passwords etc.

Every few years I have to rebuild it, because it’s important that it’s new enough to be able to effectively poke around in guest filesystems. Each time I have to try to remember how I did it. It’s not that difficult but it’s well past time that I document how it’s done.

Basic concept of Debian Live ^

The idea is that everything under the config/ directory of your build area is either

a set of configuration options for the process itself,
some files to put in the image,
some scripts to run while building the image, or
some scripts to run while booting the image.

Install packages ^

Pick a host running at least the latest Debian stable. It might be possible to build a live image for a newer version of Debian, but the live-build system and its dependencies like debootstrap might end up being too old.

$ sudo apt install live-build live-boot live-config

Prepare the work directory ^

$ sudo mkdir -vp /srv/lb/auto
$ cd /srv/lb

Main configuration ^

All of these config options are described in the lb_config man page.

$ sudo tee auto/config >/dev/null <<'_EOF_'
#!/bin/sh
 
set -e
 
cacher_prefix="apt-cacher.lon.bitfolk.com/debian"
mirror_host="deb.debian.org"
main_mirror="http://${cacher_prefix}/${mirror_host}/debian/"
sec_mirror="http://${cacher_prefix}/${mirror_host}/debian-security/"
 
lb config noauto \
    --architectures                     amd64 \
    --distribution                      bullseye \
    --binary-images                     netboot \
    --archive-areas                     main \
    --apt-source-archives               false \
    --apt-indices                       false \
    --backports                         true \
    --mirror-bootstrap                  "$main_mirror" \
    --mirror-chroot-security            "$sec_mirror" \
    --mirror-binary                     "$main_mirror" \
    --mirror-binary-security            "$sec_mirror" \
    --memtest                           none \
    --net-tarball                       true \
    "${@}"
_EOF_

The variables at the top just save me having to repeat myself for all the mirrors. They make both the build process and the resulting image use BitFolk’s apt-cacher to proxy the deb.debian.org mirror.

I’m not going to describe every config option as you can just look them up in the man page. The most important one is --binary-images netboot to make sure it builds an image that can be booted by network.

Extra packages ^

There’s some extra packages I want available in the rescue image. Here’s how to get them installed.

$ sudo tee config/package-lists/bitfolk_rescue.list.chroot > /dev/null <<_EOF_
pwgen
less
binutils
build-essential
bzip2
gnupg
openssh-client
openssh-server
perl
perl-modules
telnet
screen
tmux
rpm
_EOF_

Installing a backports kernel ^

I want the rescue system to be Debian 11 (bullseye), but with a bullseye-backports kernel.

We already used --backports true to make sure that we have access to the backports package mirrors but we need to run a script hook to actually install the backports kernel in the image while it’s being built.

$ sudo tee config/hooks/live/9000-install-backports-kernel.hook.chroot >/dev/null <<'_EOF_'
#!/bin/sh
 
set -e
 
apt -y install -t bullseye-backports linux-image-amd64
apt -y purge -t bullseye linux-image-amd64
apt -y purge -t bullseye 'linux-image-5.10.*'
_EOF_

Set a static /etc/resolv.conf ^

This image will only be booted on one network where I know what the nameservers are, so may as well statically override them. If you were building an image to use on different networks you’d probably instead want to use one of the public resolvers or accept what DHCP gives you.

$ sudo tee config/includes.chroot/etc/resolv.conf >/dev/null <<_EOF_
nameserver 85.119.80.232
nameserver 85.119.80.233
_EOF_

Set an explanatory footer text in /etc/issue.footer ^

The people using this rescue image don’t necessarily know what it is and how to use it. I take the opportunity to put some basic info in the file /etc/issue.footer in the image, which will later end up in the real /etc/issue

$ sudo tee config/includes.chroot/etc/issue.footer >/dev/null <<_EOF_
BitFolk Rescue Environment - https://tools.bitfolk.com/wiki/Rescue
 
Blah blah about what this is and how to use it
_EOF_

Set a random password at boot ^

By default a Debian Live image has a user name of “user” and a password of “live“. This isn’t suitable for a networked service that will have sshd active from the start, so we will install a hook script that sets a random password. This will be run near the end of the image’s boot process.

$ sudo tee config/includes.chroot/lib/live/config/2000-passwd >/dev/null <<'_EOF_'
#!/bin/sh
 
set -e
 
echo -n " random-password "
 
NEWPASS=$(/usr/bin/pwgen -c -N 1)
printf "user:%s\n" "$NEWPASS" | chpasswd
 
RED='\033[0;31m'
NORMAL='\033[0m'
 
{
    printf "****************************************\n";
    printf "Resetting user password to random value:\n";
    printf "\t${RED}New user password:${NORMAL} %s\n" "$NEWPASS";
    printf "****************************************\n";
    cat /etc/issue.footer
} >> /etc/issue
_EOF_

This script puts the random password and the footer text into the /etc/issue file which is displayed above the console login prompt, so the user can see what the password is.

Fix initial networking setup ^

This one’s a bit unfortunate and is a huge hack, but I’m not sure enough of the details to report a bug yet.

The live image when booted is supposed to be able to set up its network by a number of different ways. DHCP would be the most sensible for an image you take with you to different networks.

The BitFolk Rescue VM is only ever booted in one network though, and we don’t use DHCP. I want to set static networking through the ip=…s syntax of the kernel command line.

Unfortunately it doesn’t seem to work properly with live-boot as shipped. I had to hack the /lib/live/boot/9990-networking.sh file to make it parse the values out of the kernel command line.

Here’s a diff. Copy /lib/live/boot/9990-networking.sh to config/includes.chroot/usr/lib/live/boot/9990-networking.sh and then apply that patch to it.

It’s simple enough that you could probably edit it by hand. All it does is comment out one section and replace it with some bits that parse IP setup out of the $STATICIP variable.

Fix the shutdown process ^

Again this is a horrible hack and I’m sure there is a better way to handle it, but I couldn’t work out anything better and this works.

This image will be running with its root filesystem on NFS. When a shutdown or halt command is issued however, systemd seems extremely keen to shut off the network as soon as possible. That leaves the shutdown process unable to continue because it can’t read or write its root filesystem any more. The shutdown process stalls forever.

As this is a read-only system with no persistent state I don’t care how brutal the shutdown process is. I care more that it does actually shut down. So, I have added a systemd service that issues systemctl –force –force poweroff any time that it’s about to shut down by any means.

$ sudo tee config/includes.chroot/etc/systemd/system/always-brutally-poweroff.service >/dev/null <<_EOF_
[Unit]
Description=Every kind of shutdown will be brutal poweroff
DefaultDependencies=no
After=final.target
 
[Service]
Type=oneshot
ExecStart=/usr/bin/systemctl --force --force poweroff
 
[Install]
WantedBy=final.target
_EOF_

And to force it to be enabled at boot time:

$ sudo tee config/includes.chroot/etc/rc.local >/dev/null <<_EOF_
#!/bin/sh
 
set -e
 
systemctl enable always-brutally-poweroff
_EOF_

Build it ^

At last we’re ready to build the image.

$ sudo lb clean && sudo lb config && sudo lb build

The “lb clean” is there because you probably won’t get this right first time and will want to iterate on it.

Once complete you’ll find the files to put on your NFS server in binary/ and the kernel and initramfs to boot on your client machine in tftpboot/live/

$ sudo rsync -av binary/ my.nfs.server:/srv/rescue/

Booting it ^

The details of exactly how I boot the client side (which in BitFolk’s case is a customer VM) are out of scope here, but this is sort of what the kernel command line looks like on the client (normally all on one line):

root=/dev/nfs
ip=192.168.0.225:192.168.0.243:192.168.0.1:255.255.248.0:rescue
hostname=rescue
nfsroot=192.168.0.243:/srv/rescue
nfsopts=tcp
boot=live
persistent

Explained:

root=/dev/nfs

Get root filesystem from NFS.

ip=192.168.0.225:192.168.0.243:192.168.0.1:255.255.248.0:rescue

Static IP configuration on kernel command line. Separated by colons:

Client’s IP
NFS server’s IP
Default gateway
Netmask
Host name

hostname=rescue

Host name.

nfsroot=192.168.0.243:/srv/rescue

Where to mount root from on NFS server.

nfsopts=tcp

NFS client options to use.

boot=live

Tell live-boot that this is a live image.

persistent

Look for persistent data.

In action ^

Here’s an Asciinema of this image in action.

Improvements ^

There’s a few things in here which are hacks. What I have works but no doubt I am doing some things wrong. If you know better please do let me know in comments or whatever. Ideally I’d like to stick with Debian Live though because it’s got a lot of problems solved already.

btrfs compression wins

Posted on November 30, 2021November 30, 2021 by Andy

Some quite good btrfs compression results from my backup hosts (which back up customer data).

Type       Perc     Disk Usage   Uncompressed Referenced
TOTAL       64%       68G         105G         1.2T
none       100%       24G          24G         434G
zlib        54%       43G          80G         797G

Type       Perc     Disk Usage   Uncompressed Referenced
TOTAL       74%       91G         123G         992G
none       100%       59G          59G         599G
lzo         50%       32G          63G         393G

Type       Perc     Disk Usage   Uncompressed Referenced
TOTAL       73%       16G          22G         459G
none       100%       12G          12G         269G
lzo         40%      4.1G          10G         190G

Type       Perc     Disk Usage   Uncompressed Referenced
TOTAL       71%      105G         148G         1.9T
none       100%       70G          70G         910G
zlib        40%       24G          60G         1.0T
lzo         58%       10G          17G          17G

So that’s 398G that takes up 280G, a 29.6% reduction.

The “none” type is incompressible files such as media that’s already compressed. I started off with lzo compression but I’m switching to zlib now as it compresses more and this data is rarely accessed so I’m not too concerned about performance. I need newer kernels on these before I can try zstd.

I’ve had serious concerns about btrfs before based on issues I’ve had using it at home, but these were mostly around multiple device usage. Here they get a single block device that has redundancy underneath so the only remotely interesting thing that btrfs is doing here is the compression.

Might try some offline deduplication next.

Resolving a sector offset to a logical volume

Posted on July 24, 2021July 24, 2021 by Andy

Contents

The Problem
At the md level
At the lvm level

Offset into the PV
How big is an extent?
Look at the PV’s mappings

What’s going on here then?
Going further: finding the file

The Problem ^

Sometimes Linux logs interesting things with sector offsets. For example:

Jul 23 23:11:19 tanqueray kernel: [197925.429561] sg[22] phys_addr:0x00000015bac60000 offset:0 length:4096 dma_address:0x00000012cf47a000 dma_length:4096
Jul 23 23:11:19 tanqueray kernel: [197925.430323] sg[23] phys_addr:0x00000015bac5d000 offset:0 length:4608 dma_address:0x00000012cf47b000 dma_length:4608
Jul 23 23:11:19 tanqueray kernel: [197925.431052] sg[24] phys_addr:0x00000015bac5e200 offset:512 length:3584 dma_address:0x00000012cf47c200 dma_length:3584
Jul 23 23:11:19 tanqueray kernel: [197925.431824] sg[25] phys_addr:0x00000015bac2e000 offset:0 length:4096 dma_address:0x00000012cf47d000 dma_length:4096
.
.
.
Jul 23 23:11:19 tanqueray kernel: [197925.434447] Invalid SGL for payload:131072 nents:32
.
.
.
Jul 23 23:11:19 tanqueray kernel: [197925.454419] blk_update_request: I/O error, dev nvme0n1, sector 509505343 op 0x1:(WRITE) flags 0x800 phys_seg 32 prio class 0
Jul 23 23:11:19 tanqueray kernel: [197925.464644] md/raid1:md5: Disk failure on nvme0n1p5, disabling device.
Jul 23 23:11:19 tanqueray kernel: [197925.464644] md/raid1:md5: Operation continuing on 1 devices.

What is at sector 509505343 of /dev/nvme0n1p5 anyway? Well, that’s part of an md array and then on top of that is an lvm physical volume, which has a number of logical volumes.

I’d like to know which logical volume sector 509505343 of /dev/nvme0n1p5 corresponds to.

At the md level ^

Thankfully this is a RAID-1 so every device in it has the exact same layout.

$ grep -A 2 ^md5 /proc/mdstat 
md5 : active raid1 nvme0n1p5[0] sda5[1]
      3738534208 blocks super 1.2 [2/2] [UU]
      bitmap: 2/28 pages [8KB], 65536KB chunk

The superblock format of 1.2 also means that the RAID metadata is at the end of each device, so there is no offset there to worry about.

For all intents and purposes sector 509505343 of /dev/nvme0n1p5 is the same as sector 509505343 of /dev/md5.

If I’d been using a different RAID level like 5 or 6 then this would have been far more complicated as the data would have been striped across multiple devices at different offsets, together with parity. Some layouts of Linux RAID-10 would also have different offsets.

At the lvm level ^

LVM has physical volumes (PVs) that are split into extents, then one or more ranges of one or more extents make up a logical volume (LV). The physical volumes are just the underlying device, so in my case that’s /dev/md5.

Offset into the PV ^

LVM has some metadata at the start of the PV, so we first work out how far into the PV the extents can start:

$ sudo pvs --noheadings -o pe_start --units s /dev/md5
    2048S

So, sector 509505343 is actually 509503295 sectors into this PV, because the first 2048 sectors are reserved for metadata.

How big is an extent? ^

Next we need to know how big an LVM extent is.

$ sudo pvdisplay --units s /dev/md5 | grep 'PE Size'
  PE Size               8192 Se

There’s 8192 sectors in each of the extents in this PV, so this sector is inside extent number 509503295 / 8192 = 62195.22644043.

It’s fractional because naturally the sector is not on an exact PE boundary. If I need to I could work out from the remainder how many sectors into PE 62195 this is, but I’m only interested in the LV name and each LV has an integer number of PEs, so that’s fine: PE 62195.

Look at the PV’s mappings ^

Now you can dump out a list of mappings for the PV. This will show you what each range of extents corresponds to. Note that there might be multiple ranges for an LV if it’s been grown later on.

$ sudo pvdisplay --maps /dev/md5 | grep -A1 'Physical extent'
.
.
.
  Physical extent 58934 to 71733:
    Logical volume      /dev/myvg/domu_backup4_xvdd
--
  Physical extent 71734 to 912726:
    FREE

So, extent 62195 is inside /dev/myvg/domu_backup4_xvdd.

What’s going on here then? ^

I’m not sure, but there appears to be a kernel bug and it’s probably got something to do with the fact that this LV is a disk with an unaligned partition table:

$ sudo fdisk -u -l /dev/myvg/domu_backup4_xvdd Disk /dev/myvg/domu_backup4_xvdd: 50 GiB, 53687091200 bytes, 104857600 sectors Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 4096 bytes I/O size (minimum/optimal): 4096 bytes / 4096 bytes Disklabel type: dos Disk identifier: 0x07c7ce4c

Device Boot Start End Sectors Size Id Type /dev/myvg/domu_backup4_xvdd1 63 104857599 104857537 50G 83 Linux Partition 1 does not start on physical sector boundary.

The Linux NVMe driver can only do IO in multiples of 4096 bytes. As seen in the initial logs, two of the requests were for 4608 and 3584 bytes respectively; these are not divisible by 4096 and thus hit a WARN().

.
.
.
Jul 23 23:11:19 tanqueray kernel: [197925.430323] sg[23] phys_addr:0x00000015bac5d000 offset:0 length:4608 dma_address:0x00000012cf47b000 dma_length:4608
Jul 23 23:11:19 tanqueray kernel: [197925.431052] sg[24] phys_addr:0x00000015bac5e200 offset:512 length:3584 dma_address:0x00000012cf47c200 dma_length:3584
.
.
.

Going further: finding the file ^

I’m not interested in doing this because it’s fairly likely that it’s because of the offset partition and many kinds of IO to it will cause this.

If you did want to though, you’d first have to look at the partition table to see where your filesystem starts. 0.22644043 * 8192 = 1855 sectors into the disk. Partition 1 starts at 63, so this file is at 1792 sectors.

You can then (for ext4) use debugfs to poke about and see which file that corresponds to.

Keeping firewall logs out of Linux’s kernel log with ulogd2

Posted on July 24, 2021 by Andy

Contents

A few words about iptables vs nft
Logging with iptables
The Problem
An answer: ulogd2

Configuring ulogd2
Configuring iptables

A few words about `iptables` vs `nft` ^

nftables is the new thing and iptables is deprecated, but I haven’t found time to convert everything to nft rules syntax yet.

I’m still using iptables rules but it’s the iptables frontend to nftables. All of this works both with legacy iptables and with nft but with different syntax.

Logging with `iptables` ^

As a contrived example let’s log inbound ICMP packets at a maximum rate of 1 per second:

-A INPUT -m limit --limit 1/s -p icmp -j LOG --log-level 7 --log-prefix "ICMP: "

The Problem ^

If you have logging rules in your firewall then they’ll log to your kernel log, which is available at /dev/kmsg. The dmesg command displays the contents of /dev/kmsg but /dev/kmsg is a fixed size circular buffer, so after a while your firewall logs will crowd out every other thing.

On a modern systemd system this stuff does get copied to the journal, so if you set that to be persistent then you can keep the kernel logs forever. Or you can additionally run a syslog daemon like rsyslogd, and have that keep things forever.

Either way though your dmesg or journalctl -k commands are only going to display the contents of the kernel’s ring buffer which will be a limited amount.

I’m not that interested in firewall logs. They’re nice to have and very occasionally valuable when debugging something, but most of the time I’d rather they weren’t in my kernel log.

An answer: `ulogd2` ^

One answer to this problem is ulogd2. ulogd2 is a userspace logging daemon into which you can feed netfilter data and have it log it in a flexible way, to multiple different formats and destinations.

I actually already use it to log certain firewall things to a MariaDB database for monitoring purposes, but you can also emit plain text, JSON, netflow and all manner of things. Since I’m already running it I decided to switch my general firewall logging to it.

Configuring `ulogd2` ^

I added the following to /etc/ulogd.conf:

# This one for logging to local file in emulated syslog format.
stack=log2:NFLOG,base1:BASE,ifi1:IFINDEX,ip2str1:IP2STR,print1:PRINTPKT,emu1:LOGEMU
 
[log2]
group=2
 
[emu1]
file="/var/log/iptables_ulogd2.log"
sync=1

I already had a stack called log1 for logging to MariaDB, so I called the new one log2 with its output being emu1.

The log2 section can then be told to expect messages from netfilter group 2. Don’t worry about this, just know that this is what you refer to in your firewall rules, and you can’t use group 0 because that’s used for something else.

The emu1 section then says which file to write this stuff to.

That’s it. Restart the daemon.

Configuring `iptables` ^

Now it’s time to make iptables log to netfilter group 2 instead of its normal LOG target. As a reminder, here’s what the rule was like before:

-A INPUT -m limit --limit 1/s -p icmp -j LOG --log-level 7 --log-prefix "ICMP: "

And here’s what you’d change it to:

-A INPUT -m limit --limit 1/s -p icmp -j NFLOG --nflog-group 2 --nflog-prefix "ICMP:"

The --nflog-group 2 needs to match what you put in /etc/ulogd.conf.

You’re now logging with ulogd2 and none of this will be going to the kernel log buffer. Don’t forget to rotate the new log file! Or maybe you’d like to play with logging this as JSON or into a SQLite DB?

rsync and sudo without X forwarding

Posted on April 10, 2021April 10, 2021 by Andy

Five years ago I wrote about how to do rsync as root on both sides. That solution required using ssh-askpass which in turn requires X forwarding.

The main complication here is that sudo on the remote side is going to ask for a password, which either requires an interactive terminal or a forwarded X session.

I thought I would mention that if you’ve disabled tty_tickets in the sudo configuration then you can “prime” the sudo authentication with some harmless command and then do the real rsync without it asking for a sudo password:

local$ ssh -t you@remote.example.com sudo whoami
[sudo] password for you: 
root
local$ sudo rsync --rsync-path="sudo rsync" -av --delete \ 
  you@remote.example.com:/etc/secret/ /etc/secret/

This suggestion was already supplied as a comment on the earlier post five years ago, but I keep forgetting it.

I suggest this is only for ad hoc commands and not for automation. For automation you need to find a way to make sudo not ever ask for a password, and some would say to add configuration to sudo with a NOPASSWD directive to accomplish that.

I would instead suggest allowing a root login by ssh using a public key that is only for the specific purpose, as you can lock it down to only ever be able to execute that one script/program.

Also bear in mind that if you permanently allow “host A” to run rsync as root with unrestricted parameters on “host B” then a compromise of “host A” is also a compromise of “host B”, as full write access to filesystem is granted. Whereas if you only allow “host A” to run a specific script/program on “host B” then you’ve a better chance of things being contained.

grub-install: error: embedding is not possible, but this is required for RAID and LVM install

Posted on March 6, 2021March 6, 2021 by Andy

Contents

The Initial Problem
Ancient History
What’s Gone Wrong?
Possible Fixes
Rebuilding My /boot

Take a backup
Unmount /boot and stop the RAID array that it’s on
Delete and recreate first partition on each drive
Create md0 array again
Get the Array UUID
Make a new filesystem on /dev/md0
Mount it and put your files back
Reinstall grub-pc
Reboot

My Kingdom For 7 Bytes

The Initial Problem ^

The recent security update of the GRUB bootloader did not want to install on my fileserver at home:

$ sudo apt dist-upgrade
Reading package lists... Done
Building dependency tree
Reading state information... Done
Calculating upgrade... Done
The following packages will be upgraded:
  grub-common grub-pc grub-pc-bin grub2-common
4 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
Need to get 4,067 kB of archives.
After this operation, 72.7 kB of additional disk space will be used.
Do you want to continue? [Y/n]
…
Setting up grub-pc (2.02+dfsg1-20+deb10u4) ...
Installing for i386-pc platform.
grub-install: warning: your core.img is unusually large.  It won't fit in the embedding area.
grub-install: error: embedding is not possible, but this is required for RAID and LVM install.
Installing for i386-pc platform.
grub-install: warning: your core.img is unusually large.  It won't fit in the embedding area.
grub-install: error: embedding is not possible, but this is required for RAID and LVM install.
Installing for i386-pc platform.
grub-install: warning: your core.img is unusually large.  It won't fit in the embedding area.
grub-install: error: embedding is not possible, but this is required for RAID and LVM install.
Installing for i386-pc platform.
grub-install: warning: your core.img is unusually large.  It won't fit in the embedding area.
grub-install: error: embedding is not possible, but this is required for RAID and LVM install.

Four identical error messages, because this server has four drives upon which the operating system is installed, and I’d decided to do a four way RAID-1 of a small first partition to make up /boot. This error is coming from grub-install.

Ancient History ^

This system came to life in 2006, so it’s 15 years old. It’s always been Debian stable, so right now it runs Debian buster and during those 15 years it’s been transplanted into several different iterations of hardware.

Choices were made in 2006 that were reasonable for 2006, but it’s not 2006 now. Some of these choices are now causing problems.

Aside: four way RAID-1 might seem excessive, but we’re only talking about the small /boot partition. Back in 2006 I chose a ~256M one so if I did the minimal thing of only having a RAID-1 pair I’d have 2x 256M spare on the two other drives, which isn’t very useful. I’d honestly rather have all four system drives with the same partition table and there’s hardly ever writes to /boot anyway.

Here’s what the identical partition tables of the drives /dev/sd[abcd] look like:

$ sudo fdisk -u -l /dev/sda
Disk /dev/sda: 298.1 GiB, 320069031424 bytes, 625134827 sectors
Disk model: ST3320620AS     
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x00000000
 
Device     Boot   Start       End   Sectors  Size Id Type
/dev/sda1  *         63    514079    514017  251M fd Linux raid autodetect
/dev/sda2        514080   6393869   5879790  2.8G fd Linux raid autodetect
/dev/sda3       6393870 625121279 618727410  295G fd Linux raid autodetect

Note that the first partition starts at sector 63, 32,256 bytes into the disk. Modern partition tools tend to start partitions at sector 2,048 (1,024KiB in), but this was acceptable in 2006 for me and worked up until a few days ago.

Those four partitions /dev/sd[abcd]1 make up an mdadm RAID-1 with metadata version 0.90. This was purposefully chosen because at the time of install GRUB did not have RAID support. This metadata version lives at the end of the member device so anything that just reads the device can pretend it’s an ext2 filesystem. That’s what people did many years ago to boot off of software RAID.

What’s Gone Wrong? ^

The last successful update of grub-pc seems to have been done on 7 February 2021:

$ ls -la /boot/grub/i386-pc/core.img
-rw-r--r-- 1 root root 31082 Feb  7 17:19 /boot/grub/i386-pc/core.img

I’ve got 62 sectors available for the core.img so that’s 31,744 bytes – just 662 bytes more than what is required.

The update of grub-pc appears to be detecting that my /boot partition is on a software RAID and is now including MD RAID support even though I don’t strictly require it. This makes the core.img larger than the space I have available for it.

I don’t think it is great that such a major change has been introduced as a security update, and it doesn’t seem like there is any easy way to tell it not to include the MD RAID support, but I’m sure everyone is doing their best here and it’s more important to get the security update out.

Possible Fixes ^

So, how to fix? It seems to me the choices are:

Ignore the problem and stay on the older grub-pc
Create a core.img with only the modules I need
Rebuild my /boot partition

Option #1 is okay short term, especially if you don’t use Secure Boot as that’s what the security update was about.

Option #2 doesn’t seem that feasible as I can’t find a way to influence how Debian’s upgrade process calls grub-install. I don’t want that to become a manual process.

Option #3 seems like the easiest thing to do, as shaving ~1MiB off the size of my /boot isn’t going to cause me any issues.

Rebuilding My /boot ^

Take a backup ^

/boot is only relatively small so it seemed easiest just to tar it up ready to put it back later.

$ sudo tar -C /boot -cvf ~/boot.tar .

I then sent that tar file off to another machine as well, just in case the worst should happen.

Unmount /boot and stop the RAID array that it’s on ^

I’ve already checked in /etc/fstab that /boot is on /dev/md0.

$ sudo umount /boot
$ sudo mdadm --stop md0         
mdadm: stopped md0

At this point I would also recommend doing a wipefs -a on each of the partitions in order to remove the MD superblocks. I didn’t and it caused me a slight problem later as we shall see.

Delete and recreate first partition on each drive ^

I chose to use parted, but should be doable with fdisk or sfdisk or whatever you prefer.

I know from the fdisk output way above that the new partition needs to start at sector 2048 and end at sector 514,079.

$ sudo parted /dev/sda                                                             
GNU Parted 3.2
Using /dev/sda
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) unit s
(parted) rm 1
(parted) mkpart primary ext4 2048 514079s
(parted) set 1 raid on
(parted) set 1 boot on
(parted) p
Model: ATA ST3320620AS (scsi)
Disk /dev/sda: 625134827s
Sector size (logical/physical): 512B/512B
Partition Table: msdos
Disk Flags:
 
Number  Start     End         Size        Type     File system  Flags
 1      2048s     514079s     512032s     primary  ext4         boot, raid, lba
 2      514080s   6393869s    5879790s    primary               raid
 3      6393870s  625121279s  618727410s  primary               raid
 
(parted) q
Information: You may need to update /etc/fstab.

Do that for each drive in turn. When I got to /dev/sdd, this happened:

Error: Partition(s) 1 on /dev/sdd have been written, but we have been unable to
inform the kernel of the change, probably because it/they are in use.  As a result,
the old partition(s) will remain in use.  You should reboot now before making further changes.
Ignore/Cancel?

The reason for this seems to be that something has decided that there is still a RAID signature on /dev/sdd1 and so it will try to incrementally assemble the RAID-1 automatically in the background. This is why I recommend a wipefs of each member device.

To get out of this situation without rebooting I needed to repeat my mdadm --stop /dev/md0 command and then do a wipefs -a /dev/sdd1. I was then able to partition it with parted.

Create `md0` array again ^

I’m going to stick with metadata format 0.90 for this one even though it may not be strictly necessary.

$ sudo mdadm --create /dev/md0 \
             --metadata 0.9 \
             --level=1 \
             --raid-devices=4 \
             /dev/sd[abcd]1
mdadm: array /dev/md0 started.

Again, if you did not do a wipefs earlier then mdadm will complain that these devices already have a RAID array on them and ask for confirmation.

Get the Array UUID ^

$ sudo mdadm --detail /dev/md0
/dev/md0:
           Version : 0.90
     Creation Time : Sat Mar  6 03:20:10 2021
        Raid Level : raid1
        Array Size : 255936 (249.94 MiB 262.08 MB)
     Used Dev Size : 255936 (249.94 MiB 262.08 MB)
      Raid Devices : 4
     Total Devices : 4
   Preferred Minor : 0
       Persistence : Superblock is persistent
 
       Update Time : Sat Mar  6 03:20:16 2021
             State : clean
    Active Devices : 4
   Working Devices : 4
    Failed Devices : 0
     Spare Devices : 0
 
Consistency Policy : resync
 
              UUID : e05aa2fc:91023169:da7eb873:22131b12 (local to host specialbrew.localnet)            Events : 0.18
 
    Number   Major   Minor   RaidDevice State
       0       8        1        0      active sync   /dev/sda1
       1       8       17        1      active sync   /dev/sdb1
       2       8       33        2      active sync   /dev/sdc1
       3       8       49        3      active sync   /dev/sdd1

Change your /etc/mdadm/mdadm.conf for the updated UUID of md0:

$ grep md0 /etc/mdadm/mdadm.conf
ARRAY /dev/md0 level=raid1 num-devices=4 UUID=e05aa2fc:91023169:da7eb873:22131b12

Make a new filesystem on `/dev/md0` ^

$ sudo mkfs.ext4 -m0 -L boot /dev/md0
mke2fs 1.44.5 (15-Dec-2018)
Creating filesystem with 255936 1k blocks and 64000 inodes
Filesystem UUID: fdc611f2-e82a-4877-91d3-0f5f8a5dd31d
Superblock backups stored on blocks:
        8193, 24577, 40961, 57345, 73729, 204801, 221185
 
Allocating group tables: done
Writing inode tables: done
Creating journal (4096 blocks): done
Writing superblocks and filesystem accounting information: done

My /etc/fstab didn’t need a change because it mounted by device name, i.e. /dev/md0, but if yours uses UUID or label then you’ll need to update that now, too.

Mount it and put your files back ^

$ sudo mount /boot
$ sudo tar -C /boot -xvf ~/boot.tar

Reinstall grub-pc ^

$ sudo apt reinstall grub-pc
…
Setting up grub-pc (2.02+dfsg1-20+deb10u4) ...
Installing for i386-pc platform.
Installation finished. No error reported.
Installing for i386-pc platform.
Installation finished. No error reported.
Installing for i386-pc platform.
Installation finished. No error reported.
Installing for i386-pc platform.
Installation finished. No error reported.

Reboot ^

You probably should reboot now to make sure it all works when you have time to fix any problems, as opposed to risking issues when you least expect it.

$ uprecords 
     #               Uptime | System                                     Boot up
----------------------------+---------------------------------------------------
     1   392 days, 16:45:55 | Linux 4.7.0               Thu Jun 14 16:13:52 2018
     2   325 days, 03:20:18 | Linux 3.16.0-0.bpo.4-amd  Wed Apr  1 14:43:32 2015
->   3   287 days, 16:03:12 | Linux 4.19.0-9-amd64      Fri May 22 12:33:27 2020     4   257 days, 07:31:42 | Linux 4.19.0-6-amd64      Sun Sep  8 05:00:38 2019
     5   246 days, 14:45:10 | Linux 4.7.0               Sat Aug  6 06:27:52 2016
     6   165 days, 01:24:22 | Linux 4.5.0-rc4-specialb  Sat Feb 20 18:18:47 2016
     7   131 days, 18:27:51 | Linux 3.16.0              Tue Sep 16 08:01:05 2014
     8    89 days, 16:01:40 | Linux 4.7.0               Fri May 26 18:28:40 2017
     9    85 days, 17:33:51 | Linux 4.7.0               Mon Feb 19 17:17:39 2018
    10    63 days, 18:57:12 | Linux 3.16.0-0.bpo.4-amd  Mon Jan 26 02:33:47 2015
----------------------------+---------------------------------------------------
1up in    37 days, 11:17:07 | at                        Mon Apr 12 15:53:46 2021
no1 in   105 days, 00:42:44 | at                        Sat Jun 19 05:19:23 2021
    up  2362 days, 06:33:25 | since                     Tue Sep 16 08:01:05 2014
  down     0 days, 14:02:09 | since                     Tue Sep 16 08:01:05 2014
   %up               99.975 | since                     Tue Sep 16 08:01:05 2014

My Kingdom For 7 Bytes ^

My new core.img is 7 bytes too big to fit before my original /boot:

$ ls -la /boot/grub/i386-pc/core.img
-rw-r--r-- 1 root root 31751 Mar  6 03:24 /boot/grub/i386-pc/core.img

Just had my COVID-19 first vaccination (Pfizer/BioNTech)

Posted on February 25, 2021February 25, 2021 by Andy

Just got back from having my first COVID-19 vaccination. Started queueing at 10:40, pre-screening questions at 10:50, all done by 10:53 then I poked at my phone for 15 minutes while waiting to check I wouldn’t keel over from anaphylactic shock (I didn’t).

I was first notified that I should book an appointment in the form of a text message from sender “GPSurgery” on Monday 22nd February 2021:

Dear MR SMITH,

You have been invited to book your COVID-19 vaccinations.

Please click on the link to book: https://accurx.thirdparty.nhs.uk/…
[Name of My GP Surgery]

The web site presented me with a wide variety of dates and times, the earliest being today, 3 days later, so I chose that. My booking was then confirmed by another text message, and another reminder message was sent yesterday. I assume these text messages were sent by some central service on behalf of my GP whose role was probably just submitting my details.

A very smooth process a 15 minute walk from my home, and I’m hearing the same about the rest of the country too.

Watching social media mentions from others saying they’ve had their vaccination and also looking at the demographics in the queue and waiting room with me, I’ve been struck by how many people have—like me—been called up for their vaccinations quite early unrelated to their age. I was probably in the bottom third age group in the queue and waiting area: I’m 45 and although most seemed older than me, there were plenty of people around my age and younger there.

It just goes to show how many people in the UK are relying on the NHS for the management of chronic health conditions that may not be obviously apparent to those around them. Which is why we must not let this thing that so many of us rely upon be taken away. I suspect that almost everyone reading either is in a position of relying upon the NHS or has nearest and dearest who do.

The NHS gets a lot of criticism for being a bottomless pit of expenditure that is inefficient and slow to embrace change. Yes, healthcare costs a lot of money especially with our ageing population, but per head we spend a lot less than many other countries: half what the US spends per capita or as a proportion of GDP; our care is universal and our life expectancy is slightly longer. In 2017 the Commonwealth Fund rated the NHS #1 in a comparison of 11 countries.

So the narrative that the NHS is poor value for money is not correct. We are getting a good financial deal. We don’t necessarily need to make it perform better, financially, although there will always be room for improvement. The NHS has a funding crisis because the government wants it to have a funding crisis. It is being deliberately starved of funding so that it fails.

The consequences of selling off the NHS will be that many people are excluded from care they need to stay alive or to maintain a tolerable standard of living. As we see with almost every private sector takeover of what were formerly public services, they strip the assets, run below-par services that just about scrape along, and then when there is any kind of downturn or unexpected event they fold and either beg for bailout or just leave the mess in the hands of the government. Either way, taxpayers pay more for less and make a small group of wealthy people even more wealthy.

We are such mugs here in UK that even other countries have realised that they can bid to take over our public services, provide a low standard of service at a low cost to run, charge a lot to the customer and make a hefty profit. Most of our train operating companies are owned by foreign governments.

The NHS as it is only runs as well as it does because the staff are driven to breaking point with an obscene amount of unpaid overtime and workplace stress.

If you’d like to learn some more about the state of the NHS in the form of an engaging read then I recommend Adam Kay’s book This is Going to Hurt: Secret Diaries of a Junior Doctor. It will make you laugh, it will make you cry and if you’ve a soul it will make you angry. Also it may indelibly sear the phrase “penis degloving injury” into your mind.

Do not accept the premise that the NHS is too expensive.

If the NHS does a poor job (and it sometimes does), understand that underfunding plays a big part.

Privatising any of it will not improve matters in any way, except for a very small number of already wealthy people.

Please think about this when you vote.

Intel may need me to sign an NDA before I can know the capacity of one of their SSDs

Posted on February 22, 2021February 22, 2021 by Andy

Apologies for the slightly clickbaity title! I could not resist. While an Intel employee did tell me this, they are obviously wrong.

Still, I found out some interesting things that I was previously unaware of.

I was thinking of purchasing some “3.84TB” Intel D3-S4610 SSDs for work. I already have some “3.84TB” Samsung SM883s so it would be good if the actual byte capacity of the Intel SSDs were at least as much as the Samsung ones, so that they could be used to replace a failed Samsung SSD.

To those with little tech experience you would think that two things which are described as X TB in capacity would be:

Actually X TB in size, where 1TB = 1,000 x 1,000 x 1,000 x 1,000 bytes, using powers of ten SI prefixes. Or;
Actually X TiB in size, where 1TiB = 1,024 x 1,024 x 1,024 x 1,024 bytes, using binary prefixes.

…and there was a period of time where this was mostly correct, in that manufacturers would prefer something like the former case, as it results in larger headline numbers.

The thing is, years ago, manufacturers used to pick a capacity that was at least what was advertised (in powers of 10 figures) but it wasn’t standardised.

If you used those drives in a RAID array then it was possible that a replacement—even from the same manufacturer—could be very slightly smaller. That would give you a bad day as you generally need devices that are all the same size. Larger is okay (you’ll waste some), but smaller won’t work.

So for those of us who like me are old, this is something we’re accustomed to checking, and I still thought it was the case. I wanted to find out the exact byte capacity of this Intel SSD. So I tried to ask Intel, in a live support chat.

Edgar (22/02/2021, 13:50:59): Hello. My name is Edgar and I’ll be helping you today.

Me (22/02/2021, 13:51:36): Hi Edgar, I have a simple request. Please could you tell me the exact byte capacity of a SSD-SSDSC2KG038T801 that is a 3.84TB Intel D3-S4610 SSD

Me (22/02/2021, 13:51:47): I need this information for matching capacities in a RAID set

Edgar (22/02/2021, 13:52:07): Hello, thank you for contacting Intel Technical Support. It is going to be my pleasure to help you.

Edgar (22/02/2021, 13:53:05): Allow me a moment to create a ticket for you.

Edgar (22/02/2021, 13:57:26): We have a calculation to get the decimal drive sectors of an SSD because the information you are asking for most probably is going to need a Non-Disclousre Agreement (NDA)

Yeah, an Intel employee told me that I might need to sign an NDA to know the usable capacity of an SSD. This is obviously nonsense. I don’t know whether they misunderstood and thought I was asking about the raw capacity of the flash chips or what.

Me (22/02/2021, 13:58:15): That seems a bit strange. If I buy this drive I can just plug it in and see the capacity in bytes. But if it’s too small then that is a wasted purchase which would be RMA’d

Edgar (22/02/2021, 14:02:48): It is 7,500,000,000

Edgar (22/02/2021, 14:03:17): Because you take the size of the SSD that is 3.84 in TB, in Byte is 3840000000000

Edgar (22/02/2021, 14:03:47): So we divide 3840000000000 / 512 which is the sector size for a total of 7,500,000,000 Bytes

Me (22/02/2021, 14:05:50): you must mean 7,500,000,000 sectors of 512byte, right?

Edgar (22/02/2021, 14:07:45): That is the total sector size, 512 byte

Edgar (22/02/2021, 14:08:12): So the total sector size of the SSD is 7,500,000,000

Me (22/02/2021, 14:08:26): 7,500,000,000 sectors is only 3,750GB so this seems rather unlikely

The reason why this seemed unlikely to me is that I have never seen an Intel or Samsung SSD that was advertised as X.Y TB capacity that did not have a usable capacity of at least X,Y00,000,000,000 bytes. So I would expect a “3.84TB” device to have at least 3,840,000,000,000 bytes of usable capacity.

Edgar was unable to help me further so the support chat was ended. I decided to ask around online to see if anyone actually had one of these devices running and could tell me the capacity.

Peter Corlett responded to me with:

As per IDEMA LBA1-03, the capacity is 1,000,194,048 bytes per marketing gigabyte plus 10,838,016 bytes. A marketing terabyte is 1000 marketing gigabytes.

3840 * 1000194048 + 10838016 = 3840755982336. Presumably your Samsung disk has that capacity, as should that Intel one you’re eyeing up.

My Samsung ones do! And every other SSD I’ve checked obeys this formula, which explains why things have seemed a lot more standard recently. I think this might have been standardised some time around 2014 / 2015. I can’t tell right now because the IDEMA web site is down!

So the interesting and previously unknown to me thing is that storage device sizes are indeed standardised now, albeit not to any sane definition of the units that they use.

What a relief.

Also that Intel live support sadly can’t be relied upon to know basic facts about Intel products. 🙁

Booting the CentOS/RHEL installer under Xen PVH mode

Posted on February 3, 2021February 3, 2021 by Andy

Contents

CentOS/RHEL and Xen
Credit
Overview
Detailed process

Extract the CentOS initrd.img
Copy modules from a working Xen guest
Add dracut hook to copy fs modules
Repack initrd
Use the Debian kernel

Boot this kernel/initrd as a Xen guest
Minimal kickstart file
Launch the guest
A better way?

CentOS/RHEL and Xen ^

As of the release of CentOS 8 / RHEL8, Red Hat disabled kernel support for running as a Xen PV or PVH guest, even though such support is enabled by default in the upstream Linux kernel.

As a result—unlike with all previous versions of CentOS/RHEL—you cannot boot the installer in Xen PV or PVH mode. You can still boot it in Xen HVM mode, or under KVM, but that is not very helpful if you don’t want to run HVM or KVM.

At BitFolk ever since the release of CentOS 8 we’ve had to tell customers to use the Rescue VM (a kind of live system) to unpack CentOS into a chroot.

Fortunately there is now a better way.

Credit ^

This method was worked out by Jon Fautley. Jon emailed me instructions and I was able to replicate them. Several people have since asked me how it was done and Jon was happy for me to write it up, but this was all worked out by Jon, not me.

Overview ^

The basic idea here is to:

take the installer initrd.img
unpack it
shove the modules from a Debian kernel into it
repack it
use a Debian kernel and this new frankeninitrd as the installer kernel and initrd
switch the installed OS to kernel-ml package from ELRepo so it has a working kernel when it boots

Detailed process ^

I’ll go into enough detail that you should be able to exactly replicate what I did to end up with something that works. This is quite a lot but it only needs to be done each time the real installer initrd.img changes, which isn’t that often. The resulting kernel and initrd.img can be used to install many guests.

Throughout the rest of this article I’ll refer to CentOS, but Jon initially made this work for RHEL 8. I’ve replicated it for CentOS 8 and will soon do so for RHEL 8 as well.

Extract the CentOS initrd.img ^

You will find this in the install ISO or on mirrors as images/pxeboot/initrd.img.

$ mkdir /var/tmp/frankeninitrd/initrd
$ cd /var/tmp/frankeninitrd/initrd
$ xz -dc /path/to/initrd.img > ../initrd.cpio
$ # root needed because this will do some mknod/mkdev.
$ sudo cpio -idv < ../initrd.cpio

Copy modules from a working Xen guest ^

I’m going to use the Xen guest that I’m doing this on, which at the time of writing is a Debian buster system running kernel 4.19.0-13. Even a system that is not currently running as a Xen guest will probably work, as they usually have modules available for everything.

At the time of writing the kernel version in the installer is 4.18.0-240.

If you’ve got different, adjust filenames accordingly.

$ sudo cp -r /lib/modules/4.19.0-13-amd64 lib/modules/
$ # You're not going to use the original modules
$ # so may as well delete them to save space.
$ sudo rm -vr lib/modules/4.18*

Add dracut hook to copy fs modules ^

$ cat > usr/lib/dracut/hooks/pre-pivot/99-move-modules.sh <<__EOF__
#!/bin/sh
 
mkdir -p /sysroot/lib/modules/$(uname -r)/kernel/fs
rm -r /sysroot/lib/modules/4.18*
cp -r /lib/modules/$(uname -r)/kernel/fs/* /sysroot/lib/modules/$(uname -r)/kernel/fs
cp /lib/modules/$(uname -r)/modules.builtin /sysroot/lib/modules/$(uname -r)/
depmod -a -b /sysroot
 
exit 0
__EOF__
$ chmod +x usr/lib/dracut/hooks/pre-pivot/99-move-modules.sh

Repack initrd ^

This will take a really long time because xz -9 is sloooooow.

$ sudo find . 2>/dev/null | \
  sudo cpio -o -H newc -R root:root | \
  xz -9 --format=lzma > ../centos8-initrd.img

Use the Debian kernel ^

Put the matching kernel next to your initrd.

$ cp /boot/vmlinuz-4.19.0-13-amd64 ../centos8-vmlinuz
$ ls -lah ../centos*
-rw-r--r-- 1 andy andy  81M Feb  1 04:43 ../centos8-initrd.img
-rw-r--r-- 1 andy andy 5.1M Feb  1 04:04 ../centos8-vmlinuz

Boot this kernel/initrd as a Xen guest ^

Copy the kernel and initrd to somewhere on your dom0 and create a guest config file that looks a bit like this:

name       = "centostest"
# CentOS 8 installer requires at least 2.5G RAM.
# OS will run with a lot less though.
memory     = 2560
vif        = [ "mac=00:16:5e:00:02:39, ip=192.168.82.225, vifname=v-centostest" ]
type       = "pvh"
kernel     = "/var/tmp/frankeninitrd/centos8-vmlinuz"
ramdisk    = "/var/tmp/frankeninitrd/centos8-initrd.img"
extra      = "console=hvc0 ip=192.168.82.225::192.168.82.1:255.255.255.0:centostest:eth0:none nameserver=8.8.8.8 inst.stage2=http://www.mirrorservice.org/sites/mirror.centos.org/8/BaseOS/x86_64/os/ inst.ks=http://example.com/yourkickstart.ks"
disk       = [ "phy:/dev/vg/centostest_xvda,xvda,w",
               "phy:/dev/vg/centostest_xvdb,xvdb,w" ]

Assumptions in the above:

vif and disk settings will be however you usually do that.
“extra” is for the kernel command line and here gives the installer static networking with the ip=IP address::default gateway:netmask:hostname:interface name:auto configuration type option.
inst.stage2 here goes to a public mirror but could be an unpacked installer iso file instead.
inst.ks points to a minimal kickstart file you’ll have to create (see below).

Minimal kickstart file ^

This kickstart file will:

Automatically wipe disks and partition. I use xvda for the OS and xvdb for swap. Adjust accordingly.
Install only minimal package set.
Switch the installed system over to kernel-ml from EPEL.
Force an SELinux autorelabel at first boot.

The only thing it doesn’t do is create any users. The installer will wait for you to do that. If you want an entirely automated install just add the user creation stuff to your kickstart file.

url --url="http://www.mirrorservice.org/sites/mirror.centos.org/8/BaseOS/x86_64/os"
text
 
# Clear all the disks.
clearpart --all --initlabel
zerombr
 
# A root filesystem that takes up all of xvda.
part /    --ondisk=xvda --fstype=xfs --size=1 --grow
 
# A swap partition that takes up all of xvdb.
part swap --ondisk=xvdb --size=1 --grow
 
bootloader --location=mbr --driveorder=xvda --append="console=hvc0"
firstboot --disabled
timezone --utc Etc/UTC --ntpservers="0.pool.ntp.org,1.pool.ntp.org,2.pool.ntp.org,3.pool.ntp.org"
keyboard --vckeymap=gb --xlayouts='gb'
lang en_GB.UTF-8
skipx
firewall --enabled --ssh
halt
 
%packages
@^Minimal install
%end 
 
%post --interpreter=/usr/bin/bash --log=/root/ks-post.log --erroronfail
 
# Switch to kernel-ml from EPEL. Necessary for Xen PV/PVH boot support.
rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org
yum -y install https://www.elrepo.org/elrepo-release-8.el8.elrepo.noarch.rpm
yum --enablerepo=elrepo-kernel -y install kernel-ml
yum -y remove kernel-tools kernel-core kernel-modules
 
sed -i -e 's/DEFAULTKERNEL=.*/DEFAULTKERNEL=kernel-ml/' /etc/sysconfig/kernel
grub2-mkconfig -o /boot/grub2/grub.cfg
 
# Force SELinux autorelabel on first boot.
touch /.autorelabel
%end

Launch the guest ^

$ sudo xl create -c /etc/xen/centostest.conf

Obviously this guest config can only boot the installer. Once it’s actually installed and halts you’ll want to make a guest config suitable for normal booting. The kernel-ml does work in PVH mode so at BitFolk we use pvhgrub to boot these.

A better way? ^

The actual modifications needed to the stock installer kernel are quite small: just enable CONFIG_XEN_PVH kernel option and build. I don’t know the process to build a CentOS or RHEL installer kernel though, so that wasn’t an option for me.

If you do know how to do it please do send me any information you have.

If you’re running Ubuntu and/or using snaps, look into CVE-2020-27348

Posted on January 18, 2021 by Andy

I was reading an article about CVE-2020-27348 earlier, which is quite a nasty bug affecting a lot of snap packages.

My desktop runs Ubuntu 18.04 at the moment, and so does my partner’s laptop. I also have a Debian buster laptop but I’ve never installed snapd there. So it’s just my desktop and my partner’s laptop I’m concerned about.

If you run Ubuntu 20.04 or later I think there’s probably more concern, as I understand the software centre offers snap versions of things by default.

Anyway, I couldn’t recall ever installing a snap on purpose on my desktop except for a short while ago when I intentionally installed signal-desktop. But in fact I have quite a few snaps installed.

$ snap list
Name                  Version                     Rev    Tracking         Publisher     Notes
core                  16-2.48.2                   1058   latest/stable    canonical✓    core
core18                20201210                    1944   latest/stable    canonical✓    base 
gnome-3-26-1604       3.26.0.20200529             100    latest/stable/…  canonical✓    -
gnome-3-28-1804       3.28.0-19-g98f9e67.98f9e67  145    latest/stable    canonical✓    -
gnome-3-34-1804       0+git.3556cb3               66     latest/stable    canonical✓    -
gnome-calculator      3.38.0+git7.c840c69c        826    latest/stable/…  canonical✓    -
gnome-characters      v3.34.0+git9.eeab5f2        570    latest/stable/…  canonical✓    -
gnome-logs            3.34.0                      100    latest/stable/…  canonical✓    -
gnome-system-monitor  3.36.0-12-g35f88a56d7       148    latest/stable/…  canonical✓    -
gtk-common-themes     0.1-50-gf7627e4             1514   latest/stable/…  canonical✓    -
signal-desktop        1.39.5                      345    latest/stable    snapcrafters  -

I don’t know why gnome-calculator is there. It doesn’t appear to be the binary that’s run when I start the calculator.

So are any of them a security risk? Well…

$ grep -l \$LD_LIBRARY_PATH /snap/*/current/snap/snapcraft.yaml
/snap/gnome-calculator/current/snap/snapcraft.yaml
/snap/gnome-characters/current/snap/snapcraft.yaml
/snap/gnome-logs/current/snap/snapcraft.yaml
/snap/gnome-system-monitor/current/snap/snapcraft.yaml

Those are all the snaps on my system which include the value of the (empty) environment variable LD_LIBRARY_PATH, so are likely vulnerable to this.

But does this really end up with an empty item in the LD_LIBRARY_PATH list?

$ which gnome-system-monitor 
/snap/bin/gnome-system-monitor
$ gnome-system-monitor &
$ pgrep -f gnome-system-monitor
8259
$ tr '\0' '\n' < /proc/8259/environ | grep ^LD_LIBR | grep -q :: && echo "oh dear"
oh dear

Yes it really does.

(The tr is necessary above because the /proc/*/environ file is a NUL-separated string, so that modifies it to be one variable per line, then looks for the LD_LIBRARY_PATH line, and checks if it has an empty entry ::)

So yeah, my gnome-system-monitor is a local code execution vector.

As are my gnome-characters, gnome-logs and that gnome-calculator if I ever uninstall the non-snap version.

That CVE seems to have been published on 3 December 2020. I hope that the affected snaps will be fixed soon.

I don’t like that the CVE says the impact is:

If a user were tricked into installing a malicious snap or downloading a malicious library, under certain circumstances an attacker could exploit this to affect strict mode snaps that have access to the library and were launched from the directory containing the library.

My first thought upon reading is, “I’m safe, I haven’t been tricked into downloading any malicious snaps!” But I do have snaps that aren’t malicious, they are just insecure. The hardest part of the exploit is indeed getting a malicious file (a library) into my filesystem in a directory where I will run a snap from.

I'll get there one day.

Overview ^

Basic concept of Debian Live ^

Install packages ^

Prepare the work directory ^

Main configuration ^

Extra packages ^

Installing a backports kernel ^

Set a static /etc/resolv.conf ^

Set an explanatory footer text in /etc/issue.footer ^

Set a random password at boot ^

Fix initial networking setup ^

Fix the shutdown process ^

Build it ^

Booting it ^

In action ^

Improvements ^

The Problem ^

At the md level ^

At the lvm level ^

Offset into the PV ^

How big is an extent? ^

Look at the PV’s mappings ^

What’s going on here then? ^

Going further: finding the file ^

A few words about iptables vs nft ^

Logging with iptables ^

The Problem ^

An answer: ulogd2 ^

Configuring ulogd2 ^

Configuring iptables ^

The Initial Problem ^

Ancient History ^

What’s Gone Wrong? ^

Possible Fixes ^

Rebuilding My /boot ^

Take a backup ^

Unmount /boot and stop the RAID array that it’s on ^

Delete and recreate first partition on each drive ^

Create md0 array again ^

Get the Array UUID ^

Make a new filesystem on /dev/md0 ^

Mount it and put your files back ^

Reinstall grub-pc ^

Reboot ^

My Kingdom For 7 Bytes ^

CentOS/RHEL and Xen ^

Credit ^

Overview ^

Detailed process ^

Extract the CentOS initrd.img ^

Copy modules from a working Xen guest ^

Add dracut hook to copy fs modules ^

Repack initrd ^

Use the Debian kernel ^

Boot this kernel/initrd as a Xen guest ^

Minimal kickstart file ^

Launch the guest ^

A better way? ^

A few words about `iptables` vs `nft` ^

Logging with `iptables` ^

An answer: `ulogd2` ^

Configuring `ulogd2` ^

Configuring `iptables` ^

Create `md0` array again ^

Make a new filesystem on `/dev/md0` ^