Copying block devices between machines

Having a bunch of Linux servers that run Linux virtual machines I often find myself having to move a virtual machine from one server to another. The tricky thing is that I’m not in a position to be using shared storage, i.e., the virtual machines’ storage is local to the machine they are running on. So, the data has to be moved first.

A naive approach ^

The naive approach is something like the following:

  1. Ensure that I can SSH as root using SSH key from the source host to the destination host.
  2. Create a new LVM logical volume on the destination host that’s the same size as the origin.
  3. Shut down the virtual machine.
  4. Copy the data across using something like this:
    $ sudo dd bs=4M if=/dev/mapper/myvg-src_lv |
      sudo ssh root@dest-host 'dd bs=4M of=/dev/mapper/myvg-dest_lv'
    
  5. While that is copying, do any other configuration transfer that’s required.
  6. When it’s finished, start up the virtual machine on the destination host.

I also like to stick pv in the middle of that pipeline so I get a nice text mode progress bar (a bit like what you see with wget):

$ sudo dd bs=4M if=/dev/mapper/myvg-src_lv | pv -s 10g |
  sudo ssh root@dest-host 'dd bs=4M of=/dev/mapper/myvg-dest_lv'

The above transfers data between hosts via ssh, which will introduce some overhead since it will be encrypting everything. You may or may not wish to force it to do compression, or pipe it through a compressor (like gzip) first, or even avoid ssh entirely and just use nc.

Personally I don’t care about the ssh overhead; this is on the whole customer data and I’m happier if it’s encrypted. I also don’t bother compressing it unless it’s going over the Internet. Over a gigabit LAN I’ve found it fastest to use ssh with the -c arcfour option.

The above process works, but it has some fairly major limitations:

  1. The virtual machine needs to be shut down for the whole time it takes to transfer data from one host to another. For 10GiB of data that’s not too bad. For 100GiB of data it’s rather painful.
  2. It transfers the whole block device, even the empty bits. For example, if it’s a 10GiB block device with 2GiB of data on it, 10GiB still gets transferred.

Limitation #2 can be mitigated somewhat by compressing the data. But we can do better.

LVM snapshots ^

One of the great things about LVM is snapshots. You can do a snapshot of a virtual machine’s logical volume while it is still running, and transfer that using the above method.

But what do you end up with? A destination host with an out of date copy of the data on it, and a source host that is still running a virtual machine that’s still updating its data. How to get just the differences from the source host to the destination?

Again there is a naive approach, which is to shut down the virtual machine and mount the logical volume on the host itself, do the same on the destination host, and use rsync to transfer the differences.

This will work, but again has major issues such as:

  1. It’s technically possible for a virtual machine admin to maliciously construct a filesystem that interferes with the host that mounts it. Mounting random filesystems is risky.
  2. Even if you’re willing to risk the above, you have to guess what the filesystem is going to be. Is it ext3? Will it have the same options that your host supports? Will your host even support whatever filesystem is on there?
  3. What if it isn’t a filesystem at all? It could well be a partitioned disk device, which you can still work with using kpartx, but it’s a major pain. Or it could even be a raw block device used by some tool you have no clue about.

The bottom line is, it’s a world of risk and hassle interfering with the data of virtual machines that you don’t admin.

Sadly rsync doesn’t support syncing a block device. There’s a --copy-devices patch that allows it to do so, but after applying it I found that while it can now read from a block device, it would still only write to a file.

Next I found a --write-devices patch by Darryl Dixon, which provides the other end of the functionality – it allows rsync to write to a block device instead of files in a filesystem. Unfortunately no matter what I tried, this would just send all the data every time, i.e., it was no more efficient than just using dd.

Read a bit, compare a bit ^

While searching about for a solution to this dilemma, I came across this horrendous and terrifying bodge of shell and Perl on serverfault.com:

ssh -i /root/.ssh/rsync_rsa $remote "
  perl -'MDigest::MD5 md5' -ne 'BEGIN{\$/=\1024};print md5(\$_)' $dev2 | lzop -c" |
  lzop -dc | perl -'MDigest::MD5 md5' -ne 'BEGIN{$/=\1024};$b=md5($_);
    read STDIN,$a,16;if ($a eq $b) {print "s"} else {print "c" . $_}' $dev1 | lzop -c |
ssh -i /root/.ssh/rsync_rsa $remote "lzop -dc |
  perl -ne 'BEGIN{\$/=\1} if (\$_ eq\"s\") {\$s++} else {if (\$s) {
    seek STDOUT,\$s*1024,1; \$s=0}; read ARGV,\$buf,1024; print \$buf}' 1<> $dev2"

Are you OK? Do you need to have a nice cup of tea and a sit down for a bit? Yeah. I did too.

I’ve rewritten this thing into a single Perl script so it’s a little bit more readable, but I’ll attempt to explain what the above abomination does.

Even though I do refer to this script in unkind terms like “abomination”, I will be the first to admit that I couldn’t have come up with it myself, and that I’m not going to show you my single Perl script version because it’s still nearly as bad. Sorry!

It connects to the destination host and starts a Perl script which begins reading the block device over there, 1024 bytes at a time, running that through md5 and piping the output to a Perl script running locally (on the source host).

The local Perl script is reading the source block device 1024 bytes at a time, doing md5 on that and comparing it to the md5 hashes it is reading from the destination side. If they’re the same then it prints “s” otherwise it prints “c” followed by the actual data from the source block device.

The output of the local Perl script is fed to a third Perl script running on the destination. It takes the sequence of “s” or “c” as instructions on whether to skip 1024 bytes (“s”) of the destination block device or whether to take 1024 bytes of data and write it to the destination block device (“c<1024 bytes of data>“).

The lzop bits are just doing compression and can be changed for gzip or omitted entirely.

Hopefully you can see that this is behaving like a very very dumb version of rsync.

The thing is, it works really well. If you’re not convinced, run md5sum (or sha1sum or whatever you like) on both the source and destination block devices to verify that they’re identical.

The process now becomes something like:

  1. Take an LVM snapshot of virtual machine block device while the virtual machine is still running.
  2. Create suitable logical volume on destination host.
  3. Use dd to copy the snapshot volume to the destination volume.
  4. Move over any other configuration while that’s taking place.
  5. When the initial copy is complete, shut down the virtual machine.
  6. Run the script of doom to sync over the differences from the real device to the destination.
  7. When that’s finished, start up the virtual machine on the destination host.
  8. Delete snapshot on source host.

1024 bytes seemed like rather a small buffer to be working with so I upped it to 1MiB.

I find that on a typical 10GiB block device there might only be a few hundred MiB of changes between snapshot and virtual machine shut down. The entire device does have to be read through of course, but the down time and data transferred is dramatically reduced.

There must be a better way ^

Is there a better way to do this, still without shared storage?

It’s getting difficult to sell the disk capacity that comes with the number of spindles I need for performance, so maybe I could do something with DRBD so that there’s always another server with a copy of the data?

This seems like it should work, but I’ve no experience of DRBD. Presumably the active node would have to be using the /dev/drbdX devices as disks. Does DRBD scale to having say 100 of those on one host? It seems like a lot of added complexity.

I’d love to hear any other ideas.

Adventures in entropy, part 2

Recap ^

Back in part 1 I discussed what entropy is as far as Linux is concerned, why I’ve started to look in to entropy as it relates to a Linux/Xen-based virtual hosting platform, how much entropy I have available, and how this might be improved.

If you didn’t read that part yet then you might want to do so, before carrying on with this part.

As before, click on any graph to see the full-size version.

Hosting server with an Entropy Key ^

Recently I colocated a new hosting server so it seemed like a good opportunity to try out the Entropy Key at the same time. Here’s what the available entropy looks like whilst ekeyd is running.

urquell.bitfolk.com available entropy with ekey, daily

First impressions are, this is pretty impressive. It hovers very close to 4096 bytes at all times. There is very little jitter.

Trying to deplete the entropy pool, while using an Entropy Key ^

As per Hugo’s comment in part 1, I tried watch -n 0.25 cat /proc/sys/kernel/random/entropy_avail to see if I could deplete the entropy pool, but it had virtually no effect. I tried with watch -n 0.1 cat /proc/sys/kernel/random/entropy_avail (so every tenth of a second) and the available entropy fluctuated mostly around 4000 bytes with a brief dip to ~3600 bytes:

urquell.bitfolk.com available entropy with ekey, trying to deplete the pool

In the above graph, the first watch invocation was at ~1100 UTC. The second one was at ~1135 UTC.

Disabling the Entropy Key ^

Unfortunately I forgot to get graphs of urquell before the ekeyd was started, so I have no baseline for this machine.

I assumed it would be the same as all the other host machines, but decided to shut down ekeyd to verify that. Here’s what happened.

urquell.bitfolk.com available entropy with ekeyd shut down, daily

The huge chasm of very little entropy in the middle of this graph is urquell running without an ekeyd. At first I was at a loss to explain why it should only have ~400 bytes of entropy by itself, when the other hosting servers manage somewhere between 3250 and 4096 bytes.

I now believe that it’s because urquell is newly installed and has no real load. Looking into how modern Linux kernels obtain entropy, it’s basically:

  • keyboard interrupts;
  • mouse interrupts;
  • other device driver interrupts with the flag IRQF_SAMPLE_RANDOM.

Bear in mind that headless servers usuallly don’t have a mouse or keyboard attached!

You can see which other drivers are candidates for filling up the entropy pool by looking where the IRQF_SAMPLE_RANDOM identifier occurs in the source of the kernel:

http://www.cs.fsu.edu/~baker/devices/lxr/http/ident?i=IRQF_SAMPLE_RANDOM

(as an aside, in 2.4.x kernels, most of the network interface card drivers had IRQF_SAMPLE_RANDOM and then they all got removed through the 2.6.x cycle since it was decided that IRQF_SAMPLE_RANDOM is really only for interrupts that can’t be observed or tampered with by an outside party. That’s why a lot of people reported problems with lack of entropy after upgrading their kernels.)

My hosting servers are typically Supermicro motherboards with Intel gigabit NICs and 3ware RAID controller. The most obvious device in the list that could be supplying entropy is probably block/xen-blkfront since there’s one of those for each block device exported to a Xen virtual machine on the system.

To test the hypothesis that the other servers are getting entropy from busy Xen block devices, I shut down ekeyd and then hammered on a VM filesystem:

urquell.bitfolk.com available entropy with ekeyd shut down, hammering a VM filesystem

The increase you see towards the end of the graph was while I was hammering the virtual machine’s filesystem. I was able to raise the available entropy to a stable ~2000 bytes doing this, so I’m satisfied that if urquell were as busy as the other servers then it would have similar available entropy to them, even without the Entropy Key.

Feeding entropy to other hosts ^

ekeyd by default feeds entropy from the key directly into the Linux kernel of the host it’s on, but it can be configured to listen on a Unix or TCP socket and mimic the egd protocol. I set it up this way and then put an instance of HAProxy into a VM with my ekeyd as a back end. So at this point I had a service IP which would talk egd protocol, and client machines could use to request entropy.

On the client side, ekeyd-egd-linux can be found in Debian lenny-backports and in Debian squeeze, as well as Ubuntu universe since Jaunty. This daemon can read from a Unix or TCP socket using the egd protocol and will feed the received entropy into the Linux kernel.

I took a look at which of my VMs had the lowest available entropy and installed ekeyd-egd-linux on them, pointing it at my entropy service IP:

admin.obstler.bitfolk.com available entropy after hooking up to entropy service

panel0.bitfolk.com available entropy after hooking up to entropy service

spamd0.lon.bitfolk.com available entropy after hooking up to entropy service

Success!

Where next? ^

  • Get some customers using it, explore the limits of how much entropy can be served.
  • Buy another Entropy Key so that it doesn’t all grind to a halt if one of them should die.
  • Investigate a way to get egd to read from another egd so I can serve the entropy directly from a VM and not have so many connections to my real hardware. Anyone interested in coding that?
  • Monitor the served entropy both for availability and for quality.

Adventures in entropy, part 1

A while back, a couple of BitFolk customers mentioned to me that they were having problems running out of entropy.

A brief explanation of entropy as it relates to computing ^

Where we say entropy, we could in layman’s terms say “randomness”. Computers need entropy for a lot of things, particularly cryptographic operations. You may not think that you do a lot of cryptography on your computer, and you personally probably don’t, but for example every time you visit a secure web site (https://…) your computer has to set up a cryptographic channel with the server. Cryptographic algorithms generally require a lot of random data and it has to be secure random data. For the purposes of this discussion, “secure” means that an attacker shouldn’t be able to guess or influence what the random data is.

Why would an attacker be able to guess or influence the random data if it is actually random? Because it’s not actually random. The computer has to get the data from somewhere. A lot of places it might be programmed to get it from may seem random but potentially aren’t. A silly implementation might just use the number of seconds the computer has been running as a basis for generating “random” numbers, but you can see that an attacker can guess this and may even be able to influence it, which could weaken any cryptographic algorithm that uses the “random” data.

Modern computers and operating systems generate entropy based on events like electrical noise, timings of data coming into the computer over the network, what’s going on with the disks, etc. fed into algorithms — what we call pseudo-random number generators (PRNGs). A lot of data goes in and a relatively small amount of entropy comes out, but it’s entropy you should be able to trust.

That works reasonably well for conventional computers and servers, but it doesn’t work so well for virtual servers. Virtual servers are running in an emulated environment, with very little access to “real” hardware. The random data that conventional computers get from their hardware doesn’t happen with emulated virtual hardware, so the prime source of entropy just isn’t present.

When you have an application that wants some entropy and the system has no more entropy to give, what usually happens is that the application blocks, doing nothing, until the system can supply some more entropy. Linux systems have two ways for applications to request entropy: there’s /dev/random and /dev/urandom. random is the high-quality one. When it runs out, it blocks until there is more available. urandom will supply high-quality entropy until it runs out, then it will generate more programmatically, so it doesn’t block, but it might not be as secure as random. I’m vastly simplifying how these interfaces work, but that’s the basic gist of it.

What to do when there’s no more entropy? ^

If you’re running applications that want a lot of high-quality entropy, and your system keeps running out, there’s a few things you could do about it.

Nothing ^

So stuff slows down, who cares? It’s only applications that want high-quality entropy and they’re pretty specialised, right?

Well, no, not really. If you’re running a busy site with a lot of HTTPS connections then you probably don’t want it to be waiting around for more entropy when it could be serving your users. Another one that tends to use all the entropy is secure email – mail servers talking to each other using Transport Layer Security so the email is encrypted on the wire.

Use real hosting hardware ^

Most of BitFolk’s customers are using it for personal hosting, this problem is common to virtual hosting platforms (it’s not a BitFolk-specific issue), and BitFolk doesn’t provide dedicated/colo servers, so arguably I don’t need to consider this my problem to fix. If the customer could justify greater expense then they could move to a dedicated server or colo provider to host their stuff.

Tell the software to use urandom instead ^

In a lot of cases it’s possible to tell the applications to use urandom instead. Since urandom doesn’t block, but instead generates more lower-quality entropy on demand, there shouldn’t be a performance problem. There are obvious downsides to this:

  • If the application author wanted high-quality entropy, it might be unwise to not respect that.
  • Altering this may not be as simple as changing its configuration. You might find yourself having to recompile the software, which is a lot of extra work.

You could force this system-wide by replacing your /dev/random with /dev/urandom.

Customers could get some more entropy from somewhere else ^

It’s possible to feed your own data into your system’s pseudo-random number generator, so if you have a good source of entropy you can help yourself. People have used some weird and wonderful things for entropy sources. Some examples:

  • A sound card listening to electro-magnetic interference (“static”).
  • A web camera watching a lava lamp.
  • A web camera in a dark box, so it just sees noise on its CCD.

The problem for BitFolk customers of course is that all they have is a virtual server. They can’t attach web cams and sound cards to their servers! If they had real servers then they probably wouldn’t be having this issue at all.

BitFolk could get some entropy from somewhere else, and serve it to customers ^

BitFolk has the real servers, so I could do the above to get some extra entropy. I might not even need extra entropy; I could just serve the entropy that the real machines have. If it wasn’t for the existence of the Simtec Electronics Entropy Key then that’s probably what I’d be trying.

I haven’t got time to be playing about with sound cards listening to static, webcams in boxes and things like that, but buying a relatively cheap little gadget is well within the limit of things I’m prepared to risk wasting money on. 🙂

Customers would need to trust my entropy, of course. They already need to trust a lot of other things that I do though.

Entropy Key ^

Entropy Keys are very interesting little gadgets and I encourage you to read about how they work. It’s all a bit beyond me though, so for the purposes of this series of blog posts I’ll just take it as read that you plug in an Entropy Key into a USB port, run ekeyd and it feeds high quality entropy into your PRNG.

I’d been watching the development of the Entropy Key with interest. When they were offered for cheap at the Debian-UK BBQ in 2009 I was sorely tempted, but I knew I wasn’t going to be able to attend, so I left it.

Then earlier this year, James at Jump happened to mention that he was doing a bulk order (I assume to fix this same issue for his own VPS customers) if anyone wanted in. Between the Debian BBQ and then I’d had a few more complaints about people running out of entropy so at ~£30 each I was thinking it was definitely worth exploring with one of them; perhaps buy more if it works.

How much entropy do I have anyway? ^

Before stuffing more entropy in to my systems, I was curious how much I had available anyway. On Linux you can check this by looking at /proc/sys/kernel/random/entropy_avail. I think this value is in bytes, and tops out at 4096. Not hard to plug this in to your graphing system.

Click on the following images to see the full-size versions.

Typical host server, no Entropy Key ^

Here’s what some typical BitFolk VM hosting servers have in terms of available entropy.

barbar.bitfolk.com available entropy, daily

That’s pretty good. The available entropy hovers close to 4096 bytes all the time. It’s what you’d expect from a typical piece of computer hardware. The weekly view shows the small jitter:

barbar.bitfolk.com available entropy, weekly

The lighter pink area is the highest 5-minute reading in each 30 minute sample. The dark line is the lowest 5-minute reading. You can see that there is a small amount of jitter where the available entropy fluctuates between about 3250 and 4096 bytes.

Here’s a couple of the other host servers just to see the pattern:

corona.bitfolk.com available entropy, daily

corona.bitfolk.com available entropy, weekly

faustino.bitfolk.com available entropy, daily

faustino.bitfolk.com available entropy, weekly

No surprises here; they’re all much the same. If these were the only machines I was using then I’d probably decide that I have enough entropy.

Typical general purpose Xen-based paravirtualised virtual machine ^

Here’s a typical general purpose BitFolk VPS. It’s doing some crypto stuff, but there’s a good mix of every type of workload here.

bitfolk.com available entropy, daily

bitfolk.com available entropy, weekly

These graphs are very different. There’s much more jitter and a general lack of entropy to begin with. Still, it never appears to reach zero (although it’s important to realise that these graphs are at best 5-minute averages, so the minimum and maximum values will be lower and higher within that 5-minute span) so there doesn’t seem to be a huge problem here.

Virtual machines with more crypto ^

Here’s a couple of VMs which are doing more SSL work.

cacti.bitfolk.com available entropy, daily

cacti.bitfolk.com available entropy, weekly

This one has a fair number of web visitors and they’re all HTTPS. You can see that it’s even more jittery, and spends most of its time with less than 1024 bytes of entropy available. It goes as low as ~140 bytes from time to time, and because of the 5-minute sampling it’s possible that it does run out.

panel0.bitfolk.com available entropy, daily

panel0.bitfolk.com available entropy, weekly

Again, this one has some HTTPS traffic and is faring worse for entropy, with an average of only ~470 bytes available. I ran a check every second for several hours and available entropy at times was as low as 133 bytes.

Summary so far ^

BitFolk doesn’t have any particularly busy crypto-heavy VMs so the above was the best I could do. I think that I’ve shown that virtual machines do have less entropy generally available, and that a moderate amount of crypto work can come close to draining it.

Based on the above results I probably wouldn’t personally take any action since it seems none of my own VMs run out of entropy, although I am unsure if the 133 bytes I measured was merely as low as the pool is allowed to go before blocking happens. In any case, I am not really noticing poor performance.

Customers have reported running out of entropy though, so it might still be something I can fix, for them.

Where next? ^

Next:

  • See what effect using an Entropy Key has on a machine’s available entropy.
  • Assuming it has a positive effect, see if I can serve this entropy to other machines, particularly virtual ones.
  • Can I serve it from a virtual machine, so I don’t have customers interacting with my real hosts?
  • Does one Entropy Key give enough entropy for everyone that wants it?
  • Can I add extra keys and serve their entropy in a highly-available fashion?

Those are the things I’ll be looking into and will blog some more about in later parts. This isn’t high priority though so it might take a while. In the meantime, if you’re a BitFolk customer who actually is experiencing entropy exhaustion in a repeatable fashion then it’d be great if you could get in touch with me so we can see if it can be fixed.

In part 2 of this series of posts I do get the key working and serve entropy to my virtual machines.

Red Hat-based Linux under Xen, from Debian Etch

I find myself in the position of needing to overhaul BitFolk‘s Red Hat-alike offerings, which at the moment are limited to Fedora Core 6. Apparently that is out of date now and Fedora aren’t supplying any updates for it anymore, and my understanding is that in Red Hat land one can’t just upgrade to the latest release without reinstalling. Sadly I don’t think I can convince every customer to upgrade to Debian or Ubuntu, so I need to move with the Red Hat times. I decided to start with Centos 5 because that’s the one with the most demand.

The first thing I find is what looks like a very nice guide to installing Centos under Xen. At the moment all of my dom0s are Debian Etch, some with Linux software RAID (md) and the newer ones with hardware RAID. All of the Xen domains currently have their block devices exported as individual LVM logical volumes from dom0.

Going by the HOWTO, that’s not the way to do it in Centos (and by extension, every other Red Hat-alike) anymore. They want their storage as a disk image, exported as /dev/xvda which they will partition themselves. A departure from my current setup, but should be possible. Unfortunately it seems that there is some bug somewhere in md/LVM/Xen about trying to export an LV as a whole disk like this. Say this is my domU config file:

name = 'centos5'
memory = 120
kernel = '/boot/vmlinuz-centos5-install'
ramdisk = '/boot/initrd.img-centos5-install'
extra = 'text'
disk = [ 'phy:mapper/mainvg-domu_centos5_xvda,xvda,w' ]
vif = [ 'mac=00:16:3e:1b:b2:e7, bridge=xenbr0, vifname=v-centos5' ]

with /dev/mapper/mainvg-domu_centos5_xvda being an LV.

If I start this up, the installer kernel kicks in and begins the process, but any operations on exported disk result in thousands of dom0 kernel messages like this:

Jan 19 01:54:43 montelena kernel: raid10_make_request bug: can't convert block across chunks or bigger than 64k 653279743 4

At some point the domU will see this as a corrupted block device, remount it read-only and the installation will fail.

Searching for more information on this problem leads me only to a similar report from January 2007 by Ask Bjørn Hansen with no follow-ups except from me, reporting the same thing. I mailed Ask to see if he solved the problem, but he told me he just moved to image files and then the problem went away. I do not want to move to image files.

The problem does not happen when I try it on a machine with hardware RAID. I still have too few machines with hardware RAID to accept that as a solution though.

I’m thinking that it won’t happen if I did away with exporting whole disks and went back to exporting individual block devices. To try to test my theory I thought I’d go ahead and do the Centos install on one of the machines with hardware RAID, which would leave me with a disk image on an LV, that I could later split out into separate block devices.

I then encountered problems getting pygrub to work, so I may as well note down what I did to get that going in case it would be useful for anyone else.

For those who aren’t aware, pygrub is a Python tool which can look inside a disk image or a filesystem to find a GRUB menu.lst. It then emulates the GRUB menu and returns the kernel, initrd, boot arguments etc. etc. that GRUB would normally have selected. It’s a way to keep all the kernel stuff inside the guest filesystem so it can be managed as usual by the admin of the guest.

pygrub comes as part of Xen, and on Etch it can be found at /usr/lib/xen-3.0.3-1/bin/pygrub. To test it you can just run it with a disk or filesystem image/device as the parameter:

$ sudo /usr/lib/xen-3.0.3-1/bin/pygrub /dev/mainvg/domu_centos5_xvda

What that got me at first was a python error about “GrubConfig” not being found (sorry, the exact text has scrolled off my screen and I forgot to save it to a file…). That’s because of Debian bug #390678. To fix it just do as it says and edit the sys.path in /usr/lib/xen-3.0.3-1/lib/python/grub/GrubConf.py to be /usr/lib/xen-3.0.3-1/lib/python.

That got me a new error about being unable to read the filesystem. Searching around provided me with some clues that it would be a case of putting some python filesystem stuff in /usr/lib/xen-3.0.3-1/lib/python/grub/fsys/.

I downloaded the source of Xen 3.0.3, went into tools/pygrub/, installed the python2.4-dev and e2fslibs-dev packages, then did a “make”. That completed without error and left me with:

$ ls -la build/lib.linux-i686-2.4/grub/fsys/ext2/
total 36
drwxr-xr-x 2 andy andy  1024 2008-01-19 22:33 .
drwxr-xr-x 3 andy andy  1024 2008-01-19 22:33 ..
-rw-r--r-- 1 andy andy  1126 2006-10-15 12:22 __init__.py
-rwxr-xr-x 1 andy andy 30352 2008-01-19 22:33 _pyext2.so
-rw-r--r-- 1 andy andy   220 2006-10-15 12:22 test.py

which I copied into /usr/lib/xen-3.0.3-1/lib/python/grub/fsys/ext2/.

Success! I’ve now got Centos 5 in a disk image which can be booted from my Etch dom0 with the following domU config file:

name = 'centos5'
memory = 120
disk = [ 'phy:mapper/mainvg-domu_centos5_xvda,xvda,w' ]
vif = [ 'mac=00:16:3e:1b:b2:e7, bridge=xenbr0, vifname=v-centos5' ]
bootloader = '/usr/lib/xen-3.0.3-1/bin/pygrub'

(Remember this is only on a dom0 with hardware RAID; if I try to do this on LVM over md I suffer disk corruption in the exported LV)

The next thing I wanted to do was get access to the first partition of that disk image from the dom0 so that I could take a copy of it and use it as the basis for another block device. It may not be immediately obvious to you how to get at a partition inside a disk image; the way to do it is with kpartx. On Debian that can be found in the package multipath-tools:

$ sudo kpartx -a /dev/mainvg/domu_centos5_xvda
$ sudo ls -la /dev/mapper/*centos5*
brw-rw---- 1 root disk 254, 96 2008-01-19 23:51 /dev/mapper/domu_centos5_xvda1
brw-rw---- 1 root disk 254, 97 2008-01-19 23:51 /dev/mapper/domu_centos5_xvda2
brw-rw---- 1 root disk 254, 95 2008-01-19 20:20 /dev/mapper/mainvg-domu_centos5_xvda

The first two were added by kpartx and can be mounted like any block device. I did so, took an archive of the contents, created a new test LV, and unpacked that tar into it. I then labelled the new LV as “/” (because that’s how Red Hatters find their root filesystem):

$ sudo e2label /dev/mainvg/domu_test_root /

and came up with a new domU config file:

name = 'centos5'
memory = 120
disk = [ 'phy:mapper/mainvg-domu_test_root,xvda1,w',
         'phy:mapper/mainvg-domu_test_swap,xvda2,w' ]
vif = [ 'mac=00:16:3e:1b:b2:e7, bridge=xenbr0, vifname=v-centos5' ]
bootloader = '/usr/lib/xen-3.0.3-1/bin/pygrub'

This works, no disk corruption.

So Centos 5 is available now, and I will try to get around to Fedora 8 as well at some point.

If your Ubuntu Gutsy chroot/domU segfaults weirdly under Xen…

If your Ubuntu Gutsy chroot/domU segfaults weirdly under Xen, maybe you forgot to move /lib/tls out of the way?

I encountered this one yesterday as one of my customers asked me about problems debootstrapping under Xen. I have done several debootstraps before so wasn’t aware of any obvious problem, and this irritated me enough to make me delve into it further.

It was actually the dpkg --configure <package> step done for each package after debootstrap that was failing with a segfault. Having a look at the post-installaton script for that first package, the part where it was segfaulting was when it tried to do:

update-rc.d module-init-tools start 15 S .

a quick strace of that revealed:

open(".", O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY) = 3
fstat64(3, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
fcntl64(3, F_SETFD, FD_CLOEXEC)         = 0
getdents64(3, /* 11 entries */, 4096)   = 368
getdents64(3, /* 0 entries */, 4096)    = 0
--- SIGSEGV (Segmentation fault) @ 0 (0) ---

Seeing how low-level this had become I started to consider where the bug would have to be, and soon thought “what about libc?” There are well-known issues with 32 bit Xen and the TLS features of GNU libc, which is why Debian has provided a libc6-xen package since Etch. Sure enough, doing:

# mv /lib/tls /lib/tls.disabled

made this problem go away and resulted in a perfectly usable chroot environment.

TLS is normally one of the first things I check when doing things under Xen, but I didn’t think for one minute that Gutsy would still be shipping with a non-fixed libc.

I’m not really an Ubuntu user and am not aware of their packaging strategies for Xen stuff but I was also surprised to see that there is apparently no libc6-xen package in Gutsy. What is the recommended way for Ubuntu domUs to deal with this problem?

Moving the directory out of the way works, but it sucks as anything but a workaround as any libc update will replace the files. That happens to my users with older distributions all the time and they only find out because of the masses of warnings in the kernel log. Fair enough, but if it’s going to start causing weird segfaults as well then I don’t really want that support burden.

(Update: it was only because I didn’t have all the repositories in my /etc/apt/sources.list. Gutsy does have a libc6-xen package.)

It is perhaps time for me to increase the priority of moving to 64bit. (most of my servers already do 64bit but I run them 32bit as some of my customers’ chosen distributions have iffy 64bit architecture support — that may not be the case any longer)