Copying block devices between machines

Having a bunch of Linux servers that run Linux virtual machines I often find myself having to move a virtual machine from one server to another. The tricky thing is that I’m not in a position to be using shared storage, i.e., the virtual machines’ storage is local to the machine they are running on. So, the data has to be moved first.

A naive approach ^

The naive approach is something like the following:

  1. Ensure that I can SSH as root using SSH key from the source host to the destination host.
  2. Create a new LVM logical volume on the destination host that’s the same size as the origin.
  3. Shut down the virtual machine.
  4. Copy the data across using something like this:
    $ sudo dd bs=4M if=/dev/mapper/myvg-src_lv |
      sudo ssh root@dest-host 'dd bs=4M of=/dev/mapper/myvg-dest_lv'
  5. While that is copying, do any other configuration transfer that’s required.
  6. When it’s finished, start up the virtual machine on the destination host.

I also like to stick pv in the middle of that pipeline so I get a nice text mode progress bar (a bit like what you see with wget):

$ sudo dd bs=4M if=/dev/mapper/myvg-src_lv | pv -s 10g |
  sudo ssh root@dest-host 'dd bs=4M of=/dev/mapper/myvg-dest_lv'

The above transfers data between hosts via ssh, which will introduce some overhead since it will be encrypting everything. You may or may not wish to force it to do compression, or pipe it through a compressor (like gzip) first, or even avoid ssh entirely and just use nc.

Personally I don’t care about the ssh overhead; this is on the whole customer data and I’m happier if it’s encrypted. I also don’t bother compressing it unless it’s going over the Internet. Over a gigabit LAN I’ve found it fastest to use ssh with the -c arcfour option.

The above process works, but it has some fairly major limitations:

  1. The virtual machine needs to be shut down for the whole time it takes to transfer data from one host to another. For 10GiB of data that’s not too bad. For 100GiB of data it’s rather painful.
  2. It transfers the whole block device, even the empty bits. For example, if it’s a 10GiB block device with 2GiB of data on it, 10GiB still gets transferred.

Limitation #2 can be mitigated somewhat by compressing the data. But we can do better.

LVM snapshots ^

One of the great things about LVM is snapshots. You can do a snapshot of a virtual machine’s logical volume while it is still running, and transfer that using the above method.

But what do you end up with? A destination host with an out of date copy of the data on it, and a source host that is still running a virtual machine that’s still updating its data. How to get just the differences from the source host to the destination?

Again there is a naive approach, which is to shut down the virtual machine and mount the logical volume on the host itself, do the same on the destination host, and use rsync to transfer the differences.

This will work, but again has major issues such as:

  1. It’s technically possible for a virtual machine admin to maliciously construct a filesystem that interferes with the host that mounts it. Mounting random filesystems is risky.
  2. Even if you’re willing to risk the above, you have to guess what the filesystem is going to be. Is it ext3? Will it have the same options that your host supports? Will your host even support whatever filesystem is on there?
  3. What if it isn’t a filesystem at all? It could well be a partitioned disk device, which you can still work with using kpartx, but it’s a major pain. Or it could even be a raw block device used by some tool you have no clue about.

The bottom line is, it’s a world of risk and hassle interfering with the data of virtual machines that you don’t admin.

Sadly rsync doesn’t support syncing a block device. There’s a --copy-devices patch that allows it to do so, but after applying it I found that while it can now read from a block device, it would still only write to a file.

Next I found a --write-devices patch by Darryl Dixon, which provides the other end of the functionality – it allows rsync to write to a block device instead of files in a filesystem. Unfortunately no matter what I tried, this would just send all the data every time, i.e., it was no more efficient than just using dd.

Read a bit, compare a bit ^

While searching about for a solution to this dilemma, I came across this horrendous and terrifying bodge of shell and Perl on

ssh -i /root/.ssh/rsync_rsa $remote "
  perl -'MDigest::MD5 md5' -ne 'BEGIN{\$/=\1024};print md5(\$_)' $dev2 | lzop -c" |
  lzop -dc | perl -'MDigest::MD5 md5' -ne 'BEGIN{$/=\1024};$b=md5($_);
    read STDIN,$a,16;if ($a eq $b) {print "s"} else {print "c" . $_}' $dev1 | lzop -c |
ssh -i /root/.ssh/rsync_rsa $remote "lzop -dc |
  perl -ne 'BEGIN{\$/=\1} if (\$_ eq\"s\") {\$s++} else {if (\$s) {
    seek STDOUT,\$s*1024,1; \$s=0}; read ARGV,\$buf,1024; print \$buf}' 1<> $dev2"

Are you OK? Do you need to have a nice cup of tea and a sit down for a bit? Yeah. I did too.

I’ve rewritten this thing into a single Perl script so it’s a little bit more readable, but I’ll attempt to explain what the above abomination does.

Even though I do refer to this script in unkind terms like “abomination”, I will be the first to admit that I couldn’t have come up with it myself, and that I’m not going to show you my single Perl script version because it’s still nearly as bad. Sorry!

It connects to the destination host and starts a Perl script which begins reading the block device over there, 1024 bytes at a time, running that through md5 and piping the output to a Perl script running locally (on the source host).

The local Perl script is reading the source block device 1024 bytes at a time, doing md5 on that and comparing it to the md5 hashes it is reading from the destination side. If they’re the same then it prints “s” otherwise it prints “c” followed by the actual data from the source block device.

The output of the local Perl script is fed to a third Perl script running on the destination. It takes the sequence of “s” or “c” as instructions on whether to skip 1024 bytes (“s”) of the destination block device or whether to take 1024 bytes of data and write it to the destination block device (“c<1024 bytes of data>“).

The lzop bits are just doing compression and can be changed for gzip or omitted entirely.

Hopefully you can see that this is behaving like a very very dumb version of rsync.

The thing is, it works really well. If you’re not convinced, run md5sum (or sha1sum or whatever you like) on both the source and destination block devices to verify that they’re identical.

The process now becomes something like:

  1. Take an LVM snapshot of virtual machine block device while the virtual machine is still running.
  2. Create suitable logical volume on destination host.
  3. Use dd to copy the snapshot volume to the destination volume.
  4. Move over any other configuration while that’s taking place.
  5. When the initial copy is complete, shut down the virtual machine.
  6. Run the script of doom to sync over the differences from the real device to the destination.
  7. When that’s finished, start up the virtual machine on the destination host.
  8. Delete snapshot on source host.

1024 bytes seemed like rather a small buffer to be working with so I upped it to 1MiB.

I find that on a typical 10GiB block device there might only be a few hundred MiB of changes between snapshot and virtual machine shut down. The entire device does have to be read through of course, but the down time and data transferred is dramatically reduced.

There must be a better way ^

Is there a better way to do this, still without shared storage?

It’s getting difficult to sell the disk capacity that comes with the number of spindles I need for performance, so maybe I could do something with DRBD so that there’s always another server with a copy of the data?

This seems like it should work, but I’ve no experience of DRBD. Presumably the active node would have to be using the /dev/drbdX devices as disks. Does DRBD scale to having say 100 of those on one host? It seems like a lot of added complexity.

I’d love to hear any other ideas.