Copying block devices between machines

Having a bunch of Linux servers that run Linux virtual machines I often find myself having to move a virtual machine from one server to another. The tricky thing is that I’m not in a position to be using shared storage, i.e., the virtual machines’ storage is local to the machine they are running on. So, the data has to be moved first.

A naive approach ^

The naive approach is something like the following:

  1. Ensure that I can SSH as root using SSH key from the source host to the destination host.
  2. Create a new LVM logical volume on the destination host that’s the same size as the origin.
  3. Shut down the virtual machine.
  4. Copy the data across using something like this:
    $ sudo dd bs=4M if=/dev/mapper/myvg-src_lv |
      sudo ssh root@dest-host 'dd bs=4M of=/dev/mapper/myvg-dest_lv'
    
  5. While that is copying, do any other configuration transfer that’s required.
  6. When it’s finished, start up the virtual machine on the destination host.

I also like to stick pv in the middle of that pipeline so I get a nice text mode progress bar (a bit like what you see with wget):

$ sudo dd bs=4M if=/dev/mapper/myvg-src_lv | pv -s 10g |
  sudo ssh root@dest-host 'dd bs=4M of=/dev/mapper/myvg-dest_lv'

The above transfers data between hosts via ssh, which will introduce some overhead since it will be encrypting everything. You may or may not wish to force it to do compression, or pipe it through a compressor (like gzip) first, or even avoid ssh entirely and just use nc.

Personally I don’t care about the ssh overhead; this is on the whole customer data and I’m happier if it’s encrypted. I also don’t bother compressing it unless it’s going over the Internet. Over a gigabit LAN I’ve found it fastest to use ssh with the -c arcfour option.

The above process works, but it has some fairly major limitations:

  1. The virtual machine needs to be shut down for the whole time it takes to transfer data from one host to another. For 10GiB of data that’s not too bad. For 100GiB of data it’s rather painful.
  2. It transfers the whole block device, even the empty bits. For example, if it’s a 10GiB block device with 2GiB of data on it, 10GiB still gets transferred.

Limitation #2 can be mitigated somewhat by compressing the data. But we can do better.

LVM snapshots ^

One of the great things about LVM is snapshots. You can do a snapshot of a virtual machine’s logical volume while it is still running, and transfer that using the above method.

But what do you end up with? A destination host with an out of date copy of the data on it, and a source host that is still running a virtual machine that’s still updating its data. How to get just the differences from the source host to the destination?

Again there is a naive approach, which is to shut down the virtual machine and mount the logical volume on the host itself, do the same on the destination host, and use rsync to transfer the differences.

This will work, but again has major issues such as:

  1. It’s technically possible for a virtual machine admin to maliciously construct a filesystem that interferes with the host that mounts it. Mounting random filesystems is risky.
  2. Even if you’re willing to risk the above, you have to guess what the filesystem is going to be. Is it ext3? Will it have the same options that your host supports? Will your host even support whatever filesystem is on there?
  3. What if it isn’t a filesystem at all? It could well be a partitioned disk device, which you can still work with using kpartx, but it’s a major pain. Or it could even be a raw block device used by some tool you have no clue about.

The bottom line is, it’s a world of risk and hassle interfering with the data of virtual machines that you don’t admin.

Sadly rsync doesn’t support syncing a block device. There’s a --copy-devices patch that allows it to do so, but after applying it I found that while it can now read from a block device, it would still only write to a file.

Next I found a --write-devices patch by Darryl Dixon, which provides the other end of the functionality – it allows rsync to write to a block device instead of files in a filesystem. Unfortunately no matter what I tried, this would just send all the data every time, i.e., it was no more efficient than just using dd.

Read a bit, compare a bit ^

While searching about for a solution to this dilemma, I came across this horrendous and terrifying bodge of shell and Perl on serverfault.com:

ssh -i /root/.ssh/rsync_rsa $remote "
  perl -'MDigest::MD5 md5' -ne 'BEGIN{\$/=\1024};print md5(\$_)' $dev2 | lzop -c" |
  lzop -dc | perl -'MDigest::MD5 md5' -ne 'BEGIN{$/=\1024};$b=md5($_);
    read STDIN,$a,16;if ($a eq $b) {print "s"} else {print "c" . $_}' $dev1 | lzop -c |
ssh -i /root/.ssh/rsync_rsa $remote "lzop -dc |
  perl -ne 'BEGIN{\$/=\1} if (\$_ eq\"s\") {\$s++} else {if (\$s) {
    seek STDOUT,\$s*1024,1; \$s=0}; read ARGV,\$buf,1024; print \$buf}' 1<> $dev2"

Are you OK? Do you need to have a nice cup of tea and a sit down for a bit? Yeah. I did too.

I’ve rewritten this thing into a single Perl script so it’s a little bit more readable, but I’ll attempt to explain what the above abomination does.

Even though I do refer to this script in unkind terms like “abomination”, I will be the first to admit that I couldn’t have come up with it myself, and that I’m not going to show you my single Perl script version because it’s still nearly as bad. Sorry!

It connects to the destination host and starts a Perl script which begins reading the block device over there, 1024 bytes at a time, running that through md5 and piping the output to a Perl script running locally (on the source host).

The local Perl script is reading the source block device 1024 bytes at a time, doing md5 on that and comparing it to the md5 hashes it is reading from the destination side. If they’re the same then it prints “s” otherwise it prints “c” followed by the actual data from the source block device.

The output of the local Perl script is fed to a third Perl script running on the destination. It takes the sequence of “s” or “c” as instructions on whether to skip 1024 bytes (“s”) of the destination block device or whether to take 1024 bytes of data and write it to the destination block device (“c<1024 bytes of data>“).

The lzop bits are just doing compression and can be changed for gzip or omitted entirely.

Hopefully you can see that this is behaving like a very very dumb version of rsync.

The thing is, it works really well. If you’re not convinced, run md5sum (or sha1sum or whatever you like) on both the source and destination block devices to verify that they’re identical.

The process now becomes something like:

  1. Take an LVM snapshot of virtual machine block device while the virtual machine is still running.
  2. Create suitable logical volume on destination host.
  3. Use dd to copy the snapshot volume to the destination volume.
  4. Move over any other configuration while that’s taking place.
  5. When the initial copy is complete, shut down the virtual machine.
  6. Run the script of doom to sync over the differences from the real device to the destination.
  7. When that’s finished, start up the virtual machine on the destination host.
  8. Delete snapshot on source host.

1024 bytes seemed like rather a small buffer to be working with so I upped it to 1MiB.

I find that on a typical 10GiB block device there might only be a few hundred MiB of changes between snapshot and virtual machine shut down. The entire device does have to be read through of course, but the down time and data transferred is dramatically reduced.

There must be a better way ^

Is there a better way to do this, still without shared storage?

It’s getting difficult to sell the disk capacity that comes with the number of spindles I need for performance, so maybe I could do something with DRBD so that there’s always another server with a copy of the data?

This seems like it should work, but I’ve no experience of DRBD. Presumably the active node would have to be using the /dev/drbdX devices as disks. Does DRBD scale to having say 100 of those on one host? It seems like a lot of added complexity.

I’d love to hear any other ideas.

21 thoughts on “Copying block devices between machines

  1. Why must it not be shared storage? I would have thought the ideal way to do this would be to use iSCSI on something like OpenFiler. Just give every customer a LUN with LVM on it, then create LVs for swap/root etc as they want. Openfiler will do remote block replication, so everything is kept in sync with a “hot spare” etc. Also means you an start doing live migrations…

    Just an idea!

  2. It’s really expensive to do shared storage right. It’s easy to say “just use openfiler”, but when you cost out how much it will be to do it redundantly and with enough performance, it starts to look like a bad deal.

    I would rather just install a disk chassis and connect two hosts to it over SAS, before going down the iSCSI route.

  3. im sure i read something about some guys building their own filer out of commodity parts and open sourcing the entire thing? if you have a lot of timeon your hands then this might be some cheap storage?

    i havent heard great things about drbd tbh..not used it myself though!

  4. Just wondering if you thought about sharing your solution on the serverfault, might help other people out with the same problem. Nice solution!

    1. Hi Craig, I got the solution from ServerFault – I tried to make that clear. It’s not my own work. I linked to it in the article. 🙂

  5. Hi Andy, sorry it was me that wasn’t very clear. I was just wondering if you posted your re-worked single perl script solution back on serverfault.

  6. So you’re the kind of guy who prefer to expose “his skills” in private only? 😉
    Just kidding!

    Have you ever consider to backup your block devices to a venti fil server (hosted on Plan 9 from Bell Labs, the successor of Unix)? Plan 9 is little confusing to discover at first (because it looks like Unix but is really not Unix), but it is really a sane operating system (little kernel, very well designed, etc.). The venti file server is smart is the way that it addresses data block using their sha1 sum, so once a given block is written, subsequent writings of the same data block will cost nearly nothing. It is a write-once and incremental file server. http://en.wikipedia.org/wiki/Venti

    Nicolas

  7. Nice article, Andy. This sums up my experiences with transferring VM images around our systems nearly exactly. 🙂

    I just discovered recently that KVM supports block device migration when migrating running VMs from one host to another. You just need to specify ‘-b’ in the migrate command. It seems to work similar to the dd method, except since it’s part of the migration algorithm you can do it with near zero downtime.

  8. So after several hours of research, I’ve come to some conclusions.
    1) I can’t believe there is no OSS solution
    2) R1Soft’s CDP is the only thing I could find that supposedly does what I want. However, it also does a ton of other crap that I don’t need or want, that I especially don’t want to purchase like a web GUI. But it’s somewhat reasonable and I have to move on with some solution.

  9. So, any results? I find myself in a quite similar situation and would like to know if lvmsync is any good.

  10. Aljoscha, I have to admit that I haven’t actually implemented lvmsync yet. I really wish it wasn’t in Ruby, but we might be able to completely rewrite it.

  11. Here’s the problem:
    lvmsync seems to be a solution for moving a production system, with minimal downtime, by performing a secondary sync of changes after the initial initial sync has completed. The only way to use lvmsync to maintain your offsite backups would be to maintain a snapshot at all times, which is not an option for anyone remotely concerned with performance. Even then, you’d have to have a way to guarantee the moment of the new snapshot creation, and that the snapshot NEVER became overrun . . . . not a solution.

    Andy is right, this script f/ServerFault is crude, but I fail to see a better solution. I never saw anyone posting metrics of using this script vs the rsync –copy-devices options. I’ll try that next.

    The only other option I see is BlockCopy.py from http://www.bouncybouncy.net/programs/ . It requires a block device on both ends, and I would like the option to have .img files as the destination.

    1. Brian,

      What are you trying to achieve exactly?

      The script on serverfault and lvmsync work pretty much exactly the same, so if you agree that the crude script from serverfault suits your purposes then I’d instead use lvmsync since it’s written as a proper program in one file in one language.

      There is currently no way in rsync to copy a device and not have it do a full sync.

  12. We’ve been using the Perl method very successfully to replicate production virtuals to a DR server.

    There is a C+ version of lvmsync available but it suffers from the same problem as mentioned above, namely that one would need to suspend the virtual, export the snapshot to a patch file, delete and recreate the snapshot, resume the virtual, compress and transfer the patch to the DR server and apply it there. If this ever messed out we wouldn’t know so one would need to periodically run the Perl method anyway.

    I’ve setup DRBD a couple of times in LAN environments and contacted Linbit regarding costs of the DRBD Proxy solution which provides remote replication. My problem is that I don’t have sufficient capacity to continually keep the virtuals in sync and prefer a specific date/time crash point backup. DRBD unfortunately doesn’t currently support this type of synchronisation so I got a quote to custom design this for €13,430 (ouch).

    There may however be another method but it’s untested and completely theoretical:
    – Setup DRBD as a standalone volume, using LVM as it’s backing store, so that it generates and updates the metadata (which part of the block device has changed)
    – Periodically create a snapshot, then mount the snapshot as a DRBD volume
    – Create a snapshot on the DR server and connect the DRBD volumes so that they start syncing. As we’re syncing the snapshot we’re only using DRBD to sync the differences and not attempting to update data whilst it’s being committed.
    – Once sync is consistent disconnect the DRBD volumes and remove the snapshots

    If source has a snapshot system may have restarted and DRBD needs to reconnect to finish sync. If destination has a snapshot ditto.

    If DR needs to be tested one simply starts the virtuals off the DRBD volume or the snapshot (a volume with a snapshot would indicate that the current sync hasn’t finished and that the disc isn’t consistent to how it was at the time the snapshot was iniated).

  13. Herewith our Bash and Perl monster, it:
    – Wouldn’t run twice (when large changes result in syncs exceeding 24 hours)
    – Creates a snapshot on the destination which would simply fail if the snapshot already exists and would subsequently simply reuse it (recovers where it left off)
    – Creates a snapshot on the destination before starting update to have a consistent view until sync finishes, at which point snapshot can be removed.

    100 – 19-60005-30

    /etc/cron.daily/zzzzzz-network-kvm-backup
    #!/bin/sh

    network_kvm_backup () {
    src_vg=$1;
    src_lvm=$2;
    snapsize=$3;
    dst_host=$4;
    dst_vg=$5;
    num=1; while [ `lvs /dev/$src_vg/$src_lvm-snap$num 2> /dev/null | grep -Pc ‘o +\d+’` -gt 0 ]; do num=$[$num+1]; done
    export dev1=”/dev/$src_vg/$src_lvm-snap$num”;
    export dev2=”/dev/$dst_vg/$src_lvm-backup”;
    export remote=”root@$dst_host”;

    logger “Starting to update $dev1 to $dst_host as $dev2″;
    lvcreate -i 2 -L $snapsize /dev/$src_vg/$src_lvm -s -n $dev1 > /dev/null;
    ssh -i /root/.ssh/rsync_rsa -o StrictHostKeyChecking=no $remote ”
    perl -‘MDigest::MD5 md5’ -ne ‘BEGIN{\$/=\1024};print md5(\$_)’ $dev2 | lzop -c” |
    lzop -dc | perl -‘MDigest::MD5 md5’ -ne ‘BEGIN{$/=\1024};$b=md5($_);
    read STDIN,$a,16;if ($a eq $b) {print “s”} else {print “c” . $_}’ $dev1 | lzop -c |
    ssh -i /root/.ssh/rsync_rsa -o StrictHostKeyChecking=no $remote “lzop -dc |
    perl -ne ‘BEGIN{\$/=\1} if (\$_ eq\”s\”) {\$s++} else {if (\$s) {
    seek STDOUT,\$s*1024,1; \$s=0}; read ARGV,\$buf,1024; print \$buf}’ 1 $dev2″
    logger “Finished updating $dev1 to $dst_host as $dev2”;
    lvremove -f $dev1 > /dev/null;
    }

    # src_vg src_lvm snapsize dst_host dst_vg
    network_kvm_backup vg_kvm amserver 5G nas1.companysa.co.za lvm0
    network_kvm_backup vg_kvm sa-ha 30G nas1.companysa.co.za lvm0

    if [ `ps auxfww | grep ‘ssh.*drc.companysa.co.za.*amserver-backup’ | awk ‘$0 !~ /grep/ {print $2}’ | wc -l` -lt 1 ]; then
    ssh -i /root/.ssh/rsync_rsa -o StrictHostKeyChecking=no drc.companysa.co.za “lvcreate -i 2 -s /dev/vg_kvm/amserver-backup -n amserver-backup-snap1 -L 5G”;
    network_kvm_backup vg_kvm amserver 5G drc.companysa.co.za vg_kvm;
    ssh -i /root/.ssh/rsync_rsa -o StrictHostKeyChecking=no drc.companysa.co.za “lvremove -f /dev/vg_kvm/amserver-backup-snap1”; fi

    if [ `ps auxfww | grep ‘ssh.*drc.companysa.co.za.*sa-ha-backup’ | awk ‘$0 !~ /grep/ {print $2}’ | wc -l` -lt 1 ]; then
    ssh -i /root/.ssh/rsync_rsa -o StrictHostKeyChecking=no drc.companysa.co.za “lvcreate -i 2 -s /dev/vg_kvm/sa-ha-backup -n sa-ha-backup-snap1 -L 75G”;
    network_kvm_backup vg_kvm sa-ha 30G drc.companysa.co.za vg_kvm;
    ssh -i /root/.ssh/rsync_rsa -o StrictHostKeyChecking=no drc.companysa.co.za “lvremove -f /dev/vg_kvm/sa-ha-backup-snap1”; fi

  14. This perl script saved my day a couple of years ago. Thank you for your suggestion.
    During a recently search, I found a relatively recent GPL licensed tool. Blocksync[1] does the same task with the addition of some useful options (like –dryrun) and nice features like processed blocks statistics[2], complementary hash algorithm and estimated time to completion (TTC).

    [1] https://github.com/theraser/blocksync
    [2] Starting worker #0 (pid: 13998)
    [worker 0] Block size is 1.0 MB
    [worker 0] Local fadvise: None
    [worker 0] Chunk size is 10240.0 MB, offset is 0
    [worker 0] Hash 1: sha512
    [worker 0] Running: ssh -c chacha20-poly1305@openssh.com […] -b 1048576 -d 3 -1 sha512
    [worker 0] Remote fadvise: None
    [worker 0] Start syncing 10240 blocks…
    [worker 0] same: 4, diff: 0, 4/10240, 3.5 MB/s (0:49:08 remaining)
    [worker 0] same: 8, diff: 0, 8/10240, 3.5 MB/s (0:48:41 remaining)
    [worker 0] same: 12, diff: 0, 12/10240, 3.5 MB/s (0:48:45 remaining)
    [worker 0] same: 16, diff: 0, 16/10240, 2.5 MB/s (0:53:28 remaining)

    1. Fefu, that looks interesting. I must admit I haven’t had to use this script for a long time as I’ve only had to sync LVM volumes, for which I’ve used lvmsync.

Leave a Reply to Andy Cancel reply

Your email address will not be published. Required fields are marked *