Removing dead tracks from your Banshee library

I used MusicBrainz Picard to reorganize hundreds of the files in my music collection. For some reason Banshee spotted that new files appeared, but it didn’t remove the old ones from the library. My library had hundreds of entries relating to files that no longer existed on disk.

I would have hoped that Banshee would just skip when it got to a playlist entry that didn’t exist, but unfortunately (as of v2.4.1 in Ubuntu 12.04) it decides to stop playing at that point. I have to switch to it and double click on the next playlist entry, and even then it thinks it is playing the file that it tried to play before. So I have to double click again.

I have filed a bug about this.

In the mean time I wanted to remove all the dead entries from my library. I could delete my library and re-import it all, but I wanted to keep some of the metadata. I knew it was an SQLite database, so I poked around a bit and came up with this. It seems to work, so maybe it will be useful for someone else.

Scanning for open recursive DNS resolvers

A few days ago we unfortunately had some abuse reports regarding customers with DNS resolvers being abused in order to participate in a distributed denial of service attack.

Amongst other issues, DNS servers which are misconfigured to allow arbitrary hosts to do recursive queries through them can be used by attackers to launch an amplified attack on a forged source address.

I try to scan our address space reasonably often but I must admit I hadn’t done so for some time. I kicked off another scan and found one more customer with a misconfigured resolver, which has since been fixed.

After mentioning that I would do a scan I was asked how I do that.

I use a Perl script I’ve hacked together over the last couple of years. I took a few minutes to tidy it up and add a small amount of documentation (run it with --man to read that), so here it is in case anyone finds it useful:

Update: This code has now been moved to GitHub. If you have any comments, problems or improvements please do submit them as an issue and I will get to it much quicker. The gist below is now out of date so please don’t use it.

Using the default 100 concurrent queries it scans a /21 in about 80 seconds (YMMV depending upon how many hosts you have that firewall 53/UDP). That scales sort of linearly with how many you do, so using -q 200 for example will cut that down to about 40 seconds. It’s only a select loop though so it’ll use more CPU if you do that.

Two things I’ve noticed since:

  • It doesn’t handle failing to create a socket with bgsend so for example if you run up against your limit of file descriptors (commonly ~1024 on Linux) the whole thing will get stuck at 100% CPU.
  • One person reporting a similar situation (bgsend fails, stuck at 100% CPU) when they allowed it to try to send to a broadcast address. I haven’t been ale to replicate that one yet.

Converting an IPv6 address to its reverse zone in Perl

I’m needing to work out the IPv6 reverse zone for a given IPv6 CIDR prefix, that is a prefix with number of bits in the network on the end after a forward slash. e.g.:

  • 2001:ba8:1f1:f004::/64 → 4.0.0.f.1.f.1.0.8.a.b.0.1.0.0.2.ip6.arpa
  • 4:2::/32 → 2.0.0.0.4.0.0.0.ip6.arpa
  • 2001:ba8:1f1:400::/56 → 0.0.4.0.1.f.1.0.8.a.b.0.1.0.0.2.ip6.arpa

I had a quick look for a module that does it, but couldn’t find one, so I hacked this subroutine together:

Is there a more elegant way? Is there a module I can replace this with?

Must support:

  • Arbitrary prefix length
  • Use of ‘::’ anywhere legal in the address

Adding watermarks to a PDF with Perl’s PDF::API2

For ages I’ve been trying to work out how to programmatically add a watermark to an existing PDF using the Perl PDF::API2 module.

In case it’s useful for anyone else, here goes:

use warnings;
use strict;

use PDF::API2;

my $ifile = "/some/file.pdf";
my $ofile = "/some/newfile.pdf";
my $pdf   = PDF::API2->open($ifile);
my $fnt   = $pdf->ttfont('verdana.ttf', -encode => 'utf8');
# Only on the front page
my $page  = $pdf->openpage(1);
my $text  = $page->text;

$page->mediabox('A4');

# Create two extended graphic states; one for transparent content
# and one for normal content
my $eg_trans = $pdf->egstate();
my $eg_norm  = $pdf->egstate();

$eg_trans->transparency(0.7);
$eg_norm->transparency(0);

# Go transparent
$text->egstate($eg_trans);
# Black edges
$text->strokecolor('#000000');
# With light green fill
$text->fillcolor('#d3e6d2');
# Text render mode: fill text first then put the edges on top
$text->render(2);
# Diagonal transparent text
$text->textlabel(55, 235, $fnt, 56, "i'm in ur pdf", -rotate => 30);
# Back to non-transparent to do other stuff
$text->egstate($eg_norm);

# …

# Save
$pdf->saveas($ofile);
$pdf->end;

I think there may be better ways of doing this within Perl now, but I already had some code using PDF::API2.

Copying block devices between machines

Having a bunch of Linux servers that run Linux virtual machines I often find myself having to move a virtual machine from one server to another. The tricky thing is that I’m not in a position to be using shared storage, i.e., the virtual machines’ storage is local to the machine they are running on. So, the data has to be moved first.

A naive approach ^

The naive approach is something like the following:

  1. Ensure that I can SSH as root using SSH key from the source host to the destination host.
  2. Create a new LVM logical volume on the destination host that’s the same size as the origin.
  3. Shut down the virtual machine.
  4. Copy the data across using something like this:
    $ sudo dd bs=4M if=/dev/mapper/myvg-src_lv |
      sudo ssh root@dest-host 'dd bs=4M of=/dev/mapper/myvg-dest_lv'
    
  5. While that is copying, do any other configuration transfer that’s required.
  6. When it’s finished, start up the virtual machine on the destination host.

I also like to stick pv in the middle of that pipeline so I get a nice text mode progress bar (a bit like what you see with wget):

$ sudo dd bs=4M if=/dev/mapper/myvg-src_lv | pv -s 10g |
  sudo ssh root@dest-host 'dd bs=4M of=/dev/mapper/myvg-dest_lv'

The above transfers data between hosts via ssh, which will introduce some overhead since it will be encrypting everything. You may or may not wish to force it to do compression, or pipe it through a compressor (like gzip) first, or even avoid ssh entirely and just use nc.

Personally I don’t care about the ssh overhead; this is on the whole customer data and I’m happier if it’s encrypted. I also don’t bother compressing it unless it’s going over the Internet. Over a gigabit LAN I’ve found it fastest to use ssh with the -c arcfour option.

The above process works, but it has some fairly major limitations:

  1. The virtual machine needs to be shut down for the whole time it takes to transfer data from one host to another. For 10GiB of data that’s not too bad. For 100GiB of data it’s rather painful.
  2. It transfers the whole block device, even the empty bits. For example, if it’s a 10GiB block device with 2GiB of data on it, 10GiB still gets transferred.

Limitation #2 can be mitigated somewhat by compressing the data. But we can do better.

LVM snapshots ^

One of the great things about LVM is snapshots. You can do a snapshot of a virtual machine’s logical volume while it is still running, and transfer that using the above method.

But what do you end up with? A destination host with an out of date copy of the data on it, and a source host that is still running a virtual machine that’s still updating its data. How to get just the differences from the source host to the destination?

Again there is a naive approach, which is to shut down the virtual machine and mount the logical volume on the host itself, do the same on the destination host, and use rsync to transfer the differences.

This will work, but again has major issues such as:

  1. It’s technically possible for a virtual machine admin to maliciously construct a filesystem that interferes with the host that mounts it. Mounting random filesystems is risky.
  2. Even if you’re willing to risk the above, you have to guess what the filesystem is going to be. Is it ext3? Will it have the same options that your host supports? Will your host even support whatever filesystem is on there?
  3. What if it isn’t a filesystem at all? It could well be a partitioned disk device, which you can still work with using kpartx, but it’s a major pain. Or it could even be a raw block device used by some tool you have no clue about.

The bottom line is, it’s a world of risk and hassle interfering with the data of virtual machines that you don’t admin.

Sadly rsync doesn’t support syncing a block device. There’s a --copy-devices patch that allows it to do so, but after applying it I found that while it can now read from a block device, it would still only write to a file.

Next I found a --write-devices patch by Darryl Dixon, which provides the other end of the functionality – it allows rsync to write to a block device instead of files in a filesystem. Unfortunately no matter what I tried, this would just send all the data every time, i.e., it was no more efficient than just using dd.

Read a bit, compare a bit ^

While searching about for a solution to this dilemma, I came across this horrendous and terrifying bodge of shell and Perl on serverfault.com:

ssh -i /root/.ssh/rsync_rsa $remote "
  perl -'MDigest::MD5 md5' -ne 'BEGIN{\$/=\1024};print md5(\$_)' $dev2 | lzop -c" |
  lzop -dc | perl -'MDigest::MD5 md5' -ne 'BEGIN{$/=\1024};$b=md5($_);
    read STDIN,$a,16;if ($a eq $b) {print "s"} else {print "c" . $_}' $dev1 | lzop -c |
ssh -i /root/.ssh/rsync_rsa $remote "lzop -dc |
  perl -ne 'BEGIN{\$/=\1} if (\$_ eq\"s\") {\$s++} else {if (\$s) {
    seek STDOUT,\$s*1024,1; \$s=0}; read ARGV,\$buf,1024; print \$buf}' 1<> $dev2"

Are you OK? Do you need to have a nice cup of tea and a sit down for a bit? Yeah. I did too.

I’ve rewritten this thing into a single Perl script so it’s a little bit more readable, but I’ll attempt to explain what the above abomination does.

Even though I do refer to this script in unkind terms like “abomination”, I will be the first to admit that I couldn’t have come up with it myself, and that I’m not going to show you my single Perl script version because it’s still nearly as bad. Sorry!

It connects to the destination host and starts a Perl script which begins reading the block device over there, 1024 bytes at a time, running that through md5 and piping the output to a Perl script running locally (on the source host).

The local Perl script is reading the source block device 1024 bytes at a time, doing md5 on that and comparing it to the md5 hashes it is reading from the destination side. If they’re the same then it prints “s” otherwise it prints “c” followed by the actual data from the source block device.

The output of the local Perl script is fed to a third Perl script running on the destination. It takes the sequence of “s” or “c” as instructions on whether to skip 1024 bytes (“s”) of the destination block device or whether to take 1024 bytes of data and write it to the destination block device (“c<1024 bytes of data>“).

The lzop bits are just doing compression and can be changed for gzip or omitted entirely.

Hopefully you can see that this is behaving like a very very dumb version of rsync.

The thing is, it works really well. If you’re not convinced, run md5sum (or sha1sum or whatever you like) on both the source and destination block devices to verify that they’re identical.

The process now becomes something like:

  1. Take an LVM snapshot of virtual machine block device while the virtual machine is still running.
  2. Create suitable logical volume on destination host.
  3. Use dd to copy the snapshot volume to the destination volume.
  4. Move over any other configuration while that’s taking place.
  5. When the initial copy is complete, shut down the virtual machine.
  6. Run the script of doom to sync over the differences from the real device to the destination.
  7. When that’s finished, start up the virtual machine on the destination host.
  8. Delete snapshot on source host.

1024 bytes seemed like rather a small buffer to be working with so I upped it to 1MiB.

I find that on a typical 10GiB block device there might only be a few hundred MiB of changes between snapshot and virtual machine shut down. The entire device does have to be read through of course, but the down time and data transferred is dramatically reduced.

There must be a better way ^

Is there a better way to do this, still without shared storage?

It’s getting difficult to sell the disk capacity that comes with the number of spindles I need for performance, so maybe I could do something with DRBD so that there’s always another server with a copy of the data?

This seems like it should work, but I’ve no experience of DRBD. Presumably the active node would have to be using the /dev/drbdX devices as disks. Does DRBD scale to having say 100 of those on one host? It seems like a lot of added complexity.

I’d love to hear any other ideas.

Setting envelope sender when using Mail::Send

I just thought I would blog this so that people can find it in a search, as I could not when I tried.

I have a perl script on a machine that sends out transactional emails with the from address like billing@example.com. The machine it sends from is not example.com. Let’s call it admin.foobar.example.com. Since some misguided people think that sender verification callouts are a good idea, their MX hosts try to connect to admin.foobar to see if the billing address is there and accepting mails. This fails because admin.foobar does not run a public MTA and I do not want it to run one either.

It’s clear that the correct thing is for the script to be setting its envelope sender to billing@example.com since billing@example.com should be an email address that does actually work! How to do it with perl’s Mail::Send, however, is not so clear.

After trying to work this out myself for a few hours I asked dg, who is not just a perl monger but a veritable perl Tesco, making mere perl mongers compromise on their margins and putting many of them completely out of business. Every little helps.

Based on his reading of the source of Mail/Util.pm, he thought it might be accomplished by setting the MAILADDRESS environment variable. He was wrong however:


<dg> oh, that only works for SMTP
<dg> how crap

The next suggestion was a no-go also, but third time lucky:


<dg> actually, $msg->open("sendmail", "-f ...");

and that works. So there you go.

Thanks dg.