Barclays’ strange priorities

Over the last week I’ve been receiving calls to my mobile from an 0800 number. They’ve been ringing off too quickly for me to catch them, and being as they’re presenting an 0800 number and not leaving a message, I’ve been assuming they’re sales calls that I don’t need to return.

Today I caught one, and it turns out it’s Barclays doing a survey of their business banking. This seems to happen about twice a year, and while it’s nice and all, the fact is that I never have any contact with BitFolk’s business account manager.

In fact, the account manager who first set up BitFolk’s account moved on shortly afterwards, so we were assigned another one, who I’ve never had any contact with. Not even by email. I’m not complaining as I genuinely haven’t needed to talk to the guy.

This would be the fourth of these phone surveys I’ve been given though, and after the first time I just say, “Can I just stop you there. I haven’t ever had any contact with my account manager, so I think most of your questions are going to be irrelevant” and they’ve agreed with me and ended the call.

It seems like a large waste of effort to keep giving me these surveys. Don’t they have records about whether I’ve actually ever talked to the guy?

Still, thanks Barclays for not withholding CLI, and for actually caring what my experience is like.

Where are all the low power enterprise SATA drives?

It’s a bit annoying that there don’t seem to be many low power SATA enterprise drives.

The Western Digital 500GB Green Power RE2-GP ones were good for a while, but then they went end-of-life. The only enterprise Green Power now are the 2TB RE4-GP at ~£160+VAT a go. What do you do at the 750GB – 1TB scale?

Then again, according to figures from span.com, the power usage of say the 1TB Hitachi-HGST Ultrastar A7K2000 24×7 HUA722010CLA330 0A39289 isn’t that far off that of the 2TB WD RE4-GP:

Drive Capacity Wattage Wattage (Idle) Wattage (Sleep) Cost
Western Digital Caviar RE4-GP WD2002FYPS 2TB 6.8 3.7 0.8 £161
Hitachi-HGST Ultrastar A7K2000 24×7 HUA722010CLA330 0A39289 1TB 8.4 3.9 0.8 £86

So maybe drives that aren’t specifically marketed as “low power” are getting better in that regard?

Copying block devices between machines

Having a bunch of Linux servers that run Linux virtual machines I often find myself having to move a virtual machine from one server to another. The tricky thing is that I’m not in a position to be using shared storage, i.e., the virtual machines’ storage is local to the machine they are running on. So, the data has to be moved first.

A naive approach ^

The naive approach is something like the following:

  1. Ensure that I can SSH as root using SSH key from the source host to the destination host.
  2. Create a new LVM logical volume on the destination host that’s the same size as the origin.
  3. Shut down the virtual machine.
  4. Copy the data across using something like this:
    $ sudo dd bs=4M if=/dev/mapper/myvg-src_lv |
      sudo ssh root@dest-host 'dd bs=4M of=/dev/mapper/myvg-dest_lv'
    
  5. While that is copying, do any other configuration transfer that’s required.
  6. When it’s finished, start up the virtual machine on the destination host.

I also like to stick pv in the middle of that pipeline so I get a nice text mode progress bar (a bit like what you see with wget):

$ sudo dd bs=4M if=/dev/mapper/myvg-src_lv | pv -s 10g |
  sudo ssh root@dest-host 'dd bs=4M of=/dev/mapper/myvg-dest_lv'

The above transfers data between hosts via ssh, which will introduce some overhead since it will be encrypting everything. You may or may not wish to force it to do compression, or pipe it through a compressor (like gzip) first, or even avoid ssh entirely and just use nc.

Personally I don’t care about the ssh overhead; this is on the whole customer data and I’m happier if it’s encrypted. I also don’t bother compressing it unless it’s going over the Internet. Over a gigabit LAN I’ve found it fastest to use ssh with the -c arcfour option.

The above process works, but it has some fairly major limitations:

  1. The virtual machine needs to be shut down for the whole time it takes to transfer data from one host to another. For 10GiB of data that’s not too bad. For 100GiB of data it’s rather painful.
  2. It transfers the whole block device, even the empty bits. For example, if it’s a 10GiB block device with 2GiB of data on it, 10GiB still gets transferred.

Limitation #2 can be mitigated somewhat by compressing the data. But we can do better.

LVM snapshots ^

One of the great things about LVM is snapshots. You can do a snapshot of a virtual machine’s logical volume while it is still running, and transfer that using the above method.

But what do you end up with? A destination host with an out of date copy of the data on it, and a source host that is still running a virtual machine that’s still updating its data. How to get just the differences from the source host to the destination?

Again there is a naive approach, which is to shut down the virtual machine and mount the logical volume on the host itself, do the same on the destination host, and use rsync to transfer the differences.

This will work, but again has major issues such as:

  1. It’s technically possible for a virtual machine admin to maliciously construct a filesystem that interferes with the host that mounts it. Mounting random filesystems is risky.
  2. Even if you’re willing to risk the above, you have to guess what the filesystem is going to be. Is it ext3? Will it have the same options that your host supports? Will your host even support whatever filesystem is on there?
  3. What if it isn’t a filesystem at all? It could well be a partitioned disk device, which you can still work with using kpartx, but it’s a major pain. Or it could even be a raw block device used by some tool you have no clue about.

The bottom line is, it’s a world of risk and hassle interfering with the data of virtual machines that you don’t admin.

Sadly rsync doesn’t support syncing a block device. There’s a --copy-devices patch that allows it to do so, but after applying it I found that while it can now read from a block device, it would still only write to a file.

Next I found a --write-devices patch by Darryl Dixon, which provides the other end of the functionality – it allows rsync to write to a block device instead of files in a filesystem. Unfortunately no matter what I tried, this would just send all the data every time, i.e., it was no more efficient than just using dd.

Read a bit, compare a bit ^

While searching about for a solution to this dilemma, I came across this horrendous and terrifying bodge of shell and Perl on serverfault.com:

ssh -i /root/.ssh/rsync_rsa $remote "
  perl -'MDigest::MD5 md5' -ne 'BEGIN{\$/=\1024};print md5(\$_)' $dev2 | lzop -c" |
  lzop -dc | perl -'MDigest::MD5 md5' -ne 'BEGIN{$/=\1024};$b=md5($_);
    read STDIN,$a,16;if ($a eq $b) {print "s"} else {print "c" . $_}' $dev1 | lzop -c |
ssh -i /root/.ssh/rsync_rsa $remote "lzop -dc |
  perl -ne 'BEGIN{\$/=\1} if (\$_ eq\"s\") {\$s++} else {if (\$s) {
    seek STDOUT,\$s*1024,1; \$s=0}; read ARGV,\$buf,1024; print \$buf}' 1<> $dev2"

Are you OK? Do you need to have a nice cup of tea and a sit down for a bit? Yeah. I did too.

I’ve rewritten this thing into a single Perl script so it’s a little bit more readable, but I’ll attempt to explain what the above abomination does.

Even though I do refer to this script in unkind terms like “abomination”, I will be the first to admit that I couldn’t have come up with it myself, and that I’m not going to show you my single Perl script version because it’s still nearly as bad. Sorry!

It connects to the destination host and starts a Perl script which begins reading the block device over there, 1024 bytes at a time, running that through md5 and piping the output to a Perl script running locally (on the source host).

The local Perl script is reading the source block device 1024 bytes at a time, doing md5 on that and comparing it to the md5 hashes it is reading from the destination side. If they’re the same then it prints “s” otherwise it prints “c” followed by the actual data from the source block device.

The output of the local Perl script is fed to a third Perl script running on the destination. It takes the sequence of “s” or “c” as instructions on whether to skip 1024 bytes (“s”) of the destination block device or whether to take 1024 bytes of data and write it to the destination block device (“c<1024 bytes of data>“).

The lzop bits are just doing compression and can be changed for gzip or omitted entirely.

Hopefully you can see that this is behaving like a very very dumb version of rsync.

The thing is, it works really well. If you’re not convinced, run md5sum (or sha1sum or whatever you like) on both the source and destination block devices to verify that they’re identical.

The process now becomes something like:

  1. Take an LVM snapshot of virtual machine block device while the virtual machine is still running.
  2. Create suitable logical volume on destination host.
  3. Use dd to copy the snapshot volume to the destination volume.
  4. Move over any other configuration while that’s taking place.
  5. When the initial copy is complete, shut down the virtual machine.
  6. Run the script of doom to sync over the differences from the real device to the destination.
  7. When that’s finished, start up the virtual machine on the destination host.
  8. Delete snapshot on source host.

1024 bytes seemed like rather a small buffer to be working with so I upped it to 1MiB.

I find that on a typical 10GiB block device there might only be a few hundred MiB of changes between snapshot and virtual machine shut down. The entire device does have to be read through of course, but the down time and data transferred is dramatically reduced.

There must be a better way ^

Is there a better way to do this, still without shared storage?

It’s getting difficult to sell the disk capacity that comes with the number of spindles I need for performance, so maybe I could do something with DRBD so that there’s always another server with a copy of the data?

This seems like it should work, but I’ve no experience of DRBD. Presumably the active node would have to be using the /dev/drbdX devices as disks. Does DRBD scale to having say 100 of those on one host? It seems like a lot of added complexity.

I’d love to hear any other ideas.

Is there a better way to tell people to go away?

Being a seller of unmanaged hosting services, with a customer base that’s dominated by enthusiasts looking to host their personal projects, I often find myself in the position of being asked extremely basic systems administration questions.

I don’t like saying no, telling people to go away or implying that they need to work something out for themselves, but the fact is that if I spent my time answering such questions then I wouldn’t have time to get anything else done. That would be fine if the people asking the questions were paying me to train them, or do the work for them. That is always an option for them, it’s just that people who pay £8.99/month for hosting tend to object to being asked to pay £50/hour for consultancy.

When a sysadmin question arrives in my support queue and I consider it to be a general question (i.e. not anything related to the service itself) that will take more than a few seconds to answer, I usually say something like:

I’m sorry, this appears to be a general systems administration question and isn’t covered by the support included with this unmanaged service. You could ask about it on our users mailing list where other customers are usually happy to advise. If I find time I may also be able to advise there, where it will be publicly archived.

If their question is particularly involved but is something I know about then I may also offer some consultancy. Sometimes I am asked sysadmin questions which I have no clue about of course!

As an aside, something that surprises me is the very low ratio of such queries that I do see later asked on the mailing list. Is it because people are a lot more willing to ask me questions in private but don’t want to appear to lack knowledge in public? Is it because people find email clunky and unintuitive? (let’s not have the forum debate again just yet though)

It is a shame though, because aside from the fact that I really do reply there anyway when I find time, the other customers often go to great lengths to explain things in amazing detail to the less experienced.

After one iteration of the above response, most people take the hint. Some don’t though, and continue asking basic questions over and over. This is where I start to feel bad. I’m conscious of the fact that a lot of the time it would take me less time to answer their question than it would to type out the standard, “please ask on the mailing list” response. It also feels rude of me to keep saying the same thing to them.

My gut feeling is that it’s best not to cave in just because the questioner is persistent; if the original reasons for declaring their query outside the scope of support were valid then just because they keep asking does not change matters. Responding to them to repeat myself should have lower priority than any other support request. Otherwise they will learn that they just have to be persistent, and they have more time than I do.

As for there being better ways to phrase it, well, does anyone else have problems like this? How do you handle it?

(This all might seem straightforward and obvious to you, but dealing with people is something I find really hard. Yes, I have heard of Asperger syndrome. I don’t think it fits that well but that’s a conversation for another day.)

Adventures in entropy, part 2

Recap ^

Back in part 1 I discussed what entropy is as far as Linux is concerned, why I’ve started to look in to entropy as it relates to a Linux/Xen-based virtual hosting platform, how much entropy I have available, and how this might be improved.

If you didn’t read that part yet then you might want to do so, before carrying on with this part.

As before, click on any graph to see the full-size version.

Hosting server with an Entropy Key ^

Recently I colocated a new hosting server so it seemed like a good opportunity to try out the Entropy Key at the same time. Here’s what the available entropy looks like whilst ekeyd is running.

urquell.bitfolk.com available entropy with ekey, daily

First impressions are, this is pretty impressive. It hovers very close to 4096 bytes at all times. There is very little jitter.

Trying to deplete the entropy pool, while using an Entropy Key ^

As per Hugo’s comment in part 1, I tried watch -n 0.25 cat /proc/sys/kernel/random/entropy_avail to see if I could deplete the entropy pool, but it had virtually no effect. I tried with watch -n 0.1 cat /proc/sys/kernel/random/entropy_avail (so every tenth of a second) and the available entropy fluctuated mostly around 4000 bytes with a brief dip to ~3600 bytes:

urquell.bitfolk.com available entropy with ekey, trying to deplete the pool

In the above graph, the first watch invocation was at ~1100 UTC. The second one was at ~1135 UTC.

Disabling the Entropy Key ^

Unfortunately I forgot to get graphs of urquell before the ekeyd was started, so I have no baseline for this machine.

I assumed it would be the same as all the other host machines, but decided to shut down ekeyd to verify that. Here’s what happened.

urquell.bitfolk.com available entropy with ekeyd shut down, daily

The huge chasm of very little entropy in the middle of this graph is urquell running without an ekeyd. At first I was at a loss to explain why it should only have ~400 bytes of entropy by itself, when the other hosting servers manage somewhere between 3250 and 4096 bytes.

I now believe that it’s because urquell is newly installed and has no real load. Looking into how modern Linux kernels obtain entropy, it’s basically:

  • keyboard interrupts;
  • mouse interrupts;
  • other device driver interrupts with the flag IRQF_SAMPLE_RANDOM.

Bear in mind that headless servers usuallly don’t have a mouse or keyboard attached!

You can see which other drivers are candidates for filling up the entropy pool by looking where the IRQF_SAMPLE_RANDOM identifier occurs in the source of the kernel:

http://www.cs.fsu.edu/~baker/devices/lxr/http/ident?i=IRQF_SAMPLE_RANDOM

(as an aside, in 2.4.x kernels, most of the network interface card drivers had IRQF_SAMPLE_RANDOM and then they all got removed through the 2.6.x cycle since it was decided that IRQF_SAMPLE_RANDOM is really only for interrupts that can’t be observed or tampered with by an outside party. That’s why a lot of people reported problems with lack of entropy after upgrading their kernels.)

My hosting servers are typically Supermicro motherboards with Intel gigabit NICs and 3ware RAID controller. The most obvious device in the list that could be supplying entropy is probably block/xen-blkfront since there’s one of those for each block device exported to a Xen virtual machine on the system.

To test the hypothesis that the other servers are getting entropy from busy Xen block devices, I shut down ekeyd and then hammered on a VM filesystem:

urquell.bitfolk.com available entropy with ekeyd shut down, hammering a VM filesystem

The increase you see towards the end of the graph was while I was hammering the virtual machine’s filesystem. I was able to raise the available entropy to a stable ~2000 bytes doing this, so I’m satisfied that if urquell were as busy as the other servers then it would have similar available entropy to them, even without the Entropy Key.

Feeding entropy to other hosts ^

ekeyd by default feeds entropy from the key directly into the Linux kernel of the host it’s on, but it can be configured to listen on a Unix or TCP socket and mimic the egd protocol. I set it up this way and then put an instance of HAProxy into a VM with my ekeyd as a back end. So at this point I had a service IP which would talk egd protocol, and client machines could use to request entropy.

On the client side, ekeyd-egd-linux can be found in Debian lenny-backports and in Debian squeeze, as well as Ubuntu universe since Jaunty. This daemon can read from a Unix or TCP socket using the egd protocol and will feed the received entropy into the Linux kernel.

I took a look at which of my VMs had the lowest available entropy and installed ekeyd-egd-linux on them, pointing it at my entropy service IP:

admin.obstler.bitfolk.com available entropy after hooking up to entropy service

panel0.bitfolk.com available entropy after hooking up to entropy service

spamd0.lon.bitfolk.com available entropy after hooking up to entropy service

Success!

Where next? ^

  • Get some customers using it, explore the limits of how much entropy can be served.
  • Buy another Entropy Key so that it doesn’t all grind to a halt if one of them should die.
  • Investigate a way to get egd to read from another egd so I can serve the entropy directly from a VM and not have so many connections to my real hardware. Anyone interested in coding that?
  • Monitor the served entropy both for availability and for quality.

Adventures in entropy, part 1

A while back, a couple of BitFolk customers mentioned to me that they were having problems running out of entropy.

A brief explanation of entropy as it relates to computing ^

Where we say entropy, we could in layman’s terms say “randomness”. Computers need entropy for a lot of things, particularly cryptographic operations. You may not think that you do a lot of cryptography on your computer, and you personally probably don’t, but for example every time you visit a secure web site (https://…) your computer has to set up a cryptographic channel with the server. Cryptographic algorithms generally require a lot of random data and it has to be secure random data. For the purposes of this discussion, “secure” means that an attacker shouldn’t be able to guess or influence what the random data is.

Why would an attacker be able to guess or influence the random data if it is actually random? Because it’s not actually random. The computer has to get the data from somewhere. A lot of places it might be programmed to get it from may seem random but potentially aren’t. A silly implementation might just use the number of seconds the computer has been running as a basis for generating “random” numbers, but you can see that an attacker can guess this and may even be able to influence it, which could weaken any cryptographic algorithm that uses the “random” data.

Modern computers and operating systems generate entropy based on events like electrical noise, timings of data coming into the computer over the network, what’s going on with the disks, etc. fed into algorithms — what we call pseudo-random number generators (PRNGs). A lot of data goes in and a relatively small amount of entropy comes out, but it’s entropy you should be able to trust.

That works reasonably well for conventional computers and servers, but it doesn’t work so well for virtual servers. Virtual servers are running in an emulated environment, with very little access to “real” hardware. The random data that conventional computers get from their hardware doesn’t happen with emulated virtual hardware, so the prime source of entropy just isn’t present.

When you have an application that wants some entropy and the system has no more entropy to give, what usually happens is that the application blocks, doing nothing, until the system can supply some more entropy. Linux systems have two ways for applications to request entropy: there’s /dev/random and /dev/urandom. random is the high-quality one. When it runs out, it blocks until there is more available. urandom will supply high-quality entropy until it runs out, then it will generate more programmatically, so it doesn’t block, but it might not be as secure as random. I’m vastly simplifying how these interfaces work, but that’s the basic gist of it.

What to do when there’s no more entropy? ^

If you’re running applications that want a lot of high-quality entropy, and your system keeps running out, there’s a few things you could do about it.

Nothing ^

So stuff slows down, who cares? It’s only applications that want high-quality entropy and they’re pretty specialised, right?

Well, no, not really. If you’re running a busy site with a lot of HTTPS connections then you probably don’t want it to be waiting around for more entropy when it could be serving your users. Another one that tends to use all the entropy is secure email – mail servers talking to each other using Transport Layer Security so the email is encrypted on the wire.

Use real hosting hardware ^

Most of BitFolk’s customers are using it for personal hosting, this problem is common to virtual hosting platforms (it’s not a BitFolk-specific issue), and BitFolk doesn’t provide dedicated/colo servers, so arguably I don’t need to consider this my problem to fix. If the customer could justify greater expense then they could move to a dedicated server or colo provider to host their stuff.

Tell the software to use urandom instead ^

In a lot of cases it’s possible to tell the applications to use urandom instead. Since urandom doesn’t block, but instead generates more lower-quality entropy on demand, there shouldn’t be a performance problem. There are obvious downsides to this:

  • If the application author wanted high-quality entropy, it might be unwise to not respect that.
  • Altering this may not be as simple as changing its configuration. You might find yourself having to recompile the software, which is a lot of extra work.

You could force this system-wide by replacing your /dev/random with /dev/urandom.

Customers could get some more entropy from somewhere else ^

It’s possible to feed your own data into your system’s pseudo-random number generator, so if you have a good source of entropy you can help yourself. People have used some weird and wonderful things for entropy sources. Some examples:

  • A sound card listening to electro-magnetic interference (“static”).
  • A web camera watching a lava lamp.
  • A web camera in a dark box, so it just sees noise on its CCD.

The problem for BitFolk customers of course is that all they have is a virtual server. They can’t attach web cams and sound cards to their servers! If they had real servers then they probably wouldn’t be having this issue at all.

BitFolk could get some entropy from somewhere else, and serve it to customers ^

BitFolk has the real servers, so I could do the above to get some extra entropy. I might not even need extra entropy; I could just serve the entropy that the real machines have. If it wasn’t for the existence of the Simtec Electronics Entropy Key then that’s probably what I’d be trying.

I haven’t got time to be playing about with sound cards listening to static, webcams in boxes and things like that, but buying a relatively cheap little gadget is well within the limit of things I’m prepared to risk wasting money on. 🙂

Customers would need to trust my entropy, of course. They already need to trust a lot of other things that I do though.

Entropy Key ^

Entropy Keys are very interesting little gadgets and I encourage you to read about how they work. It’s all a bit beyond me though, so for the purposes of this series of blog posts I’ll just take it as read that you plug in an Entropy Key into a USB port, run ekeyd and it feeds high quality entropy into your PRNG.

I’d been watching the development of the Entropy Key with interest. When they were offered for cheap at the Debian-UK BBQ in 2009 I was sorely tempted, but I knew I wasn’t going to be able to attend, so I left it.

Then earlier this year, James at Jump happened to mention that he was doing a bulk order (I assume to fix this same issue for his own VPS customers) if anyone wanted in. Between the Debian BBQ and then I’d had a few more complaints about people running out of entropy so at ~£30 each I was thinking it was definitely worth exploring with one of them; perhaps buy more if it works.

How much entropy do I have anyway? ^

Before stuffing more entropy in to my systems, I was curious how much I had available anyway. On Linux you can check this by looking at /proc/sys/kernel/random/entropy_avail. I think this value is in bytes, and tops out at 4096. Not hard to plug this in to your graphing system.

Click on the following images to see the full-size versions.

Typical host server, no Entropy Key ^

Here’s what some typical BitFolk VM hosting servers have in terms of available entropy.

barbar.bitfolk.com available entropy, daily

That’s pretty good. The available entropy hovers close to 4096 bytes all the time. It’s what you’d expect from a typical piece of computer hardware. The weekly view shows the small jitter:

barbar.bitfolk.com available entropy, weekly

The lighter pink area is the highest 5-minute reading in each 30 minute sample. The dark line is the lowest 5-minute reading. You can see that there is a small amount of jitter where the available entropy fluctuates between about 3250 and 4096 bytes.

Here’s a couple of the other host servers just to see the pattern:

corona.bitfolk.com available entropy, daily

corona.bitfolk.com available entropy, weekly

faustino.bitfolk.com available entropy, daily

faustino.bitfolk.com available entropy, weekly

No surprises here; they’re all much the same. If these were the only machines I was using then I’d probably decide that I have enough entropy.

Typical general purpose Xen-based paravirtualised virtual machine ^

Here’s a typical general purpose BitFolk VPS. It’s doing some crypto stuff, but there’s a good mix of every type of workload here.

bitfolk.com available entropy, daily

bitfolk.com available entropy, weekly

These graphs are very different. There’s much more jitter and a general lack of entropy to begin with. Still, it never appears to reach zero (although it’s important to realise that these graphs are at best 5-minute averages, so the minimum and maximum values will be lower and higher within that 5-minute span) so there doesn’t seem to be a huge problem here.

Virtual machines with more crypto ^

Here’s a couple of VMs which are doing more SSL work.

cacti.bitfolk.com available entropy, daily

cacti.bitfolk.com available entropy, weekly

This one has a fair number of web visitors and they’re all HTTPS. You can see that it’s even more jittery, and spends most of its time with less than 1024 bytes of entropy available. It goes as low as ~140 bytes from time to time, and because of the 5-minute sampling it’s possible that it does run out.

panel0.bitfolk.com available entropy, daily

panel0.bitfolk.com available entropy, weekly

Again, this one has some HTTPS traffic and is faring worse for entropy, with an average of only ~470 bytes available. I ran a check every second for several hours and available entropy at times was as low as 133 bytes.

Summary so far ^

BitFolk doesn’t have any particularly busy crypto-heavy VMs so the above was the best I could do. I think that I’ve shown that virtual machines do have less entropy generally available, and that a moderate amount of crypto work can come close to draining it.

Based on the above results I probably wouldn’t personally take any action since it seems none of my own VMs run out of entropy, although I am unsure if the 133 bytes I measured was merely as low as the pool is allowed to go before blocking happens. In any case, I am not really noticing poor performance.

Customers have reported running out of entropy though, so it might still be something I can fix, for them.

Where next? ^

Next:

  • See what effect using an Entropy Key has on a machine’s available entropy.
  • Assuming it has a positive effect, see if I can serve this entropy to other machines, particularly virtual ones.
  • Can I serve it from a virtual machine, so I don’t have customers interacting with my real hosts?
  • Does one Entropy Key give enough entropy for everyone that wants it?
  • Can I add extra keys and serve their entropy in a highly-available fashion?

Those are the things I’ll be looking into and will blog some more about in later parts. This isn’t high priority though so it might take a while. In the meantime, if you’re a BitFolk customer who actually is experiencing entropy exhaustion in a repeatable fashion then it’d be great if you could get in touch with me so we can see if it can be fixed.

In part 2 of this series of posts I do get the key working and serve entropy to my virtual machines.

Linux software RAID hot swap disk replacement

One of BitFolk’s servers in the US has had first one and then two dead disks for quite some time. It has a 4 disk software RAID-10, so by pure luck it was still running. Obviously as soon as a disk breaks you really should replace it, preferably with a hot spare. I was very lucky that the second disk failure wasn’t from the same half of the RAID-10 (resulting in downtime and restore from backup). There’s no customer data or customer-facing services on this machine though, so I let it slide for far too long.

Yesterday morning Graham was visiting the datacenter and kindly agreed to replace the disks for me. As it happens I don’t have that much experience of software RAID since the production machines I’ve worked on tend to have hardware RAID and the home ones tend not to be hot swap. It didn’t go entirely smoothly, but I think it was my fault.

The server chassis doesn’t support drive identification (e.g. by turning a light on) so I had to generate some disk activity so that Graham could see which drive lights weren’t blinking. It was easy enough for him to spot that slots 0 and 1 were still blinking away with slots 2 and 3 dead. I checked /proc/mdstat to ensure that those disks weren’t still present in any of the arrays. If they had been then I would have done:

$ sudo mdadm --fail /dev/mdX /dev/sdbX

to remove each one.

They weren’t present, so I gave Graham the go-ahead to pull the hot swap drive trays out.

At first the server didn’t notice anything. I thought this was bad as I would like it to notice! This was confirmed to be bad when all disk IO blocked and the load went through the roof.

I think what I had forgotten to do was to remove the devices from the SCSI subsystem as described in this article. So for me, it would have been something like:

$ for disk in sd{a,b,c,d}; do echo -n "$disk: "; ls -d /sys/block/$disk/device/scsi_device*; done
sda: /sys/block/sda/device/scsi_device:0:0:0:0
sdb: /sys/block/sdb/device/scsi_device:0:0:1:0
sdc: /sys/block/sdc/device/scsi_device:1:0:0:0
sdd: /sys/block/sdd/device/scsi_device:1:0:1:0

From /proc/mdstat I knew it was sdb and sdd that were broken. I think I should have done:

$ sudo sh -c 'echo "scsi remove-single-device" 0 0 1 0 > /proc/scsi/scsi'
$ sudo sh -c 'echo "scsi remove-single-device" 1 0 1 0 > /proc/scsi/scsi'

Anyway, at the time what I had was a largely unresponsive server. I used Magic Sysrq to sync, mount filesystems read-only and then reboot. In Cernio’s console this would normally be “~b” to send a break, but Xen uses “ctrl-o“. So that was ctrl-o s to sync, ctrl-o u to remount read-only and then ctrl-o b to reboot the system.

Graham had by then taken the dead disks out of the caddies and replaced with new, re-inserted them and powered the server back on.

Happily it did come back up fine, I then had to set about adding the new disks to the arrays.

I’d already been forewarned that the new disks had 488397168 sectors whereas the existing ones had 490234752 — both described as 250GB of course! A difference of some 890MiB, despite them both being from the same manufacturer, from the same range even. I didn’t bother adding a swap partition on the two new disks which made them just about big enough for everything else.

$ sudo mdadm --add /dev/md1 /dev/sdb1
mdadm: Cannot open /dev/sdb1: Device or resource busy

Oh dear!

After lengthy googling, this article gave me a clue.

$ sudo mulitpath -l
SATA_WDC_WD2500SD-01WD-WCAL72844661dm-1 ATA,WDC WD2500SD-01K
[size=233G][features=0][hwhandler=0]
\_ round-robin 0 [prio=0][active]
 \_ 1:0:1:0 sdd 8:48  [active][undef]
SATA_WDC_WD2500SD-01WD-WCAL72802716dm-0 ATA,WDC WD2500SD-01K
[size=233G][features=0][hwhandler=0]
\_ round-robin 0 [prio=0][active]
 \_ 0:0:1:0 sdb 8:16  [active][undef]

There’s my disks!

Stopping multipath daemon didn’t help. Running multipath -F did help.

The usual

$ sudo mdadm --add /dev/md1 /dev/sdb1
$ sudo mdadm --add /dev/md1 /dev/sdd1
$ sudo mdadm --add /dev/md3 /dev/sdb3
$ sudo mdadm --add /dev/md3 /dev/sdd3
$ sudo mdadm --add /dev/md5 /dev/sdb5
$ sudo mdadm --add /dev/md5 /dev/sdd5

worked fine after that.

I hope that was useful to someone. I’ll be practising it some more on some spare hardware here to see if the fiddling with /proc/scsi/scsi really does work.

Update:

Dominic (author of the linked article about dm-multipath) says:

I think there’s also a “remove” or “delete” file you can echo to in the /sys device directory, bit more friendly than talking to /proc/scsi/scsi.

and provides this snippet for multipath.conf which should disable multipath:

# Blacklist all devices by default. Remove this to enable multipathing
# on the default devices.
blacklist {
        devnode "*"
}

My root file system doesn’t show up in “df” anymore!

Earlier tonight I had a strange bug report from a customer. Ever since I’d moved his VPS from one host to another, he’d stopped being able to see how much disk space he had free.

At first I thought it was simply because when I had moved his VPS I had taken the opportunity to reconfigure it to the new way I was setting them up, which meant that his root file system would be mounted from /dev/xvda instead of /dev/xvda1 (or /dev/sda1). That would have accounted for it if his monitoring tool had been doing it by device name, but it turned out that it was more fundamental than that — neither mount nor df were showing his root file system at all!

This was highly confusing at first. /proc/mounts looked correct and anyway how does a machine boot if it doesn’t know where its root file system is?

The answer to that question was a bit of a clue really: the boot loader tells the kernel what device the root file system is on, and in this case it was doing it by UUID. The UUID in the boot loader configuration was not the same as the UUID listed in /etc/fstab. I had forgotten to update the customer’s /etc/fstab. 🙁

The machine was able to boot because the boot loader was correctly configured, but then after it had already mounted the root file system it was trying to mount everything in /etc/fstab and failing on a line for a UUID that wasn’t present. That line then never made it to /etc/mtab which is what mount and df are reading from.

After correcting the /etc/fstab, it is fixable without a reboot by just mounting / again over the top of the existing one. Or you could probably just edit /etc/mtab.

More abject Abbey failure

Last year Abbey screwed up and caused me significant problems in my day to day BitFolk business. They were unable to convince me that they would improve in future, so I voted with my wallet and BitFolk sought another bank for business purposes. I had to keep the abbey accounts open though because some customers wouldn’t alter their standing orders etc. (even when repeatedly asked).

There were a few more incidents of poor service with Abbey, but I didn’t really care because the only thing BitFolk was using the account for was catching payments of people who were paying in unsupported ways.

A couple of weeks ago though, we moved business address. Changing addresses everywhere was quite annoying but would you like to have a guess which single organisation still has failed to accept a simple address change by now, one month after the move? The failing organisation is Abbey business banking.

Initially I went to the branch, figuring they’d need to do some form of security check (you’d hope). I was told that branches don’t deal with that and I had to write to them. So write I did, only to be sent back to the branch. Who then told me to write again and “make it clear they have to deal with it”. So write again I did.

I now have a letter from them saying that I really must go to the branch and produce all these documents (which is what I expected to do at first) and that until I do they’re going to send all correspondence to the old address. Which I now have no access to. I have no confidence at all that going to the branch again is going to get this sorted out. It’s an immense waste of my time, and now I have documents with sensitive information going to an address where some unknown people will receive them. Was I supposed to delay moving for months until one part of Abbey can work out how to communicate with the other part?

The easiest thing to do now seems to be to just close the accounts, and if anyone is still paying to the wrong place a year on then it’s going to have to be their problem. 🙁

I would strongly encourage anyone thinking of banking with this organisation to reconsider. You just can’t get anything done.