Tracking down the lvmcache fix

Background ^

In the previous article I covered how, in order to get decent performance out of lvmcache with a packaged Debian kernel, you’d have to use the 4.12.2-1~exp1 kernel from experimental. The kernels packaged in sid, testing (buster) and stable (stretch) aren’t new enough.

I decided to bisect the Linux kernel upstream git repository to find out exactly which commit(s) fixed things.

Results ^

Here’s a graph showing the IOPS over time for baseline SSD and lvmcache with a full cache under several different kernel versions. As in previous articles, the lines are actually Bezier curves fitted to the data which is scattered all over the place from 500ms averages.

What we can see here is that performance starts to improve with commit 4d44ec5ab751 authored by Joe Thornber:

dm cache policy smq: put newly promoted entries at the top of the multiqueue

This stops entries bouncing in and out of the cache quickly.

This is part of a set of commits authored by Joe Thornber on the drivers/md/dm-cache-policy-smq.c file and committed on 2017-05-14. By the time we reach commit 6cf4cc8f8b3b we have the long-term good performance that we were looking for.

The first of Joe Thornber’s commits on that day in the dm-cache area was 072792dcdfc8 and stepping back to the commit immediately prior to that one (2ea659a9ef48) we get a kernel representing the moment that Linus designated the v4.12-rc1 tag. Joe’s commits went into -rc1, and without them the performance of lvmcache under these test conditions isn’t much better than baseline HDD.

It seems like some of Joe’s changes helped a lot and then the last one really provided the long term performance.

git bisect procedure ^

Normally when you do a git bisect you’re starting with something that works and you’re looking for the commit that introduced a bug. In this case I was starting off with a known-good state and was interested in which commit(s) got me there. The normal bisect key words of “good” and “bad” in this case would be backwards to what I wanted. Dominic gave me the tip that I could alias the terms in order to reduce my confusion:

$ git bisect start --term-old broken --term-new fixed

From here on, when I encountered a test run that produced poor results I would issue:

$ git bisect broken

and when I had a test run with good results I would issue:

$ git bisect fixed

As I knew that the tag v4.13-rc1 produced a good run and v4.11 was bad, I could start off with:

$ git bisect reset v4.13-rc1
$ git bisect fixed
$ git bisect broken v4.11

git would then keep bisecting the search space of commits until I would find the one(s) that resulted in the high performance I was looking for.

Good and bad? ^

As before I’m using fio to conduct the testing, with the same job specification:

ioengine=libaio
direct=1
iodepth=8
numjobs=2
readwrite=randread
random_distribution=zipf:1.2
bs=4k
size=2g
unlink=1
runtime=15m
time_based=1
per_job_logs=1
log_avg_msec=500
write_iops_log=/var/tmp/fio-${FIOTEST}

The only difference from the other articles was that the run time was reduced to 15 minutes as all of the interesting behaviour happened within the first 11 minutes.

To recap, this fio job specification lays out two 2GiB files of random data and then starts two processes that perform 4kiB-sized reads against the files. Direct IO is used, in order to bypass the page cache.

A Zipfian distribution with a factor of 1.2 is used; this gives a 90/10 split where about 90% of the reads should come from about 10% of the data. The purpose of this is to simulate the hot spots of popular data that occur in real life. If the access pattern were to be perfectly and uniformly random then caching would not be effective.

In previous tests we had observed that dramatically different performance would be seen on the first run against an empty cache device compared to all other subsequent runs against what would be a full cache device. In the tests using kernels with the fix present the IOPS achieved would converge towards baseline SSD performance, whereas in kernels without the fix the performance would remain down near the level of baseline HDD. Therefore the fio tests were carried out twice.

Where to next? ^

I think I am going to see what happens when the cache device is pretty small in comparison to the working data.

All of the tests so far have used a 4GiB cache with 4GiB of data, so if everything got promoted it would entirely fit in cache. Not only that but the Zipf distribution makes most of the hits come from 10% of the data, so it’s actually just ~400MiB of hot data. I think it would be interesting to see what happens when the hot 10% is bigger than the cache device.

git bisect progress and test output ^

Unless you are particularly interested in the fio output and why I considered each one to be either fixed or broken, you probably want to stop reading now.

Continue reading “Tracking down the lvmcache fix”

lvmcache with a 4.12.3 kernel

Background ^

In the previous two articles I had discovered that lvmcache had amazing performance on an empty cache but then on every run after that (i.e. when the cache device was full of junk) went scarcely better than baseline HDD.

A few days ago I happened across an email on the linux-lvm list where Mike Snitzer advised:

the [CentOS] 7.4 dm-cache will be much more performant than the 7.3 cache you appear to be using.

…and…

It could be that your workload isn’t accessing the data enough to warrant promotion to the cache. dm-cache is a “hotspot” cache. If you aren’t accessing the data repeatedly then you won’t see much benefit (particularly with the 7.3 and earlier releases).

Just to get a feel, you could try the latest upstream 4.12 kernel to see how effective the 7.4 dm-cache will be for your setup.

I don’t know what kernel version CentOS 7.3 uses, but the VM I’m testing with is Debian testing (buster), so some version of 4.11.x plus backported patches.

That seemed pretty new, but Mike is suggesting 4.12.x so I thought I’d re-test lvmcache with the latest stable upstream kernel, which at the time of writing is version 4.12.3.

Test methodology ^

This time around I only focused on fio tests, using the same settings as before:

[partial]
ioengine=libaio
direct=1
iodepth=8
numjobs=2
readwrite=randread
random_distribution=zipf:1.2
bs=4k
size=2g
unlink=1
runtime=20m
time_based=1
per_job_logs=1
log_avg_msec=500
write_iops_log=/var/tmp/fio-${FIOTEST}

The only changes were:

  1. to reduce the run time to 20 minutes from 30 minutes, since all the interesting things happened within the first 20 minutes before.
  2. to write an IOPS log entry every 500ms instead of ever 1000ms, as the log files were quite small and some higher resolution might help smooth graphs out.

Last time there was a dramatic difference between the initial run with an empty cache and subsequent runs with a cache volume full of junk, so I did a test for each of those conditions, as well as tests for the baseline SSD and HDD.

The virtual machine had been upgraded from Debian 9 (stretch) to testing (buster), so it still had packaged kernel versions 4.9.30-2 and 4.11.6-1 laying around to test things with. In addition I compiled up version 4.12.3 by copying the .config from 4.11.6-1 then doing make oldconfig accepting all defaults.

Results ^

Although the fio job spec was essentially the same as in the previous article, I have since worked out how to merge the IOPS logs from both jobs so the graphs will seem to show about double the IOPS as they did before.

All-in-one ^

Well that’s an interesting set of graphs but rather hard to distinguish. Let’s try that by kernel version.

Baseline SSD by kernel version ^

A couple of weird things here:

  1. 4.12.3 and 4.11.6-1 are actually fairly consistent, but 4.9.30-2 varies rather a lot.
  2. All kernels show a sharp dip a few minutes in. I don’t know what that is about.

Although these lines do look quite far apart, bear in mind that this graph’s y axis starts at 92k IOPS. The average IOPS didn’t vary that much:

Average IOPS by kernel version
4.9.30-2 4.11.6-1 4.12.3
102,325 102,742 104,352

So there was actually only a 1.9% difference between the worst performer and the best.

Baseline HDD by kernel version ^

4.9.30-2 and 4.12.3 are close enough here to probably be within the margin of error, but there is something weird going on with 4.11.6-1.

Its average IOPS across the 20 minute test were only 56% of those for 4.12.3 and 53% of those for 4.9.30-2, which is quite a big disparity. I re-ran these tests 5 times to check it wasn’t some anomaly, but no, it’s reproducible.

Maybe something to look into another day.

lvmcache by kernel version ^

Dragging things back to the point of this article: previously we discovered that lvmcache worked great the first time through, when its cache volume was completely empty, but then subsequent runs all absolutely sucked. They didn’t perform significantly better than HDD baseline.

Let’s graph all the lvmcache results for each kernel version against the SSD baseline for that kernel to see if things changed at all.

lvmcache 4.9.30-2 ^

This is the similar to what we saw before: an empty cache volume produces decent results of around 47k IOPS. Although it’s interesting that the second job is down around 1k IOPS. Again the results on a full cache are poor. In fact the results for the second job of the empty cache are about the same as the results for both jobs on a full cache.

lvmcache 4.11.6-1 ^

Same story again here, although the performance is a little higher. Again the first job on an empty cache is getting the big results of almost 60k IOPS while the second job—and both jobs on a full cache—get only around 1k IOPS.

lvmcache 4.12.3 ^

Wow. Something dramatic has been fixed. The performance on an empty cache is still better, but crucially the performance on a full cache pretty quickly becomes very close to baseline SSD.

Also the runs against both the empty and full cache device result in both jobs getting roughly the same IOPS performance rather than the first job being great and all others very poor.

What’s next? ^

It’s really encouraging that the performance is so much better with 4.12.3. It’s changed lvmcache from a “hmm, maybe” option to one that I would strongly consider using anywhere I could.

It’s a shame though that such a new kernel is required. The kernel version in Debian testing (buster) is currently 4.11.6-1. Debian experimental’s linux-image-4.12.0-trunk-amd64 package currently has version 4.12.2-1 so I should test if that is new enough I tested to see if that was new enough.

Failing that I think I should git bisect or similar in order to find out exactly which changeset fixed this, so I could have some chance of knowing when it hits a packaged version.

Continue reading “lvmcache with a 4.12.3 kernel”

bcache and lvmcache

Background ^

Over at BitFolk we offer both SSD-backed storage and HDD-backed archive storage. The SSDs we use are really nice and basically have removed all IO performance problems we have ever encountered in the past. I can’t begin to describe how pleasant it is to just never have to think about that whole class of problems.

The main downside of course is that SSD capacity is still really expensive. That’s why we introduced the HDD-backed archive storage: for bulk storage of things that didn’t need to have high performance.

You’d really think though that by now there would be some kind of commodity tiered storage that would allow a relatively small amount of fast SSD storage to accelerate a much larger pool of HDDs. Of course there are various enterprise solutions, and there is also ZFS where SSDs could be used for the ZIL and L2ARC while HDDs are used for the pool.

ZFS is quite appealing but I’m not ready to switch everything to that yet, and I’m certainly not going to buy some enterprise storage solution. I also don’t necessarily want to commit to putting all storage behind such a system.

I decided to explore Linux’s own block device caching solutions.

Scope ^

I’ve restricted the scope to the two solutions which are part of the mainline Linux kernel as of July 2017, these being bcache and lvmcache.

lvmcache is based upon dm-cache which has been included with the mainline kernel since April 2013. It’s quite conservative, and having been around for quite a while is considered stable. It has the advantage that it can work with any LVM logical volume no matter what the contents. That brings the disadvantage that you do need to run LVM.

bcache has been around for a little longer but is a much more ambitious project. Being completely dedicated to accelerating slow block devices with fast ones it is claimed to be able to achieve higher performance than other caching solutions, but as it’s much more complicated than dm-cache there are still bugs being found. Also it requires you format your block devices as bcache before you use them for anything.

Test environment ^

I’m testing this on a Debian testing (buster) Xen virtual machine with a 20GiB xvda virtual disk containing the main operating system. That disk is backed by a software (md) RAID-10 composed of two Samsung sm863 SSDs. It was also used for testing the baseline SSD performance from the directory /srv/ssd.

The virtual machine had 1GiB of memory but the pagecache was cleared between each test run in an attempt to prevent anything being cached in memory.

A 5GiB xvdc virtual disk was provided, backed again on the SSD RAID. This was used for the cache role both in bcache and lvmcache.

A 50GiB xvdd virtual disk was provided, backed by a pair of Seagate ST4000LM016-1N2170 HDDs in software RAID-1. This was used for the HDD backing store in each of the caching implementations. The resulting cache device was mounted at /srv/cache.

Finally a 50GiB xvde virtual disk also backed on HDD was used to test baseline HDD performance, mounted at /srv/slow.

The filesystem in use in all cases was ext4 with default options. In dom0, deadline scheduler was used in all cases.

TL;DR, I just want graphs ^

In case you can’t be bothered to read the rest of this article, here’s just the graphs with some attempt at interpreting them. Down at the tests section you’ll find details of the actual testing process and more commentary on why certain graphs were produced.

git test graphs ^

Times to git clone and git grep.

fio IOPS graphs ^

These are graphs of IOPS across the 30 minutes of testing. There’s two important things to note about these graphs:

  1. They’re a Bezier curve fitted to the data points which are one per second. The actual data points are all over the place, because achieved IOPS depends on how many cache hits/misses there were, which is statistical.
  2. Only the IOPS for the first job is graphed. Even when using the per_job_logs=0 setting my copy of fio writes a set of results for each job. I couldn’t work out how to easily combine these so I’ve shown only the results for the first job.

    For all tests except bcache (sequential_cutoff=0) you just have to bear in mind that there is a second job working in parallel doing pretty much the same amount of IOPS. Strangely for that second bcache test the second job only managed a fraction of the IOPS (though still more than 10k IOPS) and I don’t know why.

IOPS over time for all tests

Well, those results are so extreme that it kind of makes it hard to distinguish between the low-end results.

A couple of observations:

  • SSD is incredibly and consistently fast.
  • For everything else there is a short steep section at the beginning which is likely to be the effect of HDD drive cache.
  • With sequential_cutoff set to 0, bcache very quickly reaches near-SSD performance for this workload (4k reads, 90% hitting 10% of data that fits entirely in the bcache). This is probably because the initial write put data straight into cache as it’s set to writeback.
  • When starting with a completely empty cache, lvmcache is no slouch either. It’s not quite as performant as bcache but that is still up near the 48k IOPS per process region, and very predictable.
  • When sequential_cutoff is left at its default of 4M, bcache performs much worse though still blazing compared to an HDD on its own. At the end of this 30 minute test performance was still increasing so it might be worth performing a longer test
  • The performance of lvmcache when starting with a cache already full of junk data seems to be not that much better than HDD baseline.
IOPS over time for low-end results

Leaving the high-performers out to see if there is anything interesting going on near the bottom of the previous graph.

Apart from the initial spike, HDD results are flat as expected.

Although the lvmcache (full cache) results in the previous graph seemed flat too, looking closer we can see that performance is still increasing, just very slowly. It may be interesting to test for longer to see if performance does continue to increase.

Both HDD and lvmcache have a very similar spike at the start of the test so let’s look closer at that.

IOPS for first 30 seconds

For all the lower-end performers the first 19 seconds are steeper and I can only think this is the effect of HDD drive cache. Once that is filled, HDD remains basically flat, lvmcache (full cache) increases performance more slowly and bcache with the default sequential_cutoff starts to take off.

SSDs don’t have the same sort of cache and bcache with no sequential_cutoff spikes up too quickly to really be noticeable at this scale.

3-hour lvmcache test

Since it seemed like lvmcache with a full cache device was still slowly increasing in performance I did a 3-hour testing on that one.

Skipping the first 20 minutes which show stronger growth, even after 3 hours there is still some performance increase happening. It seems like even a full cache would eventually promote read hot spots, but it could take a very very long time.

Continue reading “bcache and lvmcache”

Supermicro SATA DOM flash devices don’t report lifetime writes correctly

I’m playing around with a pair of Supermicro SATA DOM flash devices at the moment, evaluating them for use as the operating system storage for servers (as opposed to where customer data goes).

They’re flash devices with a limited write endurance. The smallest model (16GB), for example, is good for 17TB of writes. Therefore it’s important to know how much you’ve actually written to it.

Many SSDs and other flash devices expose the total amount written through the SMART attribute 241, Total_LBAs_Written. The SATA DOM devices do seem to expose this attribute, but right now they say this:

$ for dom in $(sudo lsblk --paths -d -o NAME,MODEL --noheadings |
    awk '/SATA SSD/ { print $1 }')
do
    echo -n "$dom: "
    sudo smartctl -A "$dom" |
      awk '/^241/ { print $10 * 512 * 1.0e-9, "GB" }'
done
/dev/sda: 0.00856934 GB
/dev/sdb: 0.00881715 GB

This being after install and (as of now) more than a week of uptime, ~9MB of lifetime writes isn’t credible.

Another place we can look for amount of bytes written is /proc/diskstats. The 10th column is the number of (512-byte) sectors written, so:

$ for dom in $(sudo lsblk -d -o NAME,MODEL --noheadings |
    awk '/SATA SSD/ { print $1 }')
do
     awk "/$dom / {
        print \$3, \$10 / 2 * 1.0e-6, \"GB\"
    }" /proc/diskstats
done
sda 3.93009 GB
sdb 3.93009 GB

Almost 4GB is a lot more believable, so can we just use /proc/diskstats? Well, the problem there is that those figures are only since boot. That won’t include, for example, all the data written during install.

Okay, so, are these figures even consistent? Let’s write 100MB and see what changes.

Since the figure provided by SMART attribute 241 apparently isn’t actually 512-byte blocks we’ll just print the raw value there.

Before:

$ for dom in $(sudo lsblk -d -o NAME,MODEL --noheadings |
    awk '/SATA SSD/ { print $1 }')
do
     awk "/$dom / {
        print \$3, \$10 / 2 * 1.0e-6, \"GB\"
    }" /proc/diskstats
done
sda 4.03076 GB
sdb 4.03076 GB
$ for dom in $(sudo lsblk --paths -d -o NAME,MODEL --noheadings |
  awk '/SATA SSD/ { print $1 }')
do
    echo -n "$dom: "
    sudo smartctl -A "$dom" |
      awk '/^241/ { print $10 }'
done
/dev/sda: 16835
/dev/sdb: 17318

Write 100MB:

$ dd if=/dev/urandom bs=1MB count=100 > /var/tmp/one_hundred_megabytes
100+0 records in
100+0 records out
100000000 bytes (100 MB) copied, 7.40454 s, 13.5 MB/s

(I used /dev/urandom just in case some compression might take place or something)

After:

$ for dom in $(sudo lsblk -d -o NAME,MODEL --noheadings |
    awk '/SATA SSD/ { print $1 }')
do
     awk "/$dom / {
        print \$3, \$10 / 2 * 1.0e-6, \"GB\"
    }" /proc/diskstats
done
sda 4.13046 GB
sdb 4.13046 GB
$ for dom in $(sudo lsblk --paths -d -o NAME,MODEL --noheadings |
  awk '/SATA SSD/ { print $1 }')
do
    echo -n "$dom: "
    sudo smartctl -A "$dom" |
      awk '/^241/ { print $10 }'
done
/dev/sda: 16932
/dev/sdb: 17416

Well, alright, all is apparently not lost: SMART attribute 241 went up by ~100 and diskstats agrees that ~100MB was written too, so it looks like it does actually report lifetime writes, but it’s reporting them as megabytes (109 bytes), not 512-byte sectors.

Note: A comment below says this is actually mebibytes (220 bytes).

Every reference I can find says that Total_LBAs_Written is the number of 512-byte sectors, though, so in reporting units of 1MB I feel that these devices are doing the wrong thing.

Anyway, I’m a little alarmed that ~0.1% of the lifetime has gone already, although a lot of that would have been the install. I probably should take this opportunity to get rid of a lot of writes by tracking down logging of mundane garbage. Also this is the smallest model; the devices are rated for 1 DWPD so just over-provisioning by using a larger model than necessary will help.

Using a TOTP app for multi-factor SSH auth

I’ve been playing around with enabling multi-factor authentication (MFA) on web services and went with TOTP. It’s pretty simple to implement in Perl, and there are plenty of apps for it including Google Authenticator, 1Password and others.

I also wanted to use the same multi-factor auth for SSH logins. Happily, from Debian jessie onwards libpam-google-authenticator is packaged. To enable it for SSH you would just add the following:

auth required pam_google_authenticator.so

to /etc/pam.d/sshd (put it just after @include common-auth).

and ensure that:

ChallengeResponseAuthentication yes

is in /etc/ssh/sshd_config.

Not all my users will have MFA enabled though, so to skip prompting for these I use:

auth required pam_google_authenticator.so nullok

Finally, I only wanted users in a particular Unix group to be prompted for an MFA token so (assuming that group was totp) that would be:

auth [success=1 default=ignore] pam_succeed_if.so quiet user notingroup totp
auth required pam_google_authenticator.so nullok

If the pam_succeed_if conditions are met then the next line is skipped, so that causes pam_google_authenticator to be skipped for users not in the group totp.

Each user will require a TOTP secret key generating and storing. If you’re only setting this up for SSH then you can use the google-authenticator binary from the libpam-google-authenticator package. This asks you some simple questions and then populates the file $HOME/.google_authenticator with the key and some configuration options. That looks like:

T6Z2KSDCG7CEWPD6EPA6BICBFD4KYKCSGO2JEQVII7ZJNCXECRZPJ4GJHD3CWC43FZIKQUSV5LR2LFFP
" RATE_LIMIT 3 30 1462548404
" DISALLOW_REUSE 48751610
" TOTP_AUTH
11494760
25488108
33980423
43620625
84061586

The first line is the secret key; the five numbers are emergency codes that will always work (once each) if locked out.

If generating keys elsewhere then you can just populate this file yourself. If the file isn’t present then that’s when “nullok” applies; without “nullok” authentication would fail.

Note that despite the repeated mentions of “google” here, this is not a Google-specific service and no data is sent to Google. Google are the authors of the open source Google Authenticator mobile app and the libpam-google-authenticator PAM module, but (as evidenced by the Perl example) this is an open standard and client and server sides can be implemented in any language.

So that is how you can make a web service and an SSH service use the same TOTP multi-factor authentication.

Disabling the default IPMI credentials on a Supermicro server

In an earlier post I mentioned that you should disable the default ADMIN / ADMIN credentials on the IPMI controller. Here’s how.

Install ipmitool ^

ipmitool is the utility that you will use from the command line of another machine in order to interact with the IPMI controllers on your servers.

# apt-get install ipmitool

List the current users ^

$ ipmitool -I lanplus -H 192.168.1.22 -U ADMIN -a user list
Password:
ID  Name             Callin  Link Auth  IPMI Msg   Channel Priv Limit
2   ADMIN            false   false      true       ADMINISTRATOR

Here you are specifying the IP address of the server’s IPMI controller. ADMIN is the IPMI user name you will use to log in, and it’s prompting you for the password which is also ADMIN by default.

Add a new user ^

You should add a new user with a name other than ADMIN.

I suppose it would be safe to just change the password of the existing ADMIN user, but there is no need to have it named that, so you may as well pick a new name.

$ ipmitool -I lanplus -H 192.168.1.22 -U ADMIN -a user set name 3 somename
Password:
$ ipmitool -I lanplus -H 192.168.1.22 -U ADMIN -a user set password 3
Password:
Password for user 3:
Password for user 3:
$ ipmitool -I lanplus -H 192.168.1.22 -U ADMIN -a channel setaccess 1 3 link=on ipmi=on callin=on privilege=4
Password:
$ ipmitool -I lanplus -H 192.168.1.22 -U ADMIN -a user enable 3
Password:

From this point on you can switch to using the new user instead.

$ ipmitool -I lanplus -H 192.168.1.22 -U somename -a user list
Password:
ID  Name             Callin  Link Auth  IPMI Msg   Channel Priv Limit
2   ADMIN            false   false      true       ADMINISTRATOR
3   somename         true    true       true       ADMINISTRATOR

Disable ADMIN user ^

Before doing this bit you may wish to check that the new user you added works for everything you need it to. Those things might include:

  • ssh to somename@192.168.1.22
  • Log in on web interface at https://192.168.1.22/
  • Various ipmitool commands like querying power status:
    $ ipmitool -I lanplus -H 192.168.1.22 -U somename -a power status
    Password:
    Chassis power is on

If all of that is okay then you can disable ADMIN:

$ ipmitool -I lanplus -H 192.168.1.22 -U somename -a user disable 2
Password:

If you are paranoid (or this is just the first time you’ve done this) you could now check to see that none of the above things now work when you try to use ADMIN / ADMIN.

Specifying the password ^

I have not done so in these examples but if you get bored of typing the password every time then you could put it in the IPMI_PASSWORD environment variable and use -E instead of -a on the ipmitool command line.

When setting the IPMI_PASSWORD environment variable you probably don’t want it logged in your shell’s history file. Depending on which shell you use there may be different ways to achieve that.

With bash, if you have ignorespace in the HISTCONTROL environment variable then commands prefixed by one or more spaces won’t be logged. Alternatively you could temporarily disable history logging with:

$ set +o history
$ sensitive commend goes here
$ set -o history # re-enable history logging

So anyway…

$ echo $HISTCONTROL
ignoredups:ignorespace
$     export IPMI_PASSWORD=letmein
$ # ^ note the leading spaces here
$ # to prevent the shell logging it
$ ipmitool -I lanplus -H 192.168.1.22 -U somename -E power status
Chassis Power is on

Installing Debian by PXE using Supermicro IPMI Serial over LAN

Here’s how to install Debian jessie on a Supermicro server using PXE boot and the IPMI serial-over-LAN.

Using these instructions you will be able to complete an install of a remote machine, although you will initially need access to the BIOS to configure the IPMI part.

BIOS settings ^

This bit needs you to be in the same location as the machine, or else have someone who is make the required changes.

Press DEL to go into the BIOS configuration.

Under Advanced > PCIe/PCI/PnP Configuration make sure that the network interface through which you’ll reach your PXE server has the “PXE” option ROM set:

BIOS PSE option ROM

Under Advanced > Serial Port Console Redirection you’ll want to enable SOL Console Redirection.

BIOS serial console redirection

(Pictured here is also COM1 Console Redirection. This is for the physical serial port on the machine, not the one in the IPMI.)

Under SOL Console Redirection Settings you may as well set the Bits per second to 115200.

BIOS SOL redirection settings

Now it’s time to configure the IPMI so you can interact with it over the network. Under IPMI > BMC Network Configuration, put the IPMI on your management network:

IPMI network configuration

Connecting to the IPMI serial ^

With the above BIOS settings in place you should be able to save and reboot and then connect to the IPMI serial console. The default credentials are ADMIn / ADMIN which you should of course change with ipmitool, but that is for a different post.

There’s two ways to connect to the serial-over-LAN: You can ssh to the IPMI controller, or you can use ipmitool. Personally I prefer ssh, but the ipmitool way is like this:

$ ipmitool -I lanplus -H 192.168.1.22 -U ADMIN -a sol activate

The ssh way:

$ ssh ADMIN@192.168.1.22
The authenticity of host '192.168.1.22 (192.168.1.22)' can't be established.
RSA key fingerprint is b7:e1:12:94:37:81:fc:f7:db:6f:1c:00:e4:e0:e1:c4.
Are you sure you want to continue connecting (yes/no)?
Warning: Permanently added '192.168.1.22,192.168.1.22' (RSA) to the list of known hosts.
ADMIN@192.168.1.22's password:
 
ATEN SMASH-CLP System Management Shell, version 1.05
Copyright (c) 2008-2009 by ATEN International CO., Ltd.
All Rights Reserved 
 
 
-> cd /system1/sol1
/system1/sol1
 
-> start
/system1/sol1
press <Enter>, <Esc>, and then <T> to terminate session
(press the keys in sequence, one after the other)

They both end up displaying basically the same thing.

The serial console should just be displaying the boot process, which won’t go anywhere.

DHCP and TFTP server ^

You will need to configure a DHCP and TFTP server on an already-existing machine on the same LAN as your new server. They can both run on the same host.

The DHCP server responds to the initial requests for IP address configuration and passes along where to get the boot environment from. The TFTP server serves up that boot environment. The boot environment here consists of a kernel, initramfs and some configuration for passing arguments to the bootloader/kernel. The boot environment is provided by the Debian project.

DHCP ^

I’m using isc-dhcp-server. Its configuration file is at /etc/dhcp/dhcpd.conf.

You’ll need to know the MAC address of the server, which can be obtained either from the front page of the IPMI controller’s web interface (i.e. https://192.168.1.22/ in this case) or else it is displayed on the serial console when it attempts to do a PXE boot. So, add a section for that:

subnet 192.168.2.0 netmask 255.255.255.0 {
}
 
host foo {
    hardware ethernet 0C:C4:7A:7C:28:40;
    fixed-address 192.168.2.22;
    filename "pxelinux.0";
    next-server 192.168.2.251;
    option subnet-mask 255.255.255.0;
    option routers 192.168.2.1;
}

Here we set the network configuration of the new server with fixed-address, option subnet-mask and option routers. The IP address in next-server refers to the IP address of the TFTP server, and pxelinux.0 is what the new server will download from it.

Make sure that is running:

# service isc-dhcp-server start

DHCP uses UDP port 67, so make sure that is allowed through your firewall.

TFTP ^

A number of different TFTP servers are available. I use tftpd-hpa, which is mostly configured by variables in /etc/default/tftp-hpa:

TFTP_OPTIONS="--secure"
TFTP_USERNAME="tftp"
TFTP_DIRECTORY="/srv/tftp"
TFTP_ADDRESS="0.0.0.0:69"

TFTP_DIRECTORY is where you’ll put the files for the PXE environment.

Make sure that the TFTP server is running:

# service tftpd-hpa start

TFTP uses UDP port 69, so make sure that is allowed through your firewall.

Download the netboot files from your local Debian mirror:

$ cd /srv/tftp
$ curl -s http://ftp.YOUR-MIRROR.debian.org/debian/dists/jessie/main/installer-amd64/current/images/netboot/netboot.tar.gz | sudo tar zxvf -
./                
./version.info    
./ldlinux.c32     
./pxelinux.0      
./pxelinux.cfg    
./debian-installer/
./debian-installer/amd64/
./debian-installer/amd64/bootnetx64.efi
./debian-installer/amd64/grub/
./debian-installer/amd64/grub/grub.cfg
./debian-installer/amd64/grub/font.pf2
…

(This assumes you are installing a device with architecture amd64.)

At this point your TFTP server root should contain a debian-installer subdirectory and a couple of links into it:

$ ls -l .
total 8
drwxrwxr-x 3 root root 4096 Jun  4  2015 debian-installer
lrwxrwxrwx 1 root root   47 Jun  4  2015 ldlinux.c32 -> debian-installer/amd64/boot-screens/ldlinux.c32
lrwxrwxrwx 1 root root   33 Jun  4  2015 pxelinux.0 -> debian-installer/amd64/pxelinux.0
lrwxrwxrwx 1 root root   35 Jun  4  2015 pxelinux.cfg -> debian-installer/amd64/pxelinux.cfg
-rw-rw-r-- 1 root root   61 Jun  4  2015 version.info

You could now boot your server and it would call out to PXE to do its netboot, but would be displaying the installer process on the VGA output. If you intend to carry it out using the Remote Console facility of the IPMI interface then that may be good enough. If you want to do it over the serial-over-LAN though, you’ll need to edit some of the files that came out of the netboot.tar.gz to configure that.

Here’s a list of the files you need to edit. All you are doing in each one is telling it to use serial console. The changes are quite mechanical so you can easily come up with a script to do it, but here I will show the changes verbosely. All the files live in the debian-installer/amd64/boot-screens/ directory.

ttyS1 is used here because this system has a real serial port on ttyS0. 115200 is the baud rate of ttyS1 as configured in the BIOS earlier.

adtxt.cfg

From:

label expert
        menu label ^Expert install
        kernel debian-installer/amd64/linux
        append priority=low vga=788 initrd=debian-installer/amd64/initrd.gz --- 
include debian-installer/amd64/boot-screens/rqtxt.cfg
label auto
        menu label ^Automated install
        kernel debian-installer/amd64/linux
        append auto=true priority=critical vga=788 initrd=debian-installer/amd64/initrd.gz --- quiet

To:

label expert
        menu label ^Expert install
        kernel debian-installer/amd64/linux
        append priority=low console=ttyS1,115200n8 initrd=debian-installer/amd64/initrd.gz --- 
include debian-installer/amd64/boot-screens/rqtxt.cfg
label auto
        menu label ^Automated install
        kernel debian-installer/amd64/linux
        append auto=true priority=critical console=ttyS1,115200n8 initrd=debian-installer/amd64/initrd.gz --- quiet
rqtxt.cfg

From:

label rescue
        menu label ^Rescue mode
        kernel debian-installer/amd64/linux
        append vga=788 initrd=debian-installer/amd64/initrd.gz rescue/enable=true --- quiet

To:

label rescue
        menu label ^Rescue mode
        kernel debian-installer/amd64/linux
        append console=ttyS1,115200n8 initrd=debian-installer/amd64/initrd.gz rescue/enable=true --- quiet
syslinux.cfg

From:

# D-I config version 2.0
# search path for the c32 support libraries (libcom32, libutil etc.)
path debian-installer/amd64/boot-screens/
include debian-installer/amd64/boot-screens/menu.cfg
default debian-installer/amd64/boot-screens/vesamenu.c32
prompt 0
timeout 0

To:

serial 1 115200
console 1
# D-I config version 2.0
# search path for the c32 support libraries (libcom32, libutil etc.)
path debian-installer/amd64/boot-screens/
include debian-installer/amd64/boot-screens/menu.cfg
default debian-installer/amd64/boot-screens/vesamenu.c32
prompt 0
timeout 0
txt.cfg

From:

default install
label install
        menu label ^Install
        menu default
        kernel debian-installer/amd64/linux
        append vga=788 initrd=debian-installer/amd64/initrd.gz --- quiet

To:

default install
label install
        menu label ^Install
        menu default
        kernel debian-installer/amd64/linux
        append console=ttyS1,115200n8 initrd=debian-installer/amd64/initrd.gz --- quiet

Perform the install ^

Connect to the serial-over-LAN and get started. If the server doesn’t have anything currently installed then it should go straight to trying PXE boot. If it does have something on the storage that it would boot then you will have to use F12 at the BIOS screen to convince it to jump straight to PXE boot.

$ ssh ADMIN@192.168.1.22
ADMIN@192.168.1.22's password:
 
ATEN SMASH-CLP System Management Shell, version 1.05
Copyright (c) 2008-2009 by ATEN International CO., Ltd.
All Rights Reserved 
 
 
-> cd /system1/sol1
/system1/sol1
 
-> start
/system1/sol1
press <Enter>, <Esc>, and then <T> to terminate session
(press the keys in sequence, one after the other)
 
Intel(R) Boot Agent GE v1.5.13                                                  
Copyright (C) 1997-2013, Intel Corporation                                      
 
CLIENT MAC ADDR: 0C C4 7A 7C 28 40  GUID: 00000000 0000 0000 0000 0CC47A7C2840  
CLIENT IP: 192.168.2.22  MASK: 255.255.255.0  DHCP IP: 192.168.2.252             
GATEWAY IP: 192.168.2.1
 
PXELINUX 6.03 PXE 20150107 Copyright (C) 1994-2014 H. Peter Anvin et al    
 
 
 
 
 
 
                 ┌───────────────────────────────────────┐
                 │ Debian GNU/Linux installer boot menu  │
                 ├───────────────────────────────────────┤
                 │ Install                               │
                 │ Advanced options                    > │
                 │ Help                                  │
                 │ Install with speech synthesis         │
                 │                                       │
                 │                                       │
                 │                                       │
                 │                                       │
                 │                                       │
                 │                                       │
                 └───────────────────────────────────────┘
 
 
 
              Press ENTER to boot or TAB to edit a menu entry
 
 
  ┌───────────────────────┤ [!!] Select a language ├────────────────────────┐
  │                                                                         │
  │ Choose the language to be used for the installation process. The        │
  │ selected language will also be the default language for the installed   │
  │ system.                                                                 │
  │                                                                         │
  │ Language:                                                               │
  │                                                                         │
  │                               C                                         │
  │                               English                                   │
  │                                                                         │
  │     <Go Back>                                                           │
  │                                                                         │
  └─────────────────────────────────────────────────────────────────────────┘
 
 
 
 
<Tab> moves; <Space> selects; <Enter> activates buttons

…and now the installation proceeds as normal.

At the end of this you should be left with a system that uses ttyS1 for its console. You may need to tweak that depending on whether you want the VGA console also.

Supermicro IPMI remote console on Ubuntu 14.04 through SSH tunnel

I normally don’t like using the web interface of Supermicro IPMI because it’s extremely clunky, unintuitive and uses Java in some places.

The other day however I needed to look at the console of a machine which had been left running Memtest86+. You can make Memtest86+ output to serial which is generally preferred for looking at it remotely, but this wasn’t run in that mode so was outputting nothing to the IPMI serial-over-LAN. I would have to use the Java remote console viewer.

As an added wrinkle, the IPMI network interfaces are on a network that I can’t access except through an SSH jump host.

So, I just gave it a go without doing anything special other than launching an SSH tunnel:

$ ssh me@jumphost -L127.0.0.1:1443:192.168.1.21:443 -N

This tunnels my localhost port 1443 to port 443 of 192.168.1.21 as available from the jump host. Local port 1443 used because binding low ports requires root privileges.

This allowed me to log in to the web interface of the IPMI at https://localhost:1443/, though it kept putting up a dialog which said I needed a newer JDK. Going to “Remote Control / Console Redirection” attempted to download a JNLP file and then said it failed to download.

This was with openjdk-7-jre and icedtea-7-plugin installed.

I decided maybe it would work better if I installed the Oracle Java 8 stuff (ugh). That was made easy by following these simple instructions. That’s an Ubuntu PPA which does everything for you, after you agree that you are a bad person who should feel badaccept the license.

This time things got a little further, but still failed saying it couldn’t download a JAR file. I noticed that it was trying to download the JAR from https://127.0.0.1:443/ even though my tunnel was on port 1443.

I eventually did get the remote console viewer to work but I’m not 100% convinced it was because I switched to Oracle Java.

So, basic networking issue here. Maybe it really needs port 443?

Okay, ran SSH as root so it could bind port 443. Got a bit further but now says “connection failed” with no diagnostics as to exactly what connection had failed. Still, gut instinct was that this was the remote console app having started but not having access to some port it needed.

Okay, ran SSH as a SOCKS proxy instead, set the SOCKS proxy in my browser. Same problem.

Did a search to see what ports the Supermicro remote console needs. Tried a new SSH command:

$ sudo ssh me@jumphost \
-L127.0.0.1:443:192.168.1.21:443 \
-L127.0.0.1:5900:192.168.1.21:5900 \
-L127.0.0.1:5901:192.168.1.21:5901 \
-L127.0.0.1:5120:192.168.1.21:5120 \
-L127.0.0.1:5123:192.168.1.21:5123 -N

Apart from a few popup dialogs complaining about “MalformedURLException: unknown protocol: socket” (wtf?), this now appears to work.

Supermicro IPMI remote console

SSDs and Linux Native Command Queuing

Native Command Queueing ^

Native Command Queuing (NCQ) is an extension of the Serial ATA protocol that allows multiple requests to be sent to a drive, allowing the drive to order them in a way it considers optimal.

This is very handy for rotational media like conventional hard drives, because they have to move the head all over to do random IO, so in theory if they are allowed to optimise ordering then they may be able to do a better job of it. If the drive supports NCQ then it will advertise this fact to the operating system and Linux by default will enable it.

Queue depth ^

The maximum depth of the queue in SATA is 31 for practical purposes, and so if the drive supports NCQ then Linux will usually set the depth to 31. You can change the depth by writing a number between 1 and 31 to /sys/block/<device>/device/queue_depth. Writing 1 to the file effectively disables NCQ for that device.

NCQ and SSDs ^

So what about SSDs? They aren’t rotational media; any access is in theory the same as any other access, so no need to optimally order the commands, right?

The sad fact is, many SSDs even today have incompatibilities with SATA drivers and chipsets such that NCQ does not reliably work. There’s advice all over the place that NCQ can be disabled with no ill effect, because supposedly SSDs do not benefit from it. Some posts even go as far as to suggest that NCQ might be detrimental to performance with SSDs.

Well, let’s see what fio has to say about that.

The setup ^

  • Two Intel DC s3610 1.6TB SSDs in an MD RAID-10 on Debian 8.1.
  • noop IO scheduler.
  • fio operating on a 4GiB test file that is on an ext4 filesystem backed by LVM.
  • fio set to do a 70/30% mix of read vs write operations with 128 simultaneous IO operations in flight.

The goal of this is to simulate a busy highly parallel server load, such as you might see with a database.

The fio command line looks like this:

fio --randrepeat=1 \
    --ioengine=libaio \
    --direct=1 \
    --gtod_reduce=1 \
    --name=ncq \
    --filename=test \
    --bs=4k \
    --iodepth=128 \
    --size=4G \
    --readwrite=randrw \
    --rwmixread=70

Expected output will be something like this:

ncq: (g=0): rw=randrw, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=128
fio-2.1.11
Starting 1 process
Jobs: 1 (f=1): [m(1)] [100.0% done] [50805KB/21546KB/0KB /s] [12.8K/5386/0 iops] [eta 00m:00s]
ncq1: (groupid=0, jobs=1): err= 0: pid=11272: Sun Aug  9 06:29:33 2015
  read : io=2867.6MB, bw=44949KB/s, iops=11237, runt= 65327msec
  write: io=1228.5MB, bw=19256KB/s, iops=4813, runt= 65327msec
  cpu          : usr=4.39%, sys=25.20%, ctx=732814, majf=0, minf=6
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued    : total=r=734099/w=314477/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=128
 
Run status group 0 (all jobs):
   READ: io=2867.6MB, aggrb=44949KB/s, minb=44949KB/s, maxb=44949KB/s, mint=65327msec, maxt=65327msec
  WRITE: io=1228.5MB, aggrb=19255KB/s, minb=19255KB/s, maxb=19255KB/s, mint=65327msec, maxt=65327msec
 
Disk stats (read/write):
    dm-0: ios=732755/313937, merge=0/0, ticks=4865644/3457248, in_queue=8323636, util=99.97%, aggrios=734101/314673, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
    md4: ios=734101/314673, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=364562/313849, aggrmerge=2519/1670, aggrticks=2422422/2049132, aggrin_queue=4471730, aggrutil=94.37%
  sda: ios=364664/313901, merge=2526/1618, ticks=2627716/2223944, in_queue=4852092, util=94.37%
  sdb: ios=364461/313797, merge=2513/1722, ticks=2217128/1874320, in_queue=4091368, util=91.68%

The figures we’re interested in are the iops= ones, in this case 11237 and 4813 for read and write respectively.

Results ^

Here’s how different NCQ queue depths affected things. Click the graph image for the full size version.

Graph of the effect of NCQ queue depth on read/write IOPS

Conclusion ^

On this setup anything below a queue depth of about 8 is disastrous to performance. The aberration at a queue depth of 19 is interesting. This is actually repeatable. I have no explanation for it.

Don’t believe anyone who tells you that NCQ is unimportant for SSDs unless you’ve benchmarked that and proven it to yourself. Disabling NCQ on an Intel DC s3610 appears to reduce its performance to around 25% of what it would be with even a queue depth of 8. Modern SSDs, especially enterprise ones, have a parallel architecture that allows them to get multiple things done at once. They expect NCQ to be enabled.

It’s easy to guess why 8 might be the magic number for the DC s3610:

The top of the PCB has eight NAND emplacements and Intel’s proprietary eight-channel PC29AS21CB0 controller.

The newer NVMe devices are even more aggressive with this; while the SATA spec stops at one queue with a depth of 32, NVMe specifies up to 65k queues with a depth of up to 65k each! Modern SSDs are designed with this in mind.