Measuring Linux IO read/write mix and size

In this article I will show you how to use blktrace (part of the blktrace package on Debian) to measure the characteristics of production IO load.

Why measure IO size and mix? ^

Storage vendors often report the specifications of their devices based on how many 4KiB requests they can read or write, sequentially or randomly. The sequential speed is often quoted in megabytes or gigabytes per second; the random performance as the number of IO operations per second (IOPS). But is this representative of the real world?

IO request size is not the only common assumption. Benchmarks often choose either pure reads or pure writes, and when a mix is suggested a common split is something like 25% writes to 75% reads.

These tests do have value when when the parameters are kept the same, in order to compare setups on a level field. The danger however is that none of the tests will be representative of production workload.

Measuring with blktrace ^

It’s rare that we can just put production load on a test setup. So what can we do? One thing we can do is to use blktrace.

Here is an example of me doing this on some of BitFolk‘s busiest VM hosts. These have MD RAID-10 arrays on pairs of SSDs. I will trace IO on three of these arrays for one hour.

$ sudo blktrace /dev/md4 -w 3600 -o elephant_md4

This will store the (binary format) output into files in the current directory, one per CPU on the host. These can be quite large (in my case about 70MiB per hour) and it is important that you store them somewhere other than the device which you are tracing, otherwise the writes to these will interfere with your trace.

I did the same with elephant_md5 and on another host to produce hobgoblin_md3.

Interpreting the output ^

The binary files that blktrace has produced are just logs of IO requests. There is no analysis there. There are several different tools that can consume blktrace output, but the one that I will use here is called blkparse and is part of the same package.

When blkparse is run simply it will produce a log of IO to standard output and then a summary at the end, like this:

$ blkparse hobgoblin_md3
  9,3    1        1     0.000000000  7548  A  WS 2409885696 + 32 <- (253,63) 20006912
  9,3    1        2     0.000000774  7548  Q  WS 2409885696 + 32 [blkback.8.xvda]
  9,3    1        3     0.000669843  7548  A  WS 2398625368 + 16 <- (253,63) 8746584
  9,3    1        4     0.000670267  7548  Q  WS 2398625368 + 16 [blkback.8.xvda]
  9,3    1        5     0.000985592  7548  A FWS 0 + 0 <- (253,63) 0
CPU0 (hobgoblin_md3):
 Reads Queued:     578,712,    6,307MiB  Writes Queued:     661,480,    9,098MiB
 Read Dispatches:        0,        0KiB  Write Dispatches:        0,        0KiB
 Reads Requeued:         0               Writes Requeued:         0
 Reads Completed:        0,        0KiB  Writes Completed:        0,        0KiB
 Read Merges:            0,        0KiB  Write Merges:            0,        0KiB
 Read depth:             0               Write depth:             0
 IO unplugs:             0               Timer unplugs:           0
CPU1 (hobgoblin_md3):
 Reads Queued:     663,713,    7,499MiB  Writes Queued:       1,063K,   14,216MiB
 Read Dispatches:        0,        0KiB  Write Dispatches:        0,        0KiB
 Reads Requeued:         0               Writes Requeued:         0
 Reads Completed:        0,        0KiB  Writes Completed:        0,        0KiB
 Read Merges:            0,        0KiB  Write Merges:            0,        0KiB
 Read depth:             0               Write depth:             0
 IO unplugs:             0               Timer unplugs:           0
Total (hobgoblin_md3):
 Reads Queued:       1,242K,   13,806MiB         Writes Queued:       1,724K,   23,315MiB
 Read Dispatches:        0,        0KiB  Write Dispatches:        0,        0KiB
 Reads Requeued:         0               Writes Requeued:         0
 Reads Completed:        0,        0KiB  Writes Completed:        0,        0KiB
 Read Merges:            0,        0KiB  Write Merges:            0,        0KiB
 IO unplugs:             0               Timer unplugs:           0

Read/write mix ^

Most of the items in the summary at the end are zero because the MD array is a logical device that doesn’t actually see most of the IO – it gets remapped to a lower layer. In order to see full details you would need to run blktrace against the constituent components of each MD array. It does give enough for the purpose of showing the read/write mix though:

$ for f in elephant_md4 elephant_md5 hobgoblin_md3; do echo $f; blkparse $f | grep Queued: | tail -1; done
 Reads Queued:     526,458,   18,704MiB  Writes Queued:     116,753,  969,524KiB
 Reads Queued:     424,788,    5,170MiB  Writes Queued:     483,372,    6,895MiB
 Reads Queued:       1,242K,   13,806MiB         Writes Queued:       1,724K,   23,315MiB

Simple maths tells us that the read/write mix is therefore approximately:

Array # of reads # of writes Total IOs Read %
elephant_md4 526,458 116,753 643,211 82%
elephant_md5 424,788 483,372 908160 47%
hobgoblin_md3 1,242,000 1,724,000 2,966,000 42%

There is quite a disparity here for one of the arrays on host elephant at 82% reads. I looked into this and it was in fact one customer doing something incredibly read-intensive. Checking a few other arrays on different hosts I found that 20-50% reads is fairly normal. The conclusions I would draw here are:

  1. Checking something around a 40/60 split of read/write is most important for my workload.
  2. Something like the traditional 75/25 split of read/write is still worth examining.

On another note we can take some comfort that elephant is apparently nowhere near its IO limits as the single array on hobgoblin is doing more than both arrays on elephant combined. The extra array was added to elephant purely because it ran out of storage capacity, not because it was hitting IO limits.

IO size ^

What about the size of individual IOs?

For the impatient, here’s some histograms. Click on them for the full size version. See the rest of this section for how they were made.

An excursion into blkparse output ^

If we filter the blkparse output to only show IO of the type we’re interested in then we get something like this (restricting to the first 10ms just for brevity):

$ blkparse hobgoblin_md3 -a write -w 0.01
Input file hobgoblin_md3.blktrace.0 added
Input file hobgoblin_md3.blktrace.1 added
  9,3    1        1     0.000000000  7548  A  WS 2409885696 + 32 <- (253,63) 20006912
  9,3    1        2     0.000000774  7548  Q  WS 2409885696 + 32 [blkback.8.xvda]
  9,3    1        3     0.000669843  7548  A  WS 2398625368 + 16 <- (253,63) 8746584
  9,3    1        4     0.000670267  7548  Q  WS 2398625368 + 16 [blkback.8.xvda]
  9,3    1        5     0.000985592  7548  A FWS 0 + 0 <- (253,63) 0
  9,3    1        6     0.000986169  7548  Q FWS [blkback.8.xvda]
  9,3    1        7     0.001238245  7548  A  WS 2398625384 + 8 <- (253,63) 8746600
  9,3    1        8     0.001238808  7548  Q  WS 2398625384 + 8 [blkback.8.xvda]
  9,3    1        9     0.001502839  7548  A FWS 0 + 0 <- (253,63) 0
  9,3    1       10     0.001503419  7548  Q FWS [blkback.8.xvda]
  9,3    1       11     0.002004499  7548  A  WS 2409903328 + 32 <- (253,63) 20024544
  9,3    1       12     0.002005074  7548  Q  WS 2409903328 + 32 [blkback.8.xvda]
  9,3    1       13     0.002334404  7548  A FWS 0 + 0 <- (253,63) 0
  9,3    1       14     0.002334913  7548  Q FWS [blkback.8.xvda]
CPU1 (hobgoblin_md3):
 Reads Queued:           0,        0KiB  Writes Queued:           7,       44KiB
 Read Dispatches:        0,        0KiB  Write Dispatches:        0,        0KiB
 Reads Requeued:         0               Writes Requeued:         0
 Reads Completed:        0,        0KiB  Writes Completed:        0,        0KiB
 Read Merges:            0,        0KiB  Write Merges:            0,        0KiB
 Read depth:             0               Write depth:             0
 IO unplugs:             0               Timer unplugs:           0
Throughput (R/W): 0KiB/s / 0KiB/s
Events (hobgoblin_md3): 14 entries
Skips: 0 forward (0 -   0.0%)

You can look up in the blkparse man page what all this means, but briefly for the first two lines:

The major and minor number of the block device. Typically 9 means MD and 3 means the third one, hence md3.
The CPU number that the IO happened on.
The sequence number of the IO.
Time delta (seconds and nanoseconds) into the trace.
The process ID that the IO was on behalf of.

In this case process 7548 on hobgoblin is a Xen disk backend called [blkback.8.xvda]. That means it is the backend for disk xvda on Xen domain ID 8.

IO action: remap. An IO came in from or is going out to a stacked device. The details for this action will show what is going where. Or;
IO action: queue. Declaring the intention to queue an IO, but it has not actually happened yet.

And for the details of the actions for the first two lines:

A WS 2409885696 + 32 <- (253,63) 20006912
The "RWBS" data. In this case indicating a Synchronous Write.
2409885696 + 32
The IO starts at sector 2409885696 and is 32 blocks in size. For this device the block size is 512 bytes, so this represents a 16KiB IO.
<- (253,63) 20006912
This IO came from a higher level device with major,minor (253,63) starting at sector 20006912 on that device.

Major number 253 is for LVM logical volumes. The command dmsetup ls will list all logical volumes with their major and minor numbers, so that enables me to tell which LV this IO came from. That information in this case is of course already known since I know from the other details which Xen disk backend process this is associated with.

Q WS 2409885696 + 32 [blkback.8.xvda]
The only different data remaining. It's the name of the process related to this IO.

That covers all output shown except for some IOs which have RWBS data of "F". This does not appear to be documented in the blkparse man page but a look at the blkparse source indicates that this IO is a Force Unit Access. We can surmise that this is a result of the process issuing a synchronous write IO; the FUA is used to ensure that the data has hit stable storage.

If a blktrace had been done on a lower-level device — for example, one of the SSDs that actually makes up these arrays — then there would be a lot of other types of IO and actions present in the blkparse output. What's present is enough though to tell us about the types and sizes of IO that are being queued.

Generating an IO size histogram ^

Here's a quick and dirty Perl script to parse the blkparse output further and generate a textual histogram of IO sizes.

#!/usr/bin/env perl
use strict;
use warnings;
use Math::Round;
my $val;
my %size;
my $total;
while(<>) {
    $val = $_;
    $size{$val * 512 / 1024}++;
foreach my $s (sort { $a <=> $b } keys %size) {
    printf "%3d:\t%6u (%2u%%)\n", $s, $size{$s},

Which would be run like this:

$ blkparse hobgoblin_md3 -a read | awk '/ Q / { print $10 }' | ./
  0:        13 ( 0%)
  4:    874810 (70%)
  8:    151457 (12%)
 12:     15811 ( 1%)
 16:     58101 ( 5%)
 20:      6487 ( 1%)
 24:      9638 ( 1%)
 28:      4619 ( 0%)
 32:     56070 ( 5%)
 36:      3031 ( 0%)
 40:      4036 ( 0%)
 44:      2681 ( 0%)
 48:      3722 ( 0%)
 52:      2311 ( 0%)
 56:      2492 ( 0%)
 60:      2241 ( 0%)
 64:      4999 ( 0%)
 68:      1972 ( 0%)
 72:      2052 ( 0%)
 76:      1331 ( 0%)
 80:      1588 ( 0%)
 84:      1151 ( 0%)
 88:      1389 ( 0%)
 92:      1022 ( 0%)
 96:      1430 ( 0%)
100:      1057 ( 0%)
104:      1457 ( 0%)
108:       991 ( 0%)
112:      1375 ( 0%)
116:       913 ( 0%)
120:      1221 ( 0%)
124:       731 ( 0%)
128:     20226 ( 2%)

Although if you like pretty pictures (who doesn't?) I recommend ditching the Perl script and just dumping the data from blkparse into a CSV file like this:

$ (echo "IO_Size"; blkparse hobgoblin_md3 -a read | awk '/ Q / { if ($10 != 0) { print $10/2 } else print 0 }') > hobgoblin_md3_read_sizes.csv

And then doing a bit of Python/Pandas:

#!/usr/bin/env python
# coding: utf-8
# All this is just boilerplate that I do in every Pandas thing.
get_ipython().run_line_magic('matplotlib', 'notebook')
get_ipython().run_line_magic('precision', '3')
from IPython.display import set_matplotlib_formats
from pandas.plotting import register_matplotlib_converters
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [9, 5]
import pandas
# End boilerplate.
# Read in this single column of numbers with a header line at
# the top. So not really a CSV, but reading it as a CSV works.
total_ios = len(df.index)
min_size = 0
max_size = df["IO_Size"].max()
mode = df["IO_Size"].mode()
mean = df["IO_Size"].mean()
# Generate a histogram.
# The 'bins' thing uses bins of size 4 offset by 2 either way, so
# the first bin will be -2 to 2, the next one 2 to 6, then 6 to
# 10 and so on. The purpose of that is so that the bar centres on
# the interesting value. For example, 4KiB IO will fall into the
# "2 to 6" bin, the middle of which is 4.
# np.arange is necessary because it is possible to get IO in an
# odd number of blocks, e.g. 1 or 3 blocks, which will converted
# to KiB will be non-integer.
ax_list = df.plot(kind='hist',
                  title='hobgoblin md3 read IO sizes',
                  bins=np.arange(min_size - 2, max_size + 2, 4))
ax_list[0].set_xlim(0, max_size)
ax_list[0].set_xlabel('IO size (KiB)')
ax_list[0].tick_params(axis='x', rotation=-45)
# Major ticks every 4KiB, minor every 2 KiB in between.
# Convert y tick labels to percentages by multiplying them by
# 100.
y_vals = ax_list[0].get_yticks()
ax_list[0].set_yticklabels(['{:,.1%}'.format(v/total_ios) for v in y_vals])
ax_list[0].grid(b=True, linestyle=':', color='#666666')
ax_list[0].legend(["Read IOs"])
# x axis label doesn't seem to fit on unless the bottom is
# extended a little.
# Add an info box with mean/mode values in.
props = dict(boxstyle='round', facecolor='wheat', alpha=0.5)
                "Mode: %3uKiB\nMean: %6.2fKiB" % (mode, mean),

Results in the histogram images you saw above.

Conclusions ^

For me, there is quite some variance between arrays as to how they are being used. While 4KiB IO is the most common, the average size is being pulled up to around 14KiB or in some cases near 36KiB. It is probably worth examining 4, 8, 16 and 32KiB IO.

While a 75/25 split of read/write random IO might be worth looking at, for my workload 40/60 is actually more relevant.

Comparing Versions in Ansible Templates

In the last few days, Debian archived their jessie release and removed the jessie-updates suite from the distribution mirrors. Those hosts which still reference jessie-updates and do an apt update will see something like:

W: Failed to fetch  404  Not Found [IP: 2001:ba8:1f1:f079::2 80]

This is because the suite and all of its files were removed from the mirrors. The files have now been archived and will be picked up by just using the jessie suite on, so no longer any need to reference jessie-updates.

In order to not see that every time you should remove the jessie-updates line from /etc/apt/sources.list.

My /etc/apt/sources.list is built by Ansible from a template, and the relevant part of the template looked a bit like this:

{% if ansible_distribution_version >= 8.0 %}
deb {{ aptcacher_prefix }}{{ debian_mirror }} {{ ansible_distribution_release }}-updates   main contrib non-free
{% endif %}

ansible_distribution_version and ansible_distribution_release are host variables, and for Debian jessie currently evaluate as the strings “8.11” and “jessie” respectively.

As there is now only an -updates for Debian stable (version 9.x, “stretch”) the “if” statement should be testing against “9.0”, right? So I changed it to:

{% if ansible_distribution_version >= 9.0 %}
deb {{ aptcacher_prefix }}{{ debian_mirror }} {{ ansible_distribution_release }}-updates   main contrib non-free
{% endif %}

Well, that made no difference. jessie-updates was still being included.

The reason why this didn’t work is that the string “8.11” is being compared against “9.0” and “8.11” is actually bigger! This is a very common mistake. In order to fix it the values could be cast, but a better idea is the use the version test (previously known as version_comparison):

{% if ansible_distribution_version is version('9.0', '>=') %}
deb {{ aptcacher_prefix }}{{ debian_mirror }} {{ ansible_distribution_release }}-updates   main contrib non-free
{% endif %}

Looking through all of my playbooks it seems that I’d figured this out long ago for the playbooks themselves — every test of ansible_distribution_version in YAML files was using version() — but some of my templates were still directly trying to use “>” or “>=”.

Of course, since jessie has now been archived it is only receiving security support from the Debian LTS effort and hosts running jessie should be upgraded as soon as possible.

Forcing the source address of an SNMP client (e.g. snmpwalk)

I was using snmpwalk earlier and it kept using the “wrong” IP address to send packets from. The destination was firewalled to only accept packets from certain sources, and I didn’t want to poke another hole just because snmpwalk was being stupid.

I read a lot of man pages to try to find out how to specify the source address but couldn’t find anything anywhere. Eventually I uncovered a post from 2004 saying that you can use the clientaddr directive in snmp.conf.

So, just so this is easy to find the next time I need it, you can force it on the command line with:

$ snmpwalk --clientaddr= …

And you do have to put the ‘=’ in there.

Ansible, ARA and MariaDB (or MySQL)


A short while back I switched to Ansible for configuration management.

One of the first things I needed was a way to monitor Ansible runs for problems, you know, so I wasn’t relying on seeing abnormal logging to tell me that a particular host was unreachable, or that tasks were failing somewhere.

I know I could use Ansible Tower for this, but it seemed like such a heavyweight solution for what seems like a simple requirement. That was when I found ARA.

Honestly even ARA is more than I need, and I don’t typically find myself using the web interface even though it’s very pretty. No, I’m only really interested in poking around in the database.

It’s great that it’s simple to install and get started with, so very quickly I was recording details about my Ansible runs a SQLite database, which is the default setup. From there it was a simple SQL query to check if a particular host had suffered any task failures. It was easy to knock up a Nagios-compatible check plugin to call from my Icinga2 monitoring.

Issues ^

Excess data ^

The first problem I noted was that the SQLite database file was starting to grow in size quite rapidly. Around one week of run data used around 800MiB of storage. It’s not huge, but it was relentlessly growing and I could see very little value in keeping that data as I never looked at data from runs more than a few days previous. So, I wrote a script to delete old stuff from the database, keeping the last week’s worth.

Locking ^

Next up I started seeing SQLite locking problems.

Checks from Icinga2 were connecting to the SQLite DB, and so was the prune script, and Ansible itself too. The ARA Ansible plugin started complaining about locking:

[WARNING]: Failure using method (v2_runner_on_ok) in callback plugin                                                   
packages/ara/plugins/callbacks/log_ara.CallbackModule object at                                                         
0x7f2a84834b10>): (sqlite3.OperationalError) database is locked [SQL: u'SELECT                                  AS tasks_id, tasks.playbook_id AS tasks_playbook_id, tasks.play_id AS                                          
tasks_play_id, AS tasks_name, tasks.sortkey AS tasks_sortkey,                                                
tasks.action AS tasks_action, tasks.tags AS tasks_tags, tasks.is_handler AS                                             
tasks_is_handler, tasks.file_id AS tasks_file_id, tasks.lineno AS tasks_lineno,                                         
tasks.time_start AS tasks_time_start, tasks.time_end AS tasks_time_end \nFROM                                           
tasks \nWHERE = ?'] [parameters:                                                                               
('5f4506f7-95ac-4468-bea3-672d399d4507',)] (Background on this error at:                                       

The Ansible playbook run itself isn’t quick, taking about 13 minutes at the moment, and it seems like sometimes when that was running the check script too was running into locking issues resulting in alerts that were not actionable.

The ARA Ansible reporting plugin is the only thing that should be writing to the database so I thought it should be simple to let that have a write lock while everything else is freely able to read, but I couldn’t get to the bottom of this. No matter what I tried I was getting lock errors not just from ARA but also from my check script.

The basic problem here is that SQLite really isn’t designed for multiple concurrent access. I needed to move to a proper database engine.

At this point I think people really should be looking at PostgreSQL. My postgres knowledge is really rusty though, and although MySQL has its serious issues I felt like I had a handle on them. Rather than have this be the only postgres thing I have deployed I decided I’d do it with MariaDB.

MariaDB ^

So, MariaDB. Debian stretch. That comes with v10.1. Regular install, create the ARA database, point Ansible at it, do a playbook run and let ARA create its tables on first run. Oh.

UTF8 what? ^

Every play being run by Ansible was giving this warning:

PLAY [all] **************************************************************************
/opt/ansible/venv/ansible/local/lib/python2.7/site-packages/pymysql/ Warning: (1300, u"Invalid utf8 character string: '9C9D56'")                             
  result = self._query(query)

A bit of searching around suggested that my problem here was that the database was defaulting to utf8mb3 character set when these days MariaDB (and MySQL) should really be using utf8mb4.

Easy enough, right? Just switch to utf8mb4. Oh.

[WARNING]: Skipping plugin (/opt/ansible/venv/ansible/lib/python2.7/site-             
packages/ara/plugins/callbacks/ as it seems to be invalid:                
(pymysql.err.InternalError) (1071, u'Specified key was too long; max key length is
767 bytes') [SQL: u'\nCREATE TABLE files (\n\tid VARCHAR(36) NOT NULL,
\n\tplaybook_id VARCHAR(36), \n\tpath VARCHAR(255), \n\tcontent_id VARCHAR(40),
\n\tis_playbook BOOL, \n\tPRIMARY KEY (id), \n\tFOREIGN KEY(content_id) REFERENCES
file_contents (id), \n\tFOREIGN KEY(playbook_id) REFERENCES playbooks (id) ON DELETE
RESTRICT, \n\tUNIQUE (playbook_id, path), \n\tCHECK (is_playbook IN (0, 1))\n)\n\n']
(Background on this error at:

The problem now is that by default InnoDB has a maximum key length of 767 bytes, and with the utf8mb4 character set it is possible for each character to use up 4 bytes. A VARCHAR(255) column might need 1020 bytes.

Bigger keys in InnoDB ^

It is possible to increase the maximum key size in InnoDB to 3,000 and something bytes. Here’s the configuration options needed:


But sadly in MariaDB 10.1 even that is not enough, because the InnoDB row format in v10.1 is compact. In order to use large prefixes you need to use either compressed or dynamic.

In MariaDB v10.2 the default changes to dynamic and there is an option to change the default on a server-wide basis, but in v10.1 as packaged with the current Debian stable the only way to change the row format is to override it on the CREATE TABLE command.


Well, this is a little awkward. ARA creates its own database schema when you run it for the first time. Or, you can tell it not to do that, but then you need to create the database schema yourself.

I could have extracted the schema out of the files in site-packages/ara/db/versions/ (there’s only two files), but for a quick hack it was easier to change them. Each op.create_table( needs to have a line added right at the end, e.g. changing from this:

def upgrade():
    sa.Column('id', sa.String(length=36), nullable=False),                       
    sa.Column('playbook_id', sa.String(length=36), nullable=True),
    sa.Column('key', sa.String(length=255), nullable=True),
    sa.Column('value', models.CompressedData((2 ** 32) - 1), nullable=True),
    sa.ForeignKeyConstraint(['playbook_id'], [''], ondelete='RESTRICT'),
    sa.UniqueConstraint('playbook_id', 'key')
    ### end Alembic commands ###

To this:

def upgrade():
    sa.Column('id', sa.String(length=36), nullable=False),                       
    sa.Column('playbook_id', sa.String(length=36), nullable=True),
    sa.Column('key', sa.String(length=255), nullable=True),
    sa.Column('value', models.CompressedData((2 ** 32) - 1), nullable=True),
    sa.ForeignKeyConstraint(['playbook_id'], [''], ondelete='RESTRICT'),
    sa.UniqueConstraint('playbook_id', 'key'),
    ### end Alembic commands ###

That’s not a good general purpose patch because this is specific to MariaDB/MySQL whereas ARA supports SQLite, MySQL and PostgreSQL. I’m not familiar enough with sqlalchemy/alembic to know how it should properly be done. Maybe the ARA project would take the view that it shouldn’t be done, since the default row format changed already in upstream InnoDB. But I need this to work now, with a packaged version of MariaDB.

Wrapping up ^

So there we go, that all works now, no concurrency issues, ARA is able to create its schema and insert its data properly and my scripts were trivial to port from SQLite syntax to MariDB syntax. The database still needs pruning, but that was always going to be the case.

It was a little more work than I really wanted, having to run a full database engine just to achieve the simple task of monitoring runs for failures, but I think it’s still less work than installing Ansible Tower.

It is a shame that ARA doesn’t work with a Debian stable package of MariaDB or MySQL just by following ARA’s install instructions. I have a feeling this should be considered a bug in ARA but I’m not sure enough to just go ahead and file it, and it seems that the only discussion forums are IRC and Slack. Anyway, once the newer versions of MariaDB and MySQL find their way into stable releases hopefully the character set issues will be a thing of the past.

This limited key length thing (and the whole of MySQL’s history with UTF8 in general) is just another annoyance in a very very long list of MySQL annoyances though, most of them stemming from basic design problems. That’s why I will reiterate: if you’re in a position to do so, I recommend picking PostgreSQL over MySQL.

Another disappointing btrfs experience

I’ve been using btrfs on my home fileserver for about 4½ years. I am not entirely happy with it and kind of wish I never did it; I will certainly not be introducing it anywhere else. I’m also pretty lazy though, which probably explains why I haven’t ripped it out and replaced it with something else yet.

I’ve had a few problems with it over the years. To be fair I’ve never lost any data; it’s really the availability aspects of it which I feel just aren’t ready yet. When I use multiple storage devices it’s generally to increase availability. I don’t expect device failure to stop me doing what I need to do, at least for small amounts of device failure.

Unfortunately btrfs has consistently not lived up to these expectations. Almost every single-disk failure I’ve had in the past has resulted in an “outage” of some sort. As this is just our data, at home, it may be strange to think of it as an outage, but that’s what it is. Our data became unavailable in some way for some period of time.

This time around, one of the drives started throwing up “Currently unreadable (pending)” and “Offline uncorrectable” sectors a few days ago. That means that there’s areas of the drive that it cannot read. Initially there were just a small number, and a scrub came back clean so that suggested the problem sectors were at that time outside of any filesystem.

In a more critical setting I’d have spare drives available and would just swap them, but for home use I’m usually comfortable with forcing the drive to reallocate these by forcing a write, before ordering a replacement if the problem doesn’t go away. Worst case, I have backups.

After a day or so though, the number of problem sectors was increasing and it was obvious the drive was going to die fairly soon. I ordered a replacement. About 6 hours before the replacement arrived the drive completely stopped responding.

Now, this drive was at the time one of five in the btrfs filesystem, and the filesystem has a raid1 storage policy so there should have been no issue with one device going missing. But apparently there was a problem. btrfs sits spewing the kernel log with errors about lost writes to a device that’s no longer there; the filesystem goes read-only.

The replacement drive arrives, but with the filesystem read-only I can’t add it. I can’t even unmount the filesystem (says it is busy but lsof doesn’t see any users). Nope, I had to reboot the fileserver, at which point the filesystem wouldn’t mount at all because you have to give it the degraded mount option if you want it to mount with any devices missing.

Add the replacement drive, btrfs device remove missing /path/to/fs to kick off a remove of the dead device. Things are at least up and running read-write while this is going on. In fact it’s still going on, because there was 1.2TiB of data on the dead device and reconstructing it is painfully slow. As I write this we’re now about 9 hours in and there’s still about 421GiB to go.

So, it’s not terrible. No data was lost (probably). A short outage due to a required reboot. But it is kind of disappointing and not really how I want to be spending my time just because a single HDD slipped its mortal coil. I am massively thankful that the operating system of that fileserver is still on four other HDDs on ext4+lvm+md and never give me any trouble. Otherwise I’d have to be booting into a rescue OS to fix this sort of thing. When the thing you’re glad of is that you didn’t use a filesystem, that isn’t a great advert for that filesystem.

I should probably try to find some time to play (again) with ZFS-on-Linux. I did actually give it a go last year but got bogged down trying to compare its performance against btrfs and ext4+lvm+md using fio, which proved quite difficult to do, and I moved on to other things.

One of the things that initially attracted me to btrfs is the possibility of using a mish-mash of differently-sized drives. Due to BitFolk constantly replacing hardware I have in my possession plenty of HDDs of differing sizes that are individually perfectly serviceable, but would be awkward to try to match up into identical sizes for conventional RAID arrays. Over the years of this btrfs filesystem it had started out with mostly 250G drives and just before this failure it was 1x 1TB, 3x 2TB and 1x 3TB.

I had thought that ZFS requires every device to be the same capacity (i.e. it would only use the smallest capacity) but I’ve since been informed that ZFS will just use the capacity of the smallest device in the vdev. So assuming mirror vdevs, I’d just need to pair the drives up (or accept that the capacity will be that of the smaller of the two).

That doesn’t seem too onerous at all, when considering the advantages that ZFS would bring. I’m most interested in the self-healing (checksums) and the storage tiering (through using faster devices like SSDs for L2ARC and ZIL). btrfs doesn’t have a good solution for tiering yet, unless you are insane and want to play with bcache(fs).

So, yeah, should stop being lazy and crack on with ZFS again. In my copious free time.

Disabling edge tiling on GNOME 3.28 / Debian testing (buster)

We’ve been here before ^

In an earlier post I mentioned how to disable edge tiling. That was for my desktop machine which at the time was running Ubuntu 17.10 and GNOME 3.26.

My laptop, however, currently runs Debian testing (buster) with GNOME 3.28, and this method does not work.

Things that work ^

In fact, one of the ways the Internet suggested that didn’t work for Ubuntu, does work on my Debian laptop. That is:

$ gsettings set edge-tiling false

I have no idea why, sorry.

Things that don’t work ^

So, for my Debian buster laptop running GNOME 3.28 under Xorg, the things that don’t work are:

$ dconf write /org/gnome/shell/extensions/classic-overrides/edge-tiling false
$ dconf write /org/gnome/mutter/edge-tiling false
$ dconf write /org/gnome/shell/overrides/edge-tiling false

Using a different theme for Mediawiki’s SyntaxHighlight extension

Probably the best syntax highlighting plugin for Mediawiki at the moment is the one simply called SyntaxHighlight. It uses Pygments to do the heavy lifting. What sets it apart from the other extensions is that it supports line numbers and picking out highlighted lines.

Unfortunately the default style (theme) is dark-on-light whereas for most of my syntax highlighting I am giving examples of either shell sessions or code. All my shell sessions and code are viewed as light-on-dark, so I would prefer that the wiki’s syntax highlighting followed suit.

I spent quite a while messing about with editing the extension itself but to little effect, until Robert pointed out that I just needed to edit the Common.css file inside the wiki itself. Then you get some decent results.

I used something like this to generate the correct CSS for the “native” style:

$ ./extensions/SyntaxHighlight_GeSHi/pygments/pygmentize -S native -f html|sed -e 's/^/.mw-highlight > pre /'
.mw-highlight > pre .hll { background-color: #404040 }
.mw-highlight > pre .c { color: #999999; font-style: italic } /* Comment */
.mw-highlight > pre .err { color: #a61717; background-color: #e3d2d2 } /* Error */
.mw-highlight > pre .esc { color: #d0d0d0 } /* Escape */
.mw-highlight > pre .g { color: #d0d0d0 } /* Generic */
.mw-highlight > pre .k { color: #6ab825; font-weight: bold } /* Keyword */
.mw-highlight > pre .l { color: #d0d0d0 } /* Literal */
.mw-highlight > pre .n { color: #d0d0d0 } /* Name */
.mw-highlight > pre .o { color: #d0d0d0 } /* Operator */
.mw-highlight > pre .x { color: #d0d0d0 } /* Other */
.mw-highlight > pre .p { color: #d0d0d0 } /* Punctuation */
.mw-highlight > pre .ch { color: #999999; font-style: italic } /* Comment.Hashbang */
.mw-highlight > pre .cm { color: #999999; font-style: italic } /* Comment.Multiline */
.mw-highlight > pre .cp { color: #cd2828; font-weight: bold } /* Comment.Preproc */
.mw-highlight > pre .cpf { color: #999999; font-style: italic } /* Comment.PreprocFile */
.mw-highlight > pre .c1 { color: #999999; font-style: italic } /* Comment.Single */
.mw-highlight > pre .cs { color: #e50808; font-weight: bold; background-color: #520000 } /* Comment.Special */
.mw-highlight > pre .gd { color: #d22323 } /* Generic.Deleted */
.mw-highlight > pre .ge { color: #d0d0d0; font-style: italic } /* Generic.Emph */
.mw-highlight > pre .gr { color: #d22323 } /* Generic.Error */
.mw-highlight > pre .gh { color: #ffffff; font-weight: bold } /* Generic.Heading */
.mw-highlight > pre .gi { color: #589819 } /* Generic.Inserted */
.mw-highlight > pre .go { color: #cccccc } /* Generic.Output */
.mw-highlight > pre .gp { color: #aaaaaa } /* Generic.Prompt */
.mw-highlight > pre .gs { color: #d0d0d0; font-weight: bold } /* Generic.Strong */
.mw-highlight > pre .gu { color: #ffffff; text-decoration: underline } /* Generic.Subheading */
.mw-highlight > pre .gt { color: #d22323 } /* Generic.Traceback */
.mw-highlight > pre .kc { color: #6ab825; font-weight: bold } /* Keyword.Constant */
.mw-highlight > pre .kd { color: #6ab825; font-weight: bold } /* Keyword.Declaration */
.mw-highlight > pre .kn { color: #6ab825; font-weight: bold } /* Keyword.Namespace */
.mw-highlight > pre .kp { color: #6ab825 } /* Keyword.Pseudo */
.mw-highlight > pre .kr { color: #6ab825; font-weight: bold } /* Keyword.Reserved */
.mw-highlight > pre .kt { color: #6ab825; font-weight: bold } /* Keyword.Type */
.mw-highlight > pre .ld { color: #d0d0d0 } /* Literal.Date */
.mw-highlight > pre .m { color: #3677a9 } /* Literal.Number */
.mw-highlight > pre .s { color: #ed9d13 } /* Literal.String */
.mw-highlight > pre .na { color: #bbbbbb } /* Name.Attribute */
.mw-highlight > pre .nb { color: #24909d } /* Name.Builtin */
.mw-highlight > pre .nc { color: #447fcf; text-decoration: underline } /* Name.Class */
.mw-highlight > pre .no { color: #40ffff } /* Name.Constant */
.mw-highlight > pre .nd { color: #ffa500 } /* Name.Decorator */
.mw-highlight > pre .ni { color: #d0d0d0 } /* Name.Entity */
.mw-highlight > pre .ne { color: #bbbbbb } /* Name.Exception */
.mw-highlight > pre .nf { color: #447fcf } /* Name.Function */
.mw-highlight > pre .nl { color: #d0d0d0 } /* Name.Label */
.mw-highlight > pre .nn { color: #447fcf; text-decoration: underline } /* Name.Namespace */
.mw-highlight > pre .nx { color: #d0d0d0 } /* Name.Other */
.mw-highlight > pre .py { color: #d0d0d0 } /* Name.Property */
.mw-highlight > pre .nt { color: #6ab825; font-weight: bold } /* Name.Tag */
.mw-highlight > pre .nv { color: #40ffff } /* Name.Variable */
.mw-highlight > pre .ow { color: #6ab825; font-weight: bold } /* Operator.Word */
.mw-highlight > pre .w { color: #666666 } /* Text.Whitespace */
.mw-highlight > pre .mb { color: #3677a9 } /* Literal.Number.Bin */
.mw-highlight > pre .mf { color: #3677a9 } /* Literal.Number.Float */
.mw-highlight > pre .mh { color: #3677a9 } /* Literal.Number.Hex */
.mw-highlight > pre .mi { color: #3677a9 } /* Literal.Number.Integer */
.mw-highlight > pre .mo { color: #3677a9 } /* Literal.Number.Oct */
.mw-highlight > pre .sa { color: #ed9d13 } /* Literal.String.Affix */
.mw-highlight > pre .sb { color: #ed9d13 } /* Literal.String.Backtick */
.mw-highlight > pre .sc { color: #ed9d13 } /* Literal.String.Char */
.mw-highlight > pre .dl { color: #ed9d13 } /* Literal.String.Delimiter */
.mw-highlight > pre .sd { color: #ed9d13 } /* Literal.String.Doc */
.mw-highlight > pre .s2 { color: #ed9d13 } /* Literal.String.Double */
.mw-highlight > pre .se { color: #ed9d13 } /* Literal.String.Escape */
.mw-highlight > pre .sh { color: #ed9d13 } /* Literal.String.Heredoc */
.mw-highlight > pre .si { color: #ed9d13 } /* Literal.String.Interpol */
.mw-highlight > pre .sx { color: #ffa500 } /* Literal.String.Other */
.mw-highlight > pre .sr { color: #ed9d13 } /* Literal.String.Regex */
.mw-highlight > pre .s1 { color: #ed9d13 } /* Literal.String.Single */
.mw-highlight > pre .ss { color: #ed9d13 } /* Literal.String.Symbol */
.mw-highlight > pre .bp { color: #24909d } /* Name.Builtin.Pseudo */
.mw-highlight > pre .fm { color: #447fcf } /* Name.Function.Magic */
.mw-highlight > pre .vc { color: #40ffff } /* Name.Variable.Class */
.mw-highlight > pre .vg { color: #40ffff } /* Name.Variable.Global */
.mw-highlight > pre .vi { color: #40ffff } /* Name.Variable.Instance */
.mw-highlight > pre .vm { color: #40ffff } /* Name.Variable.Magic */
.mw-highlight > pre .il { color: #3677a9 } /* Literal.Number.Integer.Long */

(Yes, I also need to do the light-on-dark thing here in this blog)

To get a list of available styles:

$ ./extensions/SyntaxHighlight_GeSHi/pygments/pygmentize -L styles
Pygments version 2.2.0, (c) 2006-2017 by Georg Brandl.
* manni:
    A colorful style, inspired by the terminal highlighting style.
* igor:
    Pygments version of the official colors for Igor Pro procedures.
* lovelace:
    The style used in Lovelace interactive learning environment. Tries to avoid the "angry fruit salad" effect with desaturated and dim colours.
* xcode:
    Style similar to the Xcode default colouring theme.
* vim:
    Styles somewhat like vim 7.0
* autumn:
    A colorful style, inspired by the terminal highlighting style.
* abap:
* vs:
* rrt:
    Minimalistic "rrt" theme, based on Zap and Emacs defaults.
* native:
    Pygments version of the "native" vim theme.
* perldoc:
    Style similar to the style used in the perldoc code blocks.
* borland:
    Style similar to the style used in the borland IDEs.
* arduino:
    The Arduino® language style. This style is designed to highlight the Arduino source code, so exepect the best results with it.
* tango:
    The Crunchy default Style inspired from the color palette from the Tango Icon Theme Guidelines.
* emacs:
    The default style (inspired by Emacs 22).
* friendly:
    A modern style based on the VIM pyte theme.
* monokai:
    This style mimics the Monokai color scheme.
* paraiso-dark:
* colorful:
    A colorful style, inspired by CodeRay.
* murphy:
    Murphy's style from CodeRay.
* bw:
* pastie:
    Style similar to the pastie default style.
* rainbow_dash:
    A bright and colorful syntax highlighting theme.
* algol_nu:
* paraiso-light:
* trac:
    Port of the default trac highlighter design.
* default:
    The default style (inspired by Emacs 22).
* algol:
* fruity:
    Pygments version of the "native" vim theme.

Although you may find it easier looking at the Pygments style gallery.

Let’s Encrypt wildcard certificates, and automated DNS verification

Let’s Encrypt’s wildcard certificates ^

Now that Let’s Encrypt can issue wildcard TLS certificates I found some time to look into that.

I already use a Lua script with haproxy which takes care of automatically answering http-01 ACME challenges, but to issue/renew a wildcard certificate you need to answer a dns-01 challenge. A different client/setup would be needed.

dns-01 ACME challenges ^

Most of the clients that support ACME v2 offer a range of integrations for DNS providers, plus a manual mode that prints out the DNS record that you need to add and then waits for you to indicate that you’ve done it. I run my own DNS infrastructure so the thing to do would be RFC2136 dynamic DNS updates.

One wrinkle here is that currently none of my DNS zones have dynamic updates enabled. At the moment I manage them as zone files (some are automatically generated by scripts though). After looking at a few of the client options I found that supports an “alias zone”.

Basically, in your main zone you create a CNAME for the challenge record that points at another zone, and then enable dynamic updates in that other zone. The other zone is dedicated for this purpose, so the only updates which will be happening will be for the purpose of answering dns-01 ACME challenges. I made my dynamic zone a sub-zone of my main one: zone file content ^

These records need to be added to the main zone for this to work.

; sub-zone purely used for dns-01 ACME challenges.
acmesh          NS
; Alias the dns-01 challenge record into the dedicated zone.
_acme-challenge CNAME
. zone file content ^

Initially this just needs to be an empty zone with only SOA and NS records, so this is the entire content of the file.

$TTL 86400      ; 1 day   IN SOA (
                                2018031905 ; serial
                                14400      ; refresh (4 hours)
                                7200       ; retry (2 hours)
                                1209600    ; expire (2 weeks)
                                43200      ; minimum (12 hours)

DNS server configuration ^

The DNS server needs to know a key by which it will authenticate‘s updates, and also needs to be told that the new zone is a dynamic zone. I use BIND, so it goes as follows.

Generate a key for dynamic DNS updates ^

Use the dnssec-keygen command to generate a key suitable for authenticating DNS updates.

$ dnssec-keygen -r /dev/urandom -a HMAC-SHA512 -b 512 -n HOST DDNS_UPDATE

This creates two files named like Kddns_update.+165+14059.key and Kddns_update.+165+14059.private.

Put the key in the BIND config ^

Look in the private file and take the key from the line that starts “Key:”. Put that in some config file that you will load into your BIND like this:

key "strugglers" {
    algorithm hmac-sha512;
    secret "Sb8nvwpO8bhiU4haPB+NiJKoMO6vVJumrr29Bj3daSuB8hBoTKoqPKMBKTYLRUv12pbKPwJATgdsU6BtL4Hmcw==";

The thing in quotes after “key” is a symbolic name for this key and can be anything that makes sense to you. The “secret” is the key from the private file. You can delete the two Kddns_update.+165+14059.* files now.

Put the new zone into the BIND config ^

The config for the zone itself looks something like this:

zone "" {
    type master;
    file "/path/to/";
    allow-update {
        key "strugglers";

Reload the DNS server ^

Once BIND has been reloaded the log file should indicate that the zone was loaded correctly, and in my case that triggers DNS NOTIFY to my secondary servers which automatically begin zone transfers.

Check things out with nsupdate ^

At this point it might be worth using the nsupdate command to check that you can do dynamic DNS updates.

Just type the nsupdate line in the shell, the > is a prompt at which you will type the updates you wish to send. We’ll add a trivial TXT record. The -k argument is the path to the file containing the key.

$ nsupdate -k /path/to/strugglers.key -v
> server
> debug yes
> zone
> update add 86400 TXT "bar"
> show
Outgoing update query:
;; ->>HEADER<<- opcode: UPDATE, status: NOERROR, id:      0
;; flags:; ZONE: 0, PREREQ: 0, UPDATE: 0, ADDITIONAL: 0
;         IN      SOA
;; UPDATE SECTION: 86400 IN     TXT     "bar"
> send
Sending update to
Outgoing update query:
;; ->>HEADER<<- opcode: UPDATE, status: NOERROR, id:  19987
;; flags:; ZONE: 1, PREREQ: 0, UPDATE: 1, ADDITIONAL: 1
;         IN      SOA
;; UPDATE SECTION: 86400 IN     TXT     "bar"
strugglers.             0       ANY     TSIG    hmac-sha512. 1521454639 300 64 dPndp1/ZyqzmSEn0AKIsGR62HrsplJBhntWioM4oBdPlNXUIAwg7Jwpg DGSM2S3kY+5hfGTleNqwXZrMvnBhUQ== 19987 NOERROR 0 
Reply from update query:
;; ->>HEADER<<- opcode: UPDATE, status: NOERROR, id:  19987
;; flags: qr; ZONE: 1, PREREQ: 0, UPDATE: 0, ADDITIONAL: 1
;         IN      SOA
strugglers.             0       ANY     TSIG    hmac-sha512. 1521454639 300 64 NfH/78kvq6f+59RXnyJwC6kfFRLGjG6Rh9jdYRId7UjH0jwIbtRVpqCu xx4HToGmlJrDTUqpgbYZq2orUOZlkQ== 19987 NOERROR 0
> [Ctrl-D]

And to verify it really got added (though the status of NOERROR should be confirmation enough):

$ dig +short -t txt

That it; you can do dynamic DNS updates. ^

I’m going to assume you’ve installed according to one of its supported installation methods. Personally I am not into curl | sh so I:

  • Create a system user that can’t log in.
  • git clone the source.
  • --install it as that user. doesn’t have to be run on the primary DNS server, because it’s going to use a dynamic DNS update to do all the DNS things. It just needs access to the dynamic DNS update key file. Either you can install on each host that will need to generate/renew certificates and copy the DNS key there, or else do all the certificate generation/renewal in one place and copy the certificate files around.

However you manage it, make sure that the user you’re going to run as can read the dynamic DNS update key file.

Issuing the first wildcard certificate ^

The first time you issue the certificate you need to set NSUPDATE_KEY and NSUPDATE_SERVER in your environment. After the first successful issuance will store these variables in its configuration for use in the automated renewals.

$ NSUPDATE_KEY=/path/to/strugglers.key ./ --issue -d -d '*' --challenge-alias --dns dns_nsupdate
[Mon 19 Mar 09:19:00 UTC 2018] Multi domain=',DNS:*'
[Mon 19 Mar 09:19:00 UTC 2018] Getting domain auth token for each domain
[Mon 19 Mar 09:19:03 UTC 2018] Getting webroot for domain=''
[Mon 19 Mar 09:19:03 UTC 2018] Getting webroot for domain='*'
[Mon 19 Mar 09:19:04 UTC 2018] Found domain api file: /path/to/acmesh/dnsapi/
[Mon 19 Mar 09:19:04 UTC 2018] adding 60 in txt "WmenhbXRtenhpNLYLOBjznyHcVvFk-jjxurCVTrhWc8"
[Mon 19 Mar 09:19:04 UTC 2018] Found domain api file: /path/to/acmesh/dnsapi/
[Mon 19 Mar 09:19:04 UTC 2018] adding 60 in txt "fwZPUBHijOQkJJaoOF_nIn3Z_FtuVU9R635NDVz_hPA"
[Mon 19 Mar 09:19:04 UTC 2018] Sleep 120 seconds for the txt records to take effect

At this point a DNS update has been crafted and sent so you should see your zone update and zone transfer happen to any secondary servers. If that doesn’t happen within 120 seconds then when Let’s Encrypt tries to verify the challenge it might query a DNS server that doesn’t yet have the record. Your zone transfers need to be reliable.

[Mon 19 Mar 09:21:08 UTC 2018]
[Mon 19 Mar 09:21:12 UTC 2018] Success
[Mon 19 Mar 09:21:12 UTC 2018] Verifying:*
[Mon 19 Mar 09:21:15 UTC 2018] Success
[Mon 19 Mar 09:21:15 UTC 2018] Removing DNS records.
[Mon 19 Mar 09:21:15 UTC 2018] removing txt
[Mon 19 Mar 09:21:16 UTC 2018] removing txt
[Mon 19 Mar 09:21:16 UTC 2018] Verify finished, start to sign.
[Mon 19 Mar 09:21:18 UTC 2018] Cert success.
[Mon 19 Mar 09:21:18 UTC 2018] Your cert is in  /path/to/acmesh/ 
[Mon 19 Mar 09:21:18 UTC 2018] Your cert key is in  /path/to/acmesh/ 
[Mon 19 Mar 09:21:18 UTC 2018] The intermediate CA cert is in  /path/to/acmesh/ 
[Mon 19 Mar 09:21:18 UTC 2018] And the full chain certs is there:  /path/to/acmesh/

Examining a certificate ^

Just for peace of mind…

$ openssl x509 -text -noout -certopt no_subject,no_header,no_version,no_serial,no_signame,no_subject,no_issuer,no_pubkey,no_sigdump,no_aux -in /path/to/acmesh/
            Not Before: Mar 19 08:21:17 2018 GMT
            Not After : Jun 17 08:21:17 2018 GMT
        X509v3 extensions:
            X509v3 Key Usage: critical
                Digital Signature, Key Encipherment
            X509v3 Extended Key Usage: 
                TLS Web Server Authentication, TLS Web Client Authentication
            X509v3 Basic Constraints: critical
            X509v3 Subject Key Identifier: 
            X509v3 Authority Key Identifier: 
            Authority Information Access: 
                OCSP - URI:
                CA Issuers - URI:
            X509v3 Subject Alternative Name: 
            X509v3 Certificate Policies: 
                  User Notice:
                    Explicit Text: This Certificate may only be relied upon by Relying Parties and only in accordance with the Certificate Policy found at

From the Subject Alternative Name we can see it is a wildcard certificate.

Disabling edge tiling on GNOME 3.26 / Ubuntu 17.10

Edge tiling? ^

It’s that thing where when you drag a window so it hits the edge of the screen, GNOME offers to maximise the window. Generally the number of times I will knowingly want to maximise a window by dragging it to the top of the screen is 0, while the number of times it happens accidentally is over 9,000 by lunch time.

Things that work ^

$ dconf write /org/gnome/mutter/edge-tiling false

It should take effect immediately.

If you like a pointy clicky way to do it then install dconf-editor package and run dconf-editor, but really all you will do is click down the tree orggnomemutter and then toggle edge-tiling so I don’t really see the point.

Things that people on the Internet say work, but don’t – a non-exhaustive list ^

These suggestions silently fail to do anything, as far as I can see. They may have been correct for earlier versions of GNOME, but I am using GNOME on Ubuntu 17.10 and they didn’t work for me.

dconf write /org/gnome/shell/extensions/classic-overrides/edge-tiling false
gsettings set edge-tiling false
dconf write /org/gnome/shell/overrides/edge-tiling false

When is a 64-bit counter not a 64-bit counter?

…when you run a Xen device backend (commonly dom0) on a kernel version earlier than 4.10, e.g. Debian stable.


Xen netback devices used 32-bit counters until that bug was fixed and released in kernel version 4.10.

On a kernel with that bug you will see counter wraps much sooner than you would expect, and if the interface is doing enough traffic for there to be multiple wraps in 5 minutes, your monitoring will no longer be accurate.

The problem ^

A high-bandwidth VPS customer reported that the bandwidth figures presented by BitFolk’s monitoring bore no resemblance to their own statistics gathered from inside their VPS. Their figures were a lot higher.

About octet counters ^

The Linux kernel maintains byte/octet counters for its network interfaces. You can view them in /sys/class/net/<interface>/statistics/*_bytes.

They’re a simple count of bytes transferred, and so the count always goes up. Typically these are 64-bit unsigned integers so their maximum value would be 18,446,744,073,709,551,615 (264-1).

When you’re monitoring bandwidth use the monitoring system records the value and the timestamp. The difference in value over a known period allows the monitoring system to work out the rate.

Wrapping ^

Monitoring of network devices is often done using SNMP. SNMP has 32-bit and 64-bit counters.

The maximum value that can be held in a 32-bit counter is 4,294,967,295. As that is a byte count, that represents 34,359,738,368 bits or 34,359.74 megabits. Divide that by 300 (seconds in 5 minutes) and you get 114.5. Therefore if the average bandwidth is above 114.5Mbit/s for 5 minutes, you will overflow a 32-bit counter. When the counter overflows it wraps back through zero.

Wrapping a counter once is fine. We have to expect that a counter will wrap eventually, and as counters never decrease, if a new value is smaller than the previous one then we know it has wrapped and can still work out what the rate should be.

The problem comes when the counter wraps more than once. There is no way to tell how many times it has wrapped so the monitoring system will have to assume the answer is once. Once traffic reaches ~229Mbit/s the counters will be wrapping at least twice in 5 minutes and the statistics become meaningless.

64-bit counters to the rescue ^

For that reason, network traffic is normally monitored using 64-bit counters. You would have to have a traffic rate of almost 492 Petabit/s to wrap a 64-bit byte counter in 5 minutes.

The thing is, I was already using 64-bit SNMP counters.

Examining the sysfs files ^

I decided to remove SNMP from the equation by going to the source of the data that SNMP uses: the kernel on the device being monitored.

As mentioned, the kernel’s interface byte counters are exposed in sysfs at /sys/class/net/<interface>/statistics/*_bytes. I dumped out those values every 10 seconds and watched them scroll in a terminal session.

What I observed was that these counters, for that particular customer, were wrapping every couple of minutes. I never observed a value greater than 8,469,862,875. That’s larger than a 32-bit counter would hold, but very close to what a 33 bit counter would hold (8,589,934,591).

64-bit counters not to the rescue ^

Once I realised that the kernel’s own counters were wrapping every couple of minutes inside the kernel it became clear that using 64-bit counters in SNMP was not going to help at all, and multiple wraps would be seen in 5 minutes.

What a difference a minute makes ^

To test the hypothesis I switched to 1-minute polling. Here’s what 12 hours of real data looks like under both 5- and 1-minute polling.

As you can see that is a pretty dramatic difference.

The bug ^

By this point, I’d realised that there must be a bug in Xen’s netback driver (the thing that makes virtual network interfaces in dom0).

I went searching through the source of the kernel and found that the counters had changed from an unsigned long in kernel version 4.9 to a u64 in kernel version 4.10.

Of course, once I knew what to search for it was easy to unearth a previous bug report. If I’d found that at the time of the initial report that would have saved 2 days of investigation!

Even so, the fix for this was only committed in February of this year so, unfortunately, is not present in the kernel in use by the current Debian stable. Nor in many other current distributions.

For Xen set-ups on Debian the bug could be avoided by using a backports kernel or packaging an upstream kernel.

Or you could do 1-minute polling as that would only wrap one time at an average bandwidth of ~572Mbit/s and should be safe from multiple wraps up to ~1.1Gbit/s.

Inside the VPS the counters are 64-bit so it isn’t an issue for guest administrators.