Measuring Linux IO read/write mix and size

In this article I will show you how to use blktrace (part of the blktrace package on Debian) to measure the characteristics of production IO load.

Why measure IO size and mix? ^

Storage vendors often report the specifications of their devices based on how many 4KiB requests they can read or write, sequentially or randomly. The sequential speed is often quoted in megabytes or gigabytes per second; the random performance as the number of IO operations per second (IOPS). But is this representative of the real world?

IO request size is not the only common assumption. Benchmarks often choose either pure reads or pure writes, and when a mix is suggested a common split is something like 25% writes to 75% reads.

These tests do have value when when the parameters are kept the same, in order to compare setups on a level field. The danger however is that none of the tests will be representative of production workload.

Measuring with blktrace ^

It’s rare that we can just put production load on a test setup. So what can we do? One thing we can do is to use blktrace.

Here is an example of me doing this on some of BitFolk‘s busiest VM hosts. These have MD RAID-10 arrays on pairs of SSDs. I will trace IO on three of these arrays for one hour.

$ sudo blktrace /dev/md4 -w 3600 -o elephant_md4

This will store the (binary format) output into files in the current directory, one per CPU on the host. These can be quite large (in my case about 70MiB per hour) and it is important that you store them somewhere other than the device which you are tracing, otherwise the writes to these will interfere with your trace.

I did the same with elephant_md5 and on another host to produce hobgoblin_md3.

Interpreting the output ^

The binary files that blktrace has produced are just logs of IO requests. There is no analysis there. There are several different tools that can consume blktrace output, but the one that I will use here is called blkparse and is part of the same package.

When blkparse is run simply it will produce a log of IO to standard output and then a summary at the end, like this:

$ blkparse hobgoblin_md3
  9,3    1        1     0.000000000  7548  A  WS 2409885696 + 32 <- (253,63) 20006912
  9,3    1        2     0.000000774  7548  Q  WS 2409885696 + 32 [blkback.8.xvda]
  9,3    1        3     0.000669843  7548  A  WS 2398625368 + 16 <- (253,63) 8746584
  9,3    1        4     0.000670267  7548  Q  WS 2398625368 + 16 [blkback.8.xvda]
  9,3    1        5     0.000985592  7548  A FWS 0 + 0 <- (253,63) 0
[…]
CPU0 (hobgoblin_md3):
 Reads Queued:     578,712,    6,307MiB  Writes Queued:     661,480,    9,098MiB
 Read Dispatches:        0,        0KiB  Write Dispatches:        0,        0KiB
 Reads Requeued:         0               Writes Requeued:         0
 Reads Completed:        0,        0KiB  Writes Completed:        0,        0KiB
 Read Merges:            0,        0KiB  Write Merges:            0,        0KiB
 Read depth:             0               Write depth:             0
 IO unplugs:             0               Timer unplugs:           0
CPU1 (hobgoblin_md3):
 Reads Queued:     663,713,    7,499MiB  Writes Queued:       1,063K,   14,216MiB
 Read Dispatches:        0,        0KiB  Write Dispatches:        0,        0KiB
 Reads Requeued:         0               Writes Requeued:         0
 Reads Completed:        0,        0KiB  Writes Completed:        0,        0KiB
 Read Merges:            0,        0KiB  Write Merges:            0,        0KiB
 Read depth:             0               Write depth:             0
 IO unplugs:             0               Timer unplugs:           0
 
Total (hobgoblin_md3):
 Reads Queued:       1,242K,   13,806MiB         Writes Queued:       1,724K,   23,315MiB
 Read Dispatches:        0,        0KiB  Write Dispatches:        0,        0KiB
 Reads Requeued:         0               Writes Requeued:         0
 Reads Completed:        0,        0KiB  Writes Completed:        0,        0KiB
 Read Merges:            0,        0KiB  Write Merges:            0,        0KiB
 IO unplugs:             0               Timer unplugs:           0

Read/write mix ^

Most of the items in the summary at the end are zero because the MD array is a logical device that doesn’t actually see most of the IO – it gets remapped to a lower layer. In order to see full details you would need to run blktrace against the constituent components of each MD array. It does give enough for the purpose of showing the read/write mix though:

$ for f in elephant_md4 elephant_md5 hobgoblin_md3; do echo $f; blkparse $f | grep Queued: | tail -1; done
elephant_md4
 Reads Queued:     526,458,   18,704MiB  Writes Queued:     116,753,  969,524KiB
elephant_md5
 Reads Queued:     424,788,    5,170MiB  Writes Queued:     483,372,    6,895MiB
hobgoblin_md3
 Reads Queued:       1,242K,   13,806MiB         Writes Queued:       1,724K,   23,315MiB

Simple maths tells us that the read/write mix is therefore approximately:

Array # of reads # of writes Total IOs Read %
elephant_md4 526,458 116,753 643,211 82%
elephant_md5 424,788 483,372 908160 47%
hobgoblin_md3 1,242,000 1,724,000 2,966,000 42%

There is quite a disparity here for one of the arrays on host elephant at 82% reads. I looked into this and it was in fact one customer doing something incredibly read-intensive. Checking a few other arrays on different hosts I found that 20-50% reads is fairly normal. The conclusions I would draw here are:

  1. Checking something around a 40/60 split of read/write is most important for my workload.
  2. Something like the traditional 75/25 split of read/write is still worth examining.

On another note we can take some comfort that elephant is apparently nowhere near its IO limits as the single array on hobgoblin is doing more than both arrays on elephant combined. The extra array was added to elephant purely because it ran out of storage capacity, not because it was hitting IO limits.

IO size ^

What about the size of individual IOs?

For the impatient, here’s some histograms. Click on them for the full size version. See the rest of this section for how they were made.

An excursion into blkparse output ^

If we filter the blkparse output to only show IO of the type we’re interested in then we get something like this (restricting to the first 10ms just for brevity):

$ blkparse hobgoblin_md3 -a write -w 0.01
Input file hobgoblin_md3.blktrace.0 added
Input file hobgoblin_md3.blktrace.1 added
  9,3    1        1     0.000000000  7548  A  WS 2409885696 + 32 <- (253,63) 20006912
  9,3    1        2     0.000000774  7548  Q  WS 2409885696 + 32 [blkback.8.xvda]
  9,3    1        3     0.000669843  7548  A  WS 2398625368 + 16 <- (253,63) 8746584
  9,3    1        4     0.000670267  7548  Q  WS 2398625368 + 16 [blkback.8.xvda]
  9,3    1        5     0.000985592  7548  A FWS 0 + 0 <- (253,63) 0
  9,3    1        6     0.000986169  7548  Q FWS [blkback.8.xvda]
  9,3    1        7     0.001238245  7548  A  WS 2398625384 + 8 <- (253,63) 8746600
  9,3    1        8     0.001238808  7548  Q  WS 2398625384 + 8 [blkback.8.xvda]
  9,3    1        9     0.001502839  7548  A FWS 0 + 0 <- (253,63) 0
  9,3    1       10     0.001503419  7548  Q FWS [blkback.8.xvda]
  9,3    1       11     0.002004499  7548  A  WS 2409903328 + 32 <- (253,63) 20024544
  9,3    1       12     0.002005074  7548  Q  WS 2409903328 + 32 [blkback.8.xvda]
  9,3    1       13     0.002334404  7548  A FWS 0 + 0 <- (253,63) 0
  9,3    1       14     0.002334913  7548  Q FWS [blkback.8.xvda]
CPU1 (hobgoblin_md3):
 Reads Queued:           0,        0KiB  Writes Queued:           7,       44KiB
 Read Dispatches:        0,        0KiB  Write Dispatches:        0,        0KiB
 Reads Requeued:         0               Writes Requeued:         0
 Reads Completed:        0,        0KiB  Writes Completed:        0,        0KiB
 Read Merges:            0,        0KiB  Write Merges:            0,        0KiB
 Read depth:             0               Write depth:             0
 IO unplugs:             0               Timer unplugs:           0
 
Throughput (R/W): 0KiB/s / 0KiB/s
Events (hobgoblin_md3): 14 entries
Skips: 0 forward (0 -   0.0%)

You can look up in the blkparse man page what all this means, but briefly for the first two lines:

9,3
The major and minor number of the block device. Typically 9 means MD and 3 means the third one, hence md3.
1
The CPU number that the IO happened on.
1
The sequence number of the IO.
0.000000000
Time delta (seconds and nanoseconds) into the trace.
7548
The process ID that the IO was on behalf of.

In this case process 7548 on hobgoblin is a Xen disk backend called [blkback.8.xvda]. That means it is the backend for disk xvda on Xen domain ID 8.

A
IO action: remap. An IO came in from or is going out to a stacked device. The details for this action will show what is going where. Or;
Q
IO action: queue. Declaring the intention to queue an IO, but it has not actually happened yet.

And for the details of the actions for the first two lines:

A WS 2409885696 + 32 <- (253,63) 20006912
WS
The "RWBS" data. In this case indicating a Synchronous Write.
2409885696 + 32
The IO starts at sector 2409885696 and is 32 blocks in size. For this device the block size is 512 bytes, so this represents a 16KiB IO.
<- (253,63) 20006912
This IO came from a higher level device with major,minor (253,63) starting at sector 20006912 on that device.

Major number 253 is for LVM logical volumes. The command dmsetup ls will list all logical volumes with their major and minor numbers, so that enables me to tell which LV this IO came from. That information in this case is of course already known since I know from the other details which Xen disk backend process this is associated with.

Q WS 2409885696 + 32 [blkback.8.xvda]
[blkback.8.xvda]
The only different data remaining. It's the name of the process related to this IO.

That covers all output shown except for some IOs which have RWBS data of "F". This does not appear to be documented in the blkparse man page but a look at the blkparse source indicates that this IO is a Force Unit Access. We can surmise that this is a result of the process issuing a synchronous write IO; the FUA is used to ensure that the data has hit stable storage.

If a blktrace had been done on a lower-level device — for example, one of the SSDs that actually makes up these arrays — then there would be a lot of other types of IO and actions present in the blkparse output. What's present is enough though to tell us about the types and sizes of IO that are being queued.

Generating an IO size histogram ^

Here's a quick and dirty Perl script to parse the blkparse output further and generate a textual histogram of IO sizes.

#!/usr/bin/env perl
 
use strict;
use warnings;
 
use Math::Round;
 
my $val;
my %size;
my $total;
 
while(<>) {
    $val = $_;
    chomp($val);
    $size{$val * 512 / 1024}++;
    $total++;
}
 
foreach my $s (sort { $a <=> $b } keys %size) {
    printf "%3d:\t%6u (%2u%%)\n", $s, $size{$s},
        round($size{$s}/$total*100);
}

Which would be run like this:

$ blkparse hobgoblin_md3 -a read | awk '/ Q / { print $10 }' | ./hist.pl
  0:        13 ( 0%)
  4:    874810 (70%)
  8:    151457 (12%)
 12:     15811 ( 1%)
 16:     58101 ( 5%)
 20:      6487 ( 1%)
 24:      9638 ( 1%)
 28:      4619 ( 0%)
 32:     56070 ( 5%)
 36:      3031 ( 0%)
 40:      4036 ( 0%)
 44:      2681 ( 0%)
 48:      3722 ( 0%)
 52:      2311 ( 0%)
 56:      2492 ( 0%)
 60:      2241 ( 0%)
 64:      4999 ( 0%)
 68:      1972 ( 0%)
 72:      2052 ( 0%)
 76:      1331 ( 0%)
 80:      1588 ( 0%)
 84:      1151 ( 0%)
 88:      1389 ( 0%)
 92:      1022 ( 0%)
 96:      1430 ( 0%)
100:      1057 ( 0%)
104:      1457 ( 0%)
108:       991 ( 0%)
112:      1375 ( 0%)
116:       913 ( 0%)
120:      1221 ( 0%)
124:       731 ( 0%)
128:     20226 ( 2%)

Although if you like pretty pictures (who doesn't?) I recommend ditching the Perl script and just dumping the data from blkparse into a CSV file like this:

$ (echo "IO_Size"; blkparse hobgoblin_md3 -a read | awk '/ Q / { if ($10 != 0) { print $10/2 } else print 0 }') > hobgoblin_md3_read_sizes.csv

And then doing a bit of Python/Pandas:

#!/usr/bin/env python
# coding: utf-8
 
# All this is just boilerplate that I do in every Pandas thing.
get_ipython().run_line_magic('matplotlib', 'notebook')
get_ipython().run_line_magic('precision', '3')
 
from IPython.display import set_matplotlib_formats
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
 
import matplotlib
 
import numpy as np
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [9, 5]
 
import pandas
 
# End boilerplate.
 
# Read in this single column of numbers with a header line at
# the top. So not really a CSV, but reading it as a CSV works.
df=pandas.read_csv("/home/andy/hobgoblin_md3_read_sizes.csv",
                   delim_whitespace=True)
 
total_ios = len(df.index)
min_size = 0
max_size = df["IO_Size"].max()
mode = df["IO_Size"].mode()
mean = df["IO_Size"].mean()
 
# Generate a histogram.
#
# The 'bins' thing uses bins of size 4 offset by 2 either way, so
# the first bin will be -2 to 2, the next one 2 to 6, then 6 to
# 10 and so on. The purpose of that is so that the bar centres on
# the interesting value. For example, 4KiB IO will fall into the
# "2 to 6" bin, the middle of which is 4.
#
# np.arange is necessary because it is possible to get IO in an
# odd number of blocks, e.g. 1 or 3 blocks, which will converted
# to KiB will be non-integer.
ax_list = df.plot(kind='hist',
                  subplots=True,
                  sharex=True,
                  sharey=True,
                  title='hobgoblin md3 read IO sizes',
                  color='g',
                  bins=np.arange(min_size - 2, max_size + 2, 4))
ax_list[0].set_xlim(0, max_size)
ax_list[0].set_xlabel('IO size (KiB)')
ax_list[0].tick_params(axis='x', rotation=-45)
# Major ticks every 4KiB, minor every 2 KiB in between.
ax_list[0].xaxis.set_major_locator(plt.MultipleLocator(4))
ax_list[0].xaxis.set_minor_locator(plt.MultipleLocator(2))
# Convert y tick labels to percentages by multiplying them by
# 100.
y_vals = ax_list[0].get_yticks()
ax_list[0].set_yticklabels(['{:,.1%}'.format(v/total_ios) for v in y_vals])
 
ax_list[0].grid(b=True, linestyle=':', color='#666666')
ax_list[0].legend(["Read IOs"])
 
# x axis label doesn't seem to fit on unless the bottom is
# extended a little.
plt.subplots_adjust(bottom=0.15)
 
# Add an info box with mean/mode values in.
props = dict(boxstyle='round', facecolor='wheat', alpha=0.5)
ax_list[0].text(0.2,
                0.95,
                "Mode: %3uKiB\nMean: %6.2fKiB" % (mode, mean),
                transform=ax_list[0].transAxes,
                fontsize=14,
                verticalalignment='top',
                bbox=props)

Results in the histogram images you saw above.

Conclusions ^

For me, there is quite some variance between arrays as to how they are being used. While 4KiB IO is the most common, the average size is being pulled up to around 14KiB or in some cases near 36KiB. It is probably worth examining 4, 8, 16 and 32KiB IO.

While a 75/25 split of read/write random IO might be worth looking at, for my workload 40/60 is actually more relevant.