For file integrity testing, you’re wasting your time with md5

Every time I go to test file integrity — e.g. are these two files the same? Is this file the same as a backup copy of this file? — muscle memory makes me type md5sum. Then my brain reprimands me:

Wait! md5 is insecure! It’s broken! Use SHA256!

Hands are wrong and brain is wrong. Just another day in the computer mines.

Well, if it was a secure hash function you were looking for, where someone might tamper with these files, then brain is not so wrong: MD5 has long been known to be too weak and is trivially subject to collision attacks.

But for file integrity on trusted data, like where you are checking for bitrot, cosmic rays or just everyday changes, you don’t need a cryptographically secure hash function. md5sum is safe enough for this, but in terms of performance it sucks. There’s been better hash functions around and packaged in major operating systems for years. Such as xxHash!

Maybe like me you reach for md5sum because…

  • You always have!
  • It’s right there!
  • It’s pretty fast though right?

On Debian, xxhash is right there after you have typed:

$ sudo apt install xxhash

Here’s me hashing the first 1GiB of one of my desktop machine’s NVMe drives.

Hash Function CPU seconds (user+kernel) %CPU
XXH128 0.21 10
xXH64 0.21 11
MD5 1.38 56
SHA1 1.72 62
SHA512 2.36 70
SHA256 3.76 80

I think this scenario was a good test as NVMe are really fast, so this focuses on the cost of algorithm rather than the IO. But if you want to see similar for slow storage, here is me doing same by reading 10GiB off a pair of 7,200RPM SATA drives:

Hash Function CPU seconds (user+kernel) %CPU
XXH128 2.44 5
xXH64 4.76 10
MD5 16.62 35
SHA1 18.00 38
SHA512 23.74 51
SHA256 35.99 69
$ for sum in md5 sha1 sha256 sha512 xxh64 xxh128; do \
sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'; \
printf "# %ssum\n" "$sum"; \
sudo dd if=/dev/sda bs=1M count=1024 status=none \
| /usr/bin/time -f 'CPU time %Us (user), %Ss (kernel); %P total CPU' "${sum}sum"; \
# md5sum
c5515c49de5116184a980a51c7783d9f  -
CPU time 1.28s (user), 0.10s (kernel); 56% total CPU
# sha1sum
60ecdfefb6d95338067b52118d2c7144b9dc2d63  -
CPU time 1.62s (user), 0.10s (kernel); 62% total CPU
# sha256sum
7fbffa1d96ae2232aa754111597634e37e5fd9b28ec692fb6deff2d020cb5bce  -
CPU time 3.68s (user), 0.08s (kernel); 80% total CPU
# sha512sum
eb4bffafc0dbdf523cc5229ba379c08916f0d25e762b60b2f52597acb040057a4b6795aa10dd098929bde61cffc7a7de1ed38fc53d5bd9e194e3a84b90fd9a21  -
CPU time 2.29s (user), 0.07s (kernel); 70% total CPU
# xxh64sum
d43417824bd6ef3a  stdin
CPU time 0.06s (user), 0.15s (kernel); 11% total CPU
# xxh128sum
e339b1c3c5c1e44db741a2e08e76fe66  stdin
CPU time 0.02s (user), 0.19s (kernel); 10% total CPU

Just use xxhsum!