Every time I go to test file integrity — e.g. are these two files the same? Is this file the same as a backup copy of this file? — muscle memory makes me type md5sum
. Then my brain reprimands me:
Wait!
md5
is insecure! It’s broken! UseSHA256
!
Hands are wrong and brain is wrong. Just another day in the computer mines.
Well, if it was a secure hash function you were looking for, where someone might tamper with these files, then brain is not so wrong: MD5
has long been known to be too weak and is trivially subject to collision attacks.
But for file integrity on trusted data, like where you are checking for bitrot, cosmic rays or just everyday changes, you don’t need a cryptographically secure hash function. md5sum
is safe enough for this, but in terms of performance it sucks. There’s been better hash functions around and packaged in major operating systems for years. Such as xxHash!
Maybe like me you reach for md5sum
because…
- You always have!
- It’s right there!
- It’s pretty fast though right?
On Debian, xxhash
is right there after you have typed:
$ sudo apt install xxhash |
Here’s me hashing the first 1GiB of one of my desktop machine’s NVMe drives.
Hash Function | CPU seconds (user+kernel) | %CPU |
---|---|---|
XXH128 | 0.21 | 10 |
xXH64 | 0.21 | 11 |
MD5 | 1.38 | 56 |
SHA1 | 1.72 | 62 |
SHA512 | 2.36 | 70 |
SHA256 | 3.76 | 80 |
I think this scenario was a good test as NVMe are really fast, so this focuses on the cost of algorithm rather than the IO. But if you want to see similar for slow storage, here is me doing same by reading 10GiB off a pair of 7,200RPM SATA drives:
Hash Function | CPU seconds (user+kernel) | %CPU |
---|---|---|
XXH128 | 2.44 | 5 |
xXH64 | 4.76 | 10 |
MD5 | 16.62 | 35 |
SHA1 | 18.00 | 38 |
SHA512 | 23.74 | 51 |
SHA256 | 35.99 | 69 |
$ for sum in md5 sha1 sha256 sha512 xxh64 xxh128; do \ sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'; \ printf "# %ssum\n" "$sum"; \ sudo dd if=/dev/sda bs=1M count=1024 status=none \ | /usr/bin/time -f 'CPU time %Us (user), %Ss (kernel); %P total CPU' "${sum}sum"; \ done # md5sum c5515c49de5116184a980a51c7783d9f - CPU time 1.28s (user), 0.10s (kernel); 56% total CPU # sha1sum 60ecdfefb6d95338067b52118d2c7144b9dc2d63 - CPU time 1.62s (user), 0.10s (kernel); 62% total CPU # sha256sum 7fbffa1d96ae2232aa754111597634e37e5fd9b28ec692fb6deff2d020cb5bce - CPU time 3.68s (user), 0.08s (kernel); 80% total CPU # sha512sum eb4bffafc0dbdf523cc5229ba379c08916f0d25e762b60b2f52597acb040057a4b6795aa10dd098929bde61cffc7a7de1ed38fc53d5bd9e194e3a84b90fd9a21 - CPU time 2.29s (user), 0.07s (kernel); 70% total CPU # xxh64sum d43417824bd6ef3a stdin CPU time 0.06s (user), 0.15s (kernel); 11% total CPU # xxh128sum e339b1c3c5c1e44db741a2e08e76fe66 stdin CPU time 0.02s (user), 0.19s (kernel); 10% total CPU |
Just use xxhsum
!