Unix error detection – cksum, md5sum, and shasum commands

This post will be a tutorial on Unix error checking. These commands are available in all Unix systems (that I have tested), though they have slightly different forms in each one.

Error detection codes are used to detect errors caused by either disturbances in a noisy communication channel, or deliberate modification by malicious users. There are many different methods of error checking. One form is called a checksum or hash. This is when some operation with even probability distribution is applied to the bytes, blocks, whatever, of a file at both ends of the channel to produce a much shorter value, called the sum. The sums at both ends are compared (usually by humans rather than by computers) to see if they match. If they don’t match, something has gone wrong in the transmission, and the values need to be retransmitted.

The first error detection command we will look at is the cksum command. cksum produces a checksum of a file by computing a remainder of a polynomial division of its contents. It then prints the checksum and the size of the file side-by-side.

bash-3.2$ cksum hackers.png
1967572667 13865 hackers.png

Here the first value is the checksum and the second value is the size of the file, followed of course by the name of the file.

We now compare these values to the values supplied by the site we downloaded the file from. If they match, we know the file probably downloaded correctly (it’s not an absolute guarantee, but the probability of an error is now known to be negligible).

cksum is used because it is fast and simple to implement. It has a couple of problems, though. First, implementations may vary. When using cksum you have to make sure you and the supplier of the file are using the same CRC code.

The second problem is that while CRCs protect a file from accidental modification by noise; they do not protect it from deliberate modification by malicious users. This is because multiple files can have the same checksum, and a middleman can easily inject a fake file into the network that has the same checksum as the one we are trying to download, despite being different. To guard against this sort of attack, we need what is called a one-way hash, which is a hash that is computationally very difficult to reverse.

There are a couple of these algorithms that are in common use on Unix systems. One is the MD5 algorithm, and the other is the SHA series of algorithms (in Unix the default is SHA-1). Each of these message digest algorithms, as they are called, has at least two forms – the form used in BSD systems (including Mac OS X) and the form used in Linux. Here is a table showing each of them:

Algorithm Linux form BSD form
MD5 md5sum md5
SHA-1 sha1sum shasum

Let’s look at a sample command in Mac OS X:

bash-3.2$ md5 hackers.png
MD5 (hackers.png) = d28903b7e06cd169f5e4ff59be348fb6

Unlike the cksum output, which is in decimal, the MD5/SHA-1 output is in hexadecimal. It is also much longer. Again, we compare this value to the one given by the providers of the file to make sure it is correct.

EDIT (November 11, 2016): Man, I’ve been putting off correcting this for too long.  I said something really stupid in this post, which was that a one-way hash has only one file mapped onto each hash.  Obviously, this is mathematically impossible, because a one-to-one function can only exist between two sets of the same size.  I have corrected it now.  God, this is a cringeworthy mistake, and I’m totally embarrassed by it.