message digest exercise

Create a working file

Create a file containing a megabyte of random values:

dd if=/dev/urandom of=~/myfile1M bs=1024 count=1024

Examine it in hex if you like:

xxd -g1 myfile1M

We'll manipulate the content of files with message digest utilities and refer to the files' contents as "messages."

Generate digests

You can produce a string whose value is 1) practically unique to this file, 2) dependent on its whole content, and 3) short. Such a string is called a "digest" of the content of the file. There are a couple of different utilities that can do it: md5sum and sha1sum. They use different algorithms ("Message Digest" and "Secure Hash Algorithm") that both output strings with the properties mentioned. Try both:

md5sum myfile1M
sha1sum myfile1M

In the first case the md5 algorithm's output string, derived from a megabyte of input, is itself just 32 characters. sha1 outputs just 40 characters. (The digests' actual sizes are a constant 16 bytes and 20 respectively. The encoded outputs are double that since normal hexadecimal encoding uses 2 ascii characters to represent a single byte.)

Digest unique to a given message

Different messages produce different digests, and the same message always produces the same digest. A digest utility, given the same stuff from different files or alternatively a pipe, will produce the same digest every time.

echo "How now brown cow" > mytestfile1
echo "How now brown cow" > mytestfile2
md5sum mytestfile1
md5sum mytestfile2
echo "How now brown cow" | md5sum

Note that the digest is the same each time (namely 5be031ac67e80d243ff758ca436f6d3a), because so is the message. The digest fingerprints the file. Please don't try to prove that other messages produce other digests by trying them all. You don't have time.

Digest depends on "every bit" of the message

The digest is sensitive to the whole message, producing highly dissimilar digests even for highly similar messages. Consider the message "ABC" in ascii. In binary A is 01000001, B is 01000010, and C is 01000011. So the message "ABB" differs from "ABC" by just a single bit (the last one); while the other 23 are all the same. Same with "ABA," which differs from "ABC" solely in the next-to-last bit. Generate the 3 corresponding digests: 

echo -n "ABC" | md5sum
echo -n "ABB" | md5sum
echo -n "ABA" | md5sum

echo -n "ABC" | sha1sum
echo -n "ABB" | sha1sum
echo -n "ABA" | sha1sum

The inputs are nearly identical; what about the 3 corresponding digests?

If you receive a file and know the digest value the sender generated for the original, running the digest utility on your copy will reveal if yours departs from his in the slightest (i.e., if it changed in transit). Even by just one bit. Digests are used this way for data validation like checksums.

Digest size for large files remains short

Digest size is constant. A few dozen bits only, characteristic of the algorithms. They don't diverge for large files. So a digest of the same small size represents either a huge or tiny file. Here is a shell script you can use to generate some files ranging from 1 to 64 megabytes in size.

#!/bin/bash
rm -f ~/myfile?
dd if=/dev/urandom of=~/myfile1 bs=1024 count=1024 2> /dev/null
i=1
while [ $i -le 6 ]
do
  echo "2 x myfile$i -> myfile$[$i+1]"
  cat ~/myfile$i ~/myfile$i >> ~/myfile$[$i+1]
  let i=i+1
done
ls -l ~/myfile?

Key it in as "makebigfiles" or download it per your instructor. Make it executable.

chmod +x makebigfiles

Run it. Note the range of file sizes it produced. Now produce digests for those files large and small:

md5sum myfile?
sha1sum myfile?

All the digests are of constant size. It doesn't matter how big the file itself.