A Python 3 implementation of the MD5 Hash

Today we'll see a Python 3 implementation of the MD5 hashing algorithm, which was originally designed by Ron Rivest. Even though the MD5 algorithm has been compromised and is not longer secure, it is still in use today. For example, CS ISOs of Linux distributions often provide MD5 sums in addition to SHA256 sums.

There are a lot of C implementations of MD5 out there. However, they are not easy to read, especially for someone who is not used to C. Python 3, being closer to "math" rather than the computer, reads more like a definition of the algorithm. Of course, the Python implementation is slower than the C one.

This Python implementation is ideal for experimenting with MD5 and learning about it. The code I give below is not the most compact possible, but I believe it is easy to understand.

The complete source code will be at the end of the post, after we go through the code little by little. I assume that you have read or are willing to read the complete description of the MD5 algorithm in RFC 1321 or some similar source, so I'll be brief.

We'll start with some basic utilities:

The binascii module will help us convert bytes to integers and vice-versa. The others are just so we can use our program on files of our choosing on the command line. To start the actual program, let's define a bunch of constants.

What are these constants? Element i of this array is given by 2**32*abs(math.sin(i+1)), in other words, the first 32 binary digits of $\sin(i)$ for $i=1,2\dots,65$ after the zero. These values are sort of like pseudo-random numbers that will be mashed with the message in various byzantine ways in the compression function. It's pretty typical for hash functions to use the bits of irrational numbers like the square roots of primes or the digits of pi for these constants. The reasoning is that these numbers have some desirable pseudo-random like properties and have a compact description.

One aspect of MD5 is that it has some rudimentary permuting in terms of circular shifts. So let's define a "left-circular shift" function that cuts off a bunch of the most significant bytes and places them in the least significant part:

What's with all the 32s? The MD5 algorithm is designed to operate on 32-bit integers. The left circular shift is as a 32-bit integer. That is important, because the binary string 1000 circularly shifted to the left as a four-bit integer is 0001 but as a 32 bit integer it is 10000.

The next thing is that MD5 processes a padded message in 512-bit blocks. But in the actual compression, we divide this 512-bit block further into 16 32-bit blocks. So, let's suppose we have a message block in binary, and we want to divide it. Let's write a function for that:

Here, we're using the int.from_bytes function to take a sequence of bytes and convert it to an integer. The important thing here is that we specify byteorder="little". That's because this is in the MD5 RFC 1321 specification:

[…] a sequence of bytes can be interpreted as a sequence of 32-bit words, where each consecutive group of four bytes is interpreted as a word with the low-order (least significant) byte given first.

If we used byteorder="big", then we would end up with a different hash function that is very similar to MD5. Now we have to define four bitwise functions that will be used in the compression function of MD5:

These will mix up the data of the message with the previous output of the hash in iterations over blocks. To aid in doing so we define more functions:

Here, a,b,c,d will be the previous outputs of the compression function and M will be one 32-bit piece of a 512-bit block. The value t will be one of the 64 constants we defined earlier in the array SV and the s will be some circular shift value.

Let's define two more helper functions:

The function bitlen is self-explanatory. The function fmt8 formats a number as a hexadecimal value with the smallest bytes first. That's also in the MD5 specification. We will pass four numbers to this function, each representing a 32-bit integer. The final four 32-bit integers will be concatenated to produce the 128-bit hash value.

Let's define the main hashing algorithm now:

The algorithm first pads the message with a '1' bit, then a bunch of zeros until the padded message length in bits is congruent to 448 modulo 512. Finally, a 64 bit representation of the length of the original message modulo $2^{64}$ is appended to the message in byteorder="little" convention.

The message is then mushed in various intricate ways, one 512-bit block at a time. Each transformation FF, GG, HH, or II is called a round, and there are 64 of them. The results of these 64 rounds are added to a tally modulo $2^{32}$, and the final result of the four tallies makes up the hash. To get this to work as a script, we can add:

Here is a test of this algorithm, assuming this code is in a file called md5sum.py:

The second output is of the coreutils md5sum program written in C that is present on my computer. Here is the entire code; feel free to use it for your own education:

Leave a comment

Fields marked with * are required. LaTeX snippets may be entered by surrounding them with single dollar signs. Use double dollar signs for display equations.