Cryptographic Hash Functions

This is a lead-up article to How Blockchains Work that I will publish soon. Cryptographic hash functions are vital to the blockchain, so I'm writing this to give a little bit of background. However, these functions are useful in so many applications, that it merits its own explanation.

Note: Cryptographic hash functions are complex mathematical functions, well defined by academic and industry researchers. Every aspect of it has precise terminology, and involves advanced math. This article is a simplification of those terms, and is meant as an easy introduction only.

What We Need

The main task that hashes do best is uniquely identify things. Identification is an important ability, whether for humans or for computers. What "things" are we talking about? For computers, it's any digital object. It could be a string of text - a single letter "a", for example - or it could be a large file several gigabytes in size. As long as it can be represented by 1s and 0s, it can be hashed.

Hash Functions

Let's start with regular hash functions. A hash function is a program, or an algorithm, that takes one input - the thing to be hashed - and produces one output: a short string of text. Technically, it's a number, but it's often represented as text. Think of it as a black box.

  Curabitur at arcu at arcu   |
  bibendum ultrices quis eu   |
  dui. Quisque gravida mi     |
  erat, eu finibus nisl eli   |
  aliquet id. Aliquam vitae   | → ■ ⇒ fb3e35968deeb5331a8553482288f50cb79b8a52
  viverra, ligula et tempus   |
  aliquet, sem dui iaculis    |
  ullamcorper eros, non dui   |
  faucibus a elit tellus...   |

This black box is a hash function if it has these properties:

the output string is always the same length
it produces the exact same string every time a particular input is given
it can produce an output very quickly, even if the input is big
the outputs are generally distributed evenly, and not "clumped together" (this is not as strict of a requirement)

So let's take a simple example. As we'll see, it's not a very good one. Suppose we have a function that can hash English words. Words go in, hashes come out. And let's suppose also that what the function does is return the length of the word. We'll call it lengthhash.

   lengthhash
apple    → ■ ⇒ 5
ball     → ■ ⇒ 4
cat      → ■ ⇒ 3
dog      → ■ ⇒ 3
elephant → ■ ⇒ 8

Does it meet all our requirements? The output string won't always be the same length - some words might have ten or more letters - but this can be fixed by adding a couple zeroes at the beginning of the number. Apparently, the longest English word has 189 819 letters, so let's set the length at six. Instead of "5", we'll get "000005", and instead of "10", we'll output "000010", and so on. Will it produce the same output every time the same input is given? Yes, inputting "apple" will always produce "000005", no matter how many times we try. Finding the length of a word is always very quick, even for long words.

Lastly, are the outputs distributed roughly evenly? No, they're not. This is because the average length of an English word is 5 letters, and most words are close to that. This means there were will be some hashes (like "000005" and "000003") that represent many words, and other hashes (like "029884") that represent few or no words. This is not a very good hash function.

Because many common words will produce the same output hash, lengthhash is not good at uniquely identifyting words. We could try a similar function that instead converts each letter into a number, then adds all those numbers together, but sumhash would have a very similar problem to lengthhash. Also, any function that involves random numbers wouldn't work, since it would produce different outputs when given the same input.

There are many (real) hashing functions that are much better at this, and do fit all the properties.

Cryptographic Hash Functions

This brings us to cryptographic hash functions. They are hash functions, with the same properties, but with a couple extra:

outputs "appear" random, meaning
- finding a "collision" is near impossible
- it is extremely difficult to find out what the input was when looking at just the output
changing a single bit of the input will make the output very different

A "collision" is when two different inputs just so happen to produce the same output. In our previous lengthhash example, both "cat" and "dog" yielded "000003" as their hashes. This is a collision. Collisions are guaranteed to happen, since there is a limited number of possible hashes, and a near-infinite number of possible inputs. However, some hash functions make it so hard to find two inputs that result in the same output, that we basically say that it's impossible.

Another extra property of cryptographic hash functions is that it is difficult to find out what the input was if someone shows you a hash. Because a hash function usually produces an output that's smaller than the input, you're losing information, and that partially fulfills this property. If I gave you the string "000003", and it was produced by lengthhash, it's rather difficult to know what word was the input. The problem is though, that it does narrow the possible inputs to all three-lettered words. A true cryptographic hash function should not narrow down the possibilities.

To demonstrate the last extra property, let me show you some example outputs from a real cryptographic hash function called SHA-1. This function has been around a while, and is well known. It's not suitable for extreme security situations, but it is useful in many applications.

input	SHA-1 output
My secret	bc19e826f52e2496f3682e15dad1d4647e0acb18
my secret	6c45f3c983aebe6cb9f185602d23ba3f3589ff6c

Notice how the inputs are only different by one letter - the 'm' is uppercase in one, and lowercase in the other - and yet, the outputs are totally different.

Examples

While we're at it, here are some other examples of outputs of a related function, SHA-256. Look for the properties of hash functions we discussed previously.

input	SHA-256 output
a	ca978112ca1bbdcafac231b39a23dc4da786eff8147c4e72b9807785afee48bb
a	ca978112ca1bbdcafac231b39a23dc4da786eff8147c4e72b9807785afee48bb
A	559aead08264d5795d3909718cdd05abd49572e84fe55590eef31a88a08fdffd
abc	ba7816bf8f01cfea414140de5dae2223b00361a396177a9cb410ff61f20015ad
0123456789	84d89877f0d4041efb6bf91a16f0248f2fd573e6af05c19f96bedb9f882f7882
The quick grey fox jumps over the lazy brown dog	99c5c1e9107f7b8a7a547d97ba59aa89eba1d52c4ea1e080ec0957e7f07c5407
The quick grey fox jumps over the lazy brown dog.	09aac223b5393722436d2867f830a9d0c0d261bdc151dbbedc368302e0c1ab56
(this pdf copy of the bitcoin whitepaper)	b1674191a88ec5cdd733e4240a81803105dc412d6c6708d53ab94fc248f4f553

Applications

Here are some areas where cryptographic hash functions might be used:

making sure a file was downloaded correctly (without downloading the whole file again)
as a check if you need to download a file at all (if both computers have a file with the same hash, then there's no need to copy it over)
to identify versions of files in a code version control system, like git
for a website to safely store their users' login passwords

...and in blockchains, which we will discuss later!

Resources

For more information, try these two easy-to-understand videos from Computerphile on YouTube: the first is a general overview of hash functions, and the second is a broad explanation of how the SHA-1 algorithm works under the hood. I got most of the list of applications from wikipedia.

Try it yourself! You can find any number of online converters that will take any string you type (or file that you upload), and hash it. For the above examples, I used this site and this site.

Cryptographic Hash Functions

What We Need

Hash Functions

Cryptographic Hash Functions

Examples

Applications

Resources

Add a comment

Archives

Syndicate