This is a lead-up article to How Blockchains Work that I will publish soon. Cryptographic hash functions are vital to the blockchain, so I'm writing this to give a little bit of background. However, these functions are useful in so many applications, that it merits its own explanation.
Note: Cryptographic hash functions are complex mathematical functions, well defined by academic and industry researchers. Every aspect of it has precise terminology, and involves advanced math. This article is a simplification of those terms, and is meant as an easy introduction only.
The main task that hashes do best is uniquely identify things. Identification is an important ability, whether for humans or for computers. What "things" are we talking about? For computers, it's any digital object. It could be a string of text - a single letter "a", for example - or it could be a large file several gigabytes in size. As long as it can be represented by 1s and 0s, it can be hashed.
Let's start with regular hash functions. A hash function is a program, or an algorithm, that takes one input - the thing to be hashed - and produces one output: a short string of text. Technically, it's a number, but it's often represented as text. Think of it as a black box.
Curabitur at arcu at arcu | bibendum ultrices quis eu | dui. Quisque gravida mi | erat, eu finibus nisl eli | aliquet id. Aliquam vitae | → ■ ⇒ fb3e35968deeb5331a8553482288f50cb79b8a52 viverra, ligula et tempus | aliquet, sem dui iaculis | ullamcorper eros, non dui | faucibus a elit tellus... |
This black box is a hash function if it has these properties:
So let's take a simple example. As we'll see, it's not a very good one. Suppose we have a function that can hash English words. Words go in, hashes come out. And let's suppose also that what the function does is return the length of the word. We'll call it
lengthhash apple → ■ ⇒ 5 ball → ■ ⇒ 4 cat → ■ ⇒ 3 dog → ■ ⇒ 3 elephant → ■ ⇒ 8
Does it meet all our requirements? The output string won't always be the same length - some words might have ten or more letters - but this can be fixed by adding a couple zeroes at the beginning of the number. Apparently, the longest English word has 189 819 letters, so let's set the length at six. Instead of "5", we'll get "000005", and instead of "10", we'll output "000010", and so on. Will it produce the same output every time the same input is given? Yes, inputting "apple" will always produce "000005", no matter how many times we try. Finding the length of a word is always very quick, even for long words.
Lastly, are the outputs distributed roughly evenly? No, they're not. This is because the average length of an English word is 5 letters, and most words are close to that. This means there were will be some hashes (like "000005" and "000003") that represent many words, and other hashes (like "029884") that represent few or no words. This is not a very good hash function.
Because many common words will produce the same output hash,
lengthhash is not good at uniquely identifyting words. We could try a similar function that instead converts each letter into a number, then adds all those numbers together, but
sumhash would have a very similar problem to
lengthhash. Also, any function that involves random numbers wouldn't work, since it would produce different outputs when given the same input.
There are many (real) hashing functions that are much better at this, and do fit all the properties.
This brings us to cryptographic hash functions. They are hash functions, with the same properties, but with a couple extra:
A "collision" is when two different inputs just so happen to produce the same output. In our previous
lengthhash example, both "cat" and "dog" yielded "000003" as their hashes. This is a collision. Collisions are guaranteed to happen, since there is a limited number of possible hashes, and a near-infinite number of possible inputs. However, some hash functions make it so hard to find two inputs that result in the same output, that we basically say that it's impossible.
Another extra property of cryptographic hash functions is that it is difficult to find out what the input was if someone shows you a hash. Because a hash function usually produces an output that's smaller than the input, you're losing information, and that partially fulfills this property. If I gave you the string "000003", and it was produced by
lengthhash, it's rather difficult to know what word was the input. The problem is though, that it does narrow the possible inputs to all three-lettered words. A true cryptographic hash function should not narrow down the possibilities.
To demonstrate the last extra property, let me show you some example outputs from a real cryptographic hash function called
SHA-1. This function has been around a while, and is well known. It's not suitable for extreme security situations, but it is useful in many applications.
Notice how the inputs are only different by one letter - the 'm' is uppercase in one, and lowercase in the other - and yet, the outputs are totally different.
While we're at it, here are some other examples of outputs of a related function,
SHA-256. Look for the properties of hash functions we discussed previously.
|The quick grey fox jumps over the lazy brown dog||99c5c1e9107f7b8a7a547d97ba59aa89eba1d52c4ea1e080ec0957e7f07c5407|
|The quick grey fox jumps over the lazy brown dog.||09aac223b5393722436d2867f830a9d0c0d261bdc151dbbedc368302e0c1ab56|
|(this pdf copy of the bitcoin whitepaper)||b1674191a88ec5cdd733e4240a81803105dc412d6c6708d53ab94fc248f4f553|
Here are some areas where cryptographic hash functions might be used:
...and in blockchains, which we will discuss later!
For more information, try these two easy-to-understand videos from Computerphile on YouTube: the first is a general overview of hash functions, and the second is a broad explanation of how the
SHA-1 algorithm works under the hood. I got most of the list of applications from wikipedia.