There has been a lot of news lately about nefarious-sounding backdoors being inserted into cryptographic standards and toolkits. One algorithm, a pseudo-random bit generator, Dual_EC_DRBG, was ratified by the National Institute of Standards and Technology (NIST) in 2007 and is attracting a lot of attention for having a potential backdoor. This is the algorithm into which the NSA allegedly inserted a backdoor and then paid RSA to use.
So how is that possible? This is a technical primer that explains what a backdoor is, how easy it can be to create your own, and the dangerous consequences of using a random number generator that was designed to have a backdoor. This is necessarily a long technical discussion, but hopefully by the end it should be clear why Dual_EC_DRBG has such a bad reputation.
The concept of a backdoor has cast a shadow over the security industry for a long time. A backdoor is an intentional flaw in a cryptographic algorithm or implementation that allows an individual to bypass the security mechanisms the system was designed to enforce. A backdoor is a way for someone to get something out of the system that they otherwise would not be able to. If a security system is wall, a backdoor is a secret tunnel underneath it.
Backdoors can be inserted by lazy programmers who want to bypass their own security systems for debugging reasons, or they can be created to intentionally weaken a system used by others. Government agencies have been known to insert backdoors into commonly used software to enable mass surveillance. Backdoors can be built into software, hardware, or even built into the design of an algorithm.
In theory, a well-designed cryptographic system does not include a backdoor. In practice, it is hard to guarantee that a piece of software is backdoor-free. A backdoor was recently found in a widely-deployed version of D-Link router firmware. The backdoor allows anyone with knowledge of a secret user agent string to log in and modify settings on any router running the vulnerable software. The D-Link backdoor took a long time to find because the source code for the router software was not available to security researchers to examine. With open source software, a researcher can look directly at the part of the code that verifies authentication and check for backdoors.
Open source is great tool for understanding how code works but it is not a cure-all for finding backdoors in software. It can be difficult and time-consuming to fully analyze all the code in a complicated codebase. The International Obfuscated C Code Contest shows how code can be made extremely hard to understand. The Underhanded C Contest takes this even further, showing that benign looking code can hide malicious behavior.
The translation step between human programming languages and machine code can also be used to insert a backdoor. The classic article "Reflections on Trusting Trust" introduced this idea back in 1984. The cryptographic community has recently banded together to audit the open source disk encryption software TrueCrypt for backdoors. One of the key steps in this audit is verifying that the machine code distributed online for TrueCrypt matches the source code. This requires re-building the audited source code with a fully open source compiler and making sure the machine code matches. Reproducible binaries help demonstrate that a backdoor was not inserted in the program's machine code by a malicious person or compiler.
In some cases, even this might not be enough. For example, TrueCrypt, like most cryptographic systems, use the system's random number generator to create secret keys. If an attacker can control or predict the random numbers produced by a system, they can often break otherwise secure cryptographic algorithms. Any predictability in a system's random number generator can render it vulnerable to attacks.
Examples of security systems being bypassed using flaws (intentionally created or otherwise) in random number generators are very common. Some recent examples:
- A flaw in a random number generator allowed people to hijack Hacker News accounts.
- A broken random number generator in Android allowed attackers to hijack thousands of dollars worth of bitcoins.
- The version of OpenSSL on the Debian distribution had a random number generator problem that could allow attackers to guess private keys created on these systems
It's absolutely essential to have an unpredictable source of random numbers in secure systems that rely on them. This includes SSL/TLS, the fundamental security layer of the internet where session keys are generated using random numbers. If you design a random number generator that allows you to predict the output, and convince someone to use it, you can break their system. This kind of algorithmic backdoor is what we will create in this blog post.
A random stream that isn't
The digits of pi are quite random looking but they don't make a very good random number generator because they are predictable. Anyone who knows that someone is using the digits of pi as their source of randomness can use that against them. Convincing someone to use a pi-based random number generator is a difficult challenge.
Many pseudo-random number generators start with a number called a seed. The seed is the starting point for the internal state of the algorithm. The algorithm generates a stream of random numbers using some mathematical operation on the internal state. As long as the seed (and the subsequent internal state) are kept secret, the pseudo-random numbers output by the algorithm are unpredictable to any observer. Conversely, anyone who knows the state will be able to predict the output.
Linux uses a pool of numbers as the internal state of /dev/random, its pseudo-random number generator. Every time a program requests random data from the system, Linux returns a cryptographic hash of its internal state using the algorithm SHA-1. This hash function is designed to be one-way, it is easy to compute but very difficult to find the input given an output. It is so difficult, no person has ever published an inversion of a SHA-1 hash without knowing the input beforehand. This keeps the internal state of the random number generator secret.
The random data extracted by the hash function is then mixed back into internal state. Periodically, the hashes of the timestamps of "unpredictable" system events like clicks and key presses are also mixed in.
Here's a diagram of the basic pseudo-random number generator construction:
In this diagram
F = SHA1
and G = SHA1 + mix with XOR
This construction is pretty standard. The internal state is kept secret, data is output via a one-way function, and the internal state is updated by mixing the data back into the state.
At any point, if an attacker can figure out the internal state, they can predict the output. The strategic choices for F and G here are what make this construction safe. You do not lose the randomness in the pool by XOR-ing with something else, entropy always goes up.
If F and G were chosen to be two completely independent one-way functions, it would probably still be safe. Having SHA-1 as F and MD5 (a different hash function) as G would not be too unreasonable of a choice. The key here is in the word independent, but first a sidestep into elliptic curves.
Elliptic Curves and one-way functions
In a previous blog post we gave a gentle introduction to elliptic curve cryptography. We talked about how this class of curves can be used for encryption and digital signature algorithms. We also hinted that elliptic curves could be used for generating random numbers. That is what we we will describe here.
The reason elliptic curves are used in cryptography is the strongly one-way function they enable. As described previously, there is a geometrically intuitive way to define an arithmetic on the points of an elliptic curve.
Any two points on an elliptic curve can be "dotted" ("multiplied") together to get a new point on the curve. Dotting a point with itself any number of times is fast easy to do, but going back to the original point takes a lot of computation. This operation can be used to create a nice and simple one-way function from a point P1:
Given a number n, output another number m:
- Dot P1 with itself n times to get another point Q
- Output the x-coordinate of Q as m
It's hard to go back from m to n, because that would be enough to solve the elliptic curve discrete logarithm problem, which is thought to be very, very hard to do.
The metaphor used in the previous post was that the one way function in elliptic curves is like playing a peculiar game of billiards. If someone were locked alone in a room they could play a certain number of shots and the ball would end up at a particular location. However, if you entered the room at some point and simply saw the position of the ball it would be very difficult to determine the number of shots the player had taken without playing through the whole game again yourself.
With this billiards analogy, we can think of this random number generator as a new bizarro game of pool. Consider two balls on the infinite elliptic curve billiards table, the yellow ball called P1 and the blue ball called P2.
These two balls have specific points on the curve where they start. This is a two person game where one person is called the generator and the other is the observer. The generator has a secret number "n". The generator takes the ball P1 and performs n shots, and lets the observer see its final location. Then it takes P2 and performs n shots, taking the final location of P2 as a new value for n. Then P1 and P2 are reset to their original location and that's the end of the turn. Each turn the observer sees a new pseudo-random location for P1, and that's the output of the game.
There's a trap door in your one-way functions
In the Linux random number generator example above, SHA-1 is used as the one-way function. Let's consider what happens when we use our elliptic curve one way function instead.
Looking back at the construction for a pseudo-random number generator above, we need to choose two functions to serve as F and G. The elliptic curve one-way function above seems to fit the bill, so let's use the functions defined by two points on the curve, P1 and P2. Each one-way function is hard to reverse, and if P1 and P2 are chosen randomly, they should be independent.
So how do we add a backdoor? The key is to choose P1 and P2 so that to any outside observer they look random and independent, but in reality they have a special relationship that only we know.
Suppose we choose P2 to be P1 dotted with itself s times, where s is secret number. Then P1 and P2 are related but it is hard to prove how since finding s requires solving the elliptic curve discrete logarithm problem.
Given an initial state n, let's look at what the output becomes and what the state gets updated to.
The output is the x-coordinate of:
Q = P1 ◦ P1 ◦ … ◦ P1 (n times)
Then we get that the state S gets updated to:
P2 ◦ P2 ◦ … ◦ P2 (n times)
But P2 is just P1 dotted with itself s times, so the state is really
(P1 ◦ P1 ◦ … ◦ P1 (s times)) ◦ … ◦ (P1 ◦ P1 ◦ … ◦ P1 (s times)) (n times)
P1 ◦ P1 ◦ … ◦ P1 (s ◦ n times)
(P1 ◦ P1 ◦ … ◦ P1 (n times)) ◦ … ◦ (P1 ◦ P1 ◦ … ◦ P1 (n times)) (s times)
Seeing that P1 dotted with itself n times is the output Q, we can write this as:
Q ◦ Q ◦ … ◦ Q (s times)
And since we know s and the output (and therefore Q), we can calculate the next internal state of the algorithm. The state is revealed and all subsequent bytes can be predicted. In just one round! Since given P1 and P2, finding s requires solving the discrete logarithm problem, you get to be the only one who knows this mathematical backdoor.
This can be described in the terms of the billiards game from the last section. Remember the output of one turn of the game is the location of P1 after n shots and generator's secret number comes from the location of P2 after n shots. Knowing the value s is like knowing how many shots it takes to go from P1 to P2. This lets the observer cheat at the game. If you know where P1 lands after n shots, you can shoot s times from that location to get the location of P2 after n shots. This gives you the generator's secret number and allows you to predict the next turn of the game.
Back to the real world
This toy random number generator may seem very simple and the backdoor might even seem obvious. The amazing fact is that our toy random number generator described above is Dual_EC_DRBG, almost exactly. It was published by the NSA with two "random" looking points P1 and P2. There is no indication of how these values were generated.
The values for the points P1 and P2 could have been chosen randomly or they could have been chosen with a deliberate relationship. If they were chosen deliberately, there is a backdoor. If they truly were chosen randomly, then finding the internal state is as difficult as breaking elliptic curve cryptography. Unfortunately, there is no way to identify if the two points were chosen together or randomly without either solving the elliptic curve discrete logarithm function, or catching the algorithm's author with the secret backdoor value. This is the nature of a one-way trapdoor function.
The authors did not provide any proof of randomness for the two points P1 and P2. This could have easily been done by choosing P1 and P2 as outputs of a hash function, but they did not. This is just one of many flaws in the design of this algorithm.
The evidence is mounting for Dual_EC_DRBG being well-suited for use as a back door. A working proof of concept backdoor was published in late 2013 using OpenSSL, and a patent for using the construction as "key escrow" (another term for backdoor) was filed back in 2006.
Up until recently, Dual_EC_DRBG was the default random number generator for several cryptographic products from RSA (the security division of EMC), even though cryptographers have long been skeptical of the algorithm's design. There are reports of impropriety connecting a $10 million investment by the United States government and RSA's decision to use this obscure and widely maligned algorithm in their widely-distributed products.
It is very difficult to implement a secure system. Backdoors can be introduced at the software, hardware or even algorithm level. Algorithms backed by standards are not necessarily safe or free of backdoors. Some lessons to take away from this exercise are:
- Even secure cryptographic functions can be weakened if there isn't a good source of randomness
- Randomness in deterministic systems like computers is very hard to do correctly;
- Adding unpredictable sources of entropy can help increase randomness and, in turn, secure algorithms from these types of attacks
At CloudFlare, we understand this fact and are working on ways to make sure that the randomness in our cryptographic systems is truly random. Steps include extracting entropy from the physical world, monitoring system entropy levels, using a hardware random number generator to mix in extra entropy, and not relying on a single random number generator as the source of all randomness.