The Cloudflare Blog

Know your SCM_RIGHTS

Vlad Krasnov — Thu, 29 Nov 2018 09:54:22 GMT

As TLS 1.3 was ratified earlier this year, I was recollecting how we got started with it here at Cloudflare. We made the decision to be early adopters of TLS 1.3 a little over two years ago. It was a very important decision, and we took it very seriously.

It is no secret that Cloudflare uses nginx to handle user traffic. A little less known fact, is that we have several instances of nginx running. I won’t go into detail, but there is one instance whose job is to accept connections on port 443, and proxy them to another instance of nginx that actually handles the requests. It has pretty limited functionality otherwise. We fondly call it nginx-ssl.

Back then we were using OpenSSL for TLS and Crypto in nginx, but OpenSSL (and BoringSSL) had yet to announce a timeline for TLS 1.3 support, therefore we had to implement our own TLS 1.3 stack. Obviously we wanted an implementation that would not affect any customer or client that would not enable TLS 1.3. We also needed something that we could iterate on quickly, because the spec was very fluid back then, and also something that we can release frequently without worrying about the rest of the Cloudflare stack.

The obvious solution was to implement it on top of OpenSSL. The OpenSSL version we were using was 1.0.2, but not only were we looking ahead to replace it with version 1.1.0 or with BoringSSL (which we eventually did), it was so ingrained in our stack and so fragile that we wouldn’t be able to achieve our stated goals, without risking serious bugs.

Instead, Filippo Valsorda and Brendan McMillion suggested that the easier path would be to implement TLS 1.3 on top of the Go TLS library and make a Go replica of nginx-ssl (go-ssl). Go is very easy to iterate and prototype, with a powerful standard library, and we had a great pool of Go talent to use, so it made a lot of sense. Thus tls-tris was born.

The question remained how would we have Go handle only TLS 1.3 while letting nginx handling all prior versions of TLS?

And herein lies the problem. Both TLS 1.3 and older versions of TLS communicate on port 443, and it is common knowledge that only one application can listen on a given TCP port, and that application is nginx, that would still handle the bulk of the TLS traffic. We could pipe all the TCP data into another connection in Go, effectively creating an additional proxy layer, but where is the fun in that? Also it seemed a little inefficient.

Meet SCM_RIGHTS

So how do you make two different processes, written in two different programming languages, share the same TCP socket?

Fortunately, Linux (or rather UNIX) provides us with just the tool that we need. You can use UNIX-domain sockets to pass file descriptors between applications, and like everything else in UNIX connections are files.Looking at man 7 unix we see the following:

   Ancillary messages
       Ancillary  data  is  sent and received using sendmsg(2) and recvmsg(2).
       For historical reasons the ancillary message  types  listed  below  are
       specified with a SOL_SOCKET type even though they are AF_UNIX specific.
       To send them  set  the  cmsg_level  field  of  the  struct  cmsghdr  to
       SOL_SOCKET  and  the cmsg_type field to the type.  For more information
       see cmsg(3).

       SCM_RIGHTS
              Send or receive a set of  open  file  descriptors  from  another
              process.  The data portion contains an integer array of the file
              descriptors.  The passed file descriptors behave as though  they
              have been created with dup(2).

Technically you do not send “file descriptors”. The “file descriptors” you handle in the code are simply indices into the processes' local file descriptor table, which in turn points into the OS' open file table, that finally points to the vnode representing the file. Thus the “file descriptor” observed by the other process will most likely have a different numeric value, despite pointing to the same file.

We can also check man 3 cmsg as suggested, to find a handy example on how to use SCM_RIGHTS:

   struct msghdr msg = { 0 };
   struct cmsghdr *cmsg;
   int myfds[NUM_FD];  /* Contains the file descriptors to pass */
   int *fdptr;
   union {         /* Ancillary data buffer, wrapped in a union
                      in order to ensure it is suitably aligned */
       char buf[CMSG_SPACE(sizeof(myfds))];
       struct cmsghdr align;
   } u;

   msg.msg_control = u.buf;
   msg.msg_controllen = sizeof(u.buf);
   cmsg = CMSG_FIRSTHDR(&msg);
   cmsg->cmsg_level = SOL_SOCKET;
   cmsg->cmsg_type = SCM_RIGHTS;
   cmsg->cmsg_len = CMSG_LEN(sizeof(int) * NUM_FD);
   fdptr = (int *) CMSG_DATA(cmsg);    /* Initialize the payload */
   memcpy(fdptr, myfds, NUM_FD * sizeof(int));

And that is what we decided to use. We let OpenSSL read the “Client Hello” message from an established TCP connection. If the “Client Hello” indicated TLS version 1.3, we would use SCM_RIGHTS to send it to the Go process. The Go process would in turn try to parse the rest of the “Client Hello”, if it were successful it would proceed with TLS 1.3 connection, and upon failure it would give the file descriptor back to OpenSSL, to handle regularly.

So how exactly do you implement something like that?

Since in our case we established that the C process will listen for TCP connections, our other process will have to listen on a UNIX socket, for connections C will want to forward.

For example in Go:

type scmListener struct {
	*net.UnixListener
}

type scmConn struct {
	*net.UnixConn
}

var path = "/tmp/scm_example.sock"

func listenSCM() (*scmListener, error) {
	syscall.Unlink(path)

	addr, err := net.ResolveUnixAddr("unix", path)
	if err != nil {
		return nil, err
	}

	ul, err := net.ListenUnix("unix", addr)
	if err != nil {
		return nil, err
	}

	err = os.Chmod(path, 0777)
	if err != nil {
		return nil, err
	}

	return &scmListener{ul}, nil
}

func (l *scmListener) Accept() (*scmConn, error) {
	uc, err := l.AcceptUnix()
	if err != nil {
		return nil, err
	}
	return &scmConn{uc}, nil
}

Then in the C process, for each connection we want to pass, we will connect to that socket first:

int connect_unix()
{
    struct sockaddr_un addr = {.sun_family = AF_UNIX,
                               .sun_path = "/tmp/scm_example.sock"};

    int unix_sock = socket(AF_UNIX, SOCK_STREAM, 0);
    if (unix_sock == -1)
        return -1;

    if (connect(unix_sock, (struct sockaddr *)&addr, sizeof(addr)) == -1)
    {
        close(unix_sock);
        return -1;
    }

    return unix_sock;
}

To actually pass a file descriptor we utilize the example from man 3 cmsg:

int send_fd(int unix_sock, int fd)
{
    struct iovec iov = {.iov_base = ":)", // Must send at least one byte
                        .iov_len = 2};

    union {
        char buf[CMSG_SPACE(sizeof(fd))];
        struct cmsghdr align;
    } u;

    struct msghdr msg = {.msg_iov = &iov,
                         .msg_iovlen = 1,
                         .msg_control = u.buf,
                         .msg_controllen = sizeof(u.buf)};

    struct cmsghdr *cmsg = CMSG_FIRSTHDR(&msg);
    *cmsg = (struct cmsghdr){.cmsg_level = SOL_SOCKET,
                             .cmsg_type = SCM_RIGHTS,
                             .cmsg_len = CMSG_LEN(sizeof(fd))};

    memcpy(CMSG_DATA(cmsg), &fd, sizeof(fd));

    return sendmsg(unix_sock, &msg, 0);
}

Then to receive the file descriptor in Go:

func (c *scmConn) ReadFD() (*os.File, error) {
	msg, oob := make([]byte, 2), make([]byte, 128)

	_, oobn, _, _, err := c.ReadMsgUnix(msg, oob)
	if err != nil {
		return nil, err
	}

	cmsgs, err := syscall.ParseSocketControlMessage(oob[0:oobn])
	if err != nil {
		return nil, err
	} else if len(cmsgs) != 1 {
		return nil, errors.New("invalid number of cmsgs received")
	}

	fds, err := syscall.ParseUnixRights(&cmsgs[0])
	if err != nil {
		return nil, err
	} else if len(fds) != 1 {
		return nil, errors.New("invalid number of fds received")
	}

	fd := os.NewFile(uintptr(fds[0]), "")
	if fd == nil {
		return nil, errors.New("could not open fd")
	}

	return fd, nil
}

Rust

We can also do this in Rust, although the standard library in Rust does not yet support UNIX sockets, but it does let you address the C library via the libc crate. Warning, unsafe code ahead!

First we want to implement some UNIX socket functionality in Rust:

use libc::*;
use std::io::prelude::*;
use std::net::TcpStream;
use std::os::unix::io::FromRawFd;
use std::os::unix::io::RawFd;

fn errno_str() -> String {
    let strerr = unsafe { strerror(*__error()) };
    let c_str = unsafe { std::ffi::CStr::from_ptr(strerr) };
    c_str.to_string_lossy().into_owned()
}

pub struct UNIXSocket {
    fd: RawFd,
}

pub struct UNIXConn {
    fd: RawFd,
}

impl Drop for UNIXSocket {
    fn drop(&mut self) {
        unsafe { close(self.fd) };
    }
}

impl Drop for UNIXConn {
    fn drop(&mut self) {
        unsafe { close(self.fd) };
    }
}

impl UNIXSocket {
    pub fn new() -> Result {
        match unsafe { socket(AF_UNIX, SOCK_STREAM, 0) } {
            -1 => Err(errno_str()),
            fd @ _ => Ok(UNIXSocket { fd }),
        }
    }

    pub fn bind(self, address: &str) -> Result {
        assert!(address.len() < 104);

        let mut addr = sockaddr_un {
            sun_len: std::mem::size_of::() as u8,
            sun_family: AF_UNIX as u8,
            sun_path: [0; 104],
        };

        for (i, c) in address.chars().enumerate() {
            addr.sun_path[i] = c as i8;
        }

        match unsafe {
            unlink(&addr.sun_path as *const i8);
            bind(
                self.fd,
                &addr as *const sockaddr_un as *const sockaddr,
                std::mem::size_of::() as u32,
            )
        } {
            -1 => Err(errno_str()),
            _ => Ok(self),
        }
    }

    pub fn listen(self) -> Result {
        match unsafe { listen(self.fd, 50) } {
            -1 => Err(errno_str()),
            _ => Ok(self),
        }
    }

    pub fn accept(&self) -> Result {
        match unsafe { accept(self.fd, std::ptr::null_mut(), std::ptr::null_mut()) } {
            -1 => Err(errno_str()),
            fd @ _ => Ok(UNIXConn { fd }),
        }
    }
}

And the code to extract the file desciptor:

#[repr(C)]
pub struct ScmCmsgHeader {
    cmsg_len: c_uint,
    cmsg_level: c_int,
    cmsg_type: c_int,
    fd: c_int,
}

impl UNIXConn {
    pub fn recv_fd(&self) -> Result {
        let mut iov = iovec {
            iov_base: std::ptr::null_mut(),
            iov_len: 0,
        };

        let mut scm = ScmCmsgHeader {
            cmsg_len: 0,
            cmsg_level: 0,
            cmsg_type: 0,
            fd: 0,
        };

        let mut mhdr = msghdr {
            msg_name: std::ptr::null_mut(),
            msg_namelen: 0,
            msg_iov: &mut iov as *mut iovec,
            msg_iovlen: 1,
            msg_control: &mut scm as *mut ScmCmsgHeader as *mut c_void,
            msg_controllen: std::mem::size_of::() as u32,
            msg_flags: 0,
        };

        let n = unsafe { recvmsg(self.fd, &mut mhdr, 0) };

        if n == -1
            || scm.cmsg_len as usize != std::mem::size_of::()
            || scm.cmsg_level != SOL_SOCKET
            || scm.cmsg_type != SCM_RIGHTS
        {
            Err("Invalid SCM message".to_string())
        } else {
            Ok(scm.fd)
        }
    }
}

Conclusion

SCM_RIGHTS is a very powerful tool that can be used for many purposes. In our case we used to to introduce a new service in a non-obtrusive fashion. Other uses may be:

A/B testing
Phasing out of an old C based service in favor of new Go or Rust one
Passing connections from a privileged process to an unprivileged one

And more

You can find the full example here.

On the dangers of Intel's frequency scaling

Vlad Krasnov — Fri, 10 Nov 2017 11:06:58 GMT

While I was writing the post comparing the new Qualcomm server chip, Centriq, to our current stock of Intel Skylake-based Xeons, I noticed a disturbing phenomena.

When benchmarking OpenSSL 1.1.1dev, I discovered that the performance of the cipher ChaCha20-Poly1305 does not scale very well. On a single thread, it performed at the speed of approximately 2.89GB/s, whereas on 24 cores, and 48 threads it performed at just over 35 GB/s.

CC BY-SA 2.0 image by blumblaum

Now this is a very high number, but I would like to see something closer to 69GB/s. 35GB/s is just 1.46GB/s/core, or roughly 50% of the single core performance. AES-GCM scales much better, to 80% of single core performance, which is understandable, because the CPU can sustain higher frequency turbo on a single core, but not all cores.

Why is the scaling of ChaCha20-Poly1305 so poor? Meet AVX-512. AVX-512 is a new Intel instruction set that adds many new 512-bit wide SIMD instructions and promotes most of the existing ones to 512-bit. The problem with such wide instructions is that they consume power. A lot of power. Imagine a single instruction that does the work of 64 regular byte instructions, or 8 full blown 64-bit instructions.

To keep power in check Intel introduced something called dynamic frequency scaling. It reduces the base frequency of the processor whenever AVX2 or AVX-512 instructions are used. This is not new, and has existed since Haswell introduced AVX2 three years ago.

The scaling gets worse when more cores execute AVX-512 and when multiplication is used.

If you only run AVX-512 code, then everything is good. The frequency is lower, but your overall productivity is higher, because each instruction does more work.

OpenSSL 1.1.1dev implements several variants of ChaCha20-Poly1305, including AVX2 and AVX-512 variants. BoringSSL implements a different AVX2 version of ChaCha20-Poly1305. It is understandable then why BoringSSL achieves only 1.6GB/s on a single core, compared to the 2.89GB/s OpenSSL does.

So how does this affect you, if you mix a little AVX-512 with your real workload? We use the Xeon Silver 4116 CPUs, with a base frequency 2.1GHz, in a dual socket configuration. From a figure I found on wikichip it seems that running AVX-512 even just on one core on this CPU will reduce the base frequency to 1.8GHz. Running AVX-512 on all cores will reduce it to just 1.4GHz.

Now imagine you run a webserver with Apache or NGINX. In addition you have many other services, performing some real, important work. What happens if you start encrypting your traffic with ChaCha20-Poly1305 using AVX-512? That is the question I asked myself.

I compiled two versions of NGINX, one with OpenSSL1.1.1dev and the other with BoringSSL, and installed it on our server with two Xeon Silver 4116 CPUs, for a total of 24 cores.

I configured the server to serve a medium sized HTML page, and perform some meaningful work on it. I used LuaJIT to remove line breaks and extra spaces, and brotli to compress the file.

I then monitored the number of requests per second served under full load. This is what I got:

By using ChaCha20-Poly1305 over AES-128-GCM, the server that uses OpenSSL serves 10% fewer requests per second. And that is a huge number! It is equivalent to giving up on two cores, for nothing. One might think that this is due to ChaCha20-Poly1305 being inherently slower. But that is not the case.

First, BoringSSL performs equivalently well with AES-GCM and ChaCha20-Poly1305.

Second, even when only 20% of the requests use ChaCha20-Poly1305, the server throughput drops by more than 7%, and by 5.5% when 10% of the requests are ChaCha20-Poly1305. For reference, 15% of the TLS requests Cloudflare handles are ChaCha20-Poly1305.

Finally, according to perf, the AVX-512 workload consumes only 2.5% of the CPU time when all the requests are ChaCha20-Poly1305, and less then 0.3% when doing ChaCha20-Poly1305 for 10% of the requests. Irregardless the CPU throttles down, because that what it does when it sees AVX-512 running on all cores.

It is hard to say just how much each core is throttled at any given time, but doing some sampling using lscpu, I found out that when executing the openssl speed -evp chacha20-poly1305 -multi 48 benchmark, it shows CPU MHz: 1199.963, for OpenSSL with all AES-GCM connections I got CPU MHz: 2399.926 and for OpenSSL with all ChaCha20-Poly1305 connections I saw CPU MHz: 2184.338, which is obviously 9% slower.

Another interesting distinction is that ChaCha20-Poly1305 with AVX2 is slightly slower in OpenSSL but is the same in BoringSSL. Why might that be? The reason here is that the BoringSSL code does not use AVX2 multiplication instructions for Poly1305, and only uses simple xor, shift and add operations for ChaCha20, which allows it to run at the base frequency.

OpenSSL 1.1.1dev is still in development, therefore I suspect no one is affected by this issue yet. We switched to BoringSSL months ago, and our server performance is not affected by this issue.

What the future holds in unclear. Intel announced very cool new ISA extensions for the future generation of CPUs, that are expected to improve crypto performance even further. Those extensions include AVX512+VAES, AVX512+VPCLMULQDQ and AVX512IFMA. But if the frequency scaling issue is not resolved by then, using those for general purpose cryptography libraries will do (much) more harm than good.

The problem is not with cryptography libraries alone. OpenSSL did nothing wrong by trying to get the best possible performance, on the contrary, I wrote a decent amount of AVX-512 code for OpenSSL myself. The observed behavior is a sad side effect. There are many libraries that use AVX and AVX2 instructions out there, they will probably be updated to AVX-512 at some point, and users are not likely to be aware of the implementation details. If you do not require AVX-512 for some specific high performance tasks, I suggest you disable AVX-512 execution on your server or desktop, to avoid accidental AVX-512 throttling.

ARM Takes Wing: Qualcomm vs. Intel CPU comparison

Vlad Krasnov — Wed, 08 Nov 2017 20:03:14 GMT

One of the nicer perks I have here at Cloudflare is access to the latest hardware, long before it even reaches the market.

Until recently I mostly played with Intel hardware. For example Intel supplied us with an engineering sample of their Skylake based Purley platform back in August 2016, to give us time to evaluate it and optimize our software. As a former Intel Architect, who did a lot of work on Skylake (as well as Sandy Bridge, Ivy Bridge and Icelake), I really enjoy that.

Our previous generation of servers was based on the Intel Broadwell micro-architecture. Our configuration includes dual-socket Xeons E5-2630 v4, with 10 cores each, running at 2.2GHz, with a 3.1GHz turboboost and hyper-threading enabled, for a total of 40 threads per server.

Since Intel was, and still is, the undisputed leader of the server CPU market with greater than 98% market share, our upgrade process until now was pretty straightforward: every year Intel releases a new generation of CPUs, and every year we buy them. In the process we usually get two extra cores per socket, and all the extra architectural features such upgrade brings: hardware AES and CLMUL in Westmere, AVX in Sandy Bridge, AVX2 in Haswell, etc.

In the current upgrade cycle, our next server processor ought to be the Xeon Silver 4116, also in a dual-socket configuration. In fact, we have already purchased a significant number of them. Each CPU has 12 cores, but it runs at a lower frequency of 2.1GHz, with 3.0GHz turboboost. It also has smaller last level cache: 1.375 MiB/core, compared to 2.5 MiB the Broadwell processors had. In addition, the Skylake based platform supports 6 memory channels and the AVX-512 instruction set.

As we head into 2018, however, change is in the air. For the first time in a while, Intel has serious competition in the server market: Qualcomm and Cavium both have new server platforms based on the ARMv8 64-bit architecture (aka aarch64 or arm64). Qualcomm has the Centriq platform (code name Amberwing), based on the Falkor core, and Cavium has the ThunderX2 platform, based on the ahm ... ThunderX2 core?

The majestic Amberwing powered by the Falkor CPU CC BY-SA 2.0 image by DrPhotoMoto

Recently, both Qualcomm and Cavium provided us with engineering samples of their ARM based platforms, and in this blog post I would like to share my findings about Centriq, the Qualcomm platform.

The actual Amberwing in question

Overview

I tested the Qualcomm Centriq server, and compared it with our newest Intel Skylake based server and previous Broadwell based server.

Platform	Grantley (Intel)	Purley (Intel)	Centriq (Qualcomm)
Core	Broadwell	Skylake	Falkor
Process	14nm	14nm	10nm
Issue	8 µops/cycle	8 µops/cycle	8 instructions/cycle
Dispatch	4 µops/cycle	5 µops/cycle	4 instructions/cycle
# Cores	10 x 2S + HT (40 threads)	12 x 2S + HT (48 threads)	46
Frequency	2.2GHz (3.1GHz turbo)	2.1GHz (3.0GHz turbo)	2.5 GHz
LLC	2.5 MB/core	1.35 MB/core	1.25 MB/core
Memory Channels	4	6	6
TDP	170W (85W x 2S)	170W (85W x 2S)	120W
Other features	AES CLMUL AVX2	AES CLMUL AVX512	AES CLMUL NEON Trustzone CRC32

Overall on paper Falkor looks very competitive. In theory a Falkor core can process 8 instructions/cycle, same as Skylake or Broadwell, and it has higher base frequency at a lower TDP rating.

Ecosystem readiness

Up until now, a major obstacle to the deployment of ARM servers was lack, or weak, support by the majority of the software vendors. In the past two years, ARM’s enablement efforts have paid off, as most Linux distros, as well as most popular libraries support the 64-bit ARM architecture. Driver availability, however, is unclear at that point.

At Cloudflare, we run a complex software stack that consists of many integrated services, and running each of them efficiently is top priority.

On the edge we have the NGINX server software, that does support ARMv8. NGINX is written in C, and it also uses several libraries written in C, such as zlib and BoringSSL, therefore solid C compiler support is very important.

In addition, our flavor of NGINX is highly integrated with the lua-nginx-module, and we rely a lot on LuaJIT.

Finally, a lot of our services, such as our DNS server, RRDNS, are written in Go.

The good news is that both gcc and clang not only support ARMv8 in general, but have optimization profiles for the Falkor core.

Go has official support for ARMv8 as well, and they improve the arm64 backend constantly.

As for LuaJIT, the stable version, 2.0.5 does not support ARMv8, but the beta version, 2.1.0 does. Let’s hope it gets out of beta soon.

Benchmarks

OpenSSL

The first benchmark I wanted to perform, was OpenSSL version 1.1.1 (development version), using the bundled openssl speed tool. Although we recently switched to BoringSSL, I still prefer OpenSSL for benchmarking, because it has almost equally well optimized assembly code paths for both ARMv8 and the latest Intel processors.

In my opinion handcrafted assembly is the best measure of a CPU’s potential, as it bypasses the compiler bias.

Public key cryptography

Public key cryptography is all about raw ALU performance. It is interesting, but not surprising to see that in the single core benchmark, the Broadwell core is faster than Skylake, and both in turn are faster than Falkor. This is because Broadwell runs at a higher frequency, while architecturally it is not much inferior to Skylake.

Falkor is at a disadvantage here. First, in a single core benchmark, the turbo is engaged, meaning the Intel processors run at a higher frequency. Second, in Broadwell, Intel introduced two special instructions to accelerate big number multiplication: ADCX and ADOX. These perform two independent add-with-carry operations per cycle, whereas ARM can only do one. Similarly, the ARMv8 instruction set does not have a single instruction to perform 64-bit multiplication, instead it uses a pair of MUL and UMULH instructions.

Nevertheless, at the SoC level, Falkor wins big time. It is only marginally slower than Skylake at an RSA2048 signature, and only because RSA2048 does not have an optimized implementation for ARM. The ECDSA performance is ridiculously fast. A single Centriq chip can satisfy the ECDSA needs of almost any company in the world.

It is also very interesting to see Skylake outperform Broadwell by a 30% margin, despite losing the single core benchmark, and only having 20% more cores. This can be explained by more efficient all-core turbo, and improved hyper-threading.

Symmetric key cryptography

Symmetric key performance of the Intel cores is outstanding.

AES-GCM uses a combination of special hardware instructions to accelerate AES and CLMUL (carryless multiplication). Intel first introduced those instructions back in 2010, with their Westmere CPU, and every generation since they have improved their performance. ARM introduced a set of similar instructions just recently, with their 64-bit instruction set, and as an optional extension. Fortunately every hardware vendor I know of implemented those. It is very likely that Qualcomm will improve the performance of the cryptographic instructions in future generations.

ChaCha20-Poly1305 is a more generic algorithm, designed in such a way as to better utilize wide SIMD units. The Qualcomm CPU only has the 128-bit wide NEON SIMD, while Broadwell has 256-bit wide AVX2, and Skylake has 512-bit wide AVX-512. This explains the huge lead Skylake has over both in single core performance. In the all-cores benchmark the Skylake lead lessens, because it has to lower the clock speed when executing AVX-512 workloads. When executing AVX-512 on all cores, the base frequency goes down to just 1.4GHz---keep that in mind if you are mixing AVX-512 and other code.

The bottom line for symmetric crypto is that although Skylake has the lead, Broadwell and Falkor both have good enough performance for any real life scenario, especially considering the fact that on our edge, RSA consumes more CPU time than all the other crypto algorithms combined.

Compression

The next benchmark I wanted to see was compression. This is for two reasons. First, it is a very important workload on the edge, as having better compression saves bandwidth, and helps deliver content faster to the client. Second, it is a very demanding workload, with a high rate of branch mispredictions.

Obviously the first benchmark would be the popular zlib library. At Cloudflare, we use an improved version of the library, optimized for 64-bit Intel processors, and although it is written mostly in C, it does use some Intel specific intrinsics. Comparing this optimized version to the generic zlib library wouldn’t be fair. Not to worry, with little effort I adapted the library to work very well on the ARMv8 architecture, with the use of NEON and CRC32 intrinsics. In the process it is twice as fast as the generic library for some files.

The second benchmark is the emerging brotli library, it is written in C, and allows for a level playing field for all platforms.

All the benchmarks are performed on the HTML of blog.cloudflare.com, in memory, similar to the way NGINX performs streaming compression. The size of the specific version of the HTML file is 29,329 bytes, making it a good representative of the type of files we usually compress. The parallel benchmark compresses multiple files in parallel, as opposed to compressing a single file on many threads, also similar to the way NGINX works.

gzip

When using gzip, at the single core level Skylake is the clear winner. Despite having lower frequency than Broadwell, it seems that having lower penalty for branch misprediction helps it pull ahead. The Falkor core is not far behind, especially with lower quality settings. At the system level Falkor performs significantly better, thanks to the higher core count. Note how well gzip scales on multiple cores.

brotli

With brotli on single core the situation is similar. Skylake is the fastest, but Falkor is not very much behind, and with quality setting 9, Falkor is actually faster. Brotli with quality level 4 performs very similarly to gzip at level 5, while actually compressing slightly better (8,010B vs 8,187B).

When performing many-core compression, the situation becomes a bit messy. For levels 4, 5 and 6 brotli scales very well. At level 7 and 8 we start seeing lower performance per core, bottoming with level 9, where we get less than 3x the performance of single core, running on all cores.

My understanding is that at those quality levels Brotli consumes significantly more memory, and starts thrashing the cache. The scaling improves again at levels 10 and 11.

Bottom line for brotli, Falkor wins, since we would not consider going above quality 7 for dynamic compression.

Golang

Golang is another very important language for Cloudflare. It is also one of the first languages to offer ARMv8 support, so one would expect good performance. I used some of the built-in benchmarks, but modified them to run on multiple goroutines.

Go crypto

I would like to start the benchmarks with crypto performance. Thanks to OpenSSL we have good reference numbers, and it is interesting to see just how good the Go library is.

As far as Go crypto is concerned ARM and Intel are not even on the same playground. Go has very optimized assembly code for ECDSA, AES-GCM and Chacha20-Poly1305 on Intel. It also has Intel optimized math functions, used in RSA computations. All those are missing for ARMv8, putting it at a big disadvantage.

Nevertheless, the gap can be bridged with a relatively small effort, and we know that with the right optimizations, performance can be on par with OpenSSL. Even a very minor change, such as implementing the function addMulVVW in assembly, lead to an over tenfold improvement in RSA performance, putting Falkor ahead of both Broadwell and Skylake, with 8,009 signatures/second.

Another interesting thing to note is that on Skylake, the Go Chacha20-Poly1305 code, that uses AVX2 performs almost identically to the OpenSSL AVX512 code, this is again due to AVX2 running at higher clock speeds.

Go gzip

Next in Go performance is gzip. Here again we have a reference point to pretty well optimized code, and we can compare it to Go. In the case of the gzip library, there are no Intel specific optimizations in place.

Gzip performance is pretty good. The single core Falkor performance is way below both Intel processors, but at the system level it manages to outperform Broadwell, and lags behind Skylake. Since we already know that Falkor outperforms both when C is used, it can only mean that Go’s backend for ARMv8 is still pretty immature compared to gcc.

Go regexp

Regexp is widely used in a variety of tasks, so its performance is quite important too. I ran the built-in benchmarks on 32KB strings.

Go regexp performance is not very good on Falkor. In the medium and hard tests it takes second place, thanks to the higher core count, but Skylake is significantly faster still.

Doing some profiling shows that a lot of the time is spent in the function bytes.IndexByte. This function has an assembly implementation for amd64 (runtime.indexbytebody), but generic implementation for Go. The easy regexp tests spend most time in this function, which explains the even wider gap.

Go strings

Another important library for a web server is the Go strings library. I only tested the basic Replacer class here.

In this test again, Falkor lags behind, and loses even to Broadwell. Profiling shows significant time is spent in the function runtime.memmove. Guess what? It has a highly optimized assembly code for amd64, that uses AVX2, but only very simple ARM assembly, that copies 8 bytes at a time. By changing three lines in that code, and using the LDP/STP instructions (load pair/store pair) to copy 16 bytes at a time, I improved the performance of memmove by 30%, which resulted in 20% faster EscapeString and UnescapeString performance. And that is just scratching the surface.

Go conclusion

Go support for aarch64 is quite disappointing. I am very happy to say that everything compiles and works flawlessly, but on the performance side, things should get better. Is seems like the enablement effort so far was concentrated on the compiler back end, and the library was left largely untouched. There are a lot of low hanging optimization fruits out there, like my 20-minute fix for addMulVVW clearly shows. Qualcomm and other ARMv8 vendors intends to put significant engineering resources to amend this situation, but really anyone can contribute to Go. So if you want to leave your mark, now is the time.

LuaJIT

Lua is the glue that holds Cloudflare together.

Except for the binary_trees benchmark, the performance of LuaJIT on ARM is very competitive. It wins two benchmarks, and is in almost a tie in a third one.

That being said, binary_trees is a very important benchmark, because it triggers many memory allocations and garbage collection cycles. It will require deeper investigation in the future.

NGINX

For the NGINX workload, I decided to generate a load that would resemble an actual server.

I set up a server that serves the HTML file used in the gzip benchmark, over https, with the ECDHE-ECDSA-AES128-GCM-SHA256 cipher suite.

It also uses LuaJIT to redirect the incoming request, remove all line breaks and extra spaces from the HTML file, while adding a timestamp. The HTML is then compressed using brotli with quality 5.

Each server was configured to work with as many workers as it has virtual CPUs. 40 for Broadwell, 48 for Skylake and 46 for Falkor.

As the client for this test, I used the hey program, running from 3 Broadwell servers.

Concurrently with the test, we took power readings from the respective BMC units of each server.

With the NGINX workload Falkor handled almost the same amount of requests as the Skylake server, and both significantly outperform Broadwell. The power readings, taken from the BMC show that it did so while consuming less than half the power of other processors. That means Falkor managed to get 214 requests/watt vs the Skylake’s 99 requests/watt and Broadwell’s 77.

I was a bit surprised to see Skylake and Broadwell consume about the same amount of power, given both are manufactured with the same process, and Skylake has more cores.

The low power consumption of Falkor is not surprising, Qualcomm processors are known for their great power efficiency, which has allowed them to be a dominant player in the mobile phone CPU market.

Conclusion

The engineering sample of Falkor we got certainly impressed me a lot. This is a huge step up from any previous attempt at ARM based servers. Certainly core for core, the Intel Skylake is far superior, but when you look at the system level the performance becomes very attractive.

The production version of the Centriq SoC will feature up to 48 Falkor cores, running at a frequency of up to 2.6GHz, for a potential additional 8% better performance.

Obviously the Skylake server we tested is not the flagship Platinum unit that has 28 cores, but those 28 cores come both with a big price and over 200W TDP, whereas we are interested in improving our bang for buck metric, and performance per watt.

Currently, my main concern is weak Go language performance, but that is bound to improve quickly once ARM based servers start gaining some market share.

Both C and LuaJIT performance is very competitive, and in many cases outperforms the Skylake contender. In almost every benchmark Falkor shows itself as a worthy upgrade from Broadwell.

The largest win by far for Falkor is the low power consumption. Although it has a TDP of 120W, during my tests it never went above 89W (for the go benchmark). In comparison, Skylake and Broadwell both went over 160W, while the TDP of the two CPUs is 170W.

If you enjoy testing and selecting hardware on behalf of millions of Internet properties, come [join us](https://www.cloudflare.com/careers/).

Go crypto: bridging the performance gap

Vlad Krasnov — Thu, 07 May 2015 10:06:00 GMT

It is no secret that we at CloudFlare love Go. We use it, and we use it a LOT. There are many things to love about Go, but what I personally find appealing is the ability to write assembly code!

CC BY 2.0 image by Jon Curnow

That is probably not the first thing that pops to your mind when you think of Go, but yes, it does allow you to write code "close to the metal" if you need the performance!

Another thing we do a lot in CloudFlare is... cryptography. To keep your data safe we encrypt everything. And everything in CloudFlare is a LOT.

Unfortunately the built-in cryptography libraries in Go do not perform nearly as well as state-of-the-art implementations such as OpenSSL. That is not acceptable at CloudFlare's scale, therefore we created assembly implementations of Elliptic Curves and AES-GCM for Go on the amd64 architecture, supporting the AES and CLMUL NI to bring performance up to par with the OpenSSL implementation we use for Universal SSL.

We have been using those improved implementations for a while, and attempting to make them part of the official Go build for the good of the community. For now you can use our special fork of Go to enjoy the improved performance.

Both implementations are constant-time and side-channel protected. In addition the fork includes small improvements to Go's RSA implementation.

The performance benefits of this fork over the standard Go 1.4.2 library on the tested Haswell CPU are:

                         CloudFlare          Go 1.4.2        Speedup
AES-128-GCM           2,138.4 MB/sec          91.4 MB/sec     23.4X

P256 operations:
Base Multiply           26,249 ns/op        737,363 ns/op     28.1X
Multiply/ECDH comp     110,003 ns/op      1,995,365 ns/op     18.1X
Generate Key/ECDH gen   31,824 ns/op        753,174 ns/op     23.7X
ECDSA Sign              48,741 ns/op      1,015,006 ns/op     20.8X
ECDSA Verify           146,991 ns/op      3,086,282 ns/op     21.0X

RSA2048:
Sign                 3,733,747 ns/op      7,979,705 ns/op      2.1X
Sign 3-prime         1,973,009 ns/op      5,035,561 ns/op      2.6X

AES-GCM in a brief

So what is AES-GCM and why do we care? Well, it is an AEAD - Authenticated Encryption with Associated Data. Specifically AEAD is a special combination of a cipher and a MAC (Message Authentication Code) algorithm into a single robust algorithm, using a single key. This is different from the other method of performing authenticated encryption "encrypt-then-MAC" (or as TLS does it with CBC-SHAx, "MAC-then-encrypt"), where you can use any combination of cipher and MAC.

Using a dedicated AEAD reduces the dangers of bad combinations of ciphers and MACs, and other mistakes, such as using related keys for encryption and authentication.

Given the many vulnerabilities related to the use of AES-CBC with HMAC, and the weakness of RC4, AES-GCM is the de-facto secure standard on the web right now, as the only IETF-approved AEAD to use with TLS at the moment.

Another AEAD you may have heard of is ChaCha20-Poly1305, which CloudFlare also supports, but it is not a standard just yet.

That is why we use AES-GCM as the preferred cipher for customer HTTPS only prioritizing ChaCha20-Poly1305 for mobile browsers that support it. You can see it in our configuration. As such today more than 60% of our client facing traffic is encrypted with AES-GCM, and about 10% is encrypted with ChaCha20-Poly1305. This percentage grows every day, as browser support improves. We also use AES-GCM to encrypt all the traffic between our data centers.

CC BY 2.0 image by 3:19

AES-GCM as an AEAD

As I mentioned AEAD is a special combination of a cipher and a MAC. In the case of AES-GCM the cipher is the AES block cipher in Counter Mode (AES-CTR). For the MAC it uses a universal hash called GHASH, encrypted with AES-CTR.

The inputs to the AES-GCM AEAD encryption are as follows:

The secret key (K), that may be 128, 192 or 256 bit long. In TLS, the key is usually valid for the entire connection.
A special unique value called IV (initialization value) - in TLS it is 96 bits. The IV is not secret, but the same IV may not be used for more than one message with the same key under any circumstance! To achieve that, usually part of the IV is generated as a nonce value, and the rest of it is incremented as a counter. In TLS the IV counter is also the record sequence number. The IV of GCM is unlike the IV in CBC mode, which must also be unpredictable.The disadvantage of using this type of IV, is that in order to avoid collisions, one must change the encryption key, before the IV counter overflows.
The additional data (A) - this data is not secret, and therefore not encrypted, but it is being authenticated by the GHASH. In TLS the additional data is 13 bytes, and includes data such as the record sequence number, type, length and the protocol version.
The plaintext (P) - this is the secret data, it is both encrypted and authenticated.

The operation outputs the ciphertext (C) and the authentication tag (T). The length of C is identical to that of P, and the length of T is 128 bits (although some applications allow for shorter tags). The tag T is computed over A and C, so if even a single bit of either of them is changed, the decryption process will detect the tampering attempt and refuse to use the data. In TLS, the tag T is attached at the end of the ciphertext C.

When decrypting the data, the function will receive A, C and T and compute the authentication tag of the received A and C. It will compare the resulting value to T, and only if both are equal it will output the plaintext P.

By supporting the two state of the art AEADs - AES-GCM and ChaCha20-Poly1305, together with ECDSA and ECDH algorithms, CloudFlare is able to provide the fastest, most flexible and most secure TLS experience possible to date on all platforms, be it a PC or a mobile phone.

Bottom line

Go is a very easy to learn and fun to use, yet it is one of the most powerful languages for system programming. It allows us to deliver robust web-scale software in short time frames. With the performance improvements CloudFlare brings to its crypto stack, Go can now be used for high performance TLS servers as well!

OpenSSL Security Advisory of 19 March 2015

Ryan Lackey — Thu, 19 Mar 2015 15:15:54 GMT

Today there were multiple vulnerabilities released in OpenSSL, a cryptographic library used by CloudFlare (and most sites on the Internet). There has been advance notice that an announcement would be forthcoming, although the contents of the vulnerabilities were kept closely controlled and shared only with major operating system vendors until this notice.

Based on our analysis of the vulnerabilities and how CloudFlare uses the OpenSSL library, this batch of vulnerabilties primarily affects CloudFlare as a "Denial of Service" possibility (it can cause CloudFlare's proxy servers to crash), rather than as an information disclosure vulnerability. Customer traffic and customer SSL keys continue to be protected.

As is good security practice, we have quickly tested the patched version and begun a push to our production environment, to be completed within the hour. We encourage all customers to upgrade to the latest patched versions of OpenSSL on their own servers, particularly if they are using the 1.0.2 branch of the OpenSSL library.

The individual vulnerabilities included in this announcement are:

OpenSSL 1.0.2 ClientHello sigalgs DoS (CVE-2015-0291)
Reclassified: RSA silently downgrades to EXPORT_RSA [Client] (CVE-2015-0204)
Multiblock corrupted pointer (CVE-2015-0290)
Segmentation fault in DTLSv1_listen (CVE-2015-0207)
Segmentation fault in ASN1_TYPE_cmp (CVE-2015-0286)
Segmentation fault for invalid PSS parameters (CVE-2015-0208)
ASN.1 structure reuse memory corruption (CVE-2015-0287)
PKCS7 NULL pointer dereferences (CVE-2015-0289)
Base64 decode (CVE-2015-0292)
DoS via reachable assert in SSLv2 servers (CVE-2015-0293)
Empty CKE with client auth and DHE (CVE-2015-1787)
Handshake with unseeded PRNG (CVE-2015-0285)
Use After Free following d2i_ECPrivatekey error (CVE-2015-0209)
X509_to_X509_REQ NULL pointer deref (CVE-2015-0288)

We thank the OpenSSL project and the individual vulnerability reporters for finding, disclosing, and remediating these problems. All software has bugs, sometimes security critical bugs, and having a good process for handling them once identified is a necessary part of the world of computer software.

No upgrade needed: CloudFlare sites already protected from FREAK

John Graham-Cumming — Wed, 04 Mar 2015 00:32:33 GMT

The newly announced FREAK vulnerability is not a concern for CloudFlare's SSL customers. We do not support 'export grade' cryptography (which, by its nature, is weak) and we upgraded to the non-vulnerable version of OpenSSL the day it was released in early January.

CC BY 2.0 image by Stuart Heath

Our OpenSSL configuration is freely available on our Github account here as are our patches to OpenSSL 1.0.2.

We strive to stay on top of vulnerabilities as they are announced; in this case no action was necessary as we were already protected by decisions to eliminate cipher suites and upgrade software.

We are also pro-active about disabling protocols and ciphers that are outdated (such as SSLv3, RC4) and keep up to date with the latest and most secure ciphers (such as ChaCha-Poly, forward secrecy and elliptic curves).

New OpenSSL vulnerabilities: CloudFlare systems patched

John Graham-Cumming — Thu, 05 Jun 2014 04:00:00 GMT

The OpenSSL team announced seven vulnerabilities covering OpenSSL 0.9.8, 1.0.0, 1.0.1 and 1.0.2 (i.e. all versions) earlier today.

[

](http://ccsinjection.lepidum.co.jp/)

The most serious of these is a potential on-path attack CVE-2014-0224 which is being referred to as CCS Injection. Both Google's Adam Langley and the original reporter of the problem have write ups that give more technical detail.

We have applied the required patch to all CloudFlare servers and customers are protected against CVE-2014-0224 and all the other vulnerabilities announced today.

Everyone who uses OpenSSL in their software or on their server should upgrade as soon as possible; the OpenSSL team has released new versions today.

Tracking our SSL configuration

John Graham-Cumming — Sat, 03 May 2014 11:23:00 GMT

Over time we've updated the SSL configuration we use for serving HTTPS as the security landscape has changed. In the past we've documented those changes in blog posts; to make things simpler to track, and so that people can stay up to date on the configuration we've chosen, I've created a Github repository called sslconfig. I've recreated the history of our SSL configuration from an internal repository and going forward we'll synchronize this repo with the configuration we are using.

Our SSL configuration has changed because attacks on SSL/TLS have appeared: Lucky 13, BEAST, and biases in RC4.

Not long ago we modified OpenSSL to prevent the use of RC4 for TLS 1.1 and above and introduced ECDSA and we continue to examine the right set of ciphers to use so that our customers are as secure as possible (such as using Perfect Forward Secrecy).

Stay tuned for further announcements, and keep an eye on sslconfig for the latest configuraton.

PS As with any of our open source efforts, comments, criticisms and pull requests are most welcome.

Searching for The Prime Suspect: How Heartbleed Leaked Private Keys

John Graham-Cumming — Sun, 27 Apr 2014 22:00:00 GMT

Within a few hours of CloudFlare launching its Heartbleed Challenge the truth was out. Not only did Heartbleed leak private session information (such as cookies and other data that SSL should have been protecting), but the crown jewels of an HTTPS web server were also vulnerable: the private SSL keys were accessible through Heartbleed messages.

When we launched the challenge we were unsure if private keys could be accessed, but had started the process of revoking and recreating all the SSL private keys that we manage. When the challenge was defeated in a matter of hours it became obvious that it was fairly easy to find the prime numbers that are at the heart of an RSA private key.

Most of the people who obtained the private SSL key of the challenge server did so by searching the results returned in Heartbleed messages for prime numbers. Testing for prime numbers itself is pretty simple (although slow): find a number and see if it has any divisors. Since the length of the prime numbers (in bits) was known it was just a matter of finding all blocks of 1,024 bits and seeing if they were actually a prime.

Finding one prime is enough to break the private SSL key.

The question we wanted to answer was: "Why was it apparently so easy to find prime numbers in Heartbleed results?" To get to the answer I instrumented a vulnerable version of OpenSSL (1.0.1f) and began the search for the primes and where they end up in memory. Strangely, it seemed that OpenSSL was set up to cleanse memory of primes (and other sensitive values) when they were no longer being used. So, what was going on?

But first a detour into the RSA algorithm and how it is implemented efficiently.

RSA and the Montgomery Reduction

The full details of the RSA cryptographic scheme can be read here, but, the short version is that two large prime numbers (almost always called p and q) are chosen and kept secret. The product of those two numbers is referred to as the modulus, typically called N (N = p x q).

It's safe to make N public (it is part of what's called the public key) because it is believed (by mathematicians and computer scientists) that working out p and q from N is difficult (i.e. it is believed that factoring N is hard as long as p and q are very large).

So, when a brower connects to a site that's using SSL it obtains the public key from the site and the site keeps p and q secret. But the site needs p and q to encrypt data and so they are typically (but not always) held in memory of the server handling the web site.

In the Heartbleed attack is was possible to find one of p or q in the Heartbleed packets that were returned containing blocks of memory from the vulnerable server. Since N is public, once you have either of p or q the other can easily be worked out by dividing N by the prime you've found.

Note, that in one Heartbleed attack a researcher used Coppersmith's Attack. That attack is a complex mathematical attack which means that it's possible to split N into p and q if you find just part of a prime number. Read the details of that here.

In practice, the mathematics used in RSA is rather slow and various schemes have been invented to speed it up. One, which is used by OpenSSL, is known as the Montgomery Reduction.

Deep inside RSA the fundamental operation that has to be performed is a multiplication modulo N. If you are not familiar with the term "modulo N", it simply means "divide by N and take the remainder". For example, here's 13 x 29 modulo 11 (which will be 13 x 29 divided by 11 and take the remainder):

13 x 29 modulo 11 = 377 modulo 11 = 3 (377/11 = 34 remainder 3)

This is a very expensive operation for a computer to perform (and in RSA it is performed repeatedly) because it involves multiplication and division (if you remember long division from school you'll remember how laborious it is: computers find it similarly difficult when compared to addition or subtraction).

In 1985, mathematician Peter Montgomery showed that it was possible to perform 'multiplication modulo N' very quickly as follows. At each step the following is performed:

A digit from the left number is multiplied by the right number and added to a temporary variable (usually called an accumulator);
The accumulator is divided by 10 and the decimals are discarded;
If there are digits remaining in the left number then add to the accumulator the number r where r x 11 plus whatever is currently in the accumulator would be divisible by 10.
Move on to the next digit if there is one

For example, 13 x 29 mod 11 can be calculated like this:

Accumulator

0 87 Added 3 x 29 8 Divided by 10 10 Added 2 because 2 x 11 + 8 = 30 (divisible by 10) 39 Added 1 x 29 3 Divided by 10

The answer, 3, is in the accumulator. What's important here is that only 'easy' calculations were performed: small multiplications by a single digit and division by 10. In the computer implementation all of this is done in binary with division by 2 instead of 10. Division by 2 is very, very easy to implement on a computer.

Under the hood of OpenSSL

To understand how primes are used inside OpenSSL I instrumented the code to output information about all the memory allocated by OpenSSL and to highlight blocks of memory used for prime numbers. I then processed this information using a small program to produce a picture of memory.

Here's the state of the memory used by OpenSSL when running inside NGINX 1.5.13 just after it has started up. Green blocks are memory in use, black areas are unused and the two red bars are the two primes number p and q that form part of the RSA key.

Each row of the picture is 1,024 bytes. The picture shows a total of about 280KB of memory in use. The primes are being stored in a single location (delving into the source of OpenSSL this is the location of the primes when loaded in the function ssl_set_pkey in ssl_rsa.c).

In contrast, something interesting has happen when NGINX/OpenSSL processes its first request. A second copy of the primes has appeared. OpenSSL is now using more memory and two long red lines (two complete primes) are in memory along with some fragments of prime numbers that have been partially overwritten (those are vulnerable to Coppersmith's Attack).

Delving into the code once more it becomes apparent that the two new primes are copies of p and q made for the Montgomery reduction code in bn_mont.c. The fragments of primes are left over from where p and q were used in calculations.

After ten identical requests the memory layout has changed a little more. The fragments of primes have been overwritten, but the Montgomery reduction copies are still there. And some new fragments have appeared.

After running a more realistic simulated load through the web server the memory has grown and most of the prime fragments have been overwritten leaving the original primes and the Montgomery reduction primes in memory.

To get a sense of how OpenSSL copies the primes around, here's a slightly different visualization. It shows all the memory that stored primes in red (whether or not it was later overwritten). As you can see OpenSSL is making copies of the primes all over its memory space.

And therein lies the heart of Heartbleed: there are two copies of the primes that never move (the original location where they are loaded and the location of the Montgomery reduction copy) and lots of temporary copies of the primes all over OpenSSL's memory.

OpenSSL patches

It's possible to clean up OpenSSL's memory so that copies of primes are not left in memory by applying the following patch:

diff -ur openssl-1.0.1g/crypto/bn/bn_lib.c openssl-1.0.1g-sanitise/crypto/bn/bn_lib.c --- openssl-1.0.1g/crypto/bn/bn_lib.c 2014-03-17 09:14:20.000000000 -0700 +++ openssl-1.0.1g-sanitise/crypto/bn/bn_lib.c 2014-04-19 07:57:53.489932751 -0700 @@ -431,7 +431,13 @@ { BN_ULONG *a = bn_expand_internal(b, words); if(!a) return NULL;

```
  if(b->d) OPENSSL\_free(b->d);
```

```
   if (b->d != NULL)
```
```
      {
```

      OPENSSL\_cleanse(b->d,b->dmax\*sizeof(b->d\[0\]));

```
      OPENSSL\_free(b->d);
```
```
      }
```
b->d=a; b->dmax=words; }

This patch fixes a problem where the bn_expand2 function in bn_lib.c freed memory potentially containing a prime without erasing it first. The patch adds a call to the OPENSSL_cleanse function that safely erases the number from memory before freeing it. The patch ensures the memory is cleaned up. It was sent to the OpenSSL team on April 19 and independently discovered by Rubin Xu.

This particular function is the cause of primes left in freed memory. Whenever OpenSSL needed to resize a number stored in one of its special BIGNUM structures it would free the now too small BIGNUM without erasing the memory.

That alone is not enough to prevent Heartbleed from getting the primes as there are other copies in memory. In testing, I found that the Montgomery reduction primes were particuarly easy to extract using Heartbleed messages. To prevent them from being created its possible to disable caching of the Montgomery parameters by removing the RSA_FLAG_CACHE_PRIVATE in eng_rsax.c (a similar flag exists in rsa_eay.c.

The Montgomery code also doesn't clean up memory that it frees. The following patch corrects that problem.

diff -ur openssl-1.0.1g/crypto/bn/bn_mont.c openssl-1.0.1g-sanitise/crypto/bn/bn_mont.c --- openssl-1.0.1g/crypto/bn/bn_mont.c 2014-03-17 09:14:20.000000000 -0700+++ openssl-1.0.1g-sanitise/crypto/bn/bn_mont.c 2014-04-24 17:57:31.445316346 -0700 @@ -345,9 +345,9 @@ if(mont == NULL) return;

BN_free(&(mont->RR));
BN_free(&(mont->N));
BN_free(&(mont->Ni));

BN_clear_free(&(mont->RR));
BN_clear_free(&(mont->N));
BN_clear_free(&(mont->Ni)); if (mont->flags & BN_FLG_MALLOCED) OPENSSL_free(mont); }

Finally, there's always the possibility that the original location the primes were in might get exposed in a Heartbleed message. To protect against that it would be necessary to change the way OpenSSL memory is allocated so that memory allocated for sensitive data (like private keys) is kept far away from the memory buffers used for messages. A patch for this has been submitted to OpenSSL.

Conclusion

A more radical solution is to not store the private keys inside OpenSSL at all. If the private key is not stored in the same process as OpenSSL it will not be accessible via a fault like Heartbleed.

CloudFlare is already testing that configuration. More on that in another blog post.

PS I am very grateful for Matthew Arcus for corresponding with me about this and checking some of my assertions. It was Matthew who first pointed out that the Montgomery Reduction was another source of leaked primes.

The Hidden Costs of Heartbleed

Matthew Prince — Thu, 17 Apr 2014 10:00:00 GMT

A quick followup to our last blog post on our decision to reissue and revoke all of CloudFlare's customers' SSL certificates. One question we've received is why we didn't just reissue and revoke all SSL certificates as soon as we got word about the Heartbleed vulnerability? The answer is that the revocation process for SSL certificates is far from perfect and imposes a significant cost on the Internet's infrastructure.

Today, after having done a mass reissuance and revocation, we have a tangible sense of that cost. To understand it, you need to understand a bit about how your browser checks if an SSL certificate has been revoked.

OCSP & CRL

When most browsers visit web pages over HTTPS they perform a check using one of two certificate revocation methods: Online Certificate Status Protocol (OCSP) or Certificate Revocation List (CRL). For OCSP, the browser pings the certificate authority and asks whether a particular site's certificate has been revoked. For CRL, the browser pings the certificate authority (CA) and downloads a complete list of all the certificates that have been revoked by that CA.

There are pluses and minuses to both systems. OCSP imposes a lighter bandwidth cost, but a higher number of requests and backend lookups. CRL doesn't generate as many requests, but, as the CRL becomes large, can impose a significant bandwidth burden. These costs are borne by visitors to websites, whose experience will be slower as a result, but even more so by the CAs who need significant resources in place to handle these requests.

Technical Costs of Revocation

Yesterday, CloudFlare completed the process of reissuing all the SSL certificates we manage for our customers. Once that was complete, we revoked all previously used certificates. You can see the spike in global CRL activity we generated:

What you can't see is the spike in bandwidth that imposed. Globalsign, who is CloudFlare's primary CA partner, saw their CRL grow to approximately 4.7MB in size from approximately 22KB on Monday. The activity of browsers downloading the Globalsign CRL generated around 40Gbps of net new traffic across the Internet. If you assume that the global average price for bandwidth is around \$10/Mbps, just supporting the traffic to deliver the CRL would have added \$400,000USD to Globalsign's monthly bandwidth bill.

Lest you think that’s an overestimate, to make the total costs more accurate, we ran the numbers using AWS’s CloudFront price calculator using a mix of traffic across regions that approximates what we see at CloudFlare. The total cost to Globalsign if they were using AWS’s infrastructure, would be at least \$952,992.40/month. Undoubtedly they’d give some additional discounts above the pricing they list publicly, but any way you slice it the costs are significant.

Beyond the cost, many CAs are not setup to be able to handle this increased load. Revoking SSL certificates threatens to create a sort of denial of service attack on their own infrastructures. Thankfully, CloudFlare helps power Globalsign's OCSP and CRL infrastructure. We were able to bear the brunt of the load, allowing us to move forward with revocation as quickly as we did. And, no, we didn’t charge them anything extra.

So, if you're wondering why some people are dragging their feet on mass certificate revocation, now you know why — it imposes a real cost. And if you're a CA who's wondering what you're going to do when you inevitably have to revoke all the certs you've issued over the last year, we're happy to help.

The Heartbleed Aftermath: all CloudFlare certificates revoked and reissued

Nick Sullivan — Thu, 17 Apr 2014 00:44:00 GMT

Eleven days ago the Heartbleed vulnerability was publicly announced.

Last Friday, we issued the CloudFlare Challenge: Heartbleed and simultaneously started the process of revoking and reissuing all the SSL certificates that CloudFlare manages for our customers.

That process is now complete. We have revoked and reissued every single certificate we manage and all the certificates we use.

That mass revocation showed up dramatically on the ISC's CRL activity chart.

Customers who use custom certificates are encouraged to rekey, upload their new certificates, and revoke the previous custom certificates.

Background

We announced last Monday that we had patched a bug in OpenSSL and our customers were safe. We did not know then that CloudFlare was among the few to whom the bug was disclosed before the public announcement. In fact, we did not even know the bug's name. At that time we had simply removed TLS heartbeat functionality completely from OpenSSL by recompiling with the OPENSSL_NO_HEARTBEATS option.

After learning the full extent of the bug and that it had been live on the Internet for two years, we started an investigation to see whether our private keys and those of our customers were at risk.

We started our investigation by attempting to see what sort of information we could get through Heartbleed. We set up a test server on a local machine and bombarded it with Heartbleed attacks, saving the blocks of memory it returned. We scanned that memory for copies of the private key and after extensive scanning, we could not find a trace of it. Still, we were not fully convinced that it was impossible to do so we decided to crowdsource the problem by setting up the CloudFlare Heartbleed Challenge.

Nine hours into the challenge the first instance of the key was revealed on Twitter. Over the five days of the challenge we saw over 75 million Heartbleed attacks against our server and 8,000 submissions.

The winners of the challenge used two different techniques in order to obtain the private key.

How the private key leaks

From looking at the code we knew that some pieces of the key are duplicated and manipulated during the computation of the private key operation. Following the code down into bn_div.c this code ends up making a copy of the divisor (in a modulus operation):

/* First we normalise the numbers */ norm_shift=BN_BITS2-((BN_num_bits(divisor))%BN_BITS2); if (!(BN_lshift(sdiv,divisor,norm_shift))) goto err;

The value stored in sdiv ends up remaining in memory until something overwrites it. The divisor is one of the two primes p and q needed for RSA. More on that below. But here's a diagram showing memory inside OpenSSL. The diagram shows the state of memory after 26 HTTPS requests have been sent through (of varying sizes). Memory in black is unused, green is in use, red contains private keys and yellow contains the remnants of private keys.

[

](http://staging.blog.mrk.cfdata.org/content/images/frame-26.png)

The remnants are what enabled Heartbleed to leak private keys. A great deal of the green (in use) memory is being used by OpenSSL as buffers (memory where requests and responses are handled). If some of that green memory near the yellow key remnants is used for a Heartbleed request the yellow memory may be copied into the Heartbleed response and the key may leak.

Here's what that looks like in memory:

(gdb) x/128xb 0xbd1560 0xbd1560: 0xe0 0x09 0xbd 0x00 0x00 0x00 0x00 0x00 0xbd1568: 0xf0 0x34 0xbb 0x00 0x00 0x00 0x00 0x000xbd1570: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0xbd1578: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0xbd1580: 0x0f 0x47 0x67 0xe0 0x03 0x11 0x56 0x7d 0xbd1588: 0x1c 0xdf 0x92 0x9d 0xfe 0x41 0x59 0xfa 0xbd1590: 0x6e 0x2a 0x99 0x18 0x5f 0x1d 0x2a 0x48 0xbd1598: 0x99 0xaf 0x31 0x68 0x1e 0x36 0xc8 0x75 0xbd15a0: 0x3f 0xe2 0x29 0x36 0x53 0xb7 0x8e 0x8c 0xbd15a8: 0xa2 0x9a 0x9b 0x39 0x3c 0x18 0x94 0x1a 0xbd15b0: 0x0e 0x92 0xd5 0x1b 0x60 0xc7 0x13 0x60 0xbd15b8: 0x8a 0xb9 0x0b 0x31 0x95 0x6d 0x46 0x19 0xbd15c0: 0x40 0xa9 0x89 0x4e 0xc4 0x6d 0x26 0x74 0xbd15c8: 0x77 0xfe 0x5b 0x90 0xb1 0x5d 0xa9 0x79 0xbd15d0: 0xce 0xed 0x52 0x4d 0xbe 0x17 0xcc 0x37 0xbd15d8: 0x87 0xcd 0xfa 0xd9 0x4e 0x3d 0xb1 0x55

The data from one of the primes is in the middle of that block of memory.

A nagging question is why, when OpenSSL has functions to cleanse memory, are these chunks of keys being found in memory. We are continuing to investigate and if a bug is found will submit a patch for OpenSSL.

The more HTTPS traffic the server serves, the more likely some of these intermediate values end up on the heap where Heartbleed can read them. Unfortunately for us, our test server was on our local machine and did not have a large amount of HTTPS traffic. This made it a lot less likely that we would find private keys in our experimental setting. The CloudFlare Challenge site was serving a lot of traffic, making key extraction more likely. There were some other tricks we learned from the winners about efficiently finding keys in memory. To explain these, we will go a bit into how RSA works.

The RSA Cryptosystem

The RSA cryptosystem is based on picking two prime numbers (which are habitually called p and q) and then using various mathematical properties of prime numbers to create a secure system. First a number n (called the modulus) is calculated. It is just p x q.

The other two vital values in RSA are called exponents and typically called e and d. In common RSA cryptosystems the number e is 65537. The public key is (n, e) (i.e. the modulus and the exponent e), the private key is (n, d) (i.e. the modulus and the exponent d). Of course, the prime numbers p and q also have to be kept secret because everything depends on them.

In textbook RSA, you can perform the private key operation with only d and n, but it's slow.

In OpenSSL, RSA’s private key operation also uses the primes p and q to speed up the private key operation using Montgomery Arithemetic and other tricks. If either p or q are found, they can be used to derive the private key d. Searching for p and q was the approach taken by most of the successful contestants of the challenge.

The most common trick for identifying the prime factors from random data was to take all blocks of data that match the size of a prime factor (128 bytes in the case of our 2,048 bit RSA key) and convert them to a number. Some contestants did a primality check to see if the number was prime or not, and some contestants tried to divide the number into the public modulus n. If the number turned out to be a prime factor, then it can be used to extract the private key d. This was the approach taken by Fedor Indutny as published on his blog.

The second and slightly more elegant approach was taken by Rubin Xu. He started by attempting to find prime factors but quickly gave up on finding them. He then decided to use Coppersmith's Attack to derive the private exponent. In this attack, only pieces of the private key are needed and the private key is reconstructed mathematically. This attack is so efficient that he managed to extract the private key from only the first 50 dumps!

The Winners

Here are the winners of the challenge in order of when they found the private key:

Fedor Indutny (@indutny) Developer
Ilkka Mattila, Information Security Adviser
Rubin Xu (@xurubin), Security PhD Student
Ben Murphy (@benmmurphy), Security Researcher
Steve Hunter (@nonaxiomatic)
Xavier Martin (@xav), Security Researcher
no name given
Jeremi Gosney (@jmgosney), CEO, Stricture Group
Michele Guerini Rocco (@Rnhmjoj), Student
David Gervais (@davidgervais), Software Engineer
Christian Bürgi (@buergich)
Daniel Burkard (@hiptomcat)
mancha
mwgamera

A big congratulations goes out to all of them!

The Findings

Based on these findings, we believe that within two hours a dedicated attacker could retrieve a private key from a vulnerable server. Since the allocation of temporary key material is done by OpenSSL itself, and is not special to NGINX, we expect these attacks to work on different server software including non-web servers that use OpenSSL.

All administrators running servers using vulnerable versions of OpenSSL should patch OpenSSL with version 1.0.1g (or later) and also reissue and revoke all their private keys. As mentioned above, CloudFlare has now done this for all the customers where we manage SSL on their behalf.

As a follow up to the Heartbleed bug and the CloudFlare Challenge, I hosted a webinar with Ben Murphy from Fonix. Ben was one of the winners in the CloudFlare Heartbleed challenge. Ben and I answered many questions; a recording of the webinar is posted on CloudFlare's YouTube channel and can be seen below.

Answering the Critical Question: Can You Get Private SSL Keys Using Heartbleed?

Nick Sullivan — Fri, 11 Apr 2014 02:27:00 GMT

Update:

Below is what we thought as of 12:27pm UTC. To verify our belief we crowd sourced the investigation. It turns out we were wrong. While it takes effort, it is possible to extract private SSL keys. The challenge was solved by Software Engineer Fedor Indutny and Ilkka Mattila at NCSC-FI roughly 9 hours after the challenge was first published. Fedor sent 2.5 million requests over the course of the day and Ilkka sent around 100K requests. Our recommendation based on this finding is that everyone reissue and revoke their private keys. CloudFlare has accelerated this effort on behalf of the customers whose SSL keys we manage. You can read more here.

The widely-used open source library OpenSSL revealed on Monday it had a major bug, now known as “heartbleed". By sending a specially crafted packet to a vulnerable server running an unpatched version of OpenSSL, an attacker can get up to 64kB of the server’s working memory. This is the result of a classic implementation bug known as a Buffer over-read

There has been speculation that this vulnerability could expose server certificate private keys, making those sites vulnerable to impersonation. This would be the disaster scenario, requiring virtually every service to reissue and revoke its SSL certificates. Note that simply reissuing certificates is not enough, you must revoke them as well.

Unfortunately, the certificate revocation process is far from perfect and was never built for revocation at mass scale. If every site revoked its certificates, it would impose a significant burden and performance penalty on the Internet. At CloudFlare scale the reissuance and revocation process could break the CA infrastructure. So, we’ve spent a significant amount of time talking to our CA partners in order to ensure that we can safely and successfully revoke and reissue our customers' certificates.

While the vulnerability seems likely to put private key data at risk, to date there have been no verified reports of actual private keys being exposed. At CloudFlare, we received early warning of the Heartbleed vulnerability and patched our systems 12 days ago. We’ve spent much of the time running extensive tests to figure out what can be exposed via Heartbleed and, specifically, to understand if private SSL key data was at risk.

Here’s the good news: after extensive testing on our software stack, we have been unable to successfully use Heartbleed on a vulnerable server to retrieve any private key data. Note that is not the same as saying it is impossible to use Heartbleed to get private keys. We do not yet feel comfortable saying that. However, if it is possible, it is at a minimum very hard. And, we have reason to believe based on the data structures used by OpenSSL and the modified version of NGINX that we use, that it may in fact be impossible.

To get more eyes on the problem, we have created a site so the world can challenge this hypothesis:

CloudFlare Challenge: Heartbleed

This site was created by CloudFlare engineers to be intentionally vulnerable to heartbleed. It is not running behind CloudFlare’s network. We encourage everyone to attempt to get the private key from this website. If someone is able to steal the private key from this site using heartbleed, we will post the full details here.

While we believe it is unlikely that private key data was exposed, we are proceeding with an abundance of caution. We’ve begun the process of reissuing and revoking the keys CloudFlare manages on behalf of our customers. In order to ensure that we don’t overburden the certificate authority resources, we are staging this process. We expect that it will be complete by early next week.

In the meantime, we’re hopeful we can get more assurance that SSL keys are safe through our crowd-sourced effort to hack them. To get everyone started, we wanted to outline the process we’ve embarked on to date in order to attempt to hack them.

The bug

A heartbeat is a message that is sent to the server just so the server can send it back. This lets a client know that the server is still connected and listening. The heartbleed bug was a mistake in the implementation of the response to a heartbeat message.

Here is the offending code

p = &s->s3->rrec.data[0]

[...]

hbtype = *p++; n2s(p, payload); pl = p;

[...]

buffer = OPENSSL_malloc(1 + 2 + payload + padding); bp = buffer;

[...]

memcpy(bp, pl, payload);

The incoming message is stored in a structure called rrec, which contains the incoming request data. The code reads the type (finding out that it's a heartbeat) from the first byte, then reads the next two bytes which indicate the length of the heartbeat payload. In a valid heartbeat request, this length matches the length of the payload sent in the heartbeat request.

The major problem (and cause of heartbleed) is that the code does not check that this length is the actual length sent in the heartbeat request, allowing the request to ask for more data than it should be able to retrieve. The code then copies the amount of data indicated by the length from the incoming message to the outgoing message. If the length is longer than the incoming message, the software just keeps copying data past the end of the message. Since the length variable is 16 bits, you can request up to 65,535 bytes from memory. The data that lives past the end of the incoming message is from a kind of no-man’s land that the program should not be accessing and may contain data left behind from other parts of OpenSSL.

When processing a request that contains a longer length than the request payload, some of this unknown data is copied into the response and sent back to the client. This extra data can contain sensitive information like session cookies and passwords, as we describe in the next section.

The fix for this bug is simple: check that the length of the message actually matches the length of the incoming request. If it is too long, return nothing. That’s exactly what the OpenSSL patch does.

Malloc and the Heap

So what sort of data can live past the end of the request? The technical answer is “heap data,” but the more realistic answer is that it’s platform dependent.

On most computer systems, each process has its own set of working memory. Typically this is split into two data structures: the stack and the heap. This is the case on Linux, the operating system that CloudFlare runs on its servers.

The memory address with the highest value is where the stack data lives. This includes local working variables and non-persistent data storage for running a program. The lowest portion of the address space typically contains the program’s code, followed by static data needed by the program. Right above that is the heap, where all dynamically allocated data lives.

Managing data on the heap is done with the library calls malloc (used to get memory) and free (used to give it back when no longer needed). When you call malloc, the program picks some unused space in the heap area and returns the address of the first part of it to you. Your program is then able to store data at that location. When you call free, memory space is marked as unused. In most cases, the data that was stored in that space is just left there unmodified.

Every new allocation needs some unused space from the heap. Typically this is chosen to be at the lowest possible address that has enough room for the new allocation. A heap typically grows upwards; later allocations get higher addresses. If a block of data is allocated early it gets a low address and later allocations will get higher addresses, unless a big early block is freed.

This is of direct relevance because both the incoming message request (s->s3->rrec.data) and the certificate private key are allocated on the heap with malloc. The exploit reads data from the address of the incoming message. For previous requests that were allocated and freed, their data (including passwords and cookies) may still be in memory. If they are stored less than 65,536 bytes higher in the address space than the current request, the details can be revealed to an attacker.

Requests come and go, recycling memory at around the top of the heap. This makes extracting previous request data very likely from this attack. This is a important in understanding what you can and cannot get at using the vulnerability. Previous requests could contain password data, cookies or other exploitable data. Private keys are a different story; due to the way the heap is structured. The good news is this means that it is much less likely private SSL keys would be exposed.

Read up, not down

In NGINX, the keys are loaded immediately when the process is started, which puts the keys very low in the memory space. This makes it unlikely that incoming requests will be allocated with a lower address space. We tested this experimentally.

We modified our test version of NGINX to print out the location in memory of each request (s->s3->rrec.data), whenever there was an incoming heartbeat. We compared this to the location in memory where the private key is stored and found that we could never get a request to be at a lower address than our private keys regardless of the number of requests we sent. Since the exploit only reads higher addresses, it could not be used to obtain private keys.

Here is a video of what searching for private keys looks like:

If NGINX is reloaded, it starts a new process and loads the keys right away, putting them at a low address. Getting a request to be allocated even lower in the memory space than the early-loaded keys is very unlikely.

We not only checked the location of the private keys, we wrote a tool to repeatedly extract extra data and write the results to file for analysis. We searched through gigabytes of these responses for private key information but did not find any. The most interesting things we found related to certificates were the occasional copy of the public certificate (from a previous output buffer) and some NGINX configuration data. However, the private keys were nowhere to be found.

To get an idea of what is happening inside the heap used by OpenSSL inside NGINX we wrote another tool to create a graphic showing the location of private keys (red pixels), memory that has never been used (black), memory that has been used but it now sitting idle because of a call to free (blue) and memory that is in use (green).

This picture shows the state of the heap memory (from left to right) immediately after NGINX has loaded and has yet to serve a request, after a single request, after two requests and after millions of requests. As described above the critical thing to note is that when the first request is made new memory is allocated far beyond the place where the private key is stored. Each 2x2 pixel square represents a single byte of memory; each row is 256 bytes.

[

](http://staging.blog.mrk.cfdata.org/content/images/heartbleed_1.png)

Eagle-eyed readers will have noticed a block of memory that was allocated at a lower memory location than the private key. That's true. We looked into it and it is not being used to store the heartbleed (or other) TLS packet data. And it is much more than 64k away from the private key.

What can you get?

We said above that it's possible to get sensitive data from HTTP and TLS requests that the server has handled, even if the private key looks inaccessible.

Here, for example, is a dump showing some HTTP headers from a previous request to a running NGINX server. These headers would have been transmitted securely over HTTPS but Heartbleed means that an attacker can read them. That’s a big problem because the headers might contain login credentials or a cookie.

And here’s a copy of the public part of a certificate (as would be sent as part of the TLS/SSL handshake) sitting in memory and readable. Since it’s public this is not in itself dangerous -- by design, you can get the public key of a website even without the vulnerability and doing so does not create risk due to the nature of public/private key cryptography.

We have not fully ruled out the possibility, albeit slim, that some early elements of the heap get reused when NGINX is restarted. In theory, the old memory of the previous process might be available to a newly restarted NGINX. However, after extensive testing, we have not been able to reproduce this situation with an NGINX server on Linux. If a private key is available, it is most likely only available on the first request after restart. After that the chance that the memory is still available is extremely low.

There have been reports of a private key being stolen from Apache servers, but only on the first request. This fits with our hypothesis that restarting a server may cause the key to be revealed briefly. Apache also creates some special data structures in order to load private keys that are encrypted with a passphrase which may make it more likely for private keys to appear in the vulnerable portion of the stack.

At CloudFlare we do not restart our NGINX instances very often, so the likelihood that an attacker had hit our server with this exploit on the first request after restart is extremely low. Even if they did, the likelihood of seeing private key material on that request is very low. Moreover, NGINX, which is what CloudFlare’s system is based on, does not create the same special structures for HTTPS processing, making it less likely keys would ever appear in a vulnerable portion of the stack.

Conclusions

We think the stealing private keys on most NGINX servers is at least extremely hard and, likely, impossible. Even with Apache, which we think may be slightly more vulnerable, and we do not use at CloudFlare, we believe the likelihood of private SSL keys being revealed with the Heartbleed vulnerability is very low. That’s about the only good news of the last week.

We want others to test our results so we created the Heartbleed Challenge. Aristotle struggled with the problem of disproving the existence of something that doesn’t exist. You can’t prove the negative, so through experimental results we will never be absolutely sure there’s not a condition we haven’t tested. However, the more eyes we get on the problem, the more confident we will be that, in spite of a number of other ways the Heartbleed vulnerability was extremely bad, we may have gotten lucky and been spared the worst of the potential consequences.

That said, we’re proceeding assuming the worst. With respect to private keys held by CloudFlare, we patched the vulnerability before the public had knowledge of the vulnerability, making it unlikely that attackers were able to obtain private keys. Still, to be safe, as outlined at the beginning of this post, we are executing on a plan to reissue and revoke potentially affected certificates, including the cloudflare.com certificate.

Vulnerabilities like this one are challenging because people have imperfect information about the risks they pose. It is important that the community works together to identify the real risks and work towards a safer Internet. We’ll monitor the results on the Heartbleed Challenge and immediately publicize results that challenge any of the above. I will be giving a webinar about this topic next week with updates.

You can register for that here.

Staying ahead of OpenSSL vulnerabilities

Nick Sullivan — Mon, 07 Apr 2014 09:00:00 GMT

Today a new vulnerability was announced in OpenSSL 1.0.1 that allows an attacker to reveal up to 64kB of memory to a connected client or server (CVE-2014-0160). We fixed this vulnerability last week before it was made public. All sites that use CloudFlare for SSL have received this fix and are automatically protected.

OpenSSL is the core cryptographic library CloudFlare uses for SSL/TLS connections. If your site is on CloudFlare, every connection made to the HTTPS version of your site goes through this library. As one of the largest deployments of OpenSSL on the Internet today, CloudFlare has a responsibility to be vigilant about fixing these types of bugs before they go public and attackers start exploiting them and putting our customers at risk.

We encourage everyone else running a server that uses OpenSSL to upgrade to version 1.0.1g to be protected from this vulnerability. For previous versions of OpenSSL, re-compiling with the OPENSSL_NO_HEARTBEATS flag enabled will protect against this vulnerability. OpenSSL 1.0.2 will be fixed in 1.0.2-beta2.

This bug fix is a successful example of what is called responsible disclosure. Instead of disclosing the vulnerability to the public right away, the people notified of the problem tracked down the appropriate stakeholders and gave them a chance to fix the vulnerability before it went public. This model helps keep the Internet safe. A big thank you goes out to our partners for disclosing this vulnerability to us in a safe, transparent, and responsible manner. We will announce more about our responsible disclosure policy shortly.

Just another friendly reminder that CloudFlare is on top of things and making sure your sites stay as safe as possible.

Killing RC4 (softly)

Piotr Sikora — Wed, 29 Jan 2014 12:00:00 GMT

Back in 2011, the BEAST attack on the cipher block chaining (CBC) encryption mode used in TLS v1.0 was demonstrated. At the time the advice of experts (including our own) was to prioritize the use of RC4-based cipher suites.

The BEAST vulnerability itself had already been fixed in TLS v1.1 a few years before, but in 2011 the adoption of TLS v1.1 was virtually non-existent and web server administrators (and companies like CloudFlare) started preferring RC4 over AES-CBC ciphers in order to mitigate the attack.

Fast-forward to 2013 and attacks on RC4 have been demonstrated; that makes the preference for RC4 problematic. Unfortunately, at the time, TLS v1.1 and above still weren't very popular, which meant that we had to make a choice between either the mitigation of BEAST or the RC4 attack.

Since then, all modern browsers have started supporting TLS v1.2, which means that in theory we could support RC4 only for connections using TLS v1.0 in order to protect against BEAST attack and use AES-GCM or AES-CBC for connections using TLS v1.1 and above in order to protect against RC4 attack. Unfortunately, open-source web servers (and OpenSSL) don't allow for such fine-grained control over which ciphers should be supported for which protocol version.

To make that possible, we are releasing a patch for OpenSSL which disables RC4-based cipher suites for connections using TLS v1.1 and above, while leaving them there to protect users still using TLS v1.0.

--- a/ssl/s3_lib.c
+++ b/ssl/s3_lib.c
@@ -3816,6 +3816,11 @@
                        (TLS1_get_version(s) < TLS1_2_VERSION))
                        continue;

+               /* Disable RC4 for TLS v1.1+ */
+               if ((c->algorithm_enc == SSL_RC4) &&
+                       (TLS1_get_version(s) >= TLS1_1_VERSION))
+                       continue;
+
                ssl_set_cert_masks(cert,c);
                mask_k = cert->mask_k;
                mask_a = cert->mask_a;

SSL Labs have updated their testing to penalize the use of RC4 on TLS v1.1 and v1.2 connections as detailed here. If a site allows RC4 with TLS v1.1 or v1.2 then the following warning will appear in the SSL Labs report:

SSL Labs have introduced warnings that will lower a web site's score. Warnings are given for a lack of forward secrecy, missing secure renegotiation, or the use of RC4 on TLS v1.1 and v1.2. Web sites are also heavily penalized for using key sizes below 2048 bits.

Customers of CloudFlare using our SSL options will get an A or A+ rating with no warnings from SSL Labs.

You can download the patch here.

Keeping our open source promise

John Graham-Cumming — Sun, 22 Dec 2013 17:00:00 GMT

Back in October I wrote a blog post about CloudFlare and open source software titled CloudFlare And Open Source Software: A Two-Way Street which detailed the many ways in which we use and support open source software.

Since then we've pushed out quite a lot of new open source projects, as well as continuing to support the likes of LuaJIT and OpenResty. All our open source projects are available via our dedicated GitHub page cloudflare.github.io. Here are a few that have been recently updated or released:

Red October: our implementation of the 'two man rule' for control of secrets (and nuclear weapons :-). It was large enough to warrant its own blog post.
Lua Resty Shcache: a shared cache implementation for OpenResty and Nginx Lua.
Lua Resty Core: a new FFI-based Lua API for the ngx_lua module (part of work CloudFlare is doing on optimization).
Go Libs: at the other end of the size spectrum: golibs is where we'll be dumping small Go libraries that might be useful to others.
BM: a Go implementation of Bentley-McIlroy long string compression (which is close to part of the compression performance by Railgun).
Lua Raven: a Lua interface for sending error reports to Sentry.
Lua Resty Logger Socket: a non-blocking logger for Nginx Lua that supports remote loggers.
conf: a very simple key=value configuration file parser written in Go.
Lua Resty Kyoto Tycoon: a non-blocking Lua interface for Kyoto Tycoon.
Go LZ4: a Go package for LZ4 compression.
Aho Corasick: a Go package that does Aho-Corasick string matching.
kt-fdw: a foreign data wrapper for Postegres that interfaces to Kyoto Tycoon. More details in its blog post.
Lua Resty Cookie: a Lua Resty package that improves Cookie handling in OpenResty and Nginx Lua.
Docker NPMJS: a Docker image for a private npmjs repository.

We've also been contributing to projects like OpenSSL, SystemTap and PhantomJS, and the names of CloudFlare employees Yichun and Piotr appear regularly in the Nginx changelog.

And, on a personal level, I've been keeping a repository of all the talks I've given (including any associated code or data) in jgc-talks. The most recent of which was a talk in London about rolling hashes.

We'll keep pushing projects, large and small, to Github as we're ready to share them. We've pushed out so much open source code that we've recently reorganized our Github page by language/software. There are now sections for JavaScript, Go, Nginx and Lua, and Postgres.

Right now there's a large and innovative security project getting ready for release. Expect to hear about it (and see its source code) in the New Year.