The Cloudflare Blog

Go and enhance your calm: demolishing an HTTP/2 interop problem

Lucas Pardue — Fri, 31 Oct 2025 13:00:00 GMT

In September 2025, a thread popped up in our internal engineering chat room asking, "Which part of our stack would be responsible for sending ErrCode=ENHANCE_YOUR_CALM to an HTTP/2 client?" Two internal microservices were experiencing a critical error preventing their communication and the team needed a timely answer.

In this blog post, we describe the background to well-known HTTP/2 attacks that trigger Cloudflare defences, which close connections. We then document an easy-to-make mistake using Go's standard library that can cause clients to send PING flood attacks and how you can avoid it.

HTTP/2 is powerful – but it can be easy to misuse

HTTP/2 defines a binary wire format for encoding HTTP semantics. Request and response messages are encoded as a series of HEADERS and DATA frames, each associated with a logical stream, sent over a TCP connection using TLS. There are also control frames that relate to the management of streams or the connection as a whole. For example, SETTINGS frames advertise properties of an endpoint, WINDOW_UPDATE frames provide flow control credit to a peer so that it can send data, RST_STREAM can be used to cancel or reject a request or response, while GOAWAY can be used to signal graceful or immediate connection closure.

HTTP/2 provides many powerful features that have legitimate uses. However, with great power comes responsibility and opportunity for accidental or intentional misuse. The specification details a number of denial-of-service considerations. Implementations are advised to harden themselves: "An endpoint that doesn't monitor use of these features exposes itself to a risk of denial of service. Implementations SHOULD track the use of these features and set limits on their use."

Cloudflare implements many different HTTP/2 defenses, developed over years in order to protect our systems and our customers. Some notable examples include mitigations added in 2019 to address "Netflix vulnerabilities" and in 2023 to mitigate Rapid Reset and similar style attacks.

When Cloudflare detects that HTTP/2 client behaviour is likely malicious, we close the connection using the GOAWAY frame and include the error code ENHANCE_YOUR_CALM.

One of the well-known and common attacks is CVE-2019-9512, aka PING flood: "The attacker sends continual pings to an HTTP/2 peer, causing the peer to build an internal queue of responses. Depending on how efficiently this data is queued, this can consume excess CPU, memory, or both." Sending a PING frame causes the peer to respond with a PING acknowledgement (indicated by an ACK flag). This allows for checking the liveness of the HTTP connection, along with measuring the layer 7 round-trip time – both useful things. The requirement to acknowledge a PING, however, provides the potential attack vector since it generates work for the peer.

A client that PINGs the Cloudflare edge too frequently will trigger our CVE-2019-9512 mitigations, causing us to close the connection. Shortly after we launched support for gRPC in 2020, we encountered interoperability issues with some gRPC clients that sent many PINGs as part of a performance optimization for window tuning. We also discovered that the Rust Hyper crate had a feature called Adaptive Window that emulated the design and triggered a similar problem until Hyper made a fix.

Solving a microservice miscommunication mystery

When that thread popped up asking which part of our stack was responsible for sending the ENHANCE_YOUR_CALM error code, it was regarding a client communicating over HTTP/2 between two internal microservices.

We suspected that this was an HTTP/2 mitigation issue and confirmed it was a PING flood mitigation in our logs. But taking a step back, you may wonder why two internal microservices are communicating over the Cloudflare edge at all, and therefore hitting our mitigations. In this case, communicating over the edge provides us with several advantages:

We get to dogfood our edge infrastructure and discover issues like this!
We can use Cloudflare Access for authentication. This allows our microservices to be accessed securely by both other services (using service tokens) and engineers (which is invaluable for debugging).
Internal services that are written with Cloudflare Workers can easily communicate with services that are accessible at the edge.

The question remained: Why was this client behaving this way? We traded some ideas as we attempted to get to the bottom of the issue.

The client had a configuration that would indicate that it didn't need to PING very frequently:

t2.PingTimeout = 2 * time.Second
t2.ReadIdleTimeout = 5 * time.Second

However, in situations like this it is generally a good idea to establish ground truth about what is really happening "on the wire." For instance, grabbing a packet capture that can be dissected and explored in Wireshark can provide unequivocal evidence of precisely what was sent over the network. The next best option is detailed/trace logging at the sender or receiver, although sometimes logging can be misleading, so caveat emptor.

In our particular case, it was simpler to use logging with GODEBUG=http2debug=2. We built a simplified minimal reproduction of the client that triggered the error, helping to eliminate other potential variables. We did some group log analysis, combined with diving into some of the Go standard library code to understand what it was really doing. Issac Asimov is commonly credited with the quote "The most exciting phrase to hear in science, the one that heralds new discoveries, is not 'Eureka!' but 'That's funny...'" and sure enough, within the hour someone declared– the funny part I see is this:

2025/09/02 17:33:18 http2: Framer 0x14000624540: wrote RST_STREAM stream=9 len=4 ErrCode=CANCEL
2025/09/02 17:33:18 http2: Framer 0x14000624540: wrote PING len=8 ping="j\xe7\xd6R\xdaw\xf8+"

every ping seems to be preceded by a RST_STREAM

Observant readers will recall the earlier mention of Rapid Reset. However, our logs clearly indicated ENHANCE_YOUR_CALM being triggered due to the PING flood. A bit of searching landed us on this mailing list thread and the comment "Sending a PING frame along with an RST_STREAM allows a client to distinguish between an unresponsive server and a slow response." That seemed quite relevant. We also found a change that was committed related to this topic. This partly answered why there were so many PINGs, but it also raised a new question: Why so many stream resets? So we went back to the logs and built up a little more context about the interaction:

2025/09/02 17:33:18 http2: Transport received DATA flags=END_STREAM stream=47 len=0 data=""
2025/09/02 17:33:18 http2: Framer 0x14000624540: wrote RST_STREAM stream=47 len=4 ErrCode=CANCEL
2025/09/02 17:33:18 http2: Framer 0x14000624540: wrote PING len=8 ping="\x97W\x02\xfa>\xa8\xabi"

The interesting thing here is that the server had sent a DATA frame with the END_STREAM flag set. Per the HTTP/2 stream state machine, the stream should have transitioned to closed when a frame with END_STREAM was processed. The client doesn't need to do anything in this state – sending a RST_STREAM is entirely unnecessary.

A little more digging and noodling and an engineer proclaimed: I noticed that the reset+ping only happens when you call resp.Body.Close()

I believe Go's HTTP library doesn't actually read the response body automatically, but keeps the stream open for you to use until you call resp.Body.Close(), which you can do at any point you like.

The hilarious thing in our example was that there wasn't actually any HTTP body to read. From the earlier example: received DATA flags=END_STREAM stream=47 len=0 data="".

Science and engineering are at times weird and counterintuitive. We decided to tweak our client to read the (absent) body via io.Copy(io.Discard, resp.Body) before closing it.

Sure enough, this immediately stopped the client sending both a useless RST_STREAM and, by association, a PING frame.

Mystery solved?

To prove we had fixed the root cause, the production client was updated with a similar fix. A few hours later, all the ENHANCE_YOUR_CALM closures were eliminated.

Reading bodies in Go can be unintuitive

It’s worth noting that in some situations, ensuring the response body is always read can sometimes be unintuitive in Go. For example, at first glance it appears that the response body will always be read in the following example:

resp, err := http.DefaultClient.Do(req)
if err != nil {
	return err
}
defer resp.Body.Close()

if err := json.NewDecoder(resp.Body).Decode(&respBody); err != nil {
	return err
}

However, json.Decoder stops reading as soon as it finds a complete JSON document or errors. If the response body contains multiple JSON documents or invalid JSON, then the entire response body may still not be read.

Therefore, in our clients, we’ve started replacing defer response.Body.Close() with the following pattern to ensure that response bodies are always fully read:

resp, err := http.DefaultClient.Do(req)
if err != nil {
	return err
}
defer func() {
	io.Copy(io.Discard, resp.Body)
	resp.Body.Close()
}()

if err := json.NewDecoder(resp.Body).Decode(&respBody); err != nil {
	return err
}

Actions to take if you encounter ENHANCE_YOUR_CALM

HTTP/2 is a protocol with several features. Many implementations have implemented hardening to protect themselves from misuse of features, which can trigger a connection to be closed. The recommended error code for closing connections in such conditions is ENHANCE_YOUR_CALM. There are numerous HTTP/2 implementations and APIs, which may drive the use of HTTP/2 features in unexpected ways that could appear like attacks.

If you have an HTTP/2 client that encounters closures with ENHANCE_YOUR_CALM, we recommend that you try to establish ground truth with packet captures (including TLS decryption keys via mechanisms like SSLKEYLOGFILE) and/or detailed trace logging. Look for patterns of frequent or repeated frames that might be similar to malicious traffic. Adjusting your client may help avoid it getting misclassified as an attacker.

If you use Go, we recommend always reading HTTP/2 response bodies (even if empty) in order to avoid sending unnecessary RST_STREAM and PING frames. This is especially important if you use a single connection for multiple requests, which can cause a high frequency of these frames.

This was also a great reminder of the advantages of dogfooding our own products within our internal services. When we run into issues like this one, our learnings can benefit our customers with similar setups.

How we found a bug in Go's arm64 compiler

Thea Heinen — Wed, 08 Oct 2025 14:00:00 GMT

Every second, 84 million HTTP requests are hitting Cloudflare across our fleet of data centers in 330 cities. It means that even the rarest of bugs can show up frequently. In fact, it was our scale that recently led us to discover a bug in Go's arm64 compiler which causes a race condition in the generated code.

This post breaks down how we first encountered the bug, investigated it, and ultimately drove to the root cause.

Investigating a strange panic

We run a service in our network which configures the kernel to handle traffic for some products like Magic Transit and Magic WAN. Our monitoring watches this closely, and it started to observe very sporadic panics on arm64 machines.

We first saw one with a fatal error stating that traceback did not unwind completely. That error suggests that invariants were violated when traversing the stack, likely because of stack corruption. After a brief investigation we decided that it was probably rare stack memory corruption. This was a largely idle control plane service where unplanned restarts have negligible impact, and so we felt that following up was not a priority unless it kept happening.

And then it kept happening.

Coredumps per hour

When we first saw this bug we saw that the fatal errors correlated with recovered panics. These were caused by some old code which used panic/recover as error handling.

At this point, our theory was:

All of the fatal panics happen within stack unwinding.
We correlated an increased volume of recovered panics with these fatal panics.
Recovering a panic unwinds goroutine stacks to call deferred functions.
A related Go issue (#73259) reported an arm64 stack unwinding crash.
Let’s stop using panic/recover for error handling and wait out the upstream fix?

So we did that and watched as fatal panics stopped occurring as the release rolled out. Fatal panics gone, our theoretical mitigation seemed to work, and this was no longer our problem. We subscribed to the upstream issue so we could update when it was resolved and put it out of our minds.

But, this turned out to be a much stranger bug than expected. Putting it out of our minds was premature as the same class of fatal panics came back at a much higher rate. A month later, we were seeing up to 30 daily fatal panics with no real discernible cause; while that might account for only one machine a day in less than 10% of our data centers, we found it concerning that we didn’t understand the cause. The first thing we checked was the number of recovered panics, to match our previous pattern, but there were none. More interestingly, we could not correlate this increased rate of fatal panics with anything. A release? Infrastructure changes? The position of Mars?

At this point we felt like we needed to dive deeper to better understand the root cause. Pattern matching and hoping was clearly insufficient.

We saw two classes of this bug -- a crash while accessing invalid memory and an explicitly checked fatal error.

Fatal Error

goroutine 153 gp=0x4000105340 m=324 mp=0x400639ea08 [GC worker (active)]:
/usr/local/go/src/runtime/asm_arm64.s:244 +0x6c fp=0x7ff97fffe870 sp=0x7ff97fffe860 pc=0x55558d4098fc
runtime.systemstack(0x0)
       /usr/local/go/src/runtime/mgc.go:1508 +0x68 fp=0x7ff97fffe860 sp=0x7ff97fffe810 pc=0x55558d3a9408
runtime.gcBgMarkWorker.func2()
       /usr/local/go/src/runtime/mgcmark.go:1102
runtime.gcDrainMarkWorkerIdle(...)
       /usr/local/go/src/runtime/mgcmark.go:1188 +0x434 fp=0x7ff97fffe810 sp=0x7ff97fffe7a0 pc=0x55558d3ad514
runtime.gcDrain(0x400005bc50, 0x7)
       /usr/local/go/src/runtime/mgcmark.go:212 +0x1c8 fp=0x7ff97fffe7a0 sp=0x7ff97fffe6f0 pc=0x55558d3ab248
runtime.markroot(0x400005bc50, 0x17e6, 0x1)
       /usr/local/go/src/runtime/mgcmark.go:238 +0xa8 fp=0x7ff97fffe6f0 sp=0x7ff97fffe6a0 pc=0x55558d3ab578
runtime.markroot.func1()
       /usr/local/go/src/runtime/mgcmark.go:887 +0x290 fp=0x7ff97fffe6a0 sp=0x7ff97fffe560 pc=0x55558d3acaa0
runtime.scanstack(0x4014494380, 0x400005bc50)
       /usr/local/go/src/runtime/traceback.go:447 +0x2ac fp=0x7ff97fffe560 sp=0x7ff97fffe4d0 pc=0x55558d3eeb7c
runtime.(*unwinder).next(0x7ff97fffe5b0?)
       /usr/local/go/src/runtime/traceback.go:566 +0x110 fp=0x7ff97fffe4d0 sp=0x7ff97fffe490 pc=0x55558d3eed40
runtime.(*unwinder).finishInternal(0x7ff97fffe4f8?)
       /usr/local/go/src/runtime/panic.go:1073 +0x38 fp=0x7ff97fffe490 sp=0x7ff97fffe460 pc=0x55558d403388
runtime.throw({0x55558de6aa27?, 0x7ff97fffe638?})
runtime stack:
fatal error: traceback did not unwind completely
       stack=[0x4015d6a000-0x4015d8a000
runtime: g8221077: frame.sp=0x4015d784c0 top=0x4015d89fd0

Segmentation fault

goroutine 187 gp=0x40003aea80 m=13 mp=0x40003ca008 [GC worker (active)]:
       /usr/local/go/src/runtime/asm_arm64.s:244 +0x6c fp=0x7fff2afde870 sp=0x7fff2afde860 pc=0x55557e2d98fc
runtime.systemstack(0x0)
       /usr/local/go/src/runtime/mgc.go:1489 +0x94 fp=0x7fff2afde860 sp=0x7fff2afde810 pc=0x55557e279434
runtime.gcBgMarkWorker.func2()
       /usr/local/go/src/runtime/mgcmark.go:1112
runtime.gcDrainMarkWorkerDedicated(...)
       /usr/local/go/src/runtime/mgcmark.go:1188 +0x434 fp=0x7fff2afde810 sp=0x7fff2afde7a0 pc=0x55557e27d514
runtime.gcDrain(0x4000059750, 0x3)
       /usr/local/go/src/runtime/mgcmark.go:212 +0x1c8 fp=0x7fff2afde7a0 sp=0x7fff2afde6f0 pc=0x55557e27b248
runtime.markroot(0x4000059750, 0xb8, 0x1)
       /usr/local/go/src/runtime/mgcmark.go:238 +0xa8 fp=0x7fff2afde6f0 sp=0x7fff2afde6a0 pc=0x55557e27b578
runtime.markroot.func1()
       /usr/local/go/src/runtime/mgcmark.go:887 +0x290 fp=0x7fff2afde6a0 sp=0x7fff2afde560 pc=0x55557e27caa0
runtime.scanstack(0x40042cc000, 0x4000059750)
       /usr/local/go/src/runtime/traceback.go:458 +0x188 fp=0x7fff2afde560 sp=0x7fff2afde4d0 pc=0x55557e2bea58
runtime.(*unwinder).next(0x7fff2afde5b0)
goroutine 0 gp=0x40003af880 m=13 mp=0x40003ca008 [idle]:
PC=0x55557e2bea58 m=13 sigcode=1 addr=0x118
SIGSEGV: segmentation violation

Now we could observe some clear patterns. Both errors occur when unwinding the stack in (*unwinder).next. In one case we saw an intentional fatal error as the runtime identified that unwinding could not complete and the stack was in a bad state. In the other case there was a direct memory access error that happened while trying to unwind the stack. The segfault was discussed in the GitHub issue and a Go engineer identified it as dereference of a go scheduler struct, m, when unwinding.

A review of Go scheduler structs

Go uses a lightweight userspace scheduler to manage concurrency. Many goroutines are scheduled on a smaller number of kernel threads – this is often referred to as M:N scheduling. Any individual goroutine can be scheduled on any kernel thread. The scheduler has three core types – g (the goroutine), m (the kernel thread, or “machine”), and p (the physical execution context, or “processor”). For a goroutine to be scheduled a free m must acquire a free p, which will execute a g. Each g contains a field for its m if it is currently running, otherwise it will be nil. This is all the context needed for this post but the go runtime docs explore this more comprehensively.

At this point we can start to make inferences on what’s happening: the program crashes because we try to unwind a goroutine stack which is invalid. In the first backtrace, if a return address is null, we call finishInternal and abort because the stack was not fully unwound. The segmentation fault case in the second backtrace is a bit more interesting: if instead the return address is non-zero but not a function then the unwinder code assumes that the goroutine is currently running. It'll then dereference m and fault by accessing m.incgo (the offset of incgo into struct m is 0x118, the faulting memory access).

What, then, is causing this corruption? The traces were difficult to get anything useful from – our service has hundreds if not thousands of active goroutines. It was fairly clear from the beginning that the panic was remote from the actual bug. The crashes were all observed while unwinding the stack and if this were an issue any time the stack was unwound on arm64 we would be seeing it in many more services. We felt pretty confident that the stack unwinding was happening correctly but on an invalid stack.

Our investigation stalled for a while at this point – making guesses, testing guesses, trying to infer if the panic rate went up or down, or if nothing changed. There was a known issue on Go’s GitHub issue tracker which matched our symptoms almost exactly, but what they discussed was mostly what we already knew. At some point when looking through the linked stack traces we realized that their crash referenced an old version of a library that we were also using – Go Netlink.

goroutine 1267 gp=0x4002a8ea80 m=nil [runnable (scan)]:
runtime.asyncPreempt2()
        /usr/local/go/src/runtime/preempt.go:308 +0x3c fp=0x4004cec4c0 sp=0x4004cec4a0 pc=0x46353c
runtime.asyncPreempt()
        /usr/local/go/src/runtime/preempt_arm64.s:47 +0x9c fp=0x4004cec6b0 sp=0x4004cec4c0 pc=0x4a6a8c
github.com/vishvananda/netlink/nl.(*NetlinkSocket).Receive(0x14360300000000?)
        /go/pkg/mod/github.com/!data!dog/netlink@v1.0.1-0.20240223195320-c7a4f832a3d1/nl/nl_linux.go:803 +0x130 fp=0x4004cfc710 sp=0x4004cec6c0 pc=0xf95de0

We spot-checked a few stack traces and confirmed the presence of this Netlink library. Querying our logs showed that not only did we share a library – every single segmentation fault we observed had happened while preempting NetlinkSocket.Receive.

What’s (async) preemption?

In the prehistoric era of Go (<=1.13) the runtime was cooperatively scheduled. A goroutine would run until it decided it was ready to yield to the scheduler – usually due to explicit calls to runtime.Gosched() or injected yield points at function calls/IO operations. Since Go 1.14 the runtime instead does async preemption. The Go runtime has a thread sysmon which tracks the runtime of goroutines and will preempt any that run for longer than 10ms (at time of writing). It does this by sending SIGURG to the OS thread and in the signal handler will modify the program counter and stack to mimic a call to asyncPreempt.

At this point we had two broad theories:

This is a Go Netlink bug – likely due to unsafe.Pointer usage which invoked undefined behavior but is only actually broken on arm64
This is a Go runtime bug and we're only triggering it in NetlinkSocket.Receive for some reason

After finding the same bug publicly reported upstream, we were feeling confident this was caused by a Go runtime bug. However, upon seeing that both issues implicated the same function, we felt more skeptical – notably the Go Netlink library uses unsafe.Pointer so memory corruption was a plausible explanation even if we didn't understand why.

After an unsuccessful code audit we had hit a wall. The crashes were rare and remote from the root cause. Maybe these crashes were caused by a runtime bug, maybe they were caused by a Go Netlink bug. It seemed clear that there was something wrong with this area of the code, but code auditing wasn’t going anywhere.

Breakthrough

At this point we had a fairly good understanding of what was crashing but very little understanding of why it was happening. It was clear that the root cause of the stack unwinder crashing was remote from the actual crash, and that it had to do with (*NetlinkSocket).Receive, but why? We were able to capture a coredump of a production crash and view it in a debugger. The backtrace confirmed what we already knew – that there was a segmentation fault when unwinding a stack. The crux of the issue revealed itself when we looked at the goroutine which had been preempted while calling (*NetlinkSocket).Receive.

(dlv) bt
0  0x0000555577579dec in runtime.asyncPreempt2
   at /usr/local/go/src/runtime/preempt.go:306
1  0x00005555775bc94c in runtime.asyncPreempt
   at /usr/local/go/src/runtime/preempt_arm64.s:47
2  0x0000555577cb2880 in github.com/vishvananda/netlink/nl.(*NetlinkSocket).Receive
   at
/vendor/github.com/vishvananda/netlink/nl/nl_linux.go:779
3  0x0000555577cb19a8 in github.com/vishvananda/netlink/nl.(*NetlinkRequest).Execute
   at 
/vendor/github.com/vishvananda/netlink/nl/nl_linux.go:532
4  0x0000555577551124 in runtime.heapSetType
   at /usr/local/go/src/runtime/mbitmap.go:714
5  0x0000555577551124 in runtime.heapSetType
   at /usr/local/go/src/runtime/mbitmap.go:714
...
(dlv) disass -a 0x555577cb2878 0x555577cb2888
TEXT github.com/vishvananda/netlink/nl.(*NetlinkSocket).Receive(SB) /vendor/github.com/vishvananda/netlink/nl/nl_linux.go
        nl_linux.go:779 0x555577cb2878  fdfb7fa9        LDP -8(RSP), (R29, R30)
        nl_linux.go:779 0x555577cb287c  ff430191        ADD $80, RSP, RSP
        nl_linux.go:779 0x555577cb2880  ff434091        ADD $(16<<12), RSP, RSP
        nl_linux.go:779 0x555577cb2884  c0035fd6        RET

The goroutine was paused between two opcodes in the function epilogue. Since the process of unwinding a stack relies on the stack frame being in a consistent state, it felt immediately suspicious that we preempted in the middle of adjusting the stack pointer. The goroutine had been paused at 0x555577cb2880, between ADD $80, RSP, RSP and ADD $(16<<12), RSP, RSP.

We queried the service logs to confirm our theory. This wasn’t isolated – the majority of stack traces showed that this same opcode was preempted. This was no longer a weird production crash we couldn’t reproduce. A crash happened when the Go runtime preempted between these two stack pointer adjustments. We had our smoking gun.

Building a minimal reproducer

At this point we felt pretty confident that this was actually just a runtime bug and it should be reproducible in an isolated environment without any dependencies. The theory at this point was:

Stack unwinding is triggered by garbage collection
Async preemption between a split stack pointer adjustment causes a crash
What if we make a function which splits the adjustment and then call it in a loop?

package main

import (
	"runtime"
)

//go:noinline
func big_stack(val int) int {
	var big_buffer = make([]byte, 1 << 16)

	sum := 0
	// prevent the compiler from optimizing out the stack
	for i := 0; i < (1<<16); i++ {
		big_buffer[i] = byte(val)
	}
	for i := 0; i < (1<<16); i++ {
		sum ^= int(big_buffer[i])
	}
	return sum
}

func main() {
	go func() {
		for {
			runtime.GC()
		}
	}()
	for {
		_ = big_stack(1000)
	}
}

This function ends up with a stack frame slightly larger than can be represented in 16 bits, and so on arm64 the Go compiler will split the stack pointer adjustment into two opcodes. If the runtime preempts between these opcodes then the stack unwinder will read an invalid stack pointer and crash.

; epilogue for main.big_stack
ADD $8, RSP, R29
ADD $(16<<12), R29, R29
ADD $16, RSP, RSP
; preemption is problematic between these opcodes
ADD $(16<<12), RSP, RSP
RET

After running this for a few minutes the program panicked as expected!

SIGSEGV: segmentation violation
PC=0x60598 m=8 sigcode=1 addr=0x118

goroutine 0 gp=0x400019c540 m=8 mp=0x4000198708 [idle]:
runtime.(*unwinder).next(0x400030fd10)
        /home/thea/sdk/go1.23.4/src/runtime/traceback.go:458 +0x188 fp=0x400030fcc0 sp=0x400030fc30 pc=0x60598
runtime.scanstack(0x40000021c0, 0x400002f750)
        /home/thea/sdk/go1.23.4/src/runtime/mgcmark.go:887 +0x290 

[...]

goroutine 1 gp=0x40000021c0 m=nil [runnable (scan)]:
runtime.asyncPreempt2()
        /home/thea/sdk/go1.23.4/src/runtime/preempt.go:308 +0x3c fp=0x40003bfcf0 sp=0x40003bfcd0 pc=0x400cc
runtime.asyncPreempt()
        /home/thea/sdk/go1.23.4/src/runtime/preempt_arm64.s:47 +0x9c fp=0x40003bfee0 sp=0x40003bfcf0 pc=0x75aec
main.big_stack(0x40003cff38?)
        /home/thea/dev/stack_corruption_reproducer/main.go:29 +0x94 fp=0x40003cff00 sp=0x40003bfef0 pc=0x77c04
Segmentation fault (core dumped)

real    1m29.165s
user    4m4.987s
sys     0m43.212s

A reproducible crash with standard library only? This felt like conclusive evidence that our problem was a runtime bug.

This was an extremely particular reproducer! Even now with a good understanding of the bug and its fix, some of the behavior is still puzzling. It's a one-instruction race condition, so it’s unsurprising that small changes could have large impact. For example, this reproducer was originally written and tested on Go 1.23.4, but did not crash when compiled with 1.23.9 (the version in production), even though we could objdump the binary and see the split ADD still present! We don’t have a definite explanation for this behavior – even with the bug present there remain a few unknown variables which affect the likelihood of hitting the race condition.

A single-instruction race condition window

arm64 is a fixed-length 4-byte instruction set architecture. This has a lot of implications on codegen but most relevant to this bug is the fact that immediate length is limited. add gets a 12-bit immediate, mov gets a 16-bit immediate, etc. How does the architecture handle this when the operands don't fit? It depends – ADD in particular reserves a bit for "shift left by 12" so any 24 bit addition can be decomposed into two opcodes. Other instructions are decomposed similarly, or just require loading an immediate into a register first.

The very last step of the Go compiler before emitting machine code involves transforming the program into obj.Prog structs. It's a very low level intermediate representation (IR) that mostly serves to be translated into machine code.

//https://github.com/golang/go/blob/fa2bb342d7b0024440d996c2d6d6778b7a5e0247/src/cmd/internal/obj/arm64/obj7.go#L856

// Pop stack frame.
// ADD $framesize, RSP, RSP
p = obj.Appendp(p, c.newprog)
p.As = AADD
p.From.Type = obj.TYPE_CONST
p.From.Offset = int64(c.autosize)
p.To.Type = obj.TYPE_REG
p.To.Reg = REGSP
p.Spadj = -c.autosize

Notably, this IR is not aware of immediate length limitations. Instead, this happens in asm7.go when Go's internal intermediate representation is translated into arm64 machine code. The assembler will classify an immediate in conclass based on bit size and then use that when emitting instructions – extra if needed.

The Go assembler uses a combination of (mov, add) opcodes for some adds that fit in 16-bit immediates, and prefers (add, add + lsl 12) opcodes for 16-bit+ immediates.

Compare a stack of (slightly larger than) 1<<15:

; //go:noinline
; func big_stack() byte {
; 	var big_stack = make([]byte, 1<<15)
; 	return big_stack[0]
; }
MOVD $32776, R27
ADD R27, RSP, R29
MOVD $32784, R27
ADD R27, RSP, RSP
RET

With a stack of 1<<16:

; //go:noinline
; func big_stack() byte {
; 	var big_stack = make([]byte, 1<<16)
; 	return big_stack[0]
; } 
ADD $8, RSP, R29
ADD $(16<<12), R29, R29
ADD $16, RSP, RSP
ADD $(16<<12), RSP, RSP
RET

In the larger stack case, there is a point between ADD x, RSP, RSP opcodes where the stack pointer is not pointing to the tip of a stack frame. We thought at first that this was a matter of memory corruption – that in handling async preemption the runtime would push a function call on the stack and corrupt the middle of the stack. However, this goroutine is already in the function epilogue – any data we corrupt is actively in the process of being thrown away. What's the issue then?

The Go runtime often needs to unwind the stack, which means walking backwards through the chain of function calls. For example: garbage collection uses it to find live references on the stack, panicking relies on it to evaluate defer functions, and generating stack traces needs to print the call stack. For this to work the stack pointer must be accurate during unwinding because of how golang dereferences sp to determine the calling function. If the stack pointer is partially modified, the unwinder will look for the calling function in the middle of the stack. The underlying data is meaningless when interpreted as directions to a parent stack frame and then the runtime will likely crash.

//https://github.com/golang/go/blob/66536242fce34787230c42078a7bbd373ef8dcb0/src/runtime/traceback.go#L373

if innermost && frame.sp < frame.fp || frame.lr == 0 {
    lrPtr = frame.sp
    frame.lr = *(*uintptr)(unsafe.Pointer(lrPtr))
}

When async preemption happens it will push a function call onto the stack but the parent stack frame is no longer correct because sp was only partially adjusted when the preemption happened. The crash flow looks something like this:

Async preemption happens between the two opcodes that add x, rsp expands to
Garbage collection triggers stack unwinding (to check for heap object liveness)
The unwinder starts traversing the stack of the problematic goroutine and correctly unwinds up to the problematic function
The unwinder dereferences sp to determine the parent function
Almost certainly the data behind sp is not a function
Crash

We saw earlier a faulting stack trace which ended in (*NetlinkSocket).Receive – in this case stack unwinding faulted while it was trying to determine the parent frame.

goroutine 90 gp=0x40042cc000 m=nil [preempted (scan)]:
runtime.asyncPreempt2()
/usr/local/go/src/runtime/preempt.go:306 +0x2c fp=0x40060a25d0 sp=0x40060a25b0 pc=0x55557e299dec
runtime.asyncPreempt()
/usr/local/go/src/runtime/preempt_arm64.s:47 +0x9c fp=0x40060a27c0 sp=0x40060a25d0 pc=0x55557e2dc94c
github.com/vishvananda/netlink/nl.(*NetlinkSocket).Receive(0xff48ce6e060b2848?)
/vendor/github.com/vishvananda/netlink/nl/nl_linux.go:779 +0x130 fp=0x40060b2820 sp=0x40060a27d0 pc=0x55557e9d2880

Once we discovered the root cause we reported it with a reproducer and the bug was quickly fixed. This bug is fixed in go1.23.12, go1.24.6, and go1.25.0. Previously, the go compiler emitted a single add x, rsp instruction and relied on the assembler to split immediates into multiple opcodes as necessary. After this change, stacks larger than 1<<12 will build the offset in a temporary register and then add that to rsp in a single, indivisible opcode. A goroutine can be preempted before or after the stack pointer modification, but never during. This means that the stack pointer is always valid and there is no race condition.

LDP -8(RSP), (R29, R30)
MOVD $32, R27
MOVK $(1<<16), R27
ADD R27, RSP, RSP
RET

This was a very fun problem to debug. We don’t often see bugs where you can accurately blame the compiler. Debugging it took weeks and we had to learn about areas of the Go runtime that people don’t usually need to think about. It’s a nice example of a rare race condition, the sort of bug that can only really be quantified at a large scale.

We’re always looking for people who enjoy this kind of detective work. Our engineering teams are hiring.

Building Jetflow: a framework for flexible, performant data pipelines at Cloudflare

Harry Hough — Wed, 23 Jul 2025 14:00:00 GMT

The Cloudflare Business Intelligence team manages a petabyte-scale data lake and ingests thousands of tables every day from many different sources. These include internal databases such as Postgres and ClickHouse, as well as external SaaS applications such as Salesforce. These tasks are often complex and tables may have hundreds of millions or billions of rows of new data each day. They are also business-critical for product decisions, growth plannings, and internal monitoring. In total, about 141 billion rows are ingested every day.

As Cloudflare has grown, the data has become ever larger and more complex. Our existing Extract Load Transform (ELT) solution could no longer meet our technical and business requirements. After evaluating other common ELT solutions, we concluded that their performance generally did not surpass our current system, either.

It became clear that we needed to build our own framework to cope with our unique requirements — and so Jetflow was born.

What we achieved

Over 100x efficiency improvement in GB-s:

Our longest running job with 19 billion rows was taking 48 hours using 300 GB of memory, and now completes in 5.5 hours using 4 GB of memory
We estimate that ingestion of 50 TB from Postgres via Jetflow could cost under $100 based on rates published by commercial cloud providers

>10x performance improvement:

Our largest dataset was ingesting 60-80,000 rows per second, this is now 2-5 million rows per second per database connection.
In addition, these numbers scale well with multiple database connections for some databases.

Extensibility:

The modular design makes it easy to extend and test. Today Jetflow works with ClickHouse, Postgres, Kafka, many different SaaS APIs, Google BigQuery and many others. It has continued to work well and remain flexible with the addition of new use cases.

How did we do this?

Requirements

The first step to designing our new framework had to be a clear understanding of the problems we were aiming to solve, with clear requirements to stop us creating new ones.

Performant & efficient

We needed to be able to move more data in less time as some ingestion jobs were taking ~24 hours, and our data will only grow. The data should be ingested in a streaming fashion and use less memory and compute resources than our existing solution.

Backwards compatible

Given the daily ingestion of thousands of tables, the chosen solution needed to allow for the migration of individual tables as needed. Due to our usage of Spark downstream and Spark's limitations in merging disparate Parquet schemas, the chosen solution had to offer the flexibility to generate the precise schemas needed for each case to match legacy.

We also required seamless integration with our custom metadata system, used for dependency checks and job status information.

Ease of use

We want a configuration file that can be version-controlled, without introducing bottlenecks on repositories with many concurrent changes.

To increase accessibility for different roles within the team, another requirement was no-code (or configuration as code) in the vast majority of cases. Users should not have to worry about availability or translation of data types between source and target systems, or writing new code for each new ingestion. The configuration needed should also be minimal — for example, data schema should be inferred from the source system and not need to be supplied by the user.

Customizable

Striking a balance with the no-code requirement above, although we want a low bar of entry we also want to have the option to tune and override options if desired, with a flexible and optional configuration layer. For example, writing Parquet files is often more expensive than reading from the database, so we want to be able to allocate more resources and concurrency as needed.

Additionally, we wanted to allow for control over where the work is executed, with the ability to spin up concurrent workers in different threads, different containers, or on different machines. The execution of workers and communication of data was abstracted away with an interface, and different implementations can be written and injected, controlled via the job configuration.

Testable

We wanted a solution capable of running locally in a containerized environment, which would allow us to write tests for every stage of the pipeline. With “black box” solutions, testing often means validating the output after making a change, which is a slow feedback loop, risks not testing all edge cases as there isn’t good visibility of all code paths internally, and makes debugging issues painful.

Designing a flexible framework

To build a truly flexible framework, we broke the pipeline down into distinct stages, and then create a config layer to define the composition of the pipeline from these stages, and any configuration overrides. Every pipeline configuration that makes sense logically should execute correctly, and users should not be able to create pipeline configs that do not work.

Pipeline configuration

This led us to a design where we created stages which were classified according to the meaningfully different categories of:

Consumers
Transformers
Loaders

The pipeline was constructed via a YAML file that required a consumer, zero or more transformers, and at least one loader. Consumers create a data stream (via reading from the source system), Transformers (e.g. data transformations, validations) take a data stream input and output a data stream conforming to the same API so that they can be chained, and Loaders have the same data streaming interface, but are the stages with persistent effects — i.e. stages where data is saved to an external system.

This modular design means that each stage is independently testable, with shared behaviour (such as error handling and concurrency) inherited from shared base stages, significantly decreasing development time for new use cases and increasing confidence in code correctness.

Data divisions

Next, we designed a breakdown for the data that would allow the pipeline to be idempotent both on whole pipeline re-run and also on internal retry of any data partition due to transient error. We decided on a design that let us parallelize processing, while maintaining meaningful data divisions that allowed the pipeline to perform cleanups of data where required for a retry.

RunInstance: the least granular division, corresponding to a business unit for a single run of the pipeline (e.g. one month/day/hour of data).
Partition: a division of the RunInstance that allows each row to be allocated to a partition in a way that is deterministic and self-evident from the row data without external state, and is therefore idempotent on retry. (e.g. an accountId range, a 10-minute interval)
Batch: a division of the partition data that is non-deterministic and used only to break the data down into smaller chunks for streaming/parallel processing for faster processing with fewer resources. (e.g. 10k rows, 50 MB)

The options that the user configures in the consumer stage YAML both construct the query that is used to retrieve the data from the source system, and also encode the semantic meaning of this data division in a system agnostic way, so that later stages understand what this data represents — e.g. this partition contains the data for all accounts IDs 0-500. This means that we can do targeted data cleanup and avoid, for example, duplicate data entries if a single data partition is retried due to error.

Framework implementation

Standard internal state for stage compatibility

Our most common use case is something like read from a database, convert to Parquet format, and then save to object storage, with each of these steps being a separate stage. As more use cases were onboarded to Jetflow, we had to make sure that if someone wrote a new stage it would be compatible with the other stages. We don’t want to create a situation where new code needs to be written for every output format and target system, or you end up with a custom pipeline for every different use case.

The way we have solved this problem is by having our stage extractor class only allow output data in a single format. This means as long as any downstream stages support this format as in the input and output format they would be compatible with the rest of the pipeline. This seems obvious in retrospect, but internally was a painful learning experience, as we originally created a custom type system and struggled with stage interoperability.

For this internal format, we chose to use Arrow, an in-memory columnar data format. The key benefits of this format for us are:

Arrow ecosystem: Many data projects now support Arrow as an output format. This means when we write extractor stages for new data sources, it is often trivial to produce Arrow output.
No serialisation overhead: This makes it easy to move Arrow data between machines and even programming languages with minimum overhead. Jetflow was designed from the start to have the flexibility to be able to run in a wide range of systems via a job controller interface, so this efficiency in data transmission means there’s minimal compromise on performance when creating distributed implementations.
Reserve memory in large fixed-size batches to avoid memory allocations: As Go is a garbage collected (GC) language and GC cycle times are affected mostly by the number of objects rather than the sizes of those objects, fewer heap objects reduces CPU time spent garbage collecting significantly, even if the total size is the same. As the number of objects to scan, and possibly collect, during a GC cycle increases with the number of allocations, if we have 8192 rows with 10 columns each, Arrow would only require us to do 10 allocations versus the 8192 allocations of most drivers that allocate on a row by row basis, meaning fewer objects and lower GC cycle times with Arrow.

Converting rows to columns

Another important performance optimization was reducing the number of conversion steps that happen when reading and processing data. Most data ingestion frameworks internally represent data as rows. In our case, we are mostly writing data in Parquet format, which is column based. When reading data from column-based sources (e.g. ClickHouse, where most drivers receive RowBinary format), converting into row-based memory representations for the specific language implementation is inefficient. This is then converted again from rows to columns to write Parquet files. These conversions result in a significant performance impact.

Jetflow instead reads data from column-based sources in columnar formats (e.g. for ClickHouse-native Block format) and then copies this data into Arrow column format. Parquet files are then written directly from Arrow columns. The simplification of this process improves performance.

Writing each pipelines stage

Case study: ClickHouse

When testing an initial version of Jetflow, we discovered that due to the architecture of ClickHouse, using additional connections would not be of any benefit, since ClickHouse was reading faster than we were receiving data. It should then be possible, with a more optimized database driver, to take better advantage of that single connection to read a much larger number of rows per second, without needing additional connections.

Initially, a custom database driver was written for ClickHouse, but we ended up switching to the excellent ch-go low level library, which directly reads Blocks from ClickHouse in a columnar format. This had a dramatic effect on performance in comparison to the standard Go driver. Combined with the framework optimisations above, we now ingest millions of rows per second with a single ClickHouse connection.

A valuable lesson learned is that as with any software, tradeoffs are often made for the sake of convenience or a common use case that may not match your own. Most database drivers tend not to be optimized for reading large batches of rows, and have high per-row overhead.

Case study: Postgres

For Postgres, we use the excellent jackc/pgx driver, but instead of using the database/sql Scan interface, we directly receive the raw bytes for each row and use the jackc/pgx internal scan functions for each Postgres OID (Object Identifier) type.

The database/sql Scan interface in Go uses reflection to understand the type passed to the function and then also uses reflection to set each field with the column value received from Postgres. In typical scenarios, this is fast enough and easy to use, but falls short for our use cases in terms of performance. The jackc/pgx driver reuses the row bytes produced each time the next Postgres row is requested, resulting in zero allocations per row. This allows us to write high-performance, low-allocation code within Jetflow. With this design, we are able to achieve nearly 600,000 rows per second per Postgres connection for most tables, with very low memory usage.

Conclusion

As of early July 2025, the team ingests 77 billion records per day via Jetflow. The remaining jobs are in the process of being migrated to Jetflow, which will bring the total daily ingestion to 141 billion records. The framework has allowed us to ingest tables in cases that would not otherwise have been possible, and provided significant cost savings due to ingestions running for less time and with fewer resources.

In the future, we plan to open source the project, and if you are interested in joining our team to help develop tools like this, then open roles can be found at https://www.cloudflare.com/careers/jobs/.

Over 700 million events/second: How we make sense of too much data

Constantin Pan — Mon, 27 Jan 2025 14:00:00 GMT

Cloudflare's network provides an enormous array of services to our customers. We collect and deliver associated data to customers in the form of event logs and aggregated analytics. As of December 2024, our data pipeline is ingesting up to 706M events per second generated by Cloudflare's services, and that represents 100x growth since our 2018 data pipeline blog post.

At peak, we are moving 107 GiB/s of compressed data, either pushing it directly to customers or subjecting it to additional queueing and batching.

All of these data streams power things like Logs, Analytics, and billing, as well as other products, such as training machine learning models for bot detection. This blog post is focused on techniques we use to efficiently and accurately deal with the high volume of data we ingest for our Analytics products. A previous blog post provides a deeper dive into the data pipeline for Logs.

The pipeline can be roughly described by the following diagram.

The data pipeline has multiple stages, and each can and will naturally break or slow down because of hardware failures or misconfiguration. And when that happens, there is just too much data to be able to buffer it all for very long. Eventually some will get dropped, causing gaps in analytics and a degraded product experience unless proper mitigations are in place.

Dropping data to retain information

How does one retain valuable information from more than half a billion events per second, when some must be dropped? Drop it in a controlled way, by downsampling.

Here is a visual analogy showing the difference between uncontrolled data loss and downsampling. In both cases the same number of pixels were delivered. One is a higher resolution view of just a small portion of a popular painting, while the other shows the full painting, albeit blurry and highly pixelated.

As we noted above, any point in the pipeline can fail, so we want the ability to downsample at any point as needed. Some services proactively downsample data at the source before it even hits Logfwdr. This makes the information extracted from that data a little bit blurry, but much more useful than what otherwise would be delivered: random chunks of the original with gaps in between, or even nothing at all. The amount of "blur" is outside our control (we make our best effort to deliver full data), but there is a robust way to estimate it, as discussed in the next section.

Logfwdr can decide to downsample data sitting in the buffer when it overflows. Logfwdr handles many data streams at once, so we need to prioritize them by assigning each data stream a weight and then applying max-min fairness to better utilize the buffer. It allows each data stream to store as much as it needs, as long as the whole buffer is not saturated. Once it is saturated, streams divide it fairly according to their weighted size.

In our implementation (Go), each data stream is driven by a goroutine, and they cooperate via channels. They consult a single tracker object every time they allocate and deallocate memory. The tracker uses a max-heap to always know who the heaviest participant is and what the total usage is. Whenever the total usage goes over the limit, the tracker repeatedly sends the "please shed some load" signal to the heaviest participant, until the usage is again under the limit.

The effect of this is that healthy streams, which buffer a tiny amount, allocate whatever they need without losses. But any lagging streams split the remaining memory allowance fairly.

We downsample more or less uniformly, by always taking some of the least downsampled batches from the buffer (using min-heap to find those) and merging them together upon downsampling.

^{Merging keeps the batches roughly the same size and their number under control.}

Downsampling is cheap, but since data in the buffer is compressed, it causes recompression, which is the single most expensive thing we do to the data. But using extra CPU time is the last thing you want to do when the system is under heavy load! We compensate for the recompression costs by starting to downsample the fresh data as well (before it gets compressed for the first time) whenever the stream is in the "shed the load" state.

We called this approach "bottomless buffers", because you can squeeze effectively infinite amounts of data in there, and it will just automatically be thinned out. Bottomless buffers resemble reservoir sampling, where the buffer is the reservoir and the population comes as the input stream. But there are some differences. First is that in our pipeline the input stream of data never ends, while reservoir sampling assumes it ends to finalize the sample. Secondly, the resulting sample also never ends.

Let's look at the next stage in the pipeline: Logreceiver. It sits in front of a distributed queue. The purpose of logreceiver is to partition each stream of data by a key that makes it easier for Logpush, Analytics inserters, or some other process to consume.

Logreceiver proactively performs adaptive sampling of analytics. This improves the accuracy of analytics for small customers (receiving on the order of 10 events per day), while more aggressively downsampling large customers (millions of events per second). Logreceiver then pushes the same data at multiple resolutions (100%, 10%, 1%, etc.) into different topics in the distributed queue. This allows it to keep pushing something rather than nothing when the queue is overloaded, by just skipping writing the high-resolution samples of data.

The same goes for Inserters: they can skip reading or writing high-resolution data. The Analytics APIs can skip reading high resolution data. The analytical database might be unable to read high resolution data because of overload or degraded cluster state or because there is just too much to read (very wide time range or very large customer). Adaptively dropping to lower resolutions allows the APIs to return some results in all of those cases.

Extracting value from downsampled data

Okay, we have some downsampled data in the analytical database. It looks like the original data, but with some rows missing. How do we make sense of it? How do we know if the results can be trusted?

Let's look at the math.

Since the amount of sampling can vary over time and between nodes in the distributed system, we need to store this information along with the data. With each event $x_i$ we store its sample interval, which is the reciprocal to its inclusion probability $\pi_i = \frac{1}{\text{sample interval}}$. For example, if we sample 1 in every 1,000 events, each of the events included in the resulting sample will have its $\pi_i = 0.001$, so the sample interval will be 1,000. When we further downsample that batch of data, the inclusion probabilities (and the sample intervals) multiply together: a 1 in 1,000 sample from a 1 in 1,000 sample is a 1 in 1,000,000 sample of the original population. The sample interval of an event can also be interpreted roughly as the number of original events that this event represents, so in the literature it is known as weight $w_i = \frac{1}{\pi_i}$.

We rely on the Horvitz-Thompson estimator (HT, paper) in order to derive analytics about $x_i$. It gives two estimates: the analytical estimate (e.g. the population total or size) and the estimate of the variance of that estimate. The latter enables us to figure out how accurate the results are by building confidence intervals. They define ranges that cover the true value with a given probability (confidence level). A typical confidence level is 0.95, at which a confidence interval (a, b) tells that you can be 95% sure the true SUM or COUNT is between a and b.

So far, we know how to use the HT estimator for doing SUM, COUNT, and AVG.

Given a sample of size $n$, consisting of values $x_i$ and their inclusion probabilities $\pi_i$, the HT estimator for the population total (i.e. SUM) would be $$\widehat{T}=\sum_{i=1}^n{\frac{x_i}{\pi_i}}=\sum_{i=1}^n{x_i w_i}.$$ The variance of $\widehat{T}$ is: $$\widehat{V}(\widehat{T}) = \sum_{i=1}^n{x_i^2 \frac{1 - \pi_i}{\pi_i^2}} + \sum_{i \neq j}^n{x_i x_j \frac{\pi_{ij} - \pi_i \pi_j}{\pi_{ij} \pi_i \pi_j}},$$ where $\pi_{ij}$ is the probability of both $i$-th and $j$-th events being sampled together.

We use Poisson sampling, where each event is subjected to an independent Bernoulli trial ("coin toss") which determines whether the event becomes part of the sample. Since each trial is independent, we can equate $\pi_{ij} = \pi_i \pi_j$, which when plugged in the variance estimator above turns the right-hand sum to zero: $$\widehat{V}(\widehat{T}) = \sum_{i=1}^n{x_i^2 \frac{1 - \pi_i}{\pi_i^2}} + \sum_{i \neq j}^n{x_i x_j \frac{0}{\pi_{ij} \pi_i \pi_j}},$$ thus $$\widehat{V}(\widehat{T}) = \sum_{i=1}^n{x_i^2 \frac{1 - \pi_i}{\pi_i^2}} = \sum_{i=1}^n{x_i^2 w_i (w_i-1)}.$$ For COUNT we use the same estimator, but plug in $x_i = 1$. This gives us: $$\begin{align} \widehat{C} &= \sum_{i=1}^n{\frac{1}{\pi_i}} = \sum_{i=1}^n{w_i},\\ \widehat{V}(\widehat{C}) &= \sum_{i=1}^n{\frac{1 - \pi_i}{\pi_i^2}} = \sum_{i=1}^n{w_i (w_i-1)}. \end{align}$$ For AVG we would use $$\begin{align} \widehat{\mu} &= \frac{\widehat{T}}{N},\\ \widehat{V}(\widehat{\mu}) &= \frac{\widehat{V}(\widehat{T})}{N^2}, \end{align}$$ if we could, but the original population size $N$ is not known, it is not stored anywhere, and it is not even possible to store because of custom filtering at query time. Plugging $\widehat{C}$ instead of $N$ only partially works. It gives a valid estimator for the mean itself, but not for its variance, so the constructed confidence intervals are unusable.

In all cases the corresponding pair of estimates are used as the $\mu$ and $\sigma^2$ of the normal distribution (because of the central limit theorem), and then the bounds for the confidence interval (of confidence level ) are: $$\Big( \mu - \Phi^{-1}\big(\frac{1 + \alpha}{2}\big) \cdot \sigma, \quad \mu + \Phi^{-1}\big(\frac{1 + \alpha}{2}\big) \cdot \sigma\Big).$$

We do not know the N, but there is a workaround: simultaneous confidence intervals. Construct confidence intervals for SUM and COUNT independently, and then combine them into a confidence interval for AVG. This is known as the Bonferroni method. It requires generating wider (half the "inconfidence") intervals for SUM and COUNT. Here is a simplified visual representation, but the actual estimator will have to take into account the possibility of the orange area going below zero.

In SQL, the estimators and confidence intervals look like this:

WITH sum(x * _sample_interval)                              AS t,
     sum(x * x * _sample_interval * (_sample_interval - 1)) AS vt,
     sum(_sample_interval)                                  AS c,
     sum(_sample_interval * (_sample_interval - 1))         AS vc,
     -- ClickHouse does not expose the erf⁻¹ function, so we precompute some magic numbers,
     -- (only for 95% confidence, will be different otherwise):
     --   1.959963984540054 = Φ⁻¹((1+0.950)/2) = √2 * erf⁻¹(0.950)
     --   2.241402727604945 = Φ⁻¹((1+0.975)/2) = √2 * erf⁻¹(0.975)
     1.959963984540054 * sqrt(vt) AS err950_t,
     1.959963984540054 * sqrt(vc) AS err950_c,
     2.241402727604945 * sqrt(vt) AS err975_t,
     2.241402727604945 * sqrt(vc) AS err975_c
SELECT t - err950_t AS lo_total,
       t            AS est_total,
       t + err950_t AS hi_total,
       c - err950_c AS lo_count,
       c            AS est_count,
       c + err950_c AS hi_count,
       (t - err975_t) / (c + err975_c) AS lo_average,
       t / c                           AS est_average,
       (t + err975_t) / (c - err975_c) AS hi_average
FROM ...

Construct a confidence interval for each timeslot on the timeseries, and you get a confidence band, clearly showing the accuracy of the analytics. The figure below shows an example of such a band in shading around the line.

Sampling is easy to screw up

We started using confidence bands on our internal dashboards, and after a while noticed something scary: a systematic error! For one particular website the "total bytes served" estimate was higher than the true control value obtained from rollups, and the confidence bands were way off. See the figure below, where the true value (blue line) is outside the yellow confidence band at all times.

We checked the stored data for corruption, it was fine. We checked the math in the queries, it was fine. It was only after reading through the source code for all of the systems responsible for sampling that we found a candidate for the root cause.

We used simple random sampling everywhere, basically "tossing a coin" for each event, but in Logreceiver sampling was done differently. Instead of sampling randomly it would perform systematic sampling by picking events at equal intervals starting from the first one in the batch.

Why would that be a problem?

There are two reasons. The first is that we can no longer claim $\pi_{ij} = \pi_i \pi_j$, so the simplified variance estimator stops working and confidence intervals cannot be trusted. But even worse, the estimator for the total becomes biased. To understand why exactly, we wrote a short repro code in Python:

import itertools

def take_every(src, period):
    for i, x in enumerate(src):
    if i % period == 0:
        yield x

pattern = [10, 1, 1, 1, 1, 1]
sample_interval = 10 # bad if it has common factors with len(pattern)
true_mean = sum(pattern) / len(pattern)

orig = itertools.cycle(pattern)
sample_size = 10000
sample = itertools.islice(take_every(orig, sample_interval), sample_size)

sample_mean = sum(sample) / sample_size

print(f"{true_mean=} {sample_mean=}")

After playing with different values for pattern and sample_interval in the code above, we realized where the bias was coming from.

Imagine a person opening a huge generated HTML page with many small/cached resources, such as icons. The first response will be big, immediately followed by a burst of small responses. If the website is not visited that much, responses will tend to end up all together at the start of a batch in Logfwdr. Logreceiver does not cut batches, only concatenates them. The first response remains first, so it always gets picked and skews the estimate up.

We checked the hypothesis against the raw unsampled data that we happened to have because that particular website was also using one of the Logs products. We took all events in a given time range, and grouped them by cutting at gaps of at least one minute. In each group, we ranked all events by time and looked at the variable of interest (response size in bytes), and put it on a scatter plot against the rank inside the group.

A clear pattern! The first response is much more likely to be larger than average.

We fixed the issue by making Logreceiver shuffle the data before sampling. As we rolled out the fix, the estimation and the true value converged.

Now, after battle testing it for a while, we are confident the HT estimator is implemented properly and we are using the correct sampling process.

Using Cloudflare's analytics APIs to query sampled data

We already power most of our analytics datasets with sampled data. For example, the Workers Analytics Engine exposes the sample interval in SQL, allowing our customers to build their own dashboards with confidence bands. In the GraphQL API, all of the data nodes that have "Adaptive" in their name are based on sampled data, and the sample interval is exposed as a field there as well, though it is not possible to build confidence intervals from that alone. We are working on exposing confidence intervals in the GraphQL API, and as an experiment have added them to the count and edgeResponseBytes (sum) fields on the httpRequestsAdaptiveGroups nodes. This is available under confidence(level: X).

Here is a sample GraphQL query:

query HTTPRequestsWithConfidence(
  $accountTag: string
  $zoneTag: string
  $datetimeStart: string
  $datetimeEnd: string
) {
  viewer {
    zones(filter: { zoneTag: $zoneTag }) {
      httpRequestsAdaptiveGroups(
        filter: {
          datetime_geq: $datetimeStart
          datetime_leq: $datetimeEnd
      }
      limit: 100
    ) {
      confidence(level: 0.95) {
        level
        count {
          estimate
          lower
          upper
          sampleSize
        }
        sum {
          edgeResponseBytes {
            estimate
            lower
            upper
            sampleSize
          }
        }
      }
    }
  }
}

The query above asks for the estimates and the 95% confidence intervals for SUM(edgeResponseBytes) and COUNT. The results will also show the sample size, which is good to know, as we rely on the central limit theorem to build the confidence intervals, thus small samples don't work very well.

Here is the response from this query:

{
  "data": {
    "viewer": {
      "zones": [
        {
          "httpRequestsAdaptiveGroups": [
            {
              "confidence": {
                "level": 0.95,
                "count": {
                  "estimate": 96947,
                  "lower": "96874.24",
                  "upper": "97019.76",
                  "sampleSize": 96294
                },
                "sum": {
                  "edgeResponseBytes": {
                    "estimate": 495797559,
                    "lower": "495262898.54",
                    "upper": "496332219.46",
                    "sampleSize": 96294
                  }
                }
              }
            }
          ]
        }
      ]
    }
  },
  "errors": null
}

The response shows the estimated count is 96947, and we are 95% confident that the true count lies in the range 96874.24 to 97019.76. Similarly, the estimate and range for the sum of response bytes are provided.

The estimates are based on a sample size of 96294 rows, which is plenty of samples to calculate good confidence intervals.

Conclusion

We have discussed what kept our data pipeline scalable and resilient despite doubling in size every 1.5 years, how the math works, and how it is easy to mess up. We are constantly working on better ways to keep the data pipeline, and the products based on it, useful to our customers. If you are interested in doing things like that and want to help us build a better Internet, check out our careers page.

Improving platform resilience at Cloudflare through automation

Opeyemi Onikute — Wed, 09 Oct 2024 13:00:00 GMT

Failure is an expected state in production systems, and no predictable failure of either software or hardware components should result in a negative experience for users. The exact failure mode may vary, but certain remediation steps must be taken after detection. A common example is when an error occurs on a server, rendering it unfit for production workloads, and requiring action to recover.

When operating at Cloudflare’s scale, it is important to ensure that our platform is able to recover from faults seamlessly. It can be tempting to rely on the expertise of world-class engineers to remediate these faults, but this would be manual, repetitive, unlikely to produce enduring value, and not scaling. In one word: toil; not a viable solution at our scale and rate of growth.

In this post we discuss how we built the foundations to enable a more scalable future, and what problems it has immediately allowed us to solve.

Growing pains

The Cloudflare Site Reliability Engineering (SRE) team builds and manages the platform that helps product teams deliver our extensive suite of offerings to customers. One important component of this platform is the collection of servers that power critical products such as Durable Objects, Workers, and DDoS mitigation. We also build and maintain foundational software services that power our product offerings, such as configuration management, provisioning, and IP address allocation systems.

As part of tactical operations work, we are often required to respond to failures in any of these components to minimize impact to users. Impact can vary from lack of access to a specific product feature, to total unavailability. The level of response required is determined by the priority, which is usually a reflection of the severity of impact on users. Lower-priority failures are more common — a server may run too hot, or experience an unrecoverable hardware error. Higher-priority failures are rare and are typically resolved via a well-defined incident response process, requiring collaboration with multiple other teams.

The commonality of lower-priority failures makes it obvious when the response required, as defined in runbooks, is “toilsome”. To reduce this toil, we had previously implemented a plethora of solutions to automate runbook actions such as manually-invoked shell scripts, cron jobs, and ad-hoc software services. These had grown organically over time and provided solutions on a case-by-case basis, which led to duplication of work, tight coupling, and lack of context awareness across the solutions.

We also care about how long it takes to resolve any potential impact on users. A resolution process which involves the manual invocation of a script relies on human action, increasing the Mean-Time-To-Resolve (MTTR) and leaving room for human error. This risks increasing the amount of errors we serve to users and degrading trust.

These problems proved that we needed a way to automatically heal these platform components. This especially applies to our servers, for which failure can cause impact across multiple product offerings. While we have mechanisms to automatically steer traffic away from these degraded servers, in some rare cases the breakage is sudden enough to be visible.

Solving the problem

To provide a more reliable platform, we needed a new component that provides a common ground for remediation efforts. This would remove duplication of work, provide unified context-awareness and increase development speed, which ultimately saves hours of engineering time and effort.

A good solution would not allow only the SRE team to auto-remediate, it would empower the entire company. The key to adding self-healing capability was a generic interface for all teams to self-service and quickly remediate failures at various levels: machine, service, network, or dependencies.

A good way to think about auto-remediation is in terms of workflows. A workflow is a sequence of steps to get to a desired outcome. This is not dissimilar to a manual shell script which executes what a human would otherwise do via runbook instructions. Because of this logical fit with workflows and durable execution, we decided to adopt an open-source platform called Temporal.

The concept of durable execution is useful to gracefully manage infrastructure failures such as network outages and transient failures in external service endpoints. This capability meant we only needed to build a way to schedule “workflow” tasks and have the code provide reliability guarantees by default, using Temporal. This allowed us to focus on building out the orchestration system to support the control and flow of workflow execution in our data centers.

Temporal’s documentation provides a good introduction to writing Temporal workflows.

Building an Automatic Remediation System

Below, we describe how our automatic remediation system works. It is essentially a way to schedule tasks across our global network with built-in reliability guarantees. With this system, teams can serve their customers more reliably. An unexpected failure mode can be recognized and immediately mitigated, while the root cause can be determined later via a more detailed analysis.

Step one: we need a coordinator

After our initial testing of Temporal, it was now possible to write workflows. But we needed a way to schedule workflow tasks from other internal services. The coordinator was built to serve this purpose, and became the primary mechanism for the authorisation and scheduling of workflows.

The most important roles of the coordinator are authorisation, workflow task routing, and safety constraints enforcement. Each consumer is authorized via mTLS authentication, and the coordinator uses an ACL to determine whether to permit the execution of a workflow. An ACL configuration looks like the following example.

server_config {
    enable_tls = true
    [...]
    route_rule {
      name  = "global_get"
      method = "GET"
      route_patterns = ["/*"]
      uris = ["spiffe://example.com/worker-admin"]
    }
    route_rule {
      name = "global_post"
      method = "POST"
      route_patterns = ["/*"]
      uris = ["spiffe://example.com/worker-admin"]
      allow_public = true
    }
    route_rule {
      name = "public_access"
      method = "GET"
      route_patterns = ["/metrics"]
      uris = []
      allow_public = true
      skip_log_match = true
    }
}

Each workflow specifies two key characteristics: where to run the tasks and the safety constraints, using an HCL configuration file. Example constraints could be whether to run on only a specific node type (such as a database), or if multiple parallel executions are allowed: if a task has been triggered too many times, that is a sign of a wider problem that might require human intervention. The coordinator uses the Temporal Visibility API to determine the current state of the executions in the Temporal cluster.

An example of a configuration file is shown below:

task_queue_target = ""

# The following entries will ensure that
# 1. This workflow is not run at the same time in a 15m window.
# 2. This workflow will not run more than once an hour.
# 3. This workflow will not run more than 3 times in one day.
#
constraint {
    kind = "concurency"
    value = "1"
    period = "15m"
}

constraint {
    kind = "maxExecution"
    value = "1"
    period = "1h"
}

constraint {
    kind = "maxExecution"
    value = "3"
    period = "24h"
    is_global = true
}

Step two: Task Routing is amazing

An unforeseen benefit of using a central Temporal cluster was the discovery of Task Routing. This feature allows us to schedule a Workflow/Activity on any server that has a running Temporal Worker, and further segment by the type of server, its location, etc. For this reason, we have three primary task queues — the general queue in which tasks can be executed by any worker in the datacenter, the node type queue in which tasks can only be executed by a specific node type in the datacenter, and the individual node queue where we target a specific node for task execution.

We rely on this heavily to ensure the speed and efficiency of automated remediation. Certain tasks can be run in datacenters with known low latency to an external resource, or a node type with better performance than others (due to differences in the underlying hardware). This reduces the amount of failure and latency we see overall in task executions. Sometimes we are also constrained by certain types of tasks that can only run on a certain node type, such as a database.

Task Routing also means that we can configure certain task queues to have a higher priority for execution, although this is not a feature we have needed so far. A drawback of task routing is that every Workflow/Activity needs to be registered to the target task queue, which is a common gotcha. Thankfully, it is possible to catch this failure condition with proper testing.

Step three: when/how to self-heal?

None of this would be relevant if we didn’t put it to good use. A primary design goal for the platform was to ensure we had easy, quick ways to trigger workflows on the most important failure conditions. The next step was to determine what the best sources to trigger the actions were. The answer to this was simple: we could trigger workflows from anywhere as long as they are properly authorized and detect the failure conditions accurately.

Example triggers are an alerting system, a log tailer, a health check daemon, or an authorized engineer via a chatbot. Such flexibility allows a high level of reuse, and permits to invest more in workflow quality and reliability.

As part of the solution, we built a daemon that is able to poll a signal source for any unwanted condition and trigger a configured workflow. We have initially found Prometheus useful as a source because it contains both service-level and hardware/system-level metrics. We are also exploring more event-based trigger mechanisms, which could eliminate the need to use precious system resources to poll for metrics.

We already had internal services that are able to detect widespread failure conditions for our customers, but were only able to page a human. With the adoption of auto-remediation, these systems are now able to react automatically. This ability to create an automatic feedback loop with our customers is the cornerstone of these self-healing capabilities, and we continue to work on stronger signals, faster reaction times, and better prevention of future occurrences.

The most exciting part, however, is the future possibility. Every customer cares about any negative impact from Cloudflare. With this platform we can onboard several services (especially those that are foundational for the critical path) and ensure we react quickly to any failure conditions, even before there is any visible impact.

Step four: packaging and deployment

The whole system is written in golang, and a single binary can implement each role. We distribute it as an apt package or a container for maximum ease of deployment.

We deploy a Temporal-based worker to every server we intend to run tasks on, and a daemon in datacenters where we intend to automatically trigger workflows based on the local conditions. The coordinator is more nuanced since we rely on task routing and can trigger from a central coordinator, but we have also found value in running coordinators locally in the datacenters. This is especially useful in datacenters with less capacity or degraded performance, removing the need for a round-trip to schedule the workflows.

Step five: test, test, test

Temporal provides native mechanisms to test an entire workflow, via a comprehensive test suite that supports end-to-end, integration, and unit testing, which we used extensively to prevent regressions while developing. We also ensured proper test coverage for all the critical platform components, especially the coordinator.

Despite the ease of written tests, we quickly discovered that they were not enough. After writing workflows, engineers need an environment as close as possible to the target conditions. This is why we configured our staging environments to support quick and efficient testing. These environments receive the latest changes and point to a different (staging) Temporal cluster, which enables experimentation and easy validation of changes.

After a workflow is validated in the staging environment, we can then do a full release to production. It seems obvious, but catching simple configuration errors before releasing has saved us many hours in development/change-related-task time.

Deploying to production

As you can guess from the title of this post, we put this in production to automatically react to server-specific errors and unrecoverable failures. To this end, we have a set of services that are able to detect single-server failure conditions based on analyzed traffic data. After deployment, we have successfully mitigated potential impact by taking any errant single sources of failure out of production.

We have also created a set of workflows to reduce internal toil and improve efficiency. These workflows can automatically test pull requests on target machines, wipe and reset servers after experiments are concluded, and take away manual processes that cost many hours in toil.

Building a system that is maintained by several SRE teams has allowed us to iterate faster, and rapidly tackle long-standing problems. We have set ambitious goals regarding toil elimination and are on course to achieve them, which will allow us to scale faster by eliminating the human bottleneck.

Looking to the future

Our immediate plans are to leverage this system to provide a more reliable platform for our customers and drastically reduce operational toil, freeing up engineering resources to tackle larger-scale problems. We also intend to leverage more Temporal features such as Workflow Versioning, which will simplify the process of making changes to workflows by ensuring that triggered workflows run expected versions.

We are also interested in how others are solving problems using durable execution platforms such as Temporal, and general strategies to eliminate toil. If you would like to discuss this further, feel free to reach out on the Cloudflare Community and start a conversation!

If you’re interested in contributing to projects that help build a better Internet, our engineering teams are hiring.

Intelligent, automatic restarts for unhealthy Kafka consumers

Chris Shepherd — Tue, 24 Jan 2023 14:00:00 GMT

At Cloudflare, we take steps to ensure we are resilient against failure at all levels of our infrastructure. This includes Kafka, which we use for critical workflows such as sending time-sensitive emails and alerts.

We learned a lot about keeping our applications that leverage Kafka healthy, so they can always be operational. Application health checks are notoriously hard to implement: What determines an application as healthy? How can we keep services operational at all times?

These can be implemented in many ways. We’ll talk about an approach that allows us to considerably reduce incidents with unhealthy applications while requiring less manual intervention.

Kafka at Cloudflare

Cloudflare is a big adopter of Kafka. We use Kafka as a way to decouple services due to its asynchronous nature and reliability. It allows different teams to work effectively without creating dependencies on one another. You can also read more about how other teams at Cloudflare use Kafka in this post.

Kafka is used to send and receive messages. Messages represent some kind of event like a credit card payment or details of a new user created in your platform. These messages can be represented in multiple ways: JSON, Protobuf, Avro and so on.

Kafka organises messages in topics. A topic is an ordered log of events in which each message is marked with a progressive offset. When an event is written by an external system, that is appended to the end of that topic. These events are not deleted from the topic by default (retention can be applied).

Topics are stored as log files on disk, which are finite in size. Partitions are a systematic way of breaking the one topic log file into many logs, each of which can be hosted on separate servers–enabling to scale topics.

Topics are managed by brokers–nodes in a Kafka cluster. These are responsible for writing new events to partitions, serving reads and replicating partitions among themselves.

Messages can be consumed by individual consumers or co-ordinated groups of consumers, known as consumer groups.

Consumers use a unique id (consumer id) that allows them to be identified by the broker as an application which is consuming from a specific topic.

Each topic can be read by an infinite number of different consumers, as long as they use a different id. Each consumer can replay the same messages as many times as they want.

When a consumer starts consuming from a topic, it will process all messages, starting from a selected offset, from each partition. With a consumer group, the partitions are divided amongst each consumer in the group. This division is determined by the consumer group leader. This leader will receive information about the other consumers in the group and will decide which consumers will receive messages from which partitions (partition strategy).

The offset of a consumer’s commit can demonstrate whether the consumer is working as expected. Committing a processed offset is the way a consumer and its consumer group report to the broker that they have processed a particular message.

A standard measurement of whether a consumer is processing fast enough is lag. We use this to measure how far behind the newest message we are. This tracks time elapsed between messages being written to and read from a topic. When a service is lagging behind, it means that the consumption is at a slower rate than new messages being produced.

Due to Cloudflare’s scale, message rates typically end up being very large and a lot of requests are time-sensitive so monitoring this is vital.

At Cloudflare, our applications using Kafka are deployed as microservices on Kubernetes.

Health checks for Kubernetes apps

Kubernetes uses probes to understand if a service is healthy and is ready to receive traffic or to run. When a liveness probe fails and the bounds for retrying are exceeded, Kubernetes restarts the services.

When a readiness probe fails and the bounds for retrying are exceeded, it stops sending HTTP traffic to the targeted pods. In the case of Kafka applications this is not relevant as they don’t run an http server. For this reason, we’ll cover only liveness checks.

A classic Kafka liveness check done on a consumer checks the status of the connection with the broker. It’s often best practice to keep these checks simple and perform some basic operations - in this case, something like listing topics. If, for any reason, this check fails consistently, for instance the broker returns a TLS error, Kubernetes terminates the service and starts a new pod of the same service, therefore forcing a new connection. Simple Kafka liveness checks do a good job of understanding when the connection with the broker is unhealthy.

Problems with Kafka health checks

Due to Cloudflare’s scale, a lot of our Kafka topics are divided into multiple partitions (in some cases this can be hundreds!) and in many cases the replica count of our consuming service doesn’t necessarily match the number of partitions on the Kafka topic. This can mean that in a lot of scenarios this simple approach to health checking is not quite enough!

Microservices that consume from Kafka topics are healthy if they are consuming and committing offsets at regular intervals when messages are being published to a topic. When such services are not committing offsets as expected, it means that the consumer is in a bad state, and it will start accumulating lag. An approach we often take is to manually terminate and restart the service in Kubernetes, this will cause a reconnection and rebalance.

When a consumer joins or leaves a consumer group, a rebalance is triggered and the consumer group leader must re-assign which consumers will read from which partitions.

When a rebalance happens, each consumer is notified to stop consuming. Some consumers might get their assigned partitions taken away and re-assigned to another consumer. We noticed when this happened within our library implementation; if the consumer doesn’t acknowledge this command, it will wait indefinitely for new messages to be consumed from a partition that it’s no longer assigned to, ultimately leading to a deadlock. Usually a manual restart of the faulty client-side app is needed to resume processing.

Intelligent health checks

As we were seeing consumers reporting as “healthy” but sitting idle, it occurred to us that maybe we were focusing on the wrong thing in our health checks. Just because the service is connected to the Kafka broker and can read from the topic, it does not mean the consumer is actively processing messages.

Therefore, we realised we should be focused on message ingestion, using the offset values to ensure that forward progress was being made.

The PagerDuty approach

PagerDuty wrote an excellent blog on this topic which we used as inspiration when coming up with our approach.

Their approach used the current (latest) offset and the committed offset values. The current offset signifies the last message that was sent to the topic, while the committed offset is the last message that was processed by the consumer.

Checking the consumer is moving forwards, by ensuring that the latest offset was changing (receiving new messages) and the committed offsets were changing as well (processing the new messages).

Therefore, the solution we came up with:

If we cannot read the current offset, fail liveness probe.
If we cannot read the committed offset, fail liveness probe.
If the committed offset == the current offset, pass liveness probe.
If the value for the committed offset has not changed since the last run of the health check, fail liveness probe.

To measure if the committed offset is changing, we need to store the value of the previous run, we do this using an in-memory map where partition number is the key. This means each instance of our service only has a view of the partitions it is currently consuming from and will run the health check for each.

Problems

When we first rolled out our smart health checks we started to notice cascading failures some time after release. After initial investigations we realised this was happening when a rebalance happens. It would initially affect one replica then quickly result in the others reporting as unhealthy.

What we observed was due to us storing the previous value of the committed offset in-memory, when a rebalance happens the service may get re-assigned a different partition. When this happened it meant our service was incorrectly assuming that the committed offset for that partition had not changed (as this specific replica was no longer updating the latest value), therefore it would start to report the service as unhealthy. The failing liveness probe would then cause it to restart which would in-turn trigger another rebalancing in Kafka causing other replicas to face the same issue.

Solution

To fix this issue we needed to ensure that each replica only kept track of the offsets for the partitions it was consuming from at that moment. Luckily, the Shopify Sarama library, which we use internally, has functionality to observe when a rebalancing happens. This meant we could use it to rebuild the in-memory map of offsets so that it would only include the relevant partition values.

This is handled by receiving the signal from the session context channel:

for {
  select {
  case message, ok := <-claim.Messages(): // <-- Message received

     // Store latest received offset in-memory
     offsetMap[message.Partition] = message.Offset


     // Handle message
     handleMessage(ctx, message)


     // Commit message offset
     session.MarkMessage(message, "")


  case <-session.Context().Done(): // <-- Rebalance happened

     // Remove rebalanced partition from in-memory map
     delete(offsetMap, claim.Partition())
  }
}

Verifying this solution was straightforward, we just needed to trigger a rebalance. To test this worked in all possible scenarios we spun up a single replica of a service consuming from multiple partitions, then proceeded to scale up the number of replicas until it matched the partition count, then scaled back down to a single replica. By doing this we verified that the health checks could safely handle new partitions being assigned as well as partitions being taken away.

Takeaways

Probes in Kubernetes are very easy to set up and can be a powerful tool to ensure your application is running as expected. Well implemented probes can often be the difference between engineers being called out to fix trivial issues (sometimes outside of working hours) and a service which is self-healing.

However, without proper thought, “dumb” health checks can also lead to a false sense of security that a service is running as expected even when it’s not. One thing we have learnt from this was to think more about the specific behaviour of the service and decide what being unhealthy means in each instance, instead of just ensuring that dependent services are connected.

Production ready eBPF, or how we fixed the BSD socket API

Lorenz Bauer — Thu, 17 Feb 2022 17:02:54 GMT

As we develop new products, we often push our operating system - Linux - beyond what is commonly possible. A common theme has been relying on eBPF to build technology that would otherwise have required modifying the kernel. For example, we’ve built DDoS mitigation and a load balancer and use it to monitor our fleet of servers.

This software usually consists of a small-ish eBPF program written in C, executed in the context of the kernel, and a larger user space component that loads the eBPF into the kernel and manages its lifecycle. We’ve found that the ratio of eBPF code to userspace code differs by an order of magnitude or more. We want to shed some light on the issues that a developer has to tackle when dealing with eBPF and present our solutions for building rock-solid production ready applications which contain eBPF.

For this purpose we are open sourcing the production tooling we’ve built for the sk_lookup hook we contributed to the Linux kernel, called tubular. It exists because we’ve outgrown the BSD sockets API. To deliver some products we need features that are just not possible using the standard API.

Our services are available on millions of IPs.
Multiple services using the same port on different addresses have to coexist, e.g. 1.1.1.1 resolver and our authoritative DNS.
Our Spectrum product needs to listen on all 2^16 ports.

The source code for tubular is at https://github.com/cloudflare/tubular, and it allows you to do all the things mentioned above. Maybe the most interesting feature is that you can change the addresses of a service on the fly:

How tubular works

tubular sits at a critical point in the Cloudflare stack, since it has to inspect every connection terminated by a server and decide which application should receive it.

Failure to do so will drop or misdirect connections hundreds of times per second. So it has to be incredibly robust during day to day operations. We had the following goals for tubular:

Releases must be unattended and happen online. tubular runs on thousands of machines, so we can’t babysit the process or take servers out of production.
Releases must fail safely. A failure in the process must leave the previous version of tubular running, otherwise we may drop connections.
Reduce the impact of (userspace) crashes. When the inevitable bug comes along we want to minimise the blast radius.

In the past we had built a proof-of-concept control plane for sk_lookup called inet-tool, which proved that we could get away without a persistent service managing the eBPF. Similarly, tubular has tubectl: short-lived invocations make the necessary changes and persisting state is handled by the kernel in the form of eBPF maps. Following this design gave us crash resiliency by default, but left us with the task of mapping the user interface we wanted to the tools available in the eBPF ecosystem.

The tubular user interface

tubular consists of a BPF program that attaches to the sk_lookup hook in the kernel and userspace Go code which manages the BPF program. The tubectl command wraps both in a way that is easy to distribute.

tubectl manages two kinds of objects: bindings and sockets. A binding encodes a rule against which an incoming packet is matched. A socket is a reference to a TCP or UDP socket that can accept new connections or packets.

Bindings and sockets are "glued" together via arbitrary strings called labels. Conceptually, a binding assigns a label to some traffic. The label is then used to find the correct socket.

Adding bindings

To create a binding that steers port 80 (aka HTTP) traffic destined for 127.0.0.1 to the label “foo” we use tubectl bind:

$ sudo tubectl bind "foo" tcp 127.0.0.1 80

Due to the power of sk_lookup we can have much more powerful constructs than the BSD API. For example, we can redirect connections to all IPs in 127.0.0.0/24 to a single socket:

$ sudo tubectl bind "bar" tcp 127.0.0.0/24 80

A side effect of this power is that it's possible to create bindings that "overlap":

1: tcp 127.0.0.1/32 80 -> "foo"
2: tcp 127.0.0.0/24 80 -> "bar"

The first binding says that HTTP traffic to localhost should go to “foo”, while the second asserts that HTTP traffic in the localhost subnet should go to “bar”. This creates a contradiction, which binding should we choose? tubular resolves this by defining precedence rules for bindings:

A prefix with a longer mask is more specific, e.g. 127.0.0.1/32 wins over 127.0.0.0/24.
A port is more specific than the port wildcard, e.g. port 80 wins over "all ports" (0).

Applying this to our example, HTTP traffic to all IPs in 127.0.0.0/24 will be directed to bar, except for 127.0.0.1 which goes to foo.

Getting ahold of sockets

sk_lookup needs a reference to a TCP or a UDP socket to redirect traffic to it. However, a socket is usually accessible only by the process which created it with the socket syscall. For example, an HTTP server creates a TCP listening socket bound to port 80. How can we gain access to the listening socket?

A fairly well known solution is to make processes cooperate by passing socket file descriptors via SCM_RIGHTS messages to a tubular daemon. That daemon can then take the necessary steps to hook up the socket with sk_lookup. This approach has several drawbacks:

Requires modifying processes to send SCM_RIGHTS
Requires a tubular daemon, which may crash

There is another way of getting at sockets by using systemd, provided socket activation is used. It works by creating an additional service unit with the correct Sockets setting. In other words: we can leverage systemd oneshot action executed on creation of a systemd socket service, registering the socket into tubular. For example:

[Unit]
Requisite=foo.socket

[Service]
Type=oneshot
Sockets=foo.socket
ExecStart=tubectl register "foo"

Since we can rely on systemd to execute tubectl at the correct times we don't need a daemon of any kind. However, the reality is that a lot of popular software doesn't use systemd socket activation. Dealing with systemd sockets is complicated and doesn't invite experimentation. Which brings us to the final trick: pidfd_getfd:

The pidfd_getfd() system call allocates a new file descriptor in the calling process. This new file descriptor is a duplicate of an existing file descriptor, targetfd, in the process referred to by the PID file descriptor pidfd.

We can use it to iterate all file descriptors of a foreign process, and pick the socket we are interested in. To return to our example, we can use the following command to find the TCP socket bound to 127.0.0.1 port 8080 in the httpd process and register it under the "foo" label:

$ sudo tubectl register-pid "foo" $(pidof httpd) tcp 127.0.0.1 8080

It's easy to wire this up using systemd's ExecStartPost if the need arises.

[Service]
Type=forking # or notify
ExecStart=/path/to/some/command
ExecStartPost=tubectl register-pid $MAINPID foo tcp 127.0.0.1 8080

Storing state in eBPF maps

As mentioned previously, tubular relies on the kernel to store state, using BPF key / value data structures also known as maps. Using the BPF_OBJ_PIN syscall we can persist them in /sys/fs/bpf:

/sys/fs/bpf/4026532024_dispatcher
├── bindings
├── destination_metrics
├── destinations
├── sockets
└── ...

The way the state is structured differs from how the command line interface presents it to users. Labels like “foo” are convenient for humans, but they are of variable length. Dealing with variable length data in BPF is cumbersome and slow, so the BPF program never references labels at all. Instead, the user space code allocates numeric IDs, which are then used in the BPF. Each ID represents a (label, domain, protocol) tuple, internally called destination.

For example, adding a binding for "foo" tcp 127.0.0.1 ... allocates an ID for ("foo", AF_INET, TCP). Including domain and protocol in the destination allows simpler data structures in the BPF. Each allocation also tracks how many bindings reference a destination so that we can recycle unused IDs. This data is persisted into the destinations hash table, which is keyed by (Label, Domain, Protocol) and contains (ID, Count). Metrics for each destination are tracked in destination_metrics in the form of per-CPU counters.

bindings is a longest prefix match (LPM) trie which stores a mapping from (protocol, port, prefix) to (ID, prefix length). The ID is used as a key to the sockets map which contains pointers to kernel socket structures. IDs are allocated in a way that makes them suitable as an array index, which allows using the simpler BPF sockmap (an array) instead of a socket hash table. The prefix length is duplicated in the value to work around shortcomings in the BPF API.

Encoding the precedence of bindings

As discussed, bindings have a precedence associated with them. To repeat the earlier example:

1: tcp 127.0.0.1/32 80 -> "foo"
2: tcp 127.0.0.0/24 80 -> "bar"

The first binding should be matched before the second one. We need to encode this in the BPF somehow. One idea is to generate some code that executes the bindings in order of specificity, a technique we’ve used to great effect in l4drop:

1: if (mask(ip, 32) == 127.0.0.1) return "foo"
2: if (mask(ip, 24) == 127.0.0.0) return "bar"
...

This has the downside that the program gets longer the more bindings are added, which slows down execution. It's also difficult to introspect and debug such long programs. Instead, we use a specialised BPF longest prefix match (LPM) map to do the hard work. This allows inspecting the contents from user space to figure out which bindings are active, which is very difficult if we had compiled bindings into BPF. The LPM map uses a trie behind the scenes, so lookup has complexity proportional to the length of the key instead of linear complexity for the “naive” solution.

However, using a map requires a trick for encoding the precedence of bindings into a key that we can look up. Here is a simplified version of this encoding, which ignores IPv6 and uses labels instead of IDs. To insert the binding tcp 127.0.0.0/24 80 into a trie we first convert the IP address into a number.

127.0.0.0    = 0x7f 00 00 00

Since we're only interested in the first 24 bits of the address we, can write the whole prefix as

127.0.0.0/24 = 0x7f 00 00 ??

where “?” means that the value is not specified. We choose the number 0x01 to represent TCP and prepend it and the port number (80 decimal is 0x50 hex) to create the full key:

tcp 127.0.0.0/24 80 = 0x01 50 7f 00 00 ??

Converting tcp 127.0.0.1/32 80 happens in exactly the same way. Once the converted values are inserted into the trie, the LPM trie conceptually contains the following keys and values.

LPM trie:
        0x01 50 7f 00 00 ?? = "bar"
        0x01 50 7f 00 00 01 = "foo"

To find the binding for a TCP packet destined for 127.0.0.1:80, we again encode a key and perform a lookup.

input:  0x01 50 7f 00 00 01   TCP packet to 127.0.0.1:80
---------------------------
LPM trie:
        0x01 50 7f 00 00 ?? = "bar"
           y  y  y  y  y
        0x01 50 7f 00 00 01 = "foo"
           y  y  y  y  y  y
---------------------------
result: "foo"

y = byte matches

The trie returns “foo” since its key shares the longest prefix with the input. Note that we stop comparing keys once we reach unspecified “?” bytes, but conceptually “bar” is still a valid result. The distinction becomes clear when looking up the binding for a TCP packet to 127.0.0.255:80.

input:  0x01 50 7f 00 00 ff   TCP packet to 127.0.0.255:80
---------------------------
LPM trie:
        0x01 50 7f 00 00 ?? = "bar"
           y  y  y  y  y
        0x01 50 7f 00 00 01 = "foo"
           y  y  y  y  y  n
---------------------------
result: "bar"

n = byte doesn't match

In this case "foo" is discarded since the last byte doesn't match the input. However, "bar" is returned since its last byte is unspecified and therefore considered to be a valid match.

Observability with minimal privileges

Linux has the powerful ss tool (part of iproute2) available to inspect socket state:

$ ss -tl src 127.0.0.1
State      Recv-Q      Send-Q           Local Address:Port           Peer Address:Port
LISTEN     0           128                  127.0.0.1:ipp                 0.0.0.0:*

With tubular in the picture this output is not accurate anymore. tubectl bindings makes up for this shortcoming:

$ sudo tubectl bindings tcp 127.0.0.1
Bindings:
 protocol       prefix port label
      tcp 127.0.0.1/32   80   foo

Running this command requires super-user privileges, despite in theory being safe for any user to run. While this is acceptable for casual inspection by a human operator, it's a dealbreaker for observability via pull-based monitoring systems like Prometheus. The usual approach is to expose metrics via an HTTP server, which would have to run with elevated privileges and be accessible to the Prometheus server somehow. Instead, BPF gives us the tools to enable read-only access to tubular state with minimal privileges.

The key is to carefully set file ownership and mode for state in /sys/fs/bpf. Creating and opening files in /sys/fs/bpf uses BPF_OBJ_PIN and BPF_OBJ_GET. Calling BPF_OBJ_GET with BPF_F_RDONLY is roughly equivalent to open(O_RDONLY) and allows accessing state in a read-only fashion, provided the file permissions are correct. tubular gives the owner full access but restricts read-only access to the group:

$ sudo ls -l /sys/fs/bpf/4026532024_dispatcher | head -n 3
total 0
-rw-r----- 1 root root 0 Feb  2 13:19 bindings
-rw-r----- 1 root root 0 Feb  2 13:19 destination_metrics

It's easy to choose which user and group should own state when loading tubular:

$ sudo -u root -g tubular tubectl load
created dispatcher in /sys/fs/bpf/4026532024_dispatcher
loaded dispatcher into /proc/self/ns/net
$ sudo ls -l /sys/fs/bpf/4026532024_dispatcher | head -n 3
total 0
-rw-r----- 1 root tubular 0 Feb  2 13:42 bindings
-rw-r----- 1 root tubular 0 Feb  2 13:42 destination_metrics

There is one more obstacle, systemd mounts /sys/fs/bpf in a way that makes it inaccessible to anyone but root. Adding the executable bit to the directory fixes this.

$ sudo chmod -v o+x /sys/fs/bpf
mode of '/sys/fs/bpf' changed from 0700 (rwx------) to 0701 (rwx-----x)

Finally, we can export metrics without privileges:

$ sudo -u nobody -g tubular tubectl metrics 127.0.0.1 8080
Listening on 127.0.0.1:8080
^C

There is a caveat, unfortunately: truly unprivileged access requires unprivileged BPF to be enabled. Many distros have taken to disabling it via the unprivileged_bpf_disabled sysctl, in which case scraping metrics does require CAP_BPF.

Safe releases

tubular is distributed as a single binary, but really consists of two pieces of code with widely differing lifetimes. The BPF program is loaded into the kernel once and then may be active for weeks or months, until it is explicitly replaced. In fact, a reference to the program (and link, see below) is persisted into /sys/fs/bpf:

/sys/fs/bpf/4026532024_dispatcher
├── link
├── program
└── ...

The user space code is executed for seconds at a time and is replaced whenever the binary on disk changes. This means that user space has to be able to deal with an "old" BPF program in the kernel somehow. The simplest way to achieve this is to compare what is loaded into the kernel with the BPF shipped as part of tubectl. If the two don't match we return an error:

$ sudo tubectl bind foo tcp 127.0.0.1 80
Error: bind: can't open dispatcher: loaded program #158 has differing tag: "938c70b5a8956ff2" doesn't match "e007bfbbf37171f0"

tag is the truncated hash of the instructions making up a BPF program, which the kernel makes available for every loaded program:

$ sudo bpftool prog list id 158
158: sk_lookup  name dispatcher  tag 938c70b5a8956ff2
...

By comparing the tag tubular asserts that it is dealing with a supported version of the BPF program. Of course, just returning an error isn't enough. There needs to be a way to update the kernel program so that it's once again safe to make changes. This is where the persisted link in /sys/fs/bpf comes into play. bpf_links are used to attach programs to various BPF hooks. "Enabling" a BPF program is a two-step process: first, load the BPF program, next attach it to a hook using a bpf_link. Afterwards the program will execute the next time the hook is executed. By updating the link we can change the program on the fly, in an atomic manner.

$ sudo tubectl upgrade
Upgraded dispatcher to 2022.1.0-dev, program ID #159
$ sudo bpftool prog list id 159
159: sk_lookup  name dispatcher  tag e007bfbbf37171f0
…
$ sudo tubectl bind foo tcp 127.0.0.1 80
bound foo#tcp:[127.0.0.1/32]:80

Behind the scenes the upgrade procedure is slightly more complicated, since we have to update the pinned program reference in addition to the link. We pin the new program into /sys/fs/bpf:

/sys/fs/bpf/4026532024_dispatcher
├── link
├── program
├── program-upgrade
└── ...

Once the link is updated we atomically rename program-upgrade to replace program. In the future we may be able to use RENAME_EXCHANGE to make upgrades even safer.

Preventing state corruption

So far we’ve completely neglected the fact that multiple invocations of tubectl could modify the state in /sys/fs/bpf at the same time. It’s very hard to reason about what would happen in this case, so in general it’s best to prevent this from ever occurring. A common solution to this is advisory file locks. Unfortunately it seems like BPF maps don't support locking.

$ sudo flock /sys/fs/bpf/4026532024_dispatcher/bindings echo works!
flock: cannot open lock file /sys/fs/bpf/4026532024_dispatcher/bindings: Input/output error

This led to a bit of head scratching on our part. Luckily it is possible to flock the directory instead of individual maps:

$ sudo flock --exclusive /sys/fs/bpf/foo echo works!
works!

Each tubectl invocation likewise invokes flock(), thereby guaranteeing that only ever a single process is making changes.

Conclusion

tubular is in production at Cloudflare today and has simplified the deployment of Spectrum and our authoritative DNS. It allowed us to leave behind limitations of the BSD socket API. However, its most powerful feature is that the addresses a service is available on can be changed on the fly. In fact, we have built tooling that automates this process across our global network. Need to listen on another million IPs on thousands of machines? No problem, it’s just an HTTP POST away.

Interested in working on tubular and our L4 load balancer unimog? We are hiring in our European offices.

Pairings in CIRCL

Armando Faz-Hernández — Wed, 13 Oct 2021 12:59:30 GMT

In 2019, we announced the release of CIRCL, an open-source cryptographic library written in Go that provides optimized implementations of several primitives for key exchange and digital signatures. We are pleased to announce a major update of our library: we have included more packages for elliptic curve-based cryptography (ECC), pairing-based cryptography, and quantum-resistant algorithms.

All of these packages are the foundation of work we’re doing on bringing the benefits of cutting edge research to Cloudflare. In the past we’ve experimented with post-quantum algorithms, used pairings to keep keys safe around the world, and implemented advanced elliptic curves. Now we’re continuing that work, and sharing the foundation with everyone.

In this blog post we’re going to focus on pairing-based cryptography and give you a brief overview of some properties that make this topic so pleasant. If you are not so familiar with elliptic curves, we recommend this primer on ECC.

Otherwise, let’s get ready, pairings have arrived!

What are pairings?

Elliptic curve cryptography enables an efficient instantiation of several cryptographic applications: public-key encryption, signatures, zero-knowledge proofs, and many other more exotic applications like oblivious transfer and OPRFs. With all of those applications you might wonder what is the additional value that pairings offer? To see that, we need first to understand the basic properties of an elliptic curve system, and from that we can highlight the big gap that pairings have.

Conventional elliptic curve systems work with a single group $ \mathbb{G} $: the points of an elliptic curve $E$. In this group, usually denoted additively, we can add the points $P$ and $Q$ and get another point on the curve $R=P+Q$; also, we can multiply a point $P$ by an integer scalar $k$ and by repeatedly doing$$ kP = \underbrace{P+P+\dots+P}_{k \text{ terms}} $$This operation is known as scalar multiplication, which resembles exponentiation, and there are efficient algorithms for this operation. But given the point $Q=kP$, and $P$, it is very hard for an adversary that doesn’t know $k$ to find it. This is the Elliptic Curve Discrete Logarithm problem (ECDLP).

Now we show a property of scalar multiplication that can help us to understand the properties of pairings.

Scalar Multiplication is a Linear Map

Note the following equivalences:

$ (a+b)P = aP + bP $

$ b (aP) = a (bP) $.

These are very useful properties for many protocols: for example, the last identity allows Alice and Bob to arrive at the same value when following the Diffie-Hellman key-agreement protocol.

But while point addition and scalar multiplication are nice, it’s also useful to be able to multiply points: if we had a point $P$ and $aP$ and $bP$, getting $abP$ out would be very cool and let us do all sorts of things. Unfortunately Diffie-Hellman would immediately be insecure, so we can’t get what we want.

Guess what? Pairings provide an efficient, useful sort of intermediary point multiplication.

It’s intermediate multiplication because although the operation takes two points as operands, the result of a pairing is not a point, but an element of a different group; thus, in a pairing there are more groups involved and all of them must contain the same number of elements.

Pairing is defined as $$ e \colon\; \mathbb{G}_1 \times \mathbb{G}_2 \rightarrow \mathbb{G}_T $$Groups $\mathbb{G}_1$ and $\mathbb{G}_2$ contain points of an elliptic curve $E$. More specifically, they are the $r$-torsion points, for a fixed prime $r$. Some pairing instances fix $\mathbb{G}_1=\mathbb{G}_2$, but it is common to use disjoint sets for efficiency reasons. The third group $\mathbb{G}_T$ has notable differences. First, it is written multiplicatively, unlike the other two groups. $\mathbb{G}_T$ is not the set of points on an elliptic curve. It’s instead a subgroup of the multiplicative group over some larger finite field. It contains the elements that satisfy $x^r=1$, better known as the $r$-roots of unity.

Source: “Pairings are not dead, just resting” by Diego Aranha ECC-2017 (inspired by Avanzi’s talk at SPEED-2009).

While every elliptic curve has a pairing, very few have ones that are efficiently computable. Those that do, we call them pairing-friendly curves.

The Pairing Operation is a Bilinear Map

What makes pairings special is that $e$ is a bilinear map. Yes, the linear property of the scalar multiplication is present twice, one per group. Let’s see the first linear map.

For points $P, Q, R$ and scalars $a$ and $b$ we have:

$ e(P+Q, R) = e(P, R) * e(Q, R) $

$ e(aP, Q) = e(P, Q)^a $.

So, a scalar $a$ acting in the first operand as $aP$, finds its way out and escapes from the input of the pairing and appears in the output of the pairing as an exponent in $\mathbb{G}_T$. The same linear map is observed for the second group:

$ e(P, Q+R) = e(P, Q) * e(P, R) $

$ e(P, bQ) = e(P, Q)^b $.

Hence, the pairing is bilinear. We will see below how this property becomes useful.

Can bilinear pairings help solving ECDLP?

The MOV (by Menezes, Okamoto, and Vanstone) attack reduces the discrete logarithm problem on elliptic curves to finite fields. An attacker with knowledge of $kP$ and public points $P$ and $Q$ can recover $k$ by computing:

$ g = e(P, Q) $,

$ g_k = e(kP, Q) = e(P, Q)^k $,

$ k = \log_g(g_k) = \log_g(g^k) $.

Note that the discrete logarithm to be solved was moved from $\mathbb{G}_1$ to $\mathbb{G}_T$. So an attacker must ensure that the discrete logarithm is easier to solve in $\mathbb{G}_T$, and surprisingly, for some curves this is the case.

Fortunately, pairings do not present a threat for standard curves (such as the NIST curves or Curve25519) because these curves are constructed in such a way that $\mathbb{G}_T$ gets very large, which makes the pairing operation not efficient anymore.

This attacking strategy was one of the first applications of pairings in cryptanalysis as a tool to solve the discrete logarithm. Later, more people noticed that the properties of pairings are so useful, and can be used constructively to do cryptography. One of the fascinating truisms of cryptography is that one person's sledgehammer is another person's brick: while pairings yield a generic attack strategy for the ECDLP problem, it can also be used as a building block in a ton of useful applications.

Applications of Pairings

In the 2000s decade, a large wave of research works were developed aimed at applying pairings to many practical problems. An iconic pairing-based system was created by Antoine Joux, who constructed a one-round Diffie-Hellman key exchange for three parties.

Let’s see first how a three-party Diffie-Hellman is done without pairings. Alice, Bob and Charlie want to agree on a shared key, so they compute, respectively, $aP$, $bP$ and $cP$ for a public point P. Then, they send to each other the points they computed. So Alice receives $cP$ from Charlie and sends $aP$ to Bob, who can then send $baP$ to Charlie and get $acP$ from Alice and so on. After all this is done, they can all compute $k=abcP$. Can this be performed in a single round trip?

Two round Diffie-Hellman without pairings.

One round Diffie-Hellman with pairings.

The 3-party Diffie-Hellman protocol needs two communication rounds (on the top), but with the use of pairings a one-round trip protocol is possible.

Affirmative! Antoine Joux showed how to agree on a shared secret in a single round of communication. Alice announces $aP$, gets $bP$ and $cP$ from Bob and Charlie respectively, and then computes $k= (bP, cP)^a$. Likewise Bob computes $e(aP,cP)^b$ and Charlie does $e(aP,bP)^c$. It’s not difficult to convince yourself that all these values are equivalent, just by looking at the bilinear property.

$ e(bP,cP)^a = e(aP,cP)^b = e(aP,bP)^c = e(P,P)^{abc}$

With pairings we’ve done in one round what would otherwise take two.

Another application in cryptography addresses a problem posed by Shamir in 1984: does there exist an encryption scheme in which the public key is an arbitrary string? Imagine if your public key was your email address. It would be easy to remember and certificate authorities and certificate management would be unnecessary.

A solution to this problem came some years later, in 2001, and is the Identity-based Encryption (IBE) scheme proposed by Boneh and Franklin, which uses bilinear pairings as the main tool.

Nowadays, pairings are used for the zk-SNARKS that make Zcash an anonymous currency, and are also used in drand to generate public-verifiable randomness. Pairings and the compact, aggregatable BLS signatures are used in Ethereum. We have used pairings to build Geo Key Manager: pairings let us implement a compact broadcast and negative broadcast scheme that together make Geo Key Manager work.

In order to make these schemes, we have to implement pairings, and to do that we need to understand the mathematics behind them.

Where do pairings come from?

In order to deeply understand pairings we must understand the interplay of geometry and arithmetic, and the origins of the group’s law. The starting point is the concept of a divisor, a formal combination of points on the curve.

$ D = \sum n_i P_i $

The sum of all the coefficients $n_i$ is the degree of the divisor. If we have a function on the curve that has poles and zeros, we can count them with multiplicity to get a principal divisor. Poles are counted as negative, while zeros as positive. For example if we take the projective line, the function $x$ has the divisor $(0)-(\infty)$.

The degree of a divisor is the sum of its coefficients. All principal divisors have degree equal to $0$. The group of degree zero divisors modulo the principal divisors is the Jacobian. This means that we take all the degree zero divisors, and freely add or subtract principle divisors, constructing an abelian variety called the Jacobian.

Until now our constructions have worked for any curve. Elliptic curves have a special property: since a line intersects the curve in three points, it’s always possible to turn an element of the Jacobian into one of the form $(P)-(O)$ for a point $P$. This is where the addition law of elliptic curves comes from.

The relation between addition on an elliptic curve and the geometry of the curve. Source file ECClines.svg

Given a function $f$ we can evaluate it on a divisor $D=\sum n_i P_i$ by taking the product $\prod f(P_i)^{n_i}$. And if two functions $f$ and $g$ have disjoint divisors, we have the Weil duality:

$ f(\text{div}(g)) = g(\text{div}(f)) $,

The existence of Weil duality is what gives us the bilinear map we seek. Given an $r$-torsion point $T$ we have a function $f$ whose divisor is $r(T)-r(O)$. We can write down an auxiliary function $g$ such that $f(rP)=g^r(P)$ for any $P$. We then get a pairing by taking:

$e_r(S,T)=\frac{g(X+S)}{g(X)}$.

The auxiliary point $X$ is any point that makes the numerator and denominator defined.

In practice, the pairing we have defined above, the Weil pairing, is little used. It was historically the first pairing and is extremely important in the mathematics of elliptic curves, but faster alternatives based on more complicated underlying mathematics are used today. These faster pairings have different $\mathbb{G}_1$ and $\mathbb{G}_2$, while the Weil pairing made them the same.

Shift in parameters

As we saw earlier, the discrete logarithm problem can be attacked either on the group of elliptic curve points or on the third group (an extension of a prime field) whichever is weaker. This is why the parameters that define a pairing must balance the security of the three groups.

Before 2015 a good balance between the extension degree, the size of the prime field, and the security of the scheme was achieved by the family of Barreto-Naehrig (BN) curves. For 128 bits of security, BN curves use an extension of degree 12, and have a prime of size 256 bits; as a result they are an efficient choice for implementation.

A breakthrough for pairings occurred in 2015 when Kim and Barbescu published a result that accelerated the attacks in finite fields. This resulted in increasing the size of fields to comply with standard security levels. Just as short hashes like MD5 got depreciated as they became insecure and $2^{64}$ was no longer enough, and RSA-1024 was replaced with RSA-2048, we regularly change parameters to deal with improved attacks.

For pairings this implied the use of larger primes for all the groups. Roughly speaking, the previous 192-bit security level becomes the new 128-bit level after this attack. Also, this shift in parameters brings the family of Barreto-Lynn-Scott (BLS) curves to the stage because pairings on BLS curves are faster than BN in this new setting. Hence, currently BLS curves using an extension of degree 12, and primes of around 384 bits provide the equivalent to 128 bit security.

The IETF draft (currently in preparation) draft-irtf-cfrg-pairing-friendly-curves specifies secure pairing-friendly elliptic curves. It includes parameters for BN and BLS families of curves. It also targets different security levels to provide crypto agility for some applications relying on pairing-based cryptography.

Implementing Pairings in Go

Historically, notable examples of software libraries implementing pairings include PBC by Ben Lynn, Miracl by Michael Scott, and Relic by Diego Aranha. All of them are written in C/C++ and some ports and wrappers to other languages exist.

In the Go standard library we can find the golang.org/x/crypto/bn256 package by Adam Langley, an implementation of a pairing using a BN curve with 256-bit prime. Our colleague Brendan McMillion built github.com/cloudflare/bn256 that dramatically improves the speed of pairing operations for that curve. See the RWC-2018 talk to see our use case of pairing-based cryptography. This time we want to go one step further, and we started looking for alternative pairing implementations.

Although one can find many libraries that implement pairings, our goal is to rely on one that is efficient, includes protection against side channels, and exposes a flexible API oriented towards applications that permit generality, while avoiding common security pitfalls. This motivated us to include pairings in CIRCL. We followed best practices on secure code development and we want to share with you some details about the implementation.

We started by choosing a pairing-friendly curve. Due to the attack previously mentioned, the BN256 curve does not meet the 128-bit security level. Thus there is a need for using stronger curves. Such a stronger curve is the BLS12-381 curve that is widely used in the zk-SNARK protocols and short signature schemes. Using this curve allows us to make our Go pairing implementation interoperable with other implementations available in other programming languages, so other projects can benefit from using CIRCL too.

This code snippet tests the linearity property of pairings and shows how easy it is to use our library.

import (
    "crypto/rand"
    "fmt"
    e "github.com/cloudflare/circl/ecc/bls12381"
)

func ExamplePairing() {
    P,  Q := e.G1Generator(), e.G2Generator()
    a,  b := new(e.Scalar), new(e.Scalar)
    aP, bQ := new(e.G1), new(e.G2)
    ea, eb := new(e.Gt), new(e.Gt)

    a.Random(rand.Reader)
    b.Random(rand.Reader)

    aP.ScalarMult(a, P)
    bQ.ScalarMult(b, Q)

    g  := e.Pair( P, Q)
    ga := e.Pair(aP, Q)
    gb := e.Pair( P,bQ)

    ea.Exp(g, a)
    eb.Exp(g, b)
    linearLeft := ea.IsEqual(ga) // e(P,Q)^a == e(aP,Q)
    linearRight:= eb.IsEqual(gb) // e(P,Q)^b == e(P,bQ)

    fmt.Print(linearLeft && linearRight)
    // Output: true
}

We applied several optimizations that allowed us to improve on performance and security of the implementation. In fact, as the parameters of the curve are fixed, some other optimizations become easier to apply; for example, the code for prime field arithmetic and the towering construction for extension fields as we detail next.

Formally-verified arithmetic using fiat-crypto

One of the more difficult parts of a cryptography library to implement correctly is the prime field arithmetic. Typically people specialize it for speed, but there are many tedious constraints on what the inputs to operations can be to ensure correctness. Vulnerabilities have happened when people get it wrong, across many libraries. However, this code is perfect for machines to write and check. One such tool is fiat-crypto.

Using fiat-crypto to generate the prime field arithmetic means that we have a formal verification that the code does what we need. The fiat-crypto tool is invoked in this script, and produces Go code for addition, subtraction, multiplication, and squaring over the 381-bit prime field used in the BLS12-381 curve. Other operations are not covered, but those are much easier to check and analyze by hand.

Another advantage is that it avoids relying on the generic big.Int package, which is slow, frequently spills variables to the heap causing dynamic memory allocations, and most importantly, does not run in constant-time. Instead, the code produced is straight-line code, no branches at all, and relies on the math/bits package for accessing machine-level instructions. Automated code generation also means that it’s easier to apply new techniques to all primitives.

Tower Field Arithmetic

In addition to prime field arithmetic, say integers modulo a prime number p, a pairing also requires high-level arithmetic operations over extension fields.

To better understand what an extension field is, think of the analogous case of going from the reals to the complex numbers: the operations are referred as usual, there exist addition, multiplication and division $(+, -, \times, /)$, however they are computed quite differently.

The complex numbers are a quadratic extension over the reals, so imagine a two-level house. The first floor is where the real numbers live, however, they cannot access the second floor by themselves. On the other hand, the complex numbers can access the entire house through the use of a staircase. The equation $f(x)=x^2+1$ was not solvable over the reals, but is solvable over the complex numbers, since they have the number $i^2=1$. And because they have the number $i$, they also have to have numbers like $3i$ and $5+i$ that solve other equations that weren’t solvable over the reals either. This second story has given the roots of the polynomials a place to live.

Algebraically we can view the complex numbers as $\mathbb{R}[x]/(x^2+1)$, the space of polynomials where we consider $x^2=-1$. Given a polynomial like $x^3+5x+1$, we can turn it into $4x+1$, which is another way of writing $1+4i$. In this new field $x^2+1=0$ holds automatically, and we have added both $x$ as a root of the polynomial we picked. This process of writing down a field extension by adding a root of a polynomial works over any field, including finite fields.

Following this analogy, one can construct not a house but a tower, say a $k=12$ floor building where the ground floor is the prime field $\mathbb{F}_p$. The reason to build such a large tower is because we want to host VIP guests: namely a group called $\mu_r$, the $r$-roots of unity. There are exactly $r$ members and they behave as an (algebraic) group, i.e. for all $x,y \in \mu_r$, it follows $x*y \in \mu_r$ and $x^r = 1$.

One particularity is that making operations on the top-floor can be costly. Assume that whenever an operation is needed in the top-floor, members who live on the main floor are required to provide their assistance. Hence an operation on the top-floor needs to go all the way down to the ground floor. For this reason, our tower needs more than a staircase, it needs an efficient way to move between the levels, something even better than an elevator.

What if we use portals? Imagine anyone on the twelfth floor using a portal to go down immediately to the sixth floor, and then use another one connecting the sixth to the second floor, and finally another portal connecting to the ground floor. Thus, one can get faster to the first floor rather than descending through a long staircase.

The building analogy can be used to understand and construct a tower of finite fields. We use only some extensions to build a twelfth extension from the (ground) prime field $\mathbb{F}_p$.

$\mathbb{F}_{p}$ ⇒ $\mathbb{F}_{p^2}$ ⇒ $\mathbb{F}_{p^6}$ ⇒ $\mathbb{F}_{p^{12}}$

In fact, the extension of finite fields is as follows:

$\mathbb{F}_{p^2}$ is built as polynomials in $\mathbb{F}_p[u]$ reduced modulo $u^2+1=0$.
$\mathbb{F}_{p^6}$ is built as polynomials in $\mathbb{F}_{p^2}[v]$ reduced modulo $v^3+u+1=0$.
$\mathbb{F}_{p^{12}}$ is built as polynomials in $\mathbb{F}_{p^6}[w]$ reduced modulo $w^2+v=0$, or as polynomials in $\mathbb{F}_{p^4}[w]$ reduced modulo $w^3+v=0$.

The portals here are the polynomials used as modulus, as they allow us to move from one extension to the other.

Different constructions for higher extensions have an impact on the number of operations performed. Thus, we implemented the latter tower field for $\mathbb{F}_{p^{12}}$ as it results in a lower number of operations. The arithmetic operations are quite easy to implement and manually verify, so at this level formal verification is not as effective as in the case of prime field arithmetic. However, having an automated tool that generates code for this arithmetic would be useful for developers not familiar with the internals of field towering. The fiat-crypto tool keeps track of this idea [Issue 904, Issue 851].

Now, we describe more details about the main core operations of a bilinear pairing.

The Miller loop and the final exponentiation

The pairing function we implemented is the optimal r-ate pairing, which is defined as:

$ e(P,Q) = f_Q(P)^{\text{exp}} $

That is the construction of a function $f$ based on $Q$, then evaluated on a point $P$, and the result of that is raised to a specific power. The efficient function evaluation is performed using “the Miller loop”, which is an algorithm devised by Victor Miller and has a similar structure to a double-and-add algorithm for scalar multiplication.

After having computed $f_Q(P)$ this value is an element of $\mathbb{F}_{p^{12}}$, however it is not yet an $r$-root of unity; in order to do so, the final exponentiation accomplishes this task. Since the exponent is constant for each curve, special algorithms can be tuned for it.

One interesting acceleration opportunity presents itself: in the Miller loop the elements of $\mathbb{F}_{p^{12}}$ that we have to multiply by are special — as polynomials, they have no linear term and their constant term lives in $\mathbb{F}_{p^{2}}$. We created a specialized multiplication that avoids multiplications where the input has to be zero. This specialization accelerated the pairing computation by 12%.

So far, we have described how the internal operations of a pairing are performed. There are still some other functions floating around regarding the use of pairings in cryptographic protocols. It is also important to optimize these functions and now we will discuss some of them.

Product of Pairings

Often protocols will want to evaluate a product of pairings, rather than a single pairing. This is the case if we’re evaluating multiple signatures, or if the protocol uses cancellation between different equations to ensure security, as in the dual system encoding approach to designing protocols. If each pairing was evaluated individually, this would require multiple evaluations of the final exponentiation. However, we can evaluate the product first, and then evaluate the final exponentiation once. This requires a different interface that can take vectors of points.

Occasionally, there is a sign or an exponent in the factors of the product. It’s very easy to deal with a sign explicitly by negating one of the input points, almost for free. General exponents are more complicated, especially when considering the need for side channel protection. But since we expose the interface, later work on the library will accelerate it without changing applications.

Regarding API exposure, one of the trickiest and most error prone aspects of software engineering is input validation. So we must check that raw binary inputs decode correctly as the points used for a pairing. Part of this verification includes subgroup membership testing which is the topic we discuss next.

Subgroup Membership Testing

Checking that a point is on the curve is easy, but checking that it has the right order is not: the classical way to do this is an entire expensive scalar multiplication. But implementing pairings involves the use of many clever tricks that help to make things run faster.

One example is twisting: the $\mathbb{G}_2$ group are points with coordinates in $\mathbb{F}_{p^{12}}$, however, one can use a smaller field to reduce the number of operations. The trick here is using an associated curve $E’$, which is a twist of the original curve $E$. This allows us to work on the subfield $\mathbb{F}_{p^{2}}$ that has cheaper operations.

Additionally, twisting the curve over $\mathbb{G}_2$ carries some efficiently computable endomorphisms coming from the field structure. For the cost of two field multiplications, we can compute an additional endomorphism, dramatically decreasing the cost of scalar multiplication.

By searching for the smallest combination of scalars that could zero out the $r$-torsion points, Sean Bowe came up with a much more efficient way to do subgroup checks. We implement his trick, with a big reduction in the complexity of some applications.

As can be seen, implementing a pairing is full of subtleties. We just saw that point validation in the pairing setting is a bit more challenging than in the conventional case of elliptic curve cryptography. This kind of reformulation also applies to other operations that require special care on their implementation. One another example is how to encode binary strings as elements of the group $\mathbb{G}_1$ (or $\mathbb{G}_2$). Although this operation might sound simple, implementing it securely needs to take into consideration several aspects; thus we expand more on this topic.

Hash to Curve

An important piece on the Boneh-Franklin Identity-based Encryption scheme is a special hash function that maps an arbitrary string — the identity, e.g., an email address — to a point on an elliptic curve, and that still behaves as a conventional cryptographic hash function (such as SHA-256) that is hard to invert and collision-resistant. This operation is commonly known as hashing to curve.

Boneh and Franklin found a particular way to perform hashing to curve: apply a conventional hash function to the input bitstring, and interpret the result as the $y$-coordinate of a point, then from the curve equation $y^2=x^3+b$, find the $x$-coordinate as $x=\sqrt[3]{y^2-b}$. The cubic root always exists on fields of characteristic $p\equiv 2 \bmod{3}$. But this algorithm does not apply to other fields in general restricting the parameters to be used.

Another popular algorithm, but since now we need to remark it is an insecure way for performing hash to curve is the following. Let the hash of the input be the $x$-coordinate, and from it find the $y$-coordinate by computing a square root $y= \sqrt{x^3+b}$. Note that not all $x$-coordinates lead that the square root exists, which means the algorithm may fail; thus, it’s a probabilistic algorithm. To make sure it works always, a counter can be added to $x$ and increased each time the square root is not found. Although this algorithm always finds a point on the curve, this also makes the algorithm run in variable time i.e., it's a non-constant time algorithm. The lack of this property on implementations of cryptographic algorithms makes them susceptible to timing attacks. The DragonBlood attack is an example of how a non-constant time hashing algorithm resulted in a full key recovery of WPA3 Wi-Fi passwords.

Secure hash to curve algorithms must guarantee several properties. It must be ensured that any input passed to the hash produces a point on the targeted group. That is no special inputs must trigger exceptional cases, and the output point must belong to the correct group. We make emphasis on the correct group since in certain applications the target group is the entire set of points of an elliptic curve, but in other cases, such as in the pairing setting, the target group is a subgroup of the entire curve, recall that $\mathbb{G}_1$ and $\mathbb{G}_2$ are $r$-torsion points. Finally, some cryptographic protocols are proven secure provided that the hash to curve function behaves as a random oracle of points. This requirement adds another level of complexity to the hash to curve function.

Fortunately, several researchers have addressed most of these problems and some other researchers have been involved in efforts to define a concrete specification for secure algorithms for hashing to curves, by extending the sort of geometric trick that worked for the Boneh-Franklin curve. We have participated in the Crypto Forum Research Group (CFRG) at IETF on the work-in-progress Internet draft-irtf-cfrg-hash-to-curve. This document specifies secure algorithms for hashing targeting several elliptic curves including the BLS12-381 curve. At Cloudflare, we are actively collaborating in several working groups of IETF, see Jonathan Hoyland’s post to know more about it.

Our implementation complies with the recommendations given in the hash to curve draft and also includes many implementation techniques and tricks derived from a vast number of academic research articles in pairing-based cryptography. A good compilation of most of these tricks is delightfully explained by Craig Costello.

We hope this post helps you to shed some light and guidance on the development of pairing-based cryptography, as it has become much more relevant these days. We will share with you soon an interesting use case in which the application of pairing-based cryptography helps us to harden the security of our infrastructure.

What’s next?

We invite you to use our CIRCL library, now equipped with bilinear pairings. But there is more: look at other primitives already available such as HPKE, VOPRF, and Post-Quantum algorithms. On our side, we will continue improving the performance and security of our library, and let us know if any of your projects uses CIRCL, we would like to know your use case. Reach us at research.cloudflare.com.

You’ll soon hear more about how we’re using CIRCL across Cloudflare.

Know your SCM_RIGHTS

Vlad Krasnov — Thu, 29 Nov 2018 09:54:22 GMT

As TLS 1.3 was ratified earlier this year, I was recollecting how we got started with it here at Cloudflare. We made the decision to be early adopters of TLS 1.3 a little over two years ago. It was a very important decision, and we took it very seriously.

It is no secret that Cloudflare uses nginx to handle user traffic. A little less known fact, is that we have several instances of nginx running. I won’t go into detail, but there is one instance whose job is to accept connections on port 443, and proxy them to another instance of nginx that actually handles the requests. It has pretty limited functionality otherwise. We fondly call it nginx-ssl.

Back then we were using OpenSSL for TLS and Crypto in nginx, but OpenSSL (and BoringSSL) had yet to announce a timeline for TLS 1.3 support, therefore we had to implement our own TLS 1.3 stack. Obviously we wanted an implementation that would not affect any customer or client that would not enable TLS 1.3. We also needed something that we could iterate on quickly, because the spec was very fluid back then, and also something that we can release frequently without worrying about the rest of the Cloudflare stack.

The obvious solution was to implement it on top of OpenSSL. The OpenSSL version we were using was 1.0.2, but not only were we looking ahead to replace it with version 1.1.0 or with BoringSSL (which we eventually did), it was so ingrained in our stack and so fragile that we wouldn’t be able to achieve our stated goals, without risking serious bugs.

Instead, Filippo Valsorda and Brendan McMillion suggested that the easier path would be to implement TLS 1.3 on top of the Go TLS library and make a Go replica of nginx-ssl (go-ssl). Go is very easy to iterate and prototype, with a powerful standard library, and we had a great pool of Go talent to use, so it made a lot of sense. Thus tls-tris was born.

The question remained how would we have Go handle only TLS 1.3 while letting nginx handling all prior versions of TLS?

And herein lies the problem. Both TLS 1.3 and older versions of TLS communicate on port 443, and it is common knowledge that only one application can listen on a given TCP port, and that application is nginx, that would still handle the bulk of the TLS traffic. We could pipe all the TCP data into another connection in Go, effectively creating an additional proxy layer, but where is the fun in that? Also it seemed a little inefficient.

Meet SCM_RIGHTS

So how do you make two different processes, written in two different programming languages, share the same TCP socket?

Fortunately, Linux (or rather UNIX) provides us with just the tool that we need. You can use UNIX-domain sockets to pass file descriptors between applications, and like everything else in UNIX connections are files.Looking at man 7 unix we see the following:

   Ancillary messages
       Ancillary  data  is  sent and received using sendmsg(2) and recvmsg(2).
       For historical reasons the ancillary message  types  listed  below  are
       specified with a SOL_SOCKET type even though they are AF_UNIX specific.
       To send them  set  the  cmsg_level  field  of  the  struct  cmsghdr  to
       SOL_SOCKET  and  the cmsg_type field to the type.  For more information
       see cmsg(3).

       SCM_RIGHTS
              Send or receive a set of  open  file  descriptors  from  another
              process.  The data portion contains an integer array of the file
              descriptors.  The passed file descriptors behave as though  they
              have been created with dup(2).

Technically you do not send “file descriptors”. The “file descriptors” you handle in the code are simply indices into the processes' local file descriptor table, which in turn points into the OS' open file table, that finally points to the vnode representing the file. Thus the “file descriptor” observed by the other process will most likely have a different numeric value, despite pointing to the same file.

We can also check man 3 cmsg as suggested, to find a handy example on how to use SCM_RIGHTS:

   struct msghdr msg = { 0 };
   struct cmsghdr *cmsg;
   int myfds[NUM_FD];  /* Contains the file descriptors to pass */
   int *fdptr;
   union {         /* Ancillary data buffer, wrapped in a union
                      in order to ensure it is suitably aligned */
       char buf[CMSG_SPACE(sizeof(myfds))];
       struct cmsghdr align;
   } u;

   msg.msg_control = u.buf;
   msg.msg_controllen = sizeof(u.buf);
   cmsg = CMSG_FIRSTHDR(&msg);
   cmsg->cmsg_level = SOL_SOCKET;
   cmsg->cmsg_type = SCM_RIGHTS;
   cmsg->cmsg_len = CMSG_LEN(sizeof(int) * NUM_FD);
   fdptr = (int *) CMSG_DATA(cmsg);    /* Initialize the payload */
   memcpy(fdptr, myfds, NUM_FD * sizeof(int));

And that is what we decided to use. We let OpenSSL read the “Client Hello” message from an established TCP connection. If the “Client Hello” indicated TLS version 1.3, we would use SCM_RIGHTS to send it to the Go process. The Go process would in turn try to parse the rest of the “Client Hello”, if it were successful it would proceed with TLS 1.3 connection, and upon failure it would give the file descriptor back to OpenSSL, to handle regularly.

So how exactly do you implement something like that?

Since in our case we established that the C process will listen for TCP connections, our other process will have to listen on a UNIX socket, for connections C will want to forward.

For example in Go:

type scmListener struct {
	*net.UnixListener
}

type scmConn struct {
	*net.UnixConn
}

var path = "/tmp/scm_example.sock"

func listenSCM() (*scmListener, error) {
	syscall.Unlink(path)

	addr, err := net.ResolveUnixAddr("unix", path)
	if err != nil {
		return nil, err
	}

	ul, err := net.ListenUnix("unix", addr)
	if err != nil {
		return nil, err
	}

	err = os.Chmod(path, 0777)
	if err != nil {
		return nil, err
	}

	return &scmListener{ul}, nil
}

func (l *scmListener) Accept() (*scmConn, error) {
	uc, err := l.AcceptUnix()
	if err != nil {
		return nil, err
	}
	return &scmConn{uc}, nil
}

Then in the C process, for each connection we want to pass, we will connect to that socket first:

int connect_unix()
{
    struct sockaddr_un addr = {.sun_family = AF_UNIX,
                               .sun_path = "/tmp/scm_example.sock"};

    int unix_sock = socket(AF_UNIX, SOCK_STREAM, 0);
    if (unix_sock == -1)
        return -1;

    if (connect(unix_sock, (struct sockaddr *)&addr, sizeof(addr)) == -1)
    {
        close(unix_sock);
        return -1;
    }

    return unix_sock;
}

To actually pass a file descriptor we utilize the example from man 3 cmsg:

int send_fd(int unix_sock, int fd)
{
    struct iovec iov = {.iov_base = ":)", // Must send at least one byte
                        .iov_len = 2};

    union {
        char buf[CMSG_SPACE(sizeof(fd))];
        struct cmsghdr align;
    } u;

    struct msghdr msg = {.msg_iov = &iov,
                         .msg_iovlen = 1,
                         .msg_control = u.buf,
                         .msg_controllen = sizeof(u.buf)};

    struct cmsghdr *cmsg = CMSG_FIRSTHDR(&msg);
    *cmsg = (struct cmsghdr){.cmsg_level = SOL_SOCKET,
                             .cmsg_type = SCM_RIGHTS,
                             .cmsg_len = CMSG_LEN(sizeof(fd))};

    memcpy(CMSG_DATA(cmsg), &fd, sizeof(fd));

    return sendmsg(unix_sock, &msg, 0);
}

Then to receive the file descriptor in Go:

func (c *scmConn) ReadFD() (*os.File, error) {
	msg, oob := make([]byte, 2), make([]byte, 128)

	_, oobn, _, _, err := c.ReadMsgUnix(msg, oob)
	if err != nil {
		return nil, err
	}

	cmsgs, err := syscall.ParseSocketControlMessage(oob[0:oobn])
	if err != nil {
		return nil, err
	} else if len(cmsgs) != 1 {
		return nil, errors.New("invalid number of cmsgs received")
	}

	fds, err := syscall.ParseUnixRights(&cmsgs[0])
	if err != nil {
		return nil, err
	} else if len(fds) != 1 {
		return nil, errors.New("invalid number of fds received")
	}

	fd := os.NewFile(uintptr(fds[0]), "")
	if fd == nil {
		return nil, errors.New("could not open fd")
	}

	return fd, nil
}

Rust

We can also do this in Rust, although the standard library in Rust does not yet support UNIX sockets, but it does let you address the C library via the libc crate. Warning, unsafe code ahead!

First we want to implement some UNIX socket functionality in Rust:

use libc::*;
use std::io::prelude::*;
use std::net::TcpStream;
use std::os::unix::io::FromRawFd;
use std::os::unix::io::RawFd;

fn errno_str() -> String {
    let strerr = unsafe { strerror(*__error()) };
    let c_str = unsafe { std::ffi::CStr::from_ptr(strerr) };
    c_str.to_string_lossy().into_owned()
}

pub struct UNIXSocket {
    fd: RawFd,
}

pub struct UNIXConn {
    fd: RawFd,
}

impl Drop for UNIXSocket {
    fn drop(&mut self) {
        unsafe { close(self.fd) };
    }
}

impl Drop for UNIXConn {
    fn drop(&mut self) {
        unsafe { close(self.fd) };
    }
}

impl UNIXSocket {
    pub fn new() -> Result {
        match unsafe { socket(AF_UNIX, SOCK_STREAM, 0) } {
            -1 => Err(errno_str()),
            fd @ _ => Ok(UNIXSocket { fd }),
        }
    }

    pub fn bind(self, address: &str) -> Result {
        assert!(address.len() < 104);

        let mut addr = sockaddr_un {
            sun_len: std::mem::size_of::() as u8,
            sun_family: AF_UNIX as u8,
            sun_path: [0; 104],
        };

        for (i, c) in address.chars().enumerate() {
            addr.sun_path[i] = c as i8;
        }

        match unsafe {
            unlink(&addr.sun_path as *const i8);
            bind(
                self.fd,
                &addr as *const sockaddr_un as *const sockaddr,
                std::mem::size_of::() as u32,
            )
        } {
            -1 => Err(errno_str()),
            _ => Ok(self),
        }
    }

    pub fn listen(self) -> Result {
        match unsafe { listen(self.fd, 50) } {
            -1 => Err(errno_str()),
            _ => Ok(self),
        }
    }

    pub fn accept(&self) -> Result {
        match unsafe { accept(self.fd, std::ptr::null_mut(), std::ptr::null_mut()) } {
            -1 => Err(errno_str()),
            fd @ _ => Ok(UNIXConn { fd }),
        }
    }
}

And the code to extract the file desciptor:

#[repr(C)]
pub struct ScmCmsgHeader {
    cmsg_len: c_uint,
    cmsg_level: c_int,
    cmsg_type: c_int,
    fd: c_int,
}

impl UNIXConn {
    pub fn recv_fd(&self) -> Result {
        let mut iov = iovec {
            iov_base: std::ptr::null_mut(),
            iov_len: 0,
        };

        let mut scm = ScmCmsgHeader {
            cmsg_len: 0,
            cmsg_level: 0,
            cmsg_type: 0,
            fd: 0,
        };

        let mut mhdr = msghdr {
            msg_name: std::ptr::null_mut(),
            msg_namelen: 0,
            msg_iov: &mut iov as *mut iovec,
            msg_iovlen: 1,
            msg_control: &mut scm as *mut ScmCmsgHeader as *mut c_void,
            msg_controllen: std::mem::size_of::() as u32,
            msg_flags: 0,
        };

        let n = unsafe { recvmsg(self.fd, &mut mhdr, 0) };

        if n == -1
            || scm.cmsg_len as usize != std::mem::size_of::()
            || scm.cmsg_level != SOL_SOCKET
            || scm.cmsg_type != SCM_RIGHTS
        {
            Err("Invalid SCM message".to_string())
        } else {
            Ok(scm.fd)
        }
    }
}

Conclusion

SCM_RIGHTS is a very powerful tool that can be used for many purposes. In our case we used to to introduce a new service in a non-obtrusive fashion. Other uses may be:

A/B testing
Phasing out of an old C based service in favor of new Go or Rust one
Passing connections from a privileged process to an unprivileged one

And more

You can find the full example here.

Graceful upgrades in Go

Lorenz Bauer — Thu, 11 Oct 2018 14:30:20 GMT

^{Dingle Dangle! by}^{Grant C.}^{(CC-BY 2.0)}

The idea behind graceful upgrades is to swap out the configuration and code of a process while it is running, without anyone noticing it. If this sounds error-prone, dangerous, undesirable and in general a bad idea – I’m with you. However, sometimes you really need them. Usually this happens in an environment where there is no load balancing layer. We have these at Cloudflare, which led to us investigating and implementing various solutions to this problem.

Coincidentally, implementing graceful upgrades involves some fun low-level systems programming, which is probably why there are already a bajillion options out there. Read on to learn what trade-offs there are, and why you should really use the Go library we are about to open source. For the impatient, the code is on GitHub and you can read the documentation on godoc.

The basics

So what does it mean for a process to perform a graceful upgrade? Let’s use a web server as an example: we want to be able to fire HTTP requests at it, and never see an error because a graceful upgrade is happening.

We know that HTTP uses TCP under the hood, and that we interface with TCP using the BSD socket API. We have told the OS that we’d like to receive connections on port 80, and the OS has given us a listening socket, on which we call Accept() to wait for new clients.

A new client will be refused if the OS doesn’t know of a listening socket for port 80, or nothing is calling Accept() on it. The trick of a graceful upgrade is to make sure that neither of these two things occur while we somehow restart our service. Let’s look at the all the ways we could achieve this, from simple to complex.

Just `Exec()`

Ok, how hard can it be. Let’s just Exec() the new binary (without doing a fork first). This does exactly what we want, by replacing the currently running code with the new code from disk.

// The following is pseudo-Go.

func main() {
	var ln net.Listener
	if isUpgrade {
		ln = net.FileListener(os.NewFile(uintptr(fdNumber), "listener"))
	} else {
		ln = net.Listen(network, address)
	}
	
	go handleRequests(ln)

	<-waitForUpgradeRequest

	syscall.Exec(os.Argv[0], os.Argv[1:], os.Environ())
}

Unfortunately this has a fatal flaw since we can’t “undo” the exec. Imagine a configuration file with too much white space in it or an extra semicolon. The new process would try to read that file, get an error and exit.

Even if the exec call works, this solution assumes that initialisation of the new process is practically instantaneous. We can get into a situation where the kernel refuses new connections because the listen queue is overflowing.

New connections may be dropped if Accept() is not called regularly enough

Specifically, the new binary is going to spend some time after Exec() to initialise, which delays calls to Accept(). This means the backlog of new connections grows until some are dropped. Plain exec is out of the game.

`Listen()` all the things

Since just using exec is out of the question, we can try the next best thing. Lets fork and exec a new process which then goes through its usual start up routine. At some point it will create a few sockets by listening on some addresses, except that won’t work out-of-the-box due to errno 48, otherwise known as Address Already In Use. The kernel is preventing us from listening on the address and port combination used by the old process.

Of course, there is a flag to fix that: SO_REUSEPORT. This tells the kernel to ignore the fact that there is already a listening socket for a given address and port, and just allocate a new one.

func main() {
	ln := net.ListenWithReusePort(network, address)

	go handleRequests(ln)

	<-waitForUpgradeRequest

	cmd := exec.Command(os.Argv[0], os.Argv[1:])
	cmd.Start()

	<-waitForNewProcess
}

Now both processes have working listening sockets and the upgrade works. Right?

SO_REUSEPORT is a little bit peculiar in what it does inside the kernel. As systems programmers, we tend to think of a socket as the file descriptor that is returned by the socket call. The kernel however makes a distinction between the data structure of a socket, and one or more file descriptors pointing at it. It creates a separate socket structure if you bind using SO_REUSEPORT, not just another file descriptor. The old and the new process are thus referring to two separate sockets, which happen to share the same address. This leads to an unavoidable race condition: new-but-not-yet-accepted connections on the socket used by the old process will be orphaned and terminated by the kernel. GitHub wrote an excellent blog post about this problem.

The engineers at GitHub solved the problems with SO_REUSEPORT by using an obscure feature of the sendmsg syscall called ancilliary data. It turns out that ancillary data can include file descriptors. Using this API made sense for GitHub, since it allowed them to integrate elegantly with HAProxy. Since we have the luxury of changing the program we can use simpler alternatives.

NGINX: share sockets via fork and exec

NGINX is the tried and trusted workhorse of the Internet, and happens to support graceful upgrades. As a bonus we also use it at Cloudflare, so we were confident in its implementation.

It is written in a process-per-core model, which means that instead of spawning a bunch of threads NGINX runs a process per logical CPU core. Additionally, there is a primary process which orchestrates graceful upgrades.

The primary is responsible for creating all listen sockets used by NGINX and sharing them with the workers. This is fairly straightforward: first, the FD_CLOEXEC bit is cleared on all listen sockets. This means that they are not closed when the exec() syscall is made. The primary then does the customary fork() / exec() dance to spawn the workers, passing the file descriptor numbers as an environment variable.

Graceful upgrades make use of the same mechanism. We can spawn a new primary process (PID 1176) by following the NGINX documentation. This inherits the existing listeners from the old primary process (PID 1017) just like workers do. The new primary then spawns its own workers:

 CGroup: /system.slice/nginx.service
       	├─1017 nginx: master process /usr/sbin/nginx -g daemon on; master_process on;
       	├─1019 nginx: worker process
       	├─1021 nginx: worker process
       	├─1024 nginx: worker process
       	├─1026 nginx: worker process
       	├─1027 nginx: worker process
       	├─1028 nginx: worker process
       	├─1029 nginx: worker process
       	├─1030 nginx: worker process
       	├─1176 nginx: master process /usr/sbin/nginx -g daemon on; master_process on;
       	├─1187 nginx: worker process
       	├─1188 nginx: worker process
       	├─1190 nginx: worker process
       	├─1191 nginx: worker process
       	├─1192 nginx: worker process
       	├─1193 nginx: worker process
       	├─1194 nginx: worker process
       	└─1195 nginx: worker process

At this point there are two completely independent NGINX processes running. PID 1176 might be a new version of NGINX, or could use an updated config file. When a new connection arrives for port 80, one of the 16 worker processes is chosen by the kernel.

After executing the remaining steps, we end up with a fully replaced NGINX:

   CGroup: /system.slice/nginx.service
       	├─1176 nginx: master process /usr/sbin/nginx -g daemon on; master_process on;
       	├─1187 nginx: worker process
       	├─1188 nginx: worker process
       	├─1190 nginx: worker process
       	├─1191 nginx: worker process
       	├─1192 nginx: worker process
       	├─1193 nginx: worker process
       	├─1194 nginx: worker process
       	└─1195 nginx: worker process

Now, when a request arrives the kernel chooses between one of the eight remaining processes.

This process is rather fickle, so NGINX has a safeguard in place. Try requesting a second upgrade while the first hasn’t finished, and you’ll find the following message in the error log:

[crit] 1176#1176: the changing binary signal is ignored: you should shutdown or terminate before either old or new binary's process

This is very sensible, there is no good reason why there should be more than two processes at any given point in time. In the best case, we also want this behaviour from our Go solution.

Graceful upgrade wishlist

The way NGINX has implemented graceful upgrades is very nice. There is a clear life cycle which determines valid actions at any point in time:

It also solves the problems we’ve identified with the other approaches. Really, we’d like NGINX-style graceful upgrades as a Go library.

No old code keeps running after a successful upgrade
The new process can crash during initialisation, without bad effects
Only a single upgrade is active at any point in time

Of course, the Go community has produced some fine libraries just for this occasion. We looked at

github.com/alext/tablecloth (hat tip for the great name)
github.com/astaxie/beego/grace
github.com/facebookgo/grace
github.com/crawshaw/littleboss

just to name a few. Each of them is different in its implementation and trade-offs, but none of them ticked all of our boxes. The most common problem is that they are designed to gracefully upgrade an http server. This makes their API much nicer, but removes flexibility that we need to support other socket based protocols. So really, there was absolutely no choice but to write our own library, called tableflip. Having fun was not part of the equation.

tableflip

tableflip is a Go library for NGINX-style graceful upgrades. Here is what using it looks like:

upg, _ := tableflip.New(tableflip.Options{})
defer upg.Stop()

// Do an upgrade on SIGHUP
go func() {
    sig := make(chan os.Signal, 1)
    signal.Notify(sig, syscall.SIGHUP)
    for range sig {
   	    _ = upg.Upgrade()
    }
}()

// Start a HTTP server
ln, _ := upg.Fds.Listen("tcp", "localhost:8080")
server := http.Server{}
go server.Serve(ln)

// Tell the parent we are ready
_ = upg.Ready()

// Wait to be replaced with a new process
<-upg.Exit()

// Wait for connections to drain.
server.Shutdown(context.TODO())

Calling Upgrader.Upgrade spawns a new process with the necessary net.Listeners, and waits for the new process to signal that it has finished initialisation, to die or to time out. Calling it when an upgrade is ongoing returns an error.

Upgrader.Fds.Listen is inspired by facebookgo/grace and allows inheriting net.Listener easily. Behind the scenes, Fds makes sure that unused inherited sockets are cleaned up. This includes UNIX sockets, which are tricky due to UnlinkOnClose. You can also pass straight up *os.File to the new process if you desire.

Finally, Upgrader.Ready cleans up unused fds and signals the parent process that initialisation is done. The parent can then exit, which completes the graceful upgrade cycle.

Porting Our Software to ARM64

Alexander Huynh — Sun, 02 Sep 2018 07:35:00 GMT

As we enable more ARM64[1] machines in our network, I want to give some technical insight into the process we went through to reach software parity in our multi-architecture environment.

To give some idea of the scale of this task, it’s necessary to describe the software stack we run on our servers. The foundation is the Linux kernel. Then, we use the Debian distribution as our base operating system. Finally, we install hundreds of packages that we build ourselves. Some packages are based on open-source software, often tailored to better meet our needs. Other packages were written from scratch within Cloudflare.

Industry support for ARM64 is very active, so a lot of open-source software has already been ported. This includes the Linux kernel. Additionally, Debian made ARM64 a first-class release architecture starting with Stretch in 2017. This meant that upon obtaining our ARM64 hardware, a few engineers were able to bring Debian up quickly and smoothly. Our attention then turned to getting all our in-house packages to build and run for ARM64.

Our stack uses a diverse range of programming languages, including C, C++, Go, Lua, Python, and Rust. Different languages have different porting requirements, with some being easier than others.

Porting Go Code

Cross-compiling Go code is relatively simple, since ARM64 is a first-class citizen. Go compiles and links static binaries using the system’s crossbuild toolchain, meaning the only additional Debian package we had to install on top of build-essential is crossbuild-essential-arm64.

Photo by tom coe / Unsplash

After installing the crossbuild toolchain, we then replaced every go build invocation with a loop of GOARCH= CGO_ENABLED=1 go build, where iterates through amd64 and arm64. Forcing CGO_ENABLED=1 is required, as cgo is disabled by default for cross-compilation. The generated binaries are then run through our testing framework.

Porting Rust Code

Rust also has mature support for ARM64. The steps for porting start at installing crossbuild-essential-arm64, and defining the --target triple in either rustc or cargo. Different targets are bucketed into different tiers of completeness. Full instructions are well-documented at rust-cross.

One thing to note, however, is that any crates pulled in by a package must also be cross-compiled. The more crates used, the higher of a chance of running into one that does not cross-compile well.

Testing, Plus Porting Other Code

Other languages are less cooperative when it comes to cross-compilation. Fiddling with CC and LD values didn’t seem to be best use of engineering resources. What we really wanted was an emulation layer. An emulation layer would leverage all of our x86_64 machines, from our distributed compute behemoths to developers’ laptops, for the purposes of both building and testing code.

Enter QEMU.

QEMU is an emulator with multiple modes, including both full system emulation and user-space emulation. Our compute nodes are beefy enough to handle system-level emulation, but for developers’ laptops, user-space emulation provides most of the benefits, with less overhead.

For user-space emulation, our porting team did not want to intrude too much into our developers’ normal workflow. Our internal build system already uses Docker as a backend, so it would be ideal to be able to docker run into an ARM environment, like so:

host$ uname -m
x86_64
host$ docker run --rm -it stretch-arm64/master:latest
guest# uname -m
aarch64

Fortunately, we were not the first ones to come up with this idea: folks over at resin.io have solved this problem already! They’ve also submitted a patch to qemu-user that prepends the emulator into every execve call, similar to how binfmt_misc is implemented [2]. By prepending the emulator, you’re essentially forcing every new process to also be emulated, resulting in a nice self-contained environment.

With the execve patch in built into qemu-user, all we had to do was copy the emulator into an ARM64 container, and set the appropriate entrypoint:

# internal build of qemu with patch
FROM qemu-aarch64/master:latest as qemu

# arm64v8/debian:stretch-slim at 2018-02-12T13:02:00Z
FROM arm64v8/debian@sha256:841bbe6f4132526be95c91bec6757831c76e603309a47992e6444de6a0b6521a

COPY --from=qemu /qemu-aarch64 /qemu-aarch64
SHELL ["/qemu-aarch64", "/bin/sh", "-c"]

# setcap is required for `sudo` to work, but breaks environment variable passing
# run `setcap -r /qemu-aarch64` to break sudo, but allow environment variable passing
# maybe one day we’ll have our cake and eat it too
RUN apt-get update && apt-get install -y --no-install-recommends libcap2-bin && \
    setcap cap_setuid,cap_setgid+ep /qemu-aarch64 && \
    apt-get remove --purge -y libcap2-bin && apt-get autoremove -y && \
    rm -rf /var/lib/apt/lists/*

ENTRYPOINT ["/qemu-aarch64", "--execve", "/qemu-aarch64"]
CMD ["/bin/bash"]

This Dockerfile resulted in the cross-architecture output we were looking for earlier.

Now that we had a self-contained ARM64 environment, we could build and test most of our code relatively smoothly.

Technical Gotchas

Of course, there are always a few blockers on the road to perfection. Upon releasing this modified Debian image to our developers, they returned with a few interesting problems:

tests were failing due to LD_LIBRARY_PATH not working
Go programs segfaulting at indeterminate intervals
system-installed libraries were taking precedence over user-installed libraries
slow builds and sporadic test case failures, speeding up our plan for native builds and CI

LD_LIBRARY_PATH and Friends

It turns out that LD_LIBRARY_PATH was not the only environment variable that failed to work correctly. All environment variables, either set on the command line or via other means (e.g. export), would fail to propagate into the qemu-user process.

Through bisection of known good code, we found that it was the setcap in our Dockerfile which prevented the environment variable passthrough. Unfortunately, this setcap is the same one that allows us to call sudo, so we have a caveat for our developers that they can either run sudo inside their containers, or have environment variable passing, but not both.

Intermittent Go Failures

With a decent amount of Go code running through our CI system, it was easy to spot a trend of intermittent segfaults.

Going on a hunch, we confirmed a hypothesis that non-deterministic failures are generally due to threading issues. Unfortunately, opinion on the issue tracker showed that Go / QEMU incompatibilities aren’t a priority, so we were left without an upstream fix.

The workaround we came up with is simple: if the problem is threading-related, limit where the threads can run! When we package our internal go binaries, we add a .deb post-install script to detect if we’re running under ARM64 emulation, and if so, reduce the number of CPUs the go binary can run under to one. We lose performance by pinning to one CPU, but this slowdown is negligible when we’re already running under emulation, and slow code is better than non-working code.

With the workaround in place, reports of intermittent crashes dropped to zero. Onto the next problem!

Shared Library Mixups

We like to be at the forefront of technology. From suggesting improvements to what would become TLS 1.3, to partnering with Mozilla to make DNS queries more secure, and everything in between. To be able to support these new technologies, our software has to be at the cutting edge.

On the other hand, we also need a reliable platform to build on. One of the reasons we chose Debian is due to its long-term support lifecycle, versus other rolling release operating systems.

With these two ideas counterposed, we opted not to overwrite system libraries in /usr/lib with our cutting edge version, but instead supplement the defaults by installing into /usr/local/lib.

The same development team that reported the LD_LIBRARY_PATH issue also came to us saying the ARM64 version of their software would fail to load shared object symbols. A debugging session was launched and we eventually isolated it to the ordering of /etc/ld.so.conf.d/ in Debian.

Compare an x86_64 machine:

$ uname -m
x86_64
$ ls /etc/ld.so.conf.d/
libc.conf  x86_64-linux-gnu.conf
$ cat /etc/ld.so.conf.d/libc.conf
# libc default configuration
/usr/local/lib
$ cat /etc/ld.so.conf.d/x86_64-linux-gnu.conf
# Multiarch support
/lib/x86_64-linux-gnu
/usr/lib/x86_64-linux-gnu

With an arm64 machine:

$ uname -m
aarch64
$ ls /etc/ld.so.conf.d/
aarch64-linux-gnu.conf  libc.conf
$ cat /etc/ld.so.conf.d/aarch64-linux-gnu.conf
# Multiarch support
/lib/aarch64-linux-gnu
/usr/lib/aarch64-linux-gnu
$ cat /etc/ld.so.conf.d/libc.conf
# libc default configuration
/usr/local/lib

By traversing /etc/ld.so.conf.d/ in alphabetical order, shared libraries in /usr/local/lib would be loaded before /usr/lib/$(uname -m)-linux-gnu on x86_64, while the opposite is true for arm64!

Internal discussion resulted in us not changing the system default search order, but instead use the linker flag --rpath to request the runtime loader to explicitly search our /usr/local/lib location first.

This issue applies to both the emulated and physical ARM64 environments, which is a boon for the emulation framework we’ve just put together.

Native Builds and CI

Cross- and emulated compilation brought over 99% of our edge codebase, but there were still a handful of packages that did not fit the models we defined. Some packages, e.g. llvm, parallelize their build so well that the cost of userspace emulation slowed the build time to over 6 hours. Other packages called more esoteric functions which QEMU was not prepared to handle.

Rather than devote resources to emulating the long tail, we allocated a few ARM64 machines for developer access, and one machine for a native CI agent. Maintainers of the long tail could iterate in peace, knowing their failing test cases were never due to the emulation layer. When ready, CI would pick up their changes and build an official package, post-review.

While native compilation is the least error-prone build method, limited supply of machines made this option unattractive; the more machines we allocate for development and CI mean the more machines we take away from our proving grounds.

Ideally, we should follow the Go team’s recommendation of running code natively, but as long as our developers iterate on their x86_64 laptops, supporting emulation is necessary for us.

Other Challenges

With the most glaring blockers out of the way, we have now given our developers an even footing to easily build for multiple architectures.

The rest of the time was spent coordinating over a hundred packages, split between dozens of tech teams. At the beginning, responsibility of building ARM64 packages laid on the porting team. Working on a changing codebase required close collaboration between maintainer and porter.

Once we deemed our ARM64 platform production-ready, a self-guided procedure was created to use the build methods listed above, and a request was sent out to all of engineering to support ARM64 as a first-class citizen.

The end result of our stack is currently being tested, profiled, and optimized, with results coming soon!

Many more opportunities exist for systems integration, debugging deep dives, cross-team collaboration, and internal tools development. Come join us if you’re interested!

ARM64 is sometimes used interchangeably with aarch64 and ARMv8 ↩︎
binfmt_misc is also what Docker for Mac uses to leverage multi-architecture support; we’re supporting something very similar, albeit from Linux to Linux, versus macOS to Linux. ↩︎

Copenhagen & London developers, join us for five events this May

Andrew Fitch — Thu, 26 Apr 2018 05:32:00 GMT

Photo by Nick Karvounis / Unsplash

Are you based in Copenhagen or London? Drop by one or all of these five events.

Ross Guarino and Terin Stock, both Systems Engineers at Cloudflare are traveling to Europe to lead Go and Kubernetes talks in Copenhagen. They'll then join Junade Ali and lead talks on their use of Go, Kubernetes, and Cloudflare’s Mobile SDK at Cloudflare's London office.

My Developer Relations teammates and I are visiting these cities over the next two weeks to produce these events with Ross, Terin, and Junade. We’d love to meet you and invite you along.

Our trip will begin with two meetups and a conference talk in Copenhagen.

Event #1 (Copenhagen): 6 Cloud Native Talks, 1 Evening: Special KubeCon + CloudNativeCon EU Meetup

Tuesday, 1 May: 17:00-21:00

Location: Trifork Copenhagen - Borgergade 24B, 1300 København K

How to extend your Kubernetes cluster

A brief introduction to controllers, webhooks and CRDs. Ross and Terin will talk about how Cloudflare’s internal platform builds on Kubernetes.

Speakers: Ross Guarino and Terin Stock

View Event Details & Register Here »

Event #2 (Copenhagen): Gopher Meetup At Falcon.io: Building Go With Bazel & Internationalization in Go

Wednesday, 2 May: 18:00-21:00

Location: Falcon.io - H.C. Andersen Blvd. 27, København

Talk 1: Building Go with Bazel

Fast and Reproducible go builds with Bazel. Learn how to remove Makefiles from your repositories.

Speaker: Ross Guarino

Talk 2: Internationalization in Go

Explore making effective use of Go’s internationalization and localization packages and easily making your applications world-friendly.

Speaker: Terin Stock

View Event Details & Register Here »

Event #3 (Copenhagen): Controllers: Lambda Functions for Extending your Infrastructure at KubeCon + CloudNativeCon 2018

Friday, 4 May: 14:45-15:20

Location: KubeCon + CloudNativeCon 2018 - Bella Center, Center Blvd. 5, 2300 København

If you happen to be attending KubeCon + CloudNativeCon 2018, check out Terin and Ross’s conference talk as well.

This session demonstrates how to leverage Kubernetes Controllers and Initializers as a framework for building transparent extensions of your Kubernetes cluster. Using a live coding exercise and demo, this presentation will showcase the possibilities of the basic programming paradigms the Kubernetes API server makes easy.

Speakers: Ross Guarino and Terin Stock

View Event Details & Register Here »

Photo by Paul Buffington / Unsplash

When KubeCon + CloudNativeCon 2018 concludes, we're all heading to the Cloudflare London office where we are hosting two more meetups.

Event #4 (London): Kubernetes Controllers: Lambda Functions for Extending your Infrastructure

Wednesday, 9 May: 18:00-20:00

Location: Cloudflare London - 25 Lavington St, Second floor | SE1 0NZ London

Speakers: Ross Guarino and Terin Stock

View Event Details & Register Here »

Event #5 (London): Architecture for Network Failure, Developing for Mobile Performance

Thursday, 10 May: 18:00-20:00

Location: Cloudflare London - 25 Lavington St, Second floor | SE1 0NZ London

Whether you're building an e-commerce app or a new mobile game, chances are you'll be needing some network functionality at some point when building a mobile app. Network performance can vary dramatically between carriers, networks, and APIs, but far too often mobile apps are tested inconsistent conditions with the same decent network performance. Fortunately we can iterate on our apps by collecting real-life performance measurements from your users; however, unfortunately existing mobile app analytics platforms only provide visibility into in-app performance but have no knowledge about outgoing network call.

This talk will cover how you can easily collect vital performance data from your users at no cost and then use this data to improve your apps' reliability and experience, discussing the tips and tricks needed to boost app performance.

Speaker: Junade Ali

View Event Details & Register Here »

More About the Speakers

Ross Guarino is a Systems Engineer at Cloudflare in charge of the technical direction of the internal platform. He’s determined to improve the lives of developers building and maintaining everything from a simple function to complex globally distributed systems.

Terin Stock is a long-time engineer at Cloudflare, currently working on building an internal Kubernetes cluster. By night, he hacks on building new hardware projects. Terin is also a member of gulp.js core team and the author of the Sticker Standard.

Junade Ali is a software engineer who is specialised in computer security and software architecture. Currently, Junade works at Cloudflare as a polymath, and helps make the Internet more secure and faster; prior to this, he was a technical lead at some of the UK's leading digital agencies before moving into architecting software for mission-critical road-safety systems.

We'll hope to meet you soon!

Using Go as a scripting language in Linux

Ignat Korchagin — Tue, 20 Feb 2018 19:49:17 GMT

At Cloudflare we like Go. We use it in many in-house software projects as well as parts of bigger pipeline systems. But can we take Go to the next level and use it as a scripting language for our favourite operating system, Linux?

gopher image CC BY 3.0 Renee French Tux image CC0 BY OpenClipart-Vectors

Why consider Go as a scripting language

Short answer: why not? Go is relatively easy to learn, not too verbose and there is a huge ecosystem of libraries which can be reused to avoid writing all the code from scratch. Some other potential advantages it might bring:

Go-based build system for your Go project: go build command is mostly suitable for small, self-contained projects. More complex projects usually adopt some build system/set of scripts. Why not have these scripts written in Go then as well?
Easy non-privileged package management out of the box: if you want to use a third-party library in your script, you can simply go get it. And because the code will be installed in your GOPATH, getting a third-party library does not require administrative privileges on the system (unlike some other scripting languages). This is especially useful in large corporate environments.
Quick code prototyping on early project stages: when you're writing the first iteration of the code, it usually takes a lot of edits even to make it compile and you have to waste a lot of keystrokes on "edit->build->check" cycle. Instead you can skip the "build" part and just immediately execute your source file.
Strongly-typed scripting language: if you make a small typo somewhere in the middle of the script, most scripts will execute everything up to that point and fail on the typo itself. This might leave your system in an inconsistent state. With strongly-typed languages many typos can be caught at compile time, so the buggy script will not run in the first place.

Current state of Go scripting

At first glance Go scripts seem easy to implement with Unix support of shebang lines for scripts. A shebang line is the first line of the script, which starts with #! and specifies the script interpreter to be used to execute the script (for example, #!/bin/bash or #!/usr/bin/env python), so the system knows exactly how to execute the script regardless of the programming language used. And Go already supports interpreter-like invocation for .go files with go run command, so it should be just a matter of adding a proper shebang line, something like #!/usr/bin/env go run, to any .go file, setting the executable bit and we're good to go.

However, there are problems around using go run directly. This great post describes in detail all the issues around go run and potential workarounds, but the gist is:

go run does not properly return the script error code back to the operating system and this is important for scripts, because error codes are one of the most common ways multiple scripts interact with each other and the operating system environment.
you can't have a shebang line in a valid .go file, because Go does not know how to process lines starting with #. Other scripting languages do not have this problem, because for most of them # is a way to specify comments, so the final interpreter just ignores the shebang line, but Go comments start with // and go run on invocation will just produce an error like:

package main:
helloscript.go:1:1: illegal character U+0023 '#'

The post describes several workarounds for above issues including using a custom wrapper program gorun as an interpreter, but all of them do not provide an ideal solution. You either:

have to use non-standard shebang line, which starts with //. This is technically not even a shebang line, but the way how bash shell processes executable text files, so this solution is bash specific. Also, because of the specific behaviour of go run, this line is rather complex and not obvious (see original post for examples).
have to use a custom wrapper program gorun in the shebang line, which works well, however, you end up with .go files, which are not compilable with standard go build command because of the illegal # character.

How Linux executes files

OK, it seems the shebang approach does not provide us with an all-rounder solution. Is there anything else we could use? Let's take a closer look how Linux kernel executes binaries in the first place. When you try to execute a binary/script (or any file for that matter which has executable bit set), your shell in the end will just use Linux execve system call passing it the filesystem path of the binary in question, command line parameters and currently defined environment variables. Then the kernel is responsible for correct parsing of the file and creating a new process with the code from the file. Most of us know that Linux (and many other Unix-like operating systems) use ELF binary format for its executables.

However, one of the core principles of Linux kernel development is to avoid "vendor/format lock-in" for any subsystem, which is part of the kernel. Therefore, Linux implements a "pluggable" system, which allows any binary format to be supported by the kernel - all you have to do is to write a correct module, which can parse the format of your choosing. And if you take a closer look at the kernel source code, you'll see that Linux supports more binary formats out of the box. For example, for the recent 4.14 Linux kernel we can see that it supports at least 7 binary formats (in-tree modules for various binary formats usually have binfmt_ prefix in their names). It is worth to note the binfmt_script module, which is responsible for parsing above mentioned shebang lines and executing scripts on the target system (not everyone knows that the shebang support is actually implemented in the kernel itself and not in the shell or other daemon/process).

Extending supported binary formats from userspace

But since we concluded that shebang is not the best option for our Go scripting, seems we need something else. Surprisingly Linux kernel already has a "something else" binary support module, which has an appropriate name binfmt_misc. The module allows an administrator to dynamically add support for various executable formats directly from userspace through a well-defined procfs interface and is well-documented.

Let's follow the documentation and try to setup a binary format description for .go files. First of all the guide tells you to mount special binfmt_misc filesystem to /proc/sys/fs/binfmt_misc. If you're using relatively recent systemd-based Linux distribution, it is highly likely the filesystem is already mounted for you, because systemd by default installs special mount and automount units for this purpose. To double-check just run:

$ mount | grep binfmt_misc
systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=27,pgrp=1,timeout=0,minproto=5,maxproto=5,direct)

Another way is to check if you have any files in /proc/sys/fs/binfmt_misc: properly mounted binfmt_misc filesystem will create at least two special files with names register and status in that directory.

Next, since we do want our .go scripts to be able to properly pass the exit code to the operating system, we need the custom gorun wrapper as our "interpreter":

$ go get github.com/erning/gorun
$ sudo mv ~/go/bin/gorun /usr/local/bin/

Technically we don't need to move gorun to /usr/local/bin or any other system path as binfmt_misc requires full path to the interpreter anyway, but the system may run this executable with arbitrary privileges, so it is a good idea to limit access to the file from security perspective.

At this point let's create a simple toy Go script helloscript.go and verify we can successfully "interpret" it. The script:

package main

import (
	"fmt"
	"os"
)

func main() {
	s := "world"

	if len(os.Args) > 1 {
		s = os.Args[1]
	}

	fmt.Printf("Hello, %v!", s)
	fmt.Println("")

	if s == "fail" {
		os.Exit(30)
	}
}

Checking if parameter passing and error handling works as intended:

$ gorun helloscript.go
Hello, world!
$ echo $?
0
$ gorun helloscript.go gopher
Hello, gopher!
$ echo $?
0
$ gorun helloscript.go fail
Hello, fail!
$ echo $?
30

Now we need to tell binfmt_misc module how to execute our .go files with gorun. Following the documentation we need this configuration string: :golang:E::go::/usr/local/bin/gorun:OC, which basically tells the system: "if you encounter an executable file with .go extension, please, execute it with /usr/local/bin/gorun interpreter". The OC flags at the end of the string make sure, that the script will be executed according to the owner information and permission bits set on the script itself, and not the ones set on the interpreter binary. This makes Go script execution behaviour same as the rest of the executables and scripts in Linux.

Let's register our new Go script binary format:

$ echo ':golang:E::go::/usr/local/bin/gorun:OC' | sudo tee /proc/sys/fs/binfmt_misc/register
:golang:E::go::/usr/local/bin/gorun:OC

If the system successfully registered the format, a new file golang should appear under /proc/sys/fs/binfmt_misc directory. Finally, we can natively execute our .go files:

$ chmod u+x helloscript.go
$ ./helloscript.go
Hello, world!
$ ./helloscript.go gopher
Hello, gopher!
$ ./helloscript.go fail
Hello, fail!
$ echo $?
30

That's it! Now we can edit helloscript.go to our liking and see the changes will be immediately visible the next time the file is executed. Moreover, unlike the previous shebang approach, we can compile this file any time into a real executable with go build.

Whether you like Go or digging in Linux internals, we have positions for either or these and even both of them at once. Check-out our careers page.

Go, don't collect my garbage

Vlad Krasnov — Mon, 13 Nov 2017 10:31:09 GMT

Not long ago I needed to benchmark the performance of Golang on a many-core machine. I took several of the benchmarks that are bundled with the Go source code, copied them, and modified them to run on all available threads. In that case the machine has 24 cores and 48 threads.

CC BY-SA 2.0 image by sponki25

I started with ECDSA P256 Sign, probably because I have warm feeling for that function, since I optimized it for amd64.

First, I ran the benchmark on a single goroutine: ECDSA-P256 Sign,30618.50, op/s

That looks good; next I ran it on 48 goroutines: ECDSA-P256 Sign,78940.67, op/s.

OK, that is not what I expected. Just over 2X speedup, from 24 physical cores? I must be doing something wrong. Maybe Go only uses two cores? I ran top, it showed 2,266% utilization. That is not the 4,800% I expected, but it is also way above 400%.

How about taking a step back, and running the benchmark on two goroutines? ECDSA-P256 Sign,55966.40, op/s. Almost double, so pretty good. How about four goroutines? ECDSA-P256 Sign,108731.00, op/s. That is actually faster than 48 goroutines, what is going on?

I ran the benchmark for every number of goroutines from 1 to 48:

Looks like the number of signatures per second peaks at 274,622, with 17 goroutines. And starts dropping rapidly after that.

Time to do some profiling.

(pprof) top 10
Showing nodes accounting for 47.53s, 50.83% of 93.50s total
Dropped 224 nodes (cum <= 0.47s)
Showing top 10 nodes out of 138
      flat  flat%   sum%        cum   cum%
     9.45s 10.11% 10.11%      9.45s 10.11%  runtime.procyield /state/home/vlad/go/src/runtime/asm_amd64.s
     7.55s  8.07% 18.18%      7.55s  8.07%  runtime.futex /state/home/vlad/go/src/runtime/sys_linux_amd64.s
     6.77s  7.24% 25.42%     19.18s 20.51%  runtime.sweepone /state/home/vlad/go/src/runtime/mgcsweep.go
     4.20s  4.49% 29.91%     16.28s 17.41%  runtime.lock /state/home/vlad/go/src/runtime/lock_futex.go
     3.92s  4.19% 34.11%     12.58s 13.45%  runtime.(*mspan).sweep /state/home/vlad/go/src/runtime/mgcsweep.go
     3.50s  3.74% 37.85%     15.92s 17.03%  runtime.gcDrain /state/home/vlad/go/src/runtime/mgcmark.go
     3.20s  3.42% 41.27%      4.62s  4.94%  runtime.gcmarknewobject /state/home/vlad/go/src/runtime/mgcmark.go
     3.09s  3.30% 44.58%      3.09s  3.30%  crypto/elliptic.p256OrdSqr /state/home/vlad/go/src/crypto/elliptic/p256_asm_amd64.s
     3.09s  3.30% 47.88%      3.09s  3.30%  runtime.(*lfstack).pop /state/home/vlad/go/src/runtime/lfstack.go
     2.76s  2.95% 50.83%      2.76s  2.95%  runtime.(*gcSweepBuf).push /state/home/vlad/go/src/runtime/mgcsweepbuf.go

Clearly Go spends a disproportionate amount of time collecting garbage. All my benchmark does is generates signatures and then dumps them.

So what are our options? The Go runtime states the following:

The GOGC variable sets the initial garbage collection target percentage. A collection is triggered when the ratio of freshly allocated data to live data remaining after the previous collection reaches this percentage. The default is GOGC=100. Setting GOGC=off disables the garbage collector entirely. The runtime/debug package's SetGCPercent function allows changing this percentage at run time. See https://golang.org/pkg/runtime/debug/#SetGCPercent.

The GODEBUG variable controls debugging variables within the runtime. It is a comma-separated list of name=val pairs setting these named variables:

Let’s see what setting GODEBUG to gctrace=1 does.

gc 1 @0.021s 0%: 0.15+0.37+0.25 ms clock, 3.0+0.19/0.39/0.60+5.0 ms cpu, 4->4->0 MB, 5 MB goal, 48 P
gc 2 @0.024s 0%: 0.097+0.94+0.16 ms clock, 0.29+0.21/1.3/0+0.49 ms cpu, 4->4->1 MB, 5 MB goal, 48 P
gc 3 @0.027s 1%: 0.10+0.43+0.17 ms clock, 0.60+0.48/1.5/0+1.0 ms cpu, 4->4->0 MB, 5 MB goal, 48 P
gc 4 @0.028s 1%: 0.18+0.41+0.28 ms clock, 0.18+0.69/2.0/0+0.28 ms cpu, 4->4->0 MB, 5 MB goal, 48 P
gc 5 @0.031s 1%: 0.078+0.35+0.29 ms clock, 1.1+0.26/2.0/0+4.4 ms cpu, 4->4->0 MB, 5 MB goal, 48 P
gc 6 @0.032s 1%: 0.11+0.50+0.32 ms clock, 0.22+0.99/2.3/0+0.64 ms cpu, 4->4->0 MB, 5 MB goal, 48 P
gc 7 @0.034s 1%: 0.18+0.39+0.27 ms clock, 0.18+0.56/2.2/0+0.27 ms cpu, 4->4->0 MB, 5 MB goal, 48 P
gc 8 @0.035s 2%: 0.12+0.40+0.27 ms clock, 0.12+0.63/2.2/0+0.27 ms cpu, 4->4->0 MB, 5 MB goal, 48 P
gc 9 @0.036s 2%: 0.13+0.41+0.26 ms clock, 0.13+0.52/2.2/0+0.26 ms cpu, 4->4->0 MB, 5 MB goal, 48 P
gc 10 @0.038s 2%: 0.099+0.51+0.20 ms clock, 0.19+0.56/1.9/0+0.40 ms cpu, 4->5->0 MB, 5 MB goal, 48 P
gc 11 @0.039s 2%: 0.10+0.46+0.20 ms clock, 0.10+0.23/1.3/0.005+0.20 ms cpu, 4->4->0 MB, 5 MB goal, 48 P
gc 12 @0.040s 2%: 0.066+0.46+0.24 ms clock, 0.93+0.40/1.7/0+3.4 ms cpu, 4->4->0 MB, 5 MB goal, 48 P
gc 13 @0.041s 2%: 0.099+0.30+0.20 ms clock, 0.099+0.60/1.7/0+0.20 ms cpu, 4->4->0 MB, 5 MB goal, 48 P
gc 14 @0.042s 2%: 0.095+0.45+0.24 ms clock, 0.38+0.58/2.0/0+0.98 ms cpu, 4->5->0 MB, 5 MB goal, 48 P
gc 15 @0.044s 2%: 0.095+0.45+0.21 ms clock, 1.0+0.78/1.9/0+2.3 ms cpu, 4->4->0 MB, 5 MB goal, 48 P
gc 16 @0.045s 3%: 0.10+0.45+0.23 ms clock, 0.10+0.70/2.1/0+0.23 ms cpu, 4->5->0 MB, 5 MB goal, 48 P
gc 17 @0.046s 3%: 0.088+0.40+0.17 ms clock, 0.088+0.45/1.9/0+0.17 ms cpu, 4->4->0 MB, 5 MB goal, 48 P
.
.
.
.
gc 6789 @9.998s 12%: 0.17+0.91+0.24 ms clock, 0.85+1.8/5.0/0+1.2 ms cpu, 4->6->1 MB, 6 MB goal, 48 P
gc 6790 @10.000s 12%: 0.086+0.55+0.24 ms clock, 0.78+0.30/4.2/0.043+2.2 ms cpu, 4->5->1 MB, 6 MB goal, 48 P

The first round of GC kicks in at 0.021s, then it starts collecting every 3ms and then every 1ms. That is insane, the benchmark runs for 10 seconds, and I saw 6,790 rounds of GC. The number that starts with @ is the time since program start, followed by a percentage that supposedly states the amount of time spent collecting garbage. This number is clearly misleading, because the performance indicates at least 90% of the time is wasted (indirectly) on GC, not 12%. The synchronization overhead is not taken into account. What really is interesting are the three numbers separated by arrows. They show the size of the heap at GC start, GC end, and the live heap size. Remember that a collection is triggered when the ratio of freshly allocated data to live data remaining after the previous collection reaches this percentage, and defaults to 100%.

I am running a benchmark, where all allocated data is immediately discarded, and collected at the next GC cycle. The only live heap is fixed to the Go runtime, and having more goroutines does not add to the live heap. In contrast the freshly allocated data grows much faster with each additional goroutine, triggering increasingly frequent, and expensive GC cycles.

Clearly what I needed to do next was to run the benchmark with the GC disabled, by setting GOGC=off. This lead to a dramatic improvement: ECDSA-P256 Sign,413740.30, op/s.

But still not the number I was looking for, and running an application without garbage collection is unsustainable in the long run. I started playing with the GOGC variable. First I set it to 2,400, which made sense since we have 24 cores, perhaps collecting garbage 24 times less frequently will do the trick: ECDSA-P256 Sign,671538.90, op/s, oh my that is getting better.

What if I tried 4,800, for the number of threads? ECDSA-P256 Sign,685810.90, op/s. Getting warmer.

I ran a script to find the best value, from 100 to 20,000, in increments of 100. This is what I got:

Looks like the optimal value for GOGC in that case is 11,300 and it gets us 691,054 signatures/second. That is 22.56X times faster than the single core score, and overall pretty good for a 24 core processor. Remember that when running on a single core, the CPU frequency is 3.0GHz, and only 2.1GHz when running on all cores.

Per goroutine performance when running with GOGC=11300 now looks like that:

The scaling looks much better, and even past 24 goroutines, when we run out of physical cores, and start sharing cores with hyper-threading, the overall performance improves.

The bottom line here is that although this type of benchmarking is definitely an edge case for garbage collection, where 48 threads allocate large amounts of short lived data, this situation can occur in real world scenarios. As many-core CPUs become a commodity, one should be aware of the pitfalls.

Most languages with garbage collection offer some sort of garbage collection control. Go has the GOGC variable, that can also be controlled with the SetGCPercent function in the runtime/debug package. Don't be afraid to tune the GC to suit your needs.

We're always looking for Go programmers, so if you found this blog post interesting, why not check out our jobs page?

Cloudflare Wants to Buy Your Meetup Group Pizza

Andrew Fitch — Fri, 10 Nov 2017 15:00:00 GMT

If you’re a web dev / devops / etc. meetup group that also works toward building a faster, safer Internet, I want to support your awesome group by buying you pizza. If your group’s focus falls within one of the subject categories below and you’re willing to give us a 30 second shout out and tweet a photo of your group and @Cloudflare, your meetup’s pizza expense will be reimbursed.

Get Your Pizza $ Reimbursed »

Developer Relations at Cloudflare & why we’re doing this

I’m Andrew Fitch and I work on the Developer Relations team at Cloudflare. One of the things I like most about working in DevRel is empowering community members who are already doing great things out in the world. Whether they’re starting conferences, hosting local meetups, or writing educational content, I think it’s important to support them in their efforts and reward them for doing what they do. Community organizers are the glue that holds developers together socially. Let’s support them and make their lives easier by taking care of the pizza part of the equation.

What’s in it for Cloudflare?

We want web developers to target the apps platform
We want more people to think about working at Cloudflare
Some people only know of Cloudflare as a CDN, but it’s actually a security company and does much more than that. General awareness helps tell the story of what Cloudflare is about.

What kinds of groups we most want to support

We want to work with groups that are closely aligned with Cloudflare, groups which focus on web development, web security, devops, or tech ops. We often work with language-specific meetups (such as Go London User Group) or diversity & inclusion meetups (such as Women Who Code Portland). We’d also love to work with college student groups and any other coding groups which are focused on sharing technical knowledge with the community.

To get sponsored pizza, groups must be focused on...

Web performance & optimization
Front-end frameworks
Language-specific (JavaScript / PHP / Go / Lua / etc.)
Diversity & inclusion meetups for web devs / devops / tech ops
Workshops / classes for web devs / devops / tech ops

How it works

Interested groups need to read our Cloudflare Meetup Group Pizza Reimbursement Rules document.
If the group falls within the requirements, the group should proceed to schedule their meetup and pull the Cloudflare Introductory Slides ahead of time.
At the event, an organizer should present the Cloudflare Introductory Slides at the beginning, giving us a 30 second shout out, and take a photo of the group in front of the slides. Someone from the group should Tweet the photo and @Cloudflare, so our Community Manager may retweet.
After the event, an organizer will need to fill out our reimbursement form where they will upload a completed W-9, their mailing address, group details, and the link to the Tweet.
Within a month, the organizer should have a check from Cloudflare (and some laptop stickers for their group, if they want) in their hands.

Here are some examples of groups we’ve worked with so far

Go London User Group (GLUG)

The Go London User Group is a London-based community for anyone interested in the Go programming language.

GLUG provides offline opportunities to:

Discuss Go and related topics
Socialize with friendly people who are interested in Go
Find or fill Go-related jobs

They want GLUG to be a diverse and inclusive community. As such all attendees, organizers and sponsors are required to follow the community code of conduct.

This group’s mission is very much in alignment with Cloudflare’s guidelines, so we sponsored the pizza for their October Gophers event.

Women Who Code Portland

Women Who Code’s mission statement

Women Who Code is a global nonprofit dedicated to inspiring women to excel in technology careers by creating a global, connected community of women in technology.

What they offer

Monthly Networking Nights
Study Nights (JavaScript, React, DevOps, Design + Product, Algorithms)
Technical Workshops
Community Events (Hidden Figures screening, Women Who Strength Train, Happy Hours, etc.)
Free or discounted tickets to conferences

Women Who Code are a great example of an inclusive technical group. Cloudflare reimbursed them for their pizza at their October JavaScript Study Night.

GDG Phoenix / PHX Android

GDG Phoenix / PHX Android group is for anyone interested in Android design and development. All skill levels are welcome. Participants are welcome to the meet up to hang out and watch or be part of the conversation at events.

This group is associated with the Google Developer Group program and coordinate activities with the national organization. Group activities are not exclusively Android related, however, they can cover any technology.

Cloudflare sponsored their October event “How espresso works . . . Android talk with Michael Bailey”.

I hope a lot of awesome groups around the world will take advantage of our offer. Enjoy the pizza!

Get Your Pizza $ Reimbursed »

© 2017 Cloudflare, Inc. All rights reserved. The Cloudflare logo and marks are trademarks of Cloudflare. All other company and product names may be trademarks of the respective companies with which they are associated.

ARM Takes Wing: Qualcomm vs. Intel CPU comparison

Vlad Krasnov — Wed, 08 Nov 2017 20:03:14 GMT

One of the nicer perks I have here at Cloudflare is access to the latest hardware, long before it even reaches the market.

Until recently I mostly played with Intel hardware. For example Intel supplied us with an engineering sample of their Skylake based Purley platform back in August 2016, to give us time to evaluate it and optimize our software. As a former Intel Architect, who did a lot of work on Skylake (as well as Sandy Bridge, Ivy Bridge and Icelake), I really enjoy that.

Our previous generation of servers was based on the Intel Broadwell micro-architecture. Our configuration includes dual-socket Xeons E5-2630 v4, with 10 cores each, running at 2.2GHz, with a 3.1GHz turboboost and hyper-threading enabled, for a total of 40 threads per server.

Since Intel was, and still is, the undisputed leader of the server CPU market with greater than 98% market share, our upgrade process until now was pretty straightforward: every year Intel releases a new generation of CPUs, and every year we buy them. In the process we usually get two extra cores per socket, and all the extra architectural features such upgrade brings: hardware AES and CLMUL in Westmere, AVX in Sandy Bridge, AVX2 in Haswell, etc.

In the current upgrade cycle, our next server processor ought to be the Xeon Silver 4116, also in a dual-socket configuration. In fact, we have already purchased a significant number of them. Each CPU has 12 cores, but it runs at a lower frequency of 2.1GHz, with 3.0GHz turboboost. It also has smaller last level cache: 1.375 MiB/core, compared to 2.5 MiB the Broadwell processors had. In addition, the Skylake based platform supports 6 memory channels and the AVX-512 instruction set.

As we head into 2018, however, change is in the air. For the first time in a while, Intel has serious competition in the server market: Qualcomm and Cavium both have new server platforms based on the ARMv8 64-bit architecture (aka aarch64 or arm64). Qualcomm has the Centriq platform (code name Amberwing), based on the Falkor core, and Cavium has the ThunderX2 platform, based on the ahm ... ThunderX2 core?

The majestic Amberwing powered by the Falkor CPU CC BY-SA 2.0 image by DrPhotoMoto

Recently, both Qualcomm and Cavium provided us with engineering samples of their ARM based platforms, and in this blog post I would like to share my findings about Centriq, the Qualcomm platform.

The actual Amberwing in question

Overview

I tested the Qualcomm Centriq server, and compared it with our newest Intel Skylake based server and previous Broadwell based server.

Platform	Grantley (Intel)	Purley (Intel)	Centriq (Qualcomm)
Core	Broadwell	Skylake	Falkor
Process	14nm	14nm	10nm
Issue	8 µops/cycle	8 µops/cycle	8 instructions/cycle
Dispatch	4 µops/cycle	5 µops/cycle	4 instructions/cycle
# Cores	10 x 2S + HT (40 threads)	12 x 2S + HT (48 threads)	46
Frequency	2.2GHz (3.1GHz turbo)	2.1GHz (3.0GHz turbo)	2.5 GHz
LLC	2.5 MB/core	1.35 MB/core	1.25 MB/core
Memory Channels	4	6	6
TDP	170W (85W x 2S)	170W (85W x 2S)	120W
Other features	AES CLMUL AVX2	AES CLMUL AVX512	AES CLMUL NEON Trustzone CRC32

Overall on paper Falkor looks very competitive. In theory a Falkor core can process 8 instructions/cycle, same as Skylake or Broadwell, and it has higher base frequency at a lower TDP rating.

Ecosystem readiness

Up until now, a major obstacle to the deployment of ARM servers was lack, or weak, support by the majority of the software vendors. In the past two years, ARM’s enablement efforts have paid off, as most Linux distros, as well as most popular libraries support the 64-bit ARM architecture. Driver availability, however, is unclear at that point.

At Cloudflare, we run a complex software stack that consists of many integrated services, and running each of them efficiently is top priority.

On the edge we have the NGINX server software, that does support ARMv8. NGINX is written in C, and it also uses several libraries written in C, such as zlib and BoringSSL, therefore solid C compiler support is very important.

In addition, our flavor of NGINX is highly integrated with the lua-nginx-module, and we rely a lot on LuaJIT.

Finally, a lot of our services, such as our DNS server, RRDNS, are written in Go.

The good news is that both gcc and clang not only support ARMv8 in general, but have optimization profiles for the Falkor core.

Go has official support for ARMv8 as well, and they improve the arm64 backend constantly.

As for LuaJIT, the stable version, 2.0.5 does not support ARMv8, but the beta version, 2.1.0 does. Let’s hope it gets out of beta soon.

Benchmarks

OpenSSL

The first benchmark I wanted to perform, was OpenSSL version 1.1.1 (development version), using the bundled openssl speed tool. Although we recently switched to BoringSSL, I still prefer OpenSSL for benchmarking, because it has almost equally well optimized assembly code paths for both ARMv8 and the latest Intel processors.

In my opinion handcrafted assembly is the best measure of a CPU’s potential, as it bypasses the compiler bias.

Public key cryptography

Public key cryptography is all about raw ALU performance. It is interesting, but not surprising to see that in the single core benchmark, the Broadwell core is faster than Skylake, and both in turn are faster than Falkor. This is because Broadwell runs at a higher frequency, while architecturally it is not much inferior to Skylake.

Falkor is at a disadvantage here. First, in a single core benchmark, the turbo is engaged, meaning the Intel processors run at a higher frequency. Second, in Broadwell, Intel introduced two special instructions to accelerate big number multiplication: ADCX and ADOX. These perform two independent add-with-carry operations per cycle, whereas ARM can only do one. Similarly, the ARMv8 instruction set does not have a single instruction to perform 64-bit multiplication, instead it uses a pair of MUL and UMULH instructions.

Nevertheless, at the SoC level, Falkor wins big time. It is only marginally slower than Skylake at an RSA2048 signature, and only because RSA2048 does not have an optimized implementation for ARM. The ECDSA performance is ridiculously fast. A single Centriq chip can satisfy the ECDSA needs of almost any company in the world.

It is also very interesting to see Skylake outperform Broadwell by a 30% margin, despite losing the single core benchmark, and only having 20% more cores. This can be explained by more efficient all-core turbo, and improved hyper-threading.

Symmetric key cryptography

Symmetric key performance of the Intel cores is outstanding.

AES-GCM uses a combination of special hardware instructions to accelerate AES and CLMUL (carryless multiplication). Intel first introduced those instructions back in 2010, with their Westmere CPU, and every generation since they have improved their performance. ARM introduced a set of similar instructions just recently, with their 64-bit instruction set, and as an optional extension. Fortunately every hardware vendor I know of implemented those. It is very likely that Qualcomm will improve the performance of the cryptographic instructions in future generations.

ChaCha20-Poly1305 is a more generic algorithm, designed in such a way as to better utilize wide SIMD units. The Qualcomm CPU only has the 128-bit wide NEON SIMD, while Broadwell has 256-bit wide AVX2, and Skylake has 512-bit wide AVX-512. This explains the huge lead Skylake has over both in single core performance. In the all-cores benchmark the Skylake lead lessens, because it has to lower the clock speed when executing AVX-512 workloads. When executing AVX-512 on all cores, the base frequency goes down to just 1.4GHz---keep that in mind if you are mixing AVX-512 and other code.

The bottom line for symmetric crypto is that although Skylake has the lead, Broadwell and Falkor both have good enough performance for any real life scenario, especially considering the fact that on our edge, RSA consumes more CPU time than all the other crypto algorithms combined.

Compression

The next benchmark I wanted to see was compression. This is for two reasons. First, it is a very important workload on the edge, as having better compression saves bandwidth, and helps deliver content faster to the client. Second, it is a very demanding workload, with a high rate of branch mispredictions.

Obviously the first benchmark would be the popular zlib library. At Cloudflare, we use an improved version of the library, optimized for 64-bit Intel processors, and although it is written mostly in C, it does use some Intel specific intrinsics. Comparing this optimized version to the generic zlib library wouldn’t be fair. Not to worry, with little effort I adapted the library to work very well on the ARMv8 architecture, with the use of NEON and CRC32 intrinsics. In the process it is twice as fast as the generic library for some files.

The second benchmark is the emerging brotli library, it is written in C, and allows for a level playing field for all platforms.

All the benchmarks are performed on the HTML of blog.cloudflare.com, in memory, similar to the way NGINX performs streaming compression. The size of the specific version of the HTML file is 29,329 bytes, making it a good representative of the type of files we usually compress. The parallel benchmark compresses multiple files in parallel, as opposed to compressing a single file on many threads, also similar to the way NGINX works.

gzip

When using gzip, at the single core level Skylake is the clear winner. Despite having lower frequency than Broadwell, it seems that having lower penalty for branch misprediction helps it pull ahead. The Falkor core is not far behind, especially with lower quality settings. At the system level Falkor performs significantly better, thanks to the higher core count. Note how well gzip scales on multiple cores.

brotli

With brotli on single core the situation is similar. Skylake is the fastest, but Falkor is not very much behind, and with quality setting 9, Falkor is actually faster. Brotli with quality level 4 performs very similarly to gzip at level 5, while actually compressing slightly better (8,010B vs 8,187B).

When performing many-core compression, the situation becomes a bit messy. For levels 4, 5 and 6 brotli scales very well. At level 7 and 8 we start seeing lower performance per core, bottoming with level 9, where we get less than 3x the performance of single core, running on all cores.

My understanding is that at those quality levels Brotli consumes significantly more memory, and starts thrashing the cache. The scaling improves again at levels 10 and 11.

Bottom line for brotli, Falkor wins, since we would not consider going above quality 7 for dynamic compression.

Golang

Golang is another very important language for Cloudflare. It is also one of the first languages to offer ARMv8 support, so one would expect good performance. I used some of the built-in benchmarks, but modified them to run on multiple goroutines.

Go crypto

I would like to start the benchmarks with crypto performance. Thanks to OpenSSL we have good reference numbers, and it is interesting to see just how good the Go library is.

As far as Go crypto is concerned ARM and Intel are not even on the same playground. Go has very optimized assembly code for ECDSA, AES-GCM and Chacha20-Poly1305 on Intel. It also has Intel optimized math functions, used in RSA computations. All those are missing for ARMv8, putting it at a big disadvantage.

Nevertheless, the gap can be bridged with a relatively small effort, and we know that with the right optimizations, performance can be on par with OpenSSL. Even a very minor change, such as implementing the function addMulVVW in assembly, lead to an over tenfold improvement in RSA performance, putting Falkor ahead of both Broadwell and Skylake, with 8,009 signatures/second.

Another interesting thing to note is that on Skylake, the Go Chacha20-Poly1305 code, that uses AVX2 performs almost identically to the OpenSSL AVX512 code, this is again due to AVX2 running at higher clock speeds.

Go gzip

Next in Go performance is gzip. Here again we have a reference point to pretty well optimized code, and we can compare it to Go. In the case of the gzip library, there are no Intel specific optimizations in place.

Gzip performance is pretty good. The single core Falkor performance is way below both Intel processors, but at the system level it manages to outperform Broadwell, and lags behind Skylake. Since we already know that Falkor outperforms both when C is used, it can only mean that Go’s backend for ARMv8 is still pretty immature compared to gcc.

Go regexp

Regexp is widely used in a variety of tasks, so its performance is quite important too. I ran the built-in benchmarks on 32KB strings.

Go regexp performance is not very good on Falkor. In the medium and hard tests it takes second place, thanks to the higher core count, but Skylake is significantly faster still.

Doing some profiling shows that a lot of the time is spent in the function bytes.IndexByte. This function has an assembly implementation for amd64 (runtime.indexbytebody), but generic implementation for Go. The easy regexp tests spend most time in this function, which explains the even wider gap.

Go strings

Another important library for a web server is the Go strings library. I only tested the basic Replacer class here.

In this test again, Falkor lags behind, and loses even to Broadwell. Profiling shows significant time is spent in the function runtime.memmove. Guess what? It has a highly optimized assembly code for amd64, that uses AVX2, but only very simple ARM assembly, that copies 8 bytes at a time. By changing three lines in that code, and using the LDP/STP instructions (load pair/store pair) to copy 16 bytes at a time, I improved the performance of memmove by 30%, which resulted in 20% faster EscapeString and UnescapeString performance. And that is just scratching the surface.

Go conclusion

Go support for aarch64 is quite disappointing. I am very happy to say that everything compiles and works flawlessly, but on the performance side, things should get better. Is seems like the enablement effort so far was concentrated on the compiler back end, and the library was left largely untouched. There are a lot of low hanging optimization fruits out there, like my 20-minute fix for addMulVVW clearly shows. Qualcomm and other ARMv8 vendors intends to put significant engineering resources to amend this situation, but really anyone can contribute to Go. So if you want to leave your mark, now is the time.

LuaJIT

Lua is the glue that holds Cloudflare together.

Except for the binary_trees benchmark, the performance of LuaJIT on ARM is very competitive. It wins two benchmarks, and is in almost a tie in a third one.

That being said, binary_trees is a very important benchmark, because it triggers many memory allocations and garbage collection cycles. It will require deeper investigation in the future.

NGINX

For the NGINX workload, I decided to generate a load that would resemble an actual server.

I set up a server that serves the HTML file used in the gzip benchmark, over https, with the ECDHE-ECDSA-AES128-GCM-SHA256 cipher suite.

It also uses LuaJIT to redirect the incoming request, remove all line breaks and extra spaces from the HTML file, while adding a timestamp. The HTML is then compressed using brotli with quality 5.

Each server was configured to work with as many workers as it has virtual CPUs. 40 for Broadwell, 48 for Skylake and 46 for Falkor.

As the client for this test, I used the hey program, running from 3 Broadwell servers.

Concurrently with the test, we took power readings from the respective BMC units of each server.

With the NGINX workload Falkor handled almost the same amount of requests as the Skylake server, and both significantly outperform Broadwell. The power readings, taken from the BMC show that it did so while consuming less than half the power of other processors. That means Falkor managed to get 214 requests/watt vs the Skylake’s 99 requests/watt and Broadwell’s 77.

I was a bit surprised to see Skylake and Broadwell consume about the same amount of power, given both are manufactured with the same process, and Skylake has more cores.

The low power consumption of Falkor is not surprising, Qualcomm processors are known for their great power efficiency, which has allowed them to be a dominant player in the mobile phone CPU market.

Conclusion

The engineering sample of Falkor we got certainly impressed me a lot. This is a huge step up from any previous attempt at ARM based servers. Certainly core for core, the Intel Skylake is far superior, but when you look at the system level the performance becomes very attractive.

The production version of the Centriq SoC will feature up to 48 Falkor cores, running at a frequency of up to 2.6GHz, for a potential additional 8% better performance.

Obviously the Skylake server we tested is not the flagship Platinum unit that has 28 cores, but those 28 cores come both with a big price and over 200W TDP, whereas we are interested in improving our bang for buck metric, and performance per watt.

Currently, my main concern is weak Go language performance, but that is bound to improve quickly once ARM based servers start gaining some market share.

Both C and LuaJIT performance is very competitive, and in many cases outperforms the Skylake contender. In almost every benchmark Falkor shows itself as a worthy upgrade from Broadwell.

The largest win by far for Falkor is the low power consumption. Although it has a TDP of 120W, during my tests it never went above 89W (for the go benchmark). In comparison, Skylake and Broadwell both went over 160W, while the TDP of the two CPUs is 170W.

If you enjoy testing and selecting hardware on behalf of millions of Internet properties, come [join us](https://www.cloudflare.com/careers/).

WHOIS going to be at the Grace Hopper Celebration?

Dani Grant — Tue, 03 Oct 2017 10:00:00 GMT

Ubuntu us are doing the round trip! It’s time to live - WAN you arrive at GHC, come meet us and say HELO (we love GNU faces, we’ll be very api to meet you). When you’re exhausted like IPv4, git over to the Cloudflare corner to reboot –– we’ll have chargers and Wi-Fi (it’s not a SYN to REST). R booth can be your ESC. Then Thursday morning we’re hosting a breakfast bash with Zendesk –– it will be quite the Assembly, you should definitely Go,compile a bowl of serial, drink a bit of CIDR or a cup of tee.

I’m also speaking at 1:30PM on Wednesday in OCCC W414 hashing out encryption and updates for IoT –– DES should be a fun session.

ACK! I did NAT tell you how to find us. Check for sum women in capes a few hops away from the booths with the lava LAMP stack. I'm the one with cURLs.

In D air! Excited to LANd. C you soon.

Go Hack Nights at Cloudflare

Ryan Djurovich — Sun, 01 Oct 2017 17:00:00 GMT

At Cloudflare we're extensively using the Go programming language to build a better Internet. Go is a free and open source programming language created by Google in 2007 and open sourced in 2009. Earlier this year, Go made news when it entered the list of top 10 programming languages on the TIOBE Index.

Recently we launched an internal monthly Go Hack Night at our San Francisco office, open to anyone who works at Cloudflare regardless of their department or position. Anyone from newbie programmers to our most experienced Go engineers are encouraged to attend, and experienced engineers are asked to throw on a mentor badge and help guide colleagues with installing and learning Go.

We had over 30 attendees at our inaugural Go Hack Night, and our survey reveals some great stats:

26% of attendees were completely new to programming
61% of attendees were experienced in other languages but new to Go
Every attendee said they learned something!

We actively encourage an inclusive learning culture and we're super excited to make the Go programming language more accessible to our entire company.

If you're interested in working with Go and helping to build a better Internet, we're hiring!

P.S. if you're attending the 2017 Grace Hopper Celebration this week, stop by the Cloudflare booth to say hello!

The complete guide to Go net/http timeouts

Filippo Valsorda — Wed, 29 Jun 2016 13:09:27 GMT

When writing an HTTP server or client in Go, timeouts are amongst the easiest and most subtle things to get wrong: there’s many to choose from, and a mistake can have no consequences for a long time, until the network glitches and the process hangs.

HTTP is a complex multi-stage protocol, so there's no one-size fits all solution to timeouts. Think about a streaming endpoint versus a JSON API versus a Comet endpoint. Indeed, the defaults are often not what you want.

In this post I’ll take apart the various stages you might need to apply a timeout to, and look at the different ways to do it, on both the Server and the Client side.

SetDeadline

First, you need to know about the network primitive that Go exposes to implement timeouts: Deadlines.

Exposed by net.Conn with the Set[Read|Write]Deadline(time.Time) methods, Deadlines are an absolute time which when reached makes all I/O operations fail with a timeout error.

Deadlines are not timeouts. Once set they stay in force forever (or until the next call to SetDeadline), no matter if and how the connection is used in the meantime. So to build a timeout with SetDeadline you'll have to call it before every Read/Write operation.

You probably don't want to call SetDeadline yourself, and let net/http call it for you instead, using its higher level timeouts. However, keep in mind that all timeouts are implemented in terms of Deadlines, so they do NOT reset every time data is sent or received.

Server Timeouts

The "So you want to expose Go on the Internet" post has more information on server timeouts, in particular about HTTP/2 and Go 1.7 bugs.

It's critical for an HTTP server exposed to the Internet to enforce timeouts on client connections. Otherwise very slow or disappearing clients might leak file descriptors and eventually result in something along the lines of:

http: Accept error: accept tcp [::]:80: accept4: too many open files; retrying in 5ms

There are two timeouts exposed in http.Server: ReadTimeout and WriteTimeout. You set them by explicitly using a Server:

srv := &http.Server{
    ReadTimeout: 5 * time.Second,
    WriteTimeout: 10 * time.Second,
}
log.Println(srv.ListenAndServe())

ReadTimeout covers the time from when the connection is accepted to when the request body is fully read (if you do read the body, otherwise to the end of the headers). It's implemented in net/http by calling SetReadDeadline immediately after Accept.

WriteTimeout normally covers the time from the end of the request header read to the end of the response write (a.k.a. the lifetime of the ServeHTTP), by calling SetWriteDeadline at the end of readRequest.

However, when the connection is HTTPS, SetWriteDeadline is called immediately after Accept so that it also covers the packets written as part of the TLS handshake. Annoyingly, this means that (in that case only) WriteTimeout ends up including the header read and the first byte wait.

You should set both timeouts when you deal with untrusted clients and/or networks, so that a client can't hold up a connection by being slow to write or read.

Finally, there's http.TimeoutHandler. It’s not a Server parameter, but a Handler wrapper that limits the maximum duration of ServeHTTP calls. It works by buffering the response, and sending a 504 Gateway Timeout instead if the deadline is exceeded. Note that it is broken in 1.6 and fixed in 1.6.2.

http.ListenAndServe is doing it wrong

Incidentally, this means that the package-level convenience functions that bypass http.Server like http.ListenAndServe, http.ListenAndServeTLS and http.Serve are unfit for public Internet servers.

Those functions leave the Timeouts to their default off value, with no way of enabling them, so if you use them you'll soon be leaking connections and run out of file descriptors. I've made this mistake at least half a dozen times.

Instead, create a http.Server instance with ReadTimeout and WriteTimeout and use its corresponding methods, like in the example a few paragraphs above.

About streaming

Very annoyingly, there is no way of accessing the underlying net.Conn from ServeHTTP so a server that intends to stream a response is forced to unset the WriteTimeout (which is also possibly why they are 0 by default). This is because without net.Conn access, there is no way of calling SetWriteDeadline before each Write to implement a proper idle (not absolute) timeout.

Also, there's no way to cancel a blocked ResponseWriter.Write since ResponseWriter.Close (which you can access via an interface upgrade) is not documented to unblock a concurrent Write. So there's no way to build a timeout manually with a Timer, either.

Sadly, this means that streaming servers can't really defend themselves from a slow-reading client.

I submitted an issue with some proposals, and I welcome feedback there.

Client Timeouts

Client-side timeouts can be simpler or much more complex, depending which ones you use, but are just as important to prevent leaking resources or getting stuck.

The easiest to use is the Timeout field of http.Client. It covers the entire exchange, from Dial (if a connection is not reused) to reading the body.

c := &http.Client{
    Timeout: 15 * time.Second,
}
resp, err := c.Get("https://blog.filippo.io/")

Like the server-side case above, the package level functions such as http.Get use a Client without timeouts, so are dangerous to use on the open Internet.

For more granular control, there are a number of other more specific timeouts you can set:

net.Dialer.Timeout limits the time spent establishing a TCP connection (if a new one is needed).
http.Transport.TLSHandshakeTimeout limits the time spent performing the TLS handshake.
http.Transport.ResponseHeaderTimeout limits the time spent reading the headers of the response.
http.Transport.ExpectContinueTimeout limits the time the client will wait between sending the request headers when including an Expect: 100-continue and receiving the go-ahead to send the body. Note that setting this in 1.6 will disable HTTP/2 (DefaultTransport is special-cased from 1.6.2).

c := &http.Client{
    Transport: &http.Transport{
        Dial: (&net.Dialer{
                Timeout:   30 * time.Second,
                KeepAlive: 30 * time.Second,
        }).Dial,
        TLSHandshakeTimeout:   10 * time.Second,
        ResponseHeaderTimeout: 10 * time.Second,
        ExpectContinueTimeout: 1 * time.Second,
    }
}

As far as I can tell, there's no way to limit the time spent sending the request specifically. The time spent reading the request body can be controlled manually with a time.Timer since it happens after the Client method returns (see below for how to cancel a request).

Finally, new in 1.7, there's http.Transport.IdleConnTimeout. It does not control a blocking phase of a client request, but how long an idle connection is kept in the connection pool.

Note that a Client will follow redirects by default. http.Client.Timeout includes all time spent following redirects, while the granular timeouts are specific for each request, since http.Transport is a lower level system that has no concept of redirects.

Cancel and Context

net/http offers two ways to cancel a client request: Request.Cancel and, new in 1.7, Context.

Request.Cancel is an optional channel that when set and then closed causes the request to abort as if the Request.Timeout had been hit. (They are actually implemented through the same mechanism, and while writing this post I found a bug in 1.7 where all cancellations would be returned as timeout errors.)

We can use Request.Cancel and time.Timer to build a more granular timeout that allows streaming, pushing the deadline back every time we successfully read some data from the Body:

package main

import (
	"io"
	"io/ioutil"
	"log"
	"net/http"
	"time"
)

func main() {
	c := make(chan struct{})
	timer := time.AfterFunc(5*time.Second, func() {
		close(c)
	})

        // Serve 256 bytes every second.
	req, err := http.NewRequest("GET", "http://httpbin.org/range/2048?duration=8&chunk_size=256", nil)
	if err != nil {
		log.Fatal(err)
	}
	req.Cancel = c

	log.Println("Sending request...")
	resp, err := http.DefaultClient.Do(req)
	if err != nil {
		log.Fatal(err)
	}
	defer resp.Body.Close()

	log.Println("Reading body...")
	for {
		timer.Reset(2 * time.Second)
                // Try instead: timer.Reset(50 * time.Millisecond)
		_, err = io.CopyN(ioutil.Discard, resp.Body, 256)
		if err == io.EOF {
			break
		} else if err != nil {
			log.Fatal(err)
		}
	}
}

In the example above, we put a timeout of 5 seconds on the Do phases of the request, but then we spend at least 8 seconds reading the body in 8 rounds, each time with a timeout of 2 seconds. We could go on streaming like this forever without risk of getting stuck. If we were not to receive body data for more than 2 seconds, then io.CopyN would return net/http: request canceled.

In 1.7 the context package graduated to the standard library. There's a lot to learn about Contexts, but for our purposes you should know that they replace and deprecate Request.Cancel.

To use Contexts to cancel a request we just obtain a new Context and its cancel() function with context.WithCancel and create a Request bound to it with Request.WithContext. When we want to cancel the request, we cancel the Context by calling cancel() (instead of closing the Cancel channel):

ctx, cancel := context.WithCancel(context.TODO())
timer := time.AfterFunc(5*time.Second, func() {
	cancel()
})

req, err := http.NewRequest("GET", "http://httpbin.org/range/2048?duration=8&chunk_size=256", nil)
if err != nil {
	log.Fatal(err)
}
req = req.WithContext(ctx)

Contexts have the advantage that if the parent context (the one we passed to context.WithCancel) is canceled, ours will be, too, propagating the command down the entire pipeline.

This is all. I hope I didn't exceed your ReadDeadline!

If this kind of deep dive into the Go standard libraries sound entertaining to you, know that we are hiring in London, Austin (TX), Champaign (IL), San Francisco and Singapore.

Building the simplest Go static analysis tool

Filippo Valsorda — Wed, 27 Apr 2016 15:01:15 GMT

Go native vendoring (a.k.a. GO15VENDOREXPERIMENT) allows you to freeze dependencies by putting them in a vendor folder in your project. The compiler will then look there before searching the GOPATH.

The only annoyance compared to using a per-project GOPATH, which is what we used to do, is that you might forget to vendor a package that you have in your GOPATH. The program will build for you, but it won't for anyone else. Back to the WFM times!

I decided I wanted something, a tool, to check that all my (non-stdlib) dependencies were vendored.

At first I thought of using go list, which Dave Cheney appropriately called a swiss army knife, but while it can show the entire recursive dependency tree (format .Deps), there's no way to know from the templating engine if a dependency is in the standard library.

We could just pass each output back into go list to check for .Standard, but I thought this would be a good occasion to build a very simple static analysis tool. Go's simplicity and libraries make it a very easy task, as you will see.

First, loading the program

We use golang.org/x/tools/go/loader to load the packages passed as arguments on the command line, including the test files based on a flag.

var conf loader.Config
for _, p := range flag.Args() {
    if *tests {
        conf.ImportWithTests(p)
    } else {
        conf.Import(p)
    }
}
prog, err := conf.Load()
if err != nil {
    log.Fatal(err)
}
for p := range prog.AllPackages {
    fmt.Println(p.Path())
}

With these few lines we already replicated go list -f {{ .Deps }}!

The only missing loading feature here is wildcard (./...) support. That code is in the go tool source and it's unexported. There's an issue about exposing it, but for now packages are just copy-pasting it. We'll use a packaged version of that code, github.com/kisielk/gotool:

for _, p := range gotool.ImportPaths(flag.Args()) {

Finally, since we are only interested in the dependency tree today we instruct the parser to only go as far as the imports statements and we ignore the resulting "not used" errors:

conf.ParserMode = parser.ImportsOnly
conf.AllowErrors = true
conf.TypeChecker.Error = func(error) {}

Then, the actual logic

We now have a loader.Program object, which holds references to various loader.PackageInfo objects, which in turn are a combination of package, AST and types information. All you need to perform any kind of complex analysis. Not that we are going to do that today :)

We'll just replicate the go list logic to recognize stdlib packages and remove the packages passed on the command line from the list:

initial := make(map[*loader.PackageInfo]bool)
for _, pi := range prog.InitialPackages() {
    initial[pi] = true
}

var packages []*loader.PackageInfo
for _, pi := range prog.AllPackages {
    if initial[pi] {
        continue
    }
    if len(pi.Files) == 0 {
        continue // virtual stdlib package
    }
    filename := prog.Fset.File(pi.Files[0].Pos()).Name()
    if !strings.HasPrefix(filename, build.Default.GOROOT) ||
        !isStandardImportPath(pi.Pkg.Path()) {
        packages = append(packages, pi)
    }
}

Then we just have to print a warning if any remaining package is not in a /vendor/ folder:

for _, pi := range packages {
    if strings.Index(pi.Pkg.Path(), "/vendor/") == -1 {
        fmt.Println("[!] dependency not vendored:", pi.Pkg.Path())
    }
}

Done! You can find the tool here: https://github.com/FiloSottile/vendorcheck