It's well known that we're heavy users of the Go programming language at CloudFlare. Our work often involves delving into the standard library source code to understand internal code paths, error handling and performance characteristics.
Recently, I looked at how the standard library's built-in HTTP client handles connections to remote servers in order to provide minimal roundtrip latency.
CC By 2.0 Image by Dean Hochman
Connection pooling
A common pattern that aims to avoid connection setup costs (such as the TCP handshake and TLS setup) and confer control over the number of concurrently established connections is to pool them. net/http
maintains a pool of connections to each remote host which supports Connection: keep-alive
. The default size of the pool is two idle connections per remote host.
More interestingly, when you make a request with net/http
, a race happens. Races in code are often an unwanted side effect, but in this case it's intentional. Two goroutines operate in parallel: one that tries to dial a connection to the remote host, and another which tries to retrieve an idle connection from the connection pool. The fastest goroutine wins.
To illustrate, let's look at the code executed when transport.RoundTrip(req)
is called:
go func() {
pc, err := t.dialConn(cm)
dialc <- dialRes{pc, err}
}()
idleConnCh := t.getIdleConnCh(cm)
select {
case v := <-dialc:
// Our dial finished.
return v.pc, v.err
case pc := <-idleConnCh:
// Another request finished first and its net.Conn
// became available before our dial. Or somebody
// else's dial that they didn't use.
// But our dial is still going, so give it away
// when it finishes:
handlePendingDial()
return pc, nil
case <-req.Cancel:
handlePendingDial()
return nil, errors.New("net/http: request canceled while waiting for connection")
case <-cancelc:
handlePendingDial()
return nil, errors.New("net/http: request canceled while waiting for connection")
}
First a connection is dialed, then we select over idleConnCh
(down which idle connections are passed) or dialc
, which gives us the result of the dial. Cancellation of the request is also possible if the caller decides RoundTrip
has taken too long.
Late binding
So if an idle connection is available, retrieving it from the pool should win against dialing a new one. A similar approach (alongside other optimizations) has been used in Chromium, where it's referred to as late binding.
This echoes a mechanism we use in Railgun, CloudFlare's dynamic content optimizer, to ensure that an incoming request is serviced as quickly as possible. Idle connections to the Railgun component running at an origin server are periodically pruned after a timeout or may be closed because of errors, so an established connection from CloudFlare's edge is not always available. In this case, a direct request is made without Railgun whilst, in parallel, a persistent connection is initiated for use by subsequent requests.
As long as you have confidence that your connection manager is capable of cleaning up bad connections and cancelled requests properly, connection pooling and late binding can be important in reducing roundtrip latency. sync.Pool in the standard library may be a useful starting point if you need to pool something other than HTTP connections.
If you found this exploration interesting, we're hiring engineers in London, San Francisco and Singapore.