I work on Cloudflare Tunnel, which lets customers quickly connect their private services and networks through the Cloudflare network without having to expose their public IPs or ports through their firewall. Tunnel is managed for users by cloudflared, a tool that runs on the same network as the private services. It proxies traffic for these services via Cloudflare, and users can then access these services securely through the Cloudflare network.

Recently, I was trying to get Cloudflare Tunnel to connect to the Cloudflare network using a UDP protocol, QUIC. While doing this, I ran into an interesting connectivity problem unique to UDP. In this post I will talk about how I went about debugging this connectivity issue beyond the land of firewalls, and how some interesting differences between UDP and TCP came into play when sending network packets.

How does Cloudflare Tunnel work?

cloudflared works by opening several connections to different servers on the Cloudflare edge. Currently, these are long-lived TCP-based connections proxied over HTTP/2 frames. When Cloudflare receives a request to a hostname, it is proxied through these connections to the local service behind cloudflared.

While our HTTP/2 protocol mode works great, we’d like to improve a few things. First, TCP traffic sent over HTTP/2 is susceptible to Head of Line (HoL) blocking — this affects both HTTP traffic and traffic from WARP routing. Additionally, it is currently not possible to initiate communication from cloudflared’s HTTP/2 server in an efficient way. With the current Go implementation of HTTP/2, we could use Server-Sent Events, but this is not very useful in the scheme of proxying L4 traffic.

The upgrade to QUIC solves possible HoL blocking issues and opens up avenues that allow us to initiate communication from cloudflared to a different cloudflared in the future.

Naturally, QUIC required a UDP-based listener on our edge servers which cloudflared could connect to. We already connect to a TCP-based listener for the existing protocols, so this should be nice and easy, right?

Failed to dial to the edge

Things weren’t as straightforward as they first looked. I added a QUIC listener on the edge, and the ability for cloudflared to connect to this new UDP-based listener. I tried to run my brand new QUIC tunnel and this happened.

$  cloudflared tunnel run --protocol quic my-tunnel
2021-09-17T18:44:11Z ERR Failed to create new quic connection, err: failed to dial to edge: timeout: no recent network activity

cloudflared wasn’t even establishing a connection to the edge. I started looking at the obvious places first. Did I add a firewall rule allowing traffic to this port? Check. Did I have iptables rules ACCEPTing or DROPping appropriate traffic for this port? Check. They seemed to be in order. So what else could I do?

tcpdump all the packets

I started by logging for UDP traffic on the machine my server was running on to see what could be happening.

$  sudo tcpdump -n -i eth0 port 7844 and udp
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
14:44:27.742629 IP 173.13.13.10.50152 > 198.41.200.200.7844: UDP, length 1252
14:44:27.743298 IP 203.0.113.0.7844 > 173.13.13.10.50152: UDP, length 37

Looking at this tcpdump helped me understand why I had no connectivity! Not only was this port getting UDP traffic but I was also seeing traffic flow out. But there seemed to be something strange afoot. Incoming packets were being sent to 198.41.200.200:7844 while responses were being sent back from 203.0.113.0:7844 (this is an example IP used for illustration purposes)  instead.

Why is this a problem? If a host (in this case, the server) chooses an address from a network unable to communicate with a public Internet host, it is likely that the return half of the communication will never arrive. But wait a minute. Why is some other IP getting prioritized over a source address my packets were already being sent to? Let’s take a deeper look at some IP addresses. (Note that I’ve deliberately oversimplified and scrambled results to minimally illustrate the problem)

$  ip addr list
eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1600 qdisc noqueue state UP group default qlen 1000
inet 203.0.113.0/32 scope global eth0
inet 198.41.200.200/32 scope global eth0 

$ ip route show
default via 203.0.113.0 dev eth0

So this was clearly why the server was working fine on my machine but not on the Cloudflare edge servers. It looks like I have multiple IPs on the interface my service is bound to. The IP that is the default route is being sent back as the source address of the packet.

Why does this work for TCP but not UDP?

Connection-oriented protocols, like TCP, initiate a connection (connect()) with a three-way handshake. The kernel therefore maintains a state about ongoing connections and uses this to determine the source IP address at the time of a response.

Because UDP (unless SOCK_SEQPACKET is involved) is connectionless, the kernel cannot maintain state like TCP does. The recvfrom  system call is invoked from the server side and tells who the data comes from. Unfortunately, recvfrom  does not tell us which IP this data is addressed for. Therefore, when the UDP server invokes the sendto system call to respond to the client, we can only tell it which address to send the data to. The responsibility of determining the source-address IP then falls to the kernel. The kernel has certain heuristics that it uses to determine the source address. This may or may not work, and in the ip routes example above, these heuristics did not work. The kernel naturally (and wrongly) picks the address of the default route to respond with.

Telling the kernel what to do

I had to rely on my application to set the source address explicitly and therefore not rely on kernel heuristics.

Linux has some generic I/O system calls, namely recvmsg  and sendmsg. Their function signatures allow us to both read or write additional out-of-band data we can pass the source address to. This control information is passed via the msghdr struct’s msg_control field.

ssize_t sendmsg(int socket, const struct msghdr *message, int flags)
ssize_t recvmsg(int socket, struct msghdr *message, int flags);
 
struct msghdr {
     void    *   msg_name;   /* Socket name          */
     int     msg_namelen;    /* Length of name       */
     struct iovec *  msg_iov;    /* Data blocks          */
     __kernel_size_t msg_iovlen; /* Number of blocks     */
     void    *   msg_control;    /* Per protocol magic (eg BSD file descriptor passing) */
    __kernel_size_t msg_controllen; /* Length of cmsg list */
     unsigned int    msg_flags;
};

We can now copy the control information we’ve gotten from recvmsg back when calling sendmsg, providing the kernel with information about the source address.The library I used (https://github.com/lucas-clemente/quic-go) had a recent update that did exactly this! I pulled the changes into my service and gave it a spin.

But alas. It did not work! A quick tcpdump showed that the same source address was being sent back. It seemed clear from reading the source code that the recvmsg and sendmsg were being called with the right values. It did not make sense.

So I had to see for myself if these system calls were being made.

strace all the system calls

strace is an extremely useful tool that tracks all system calls and signals sent/received by a process. Here’s what it had to say. I've removed all the information not relevant to this specific issue.

17:39:09.130346 recvmsg(3, {msg_name={sa_family=AF_INET6,
sin6_port=htons(35224), inet_pton(AF_INET6, "::ffff:171.54.148.10", 
&sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, msg_namelen=112->28, msg_iov=
[{iov_base="_\5S\30\273]\[email protected]\34\24\322\243{2\361\312|\325\n\1\314\316`\3
03\250\301X\20", iov_len=1452}], msg_iovlen=1, msg_control=[{cmsg_len=36, 
cmsg_level=SOL_IPV6, cmsg_type=0x32}, {cmsg_len=28, cmsg_level=SOL_IP, 
cmsg_type=IP_PKTINFO, cmsg_data={ipi_ifindex=if_nametoindex("eth0"),
ipi_spec_dst=inet_addr("198.41.200.200"),ipi_addr=inet_addr("198.41.200.200")}},
{cmsg_len=17, cmsg_level=SOL_IP, 
cmsg_type=IP_TOS, cmsg_data=[0]}], msg_controllen=96, msg_flags=0}, 0) = 28 <0.000007>
17:39:09.165160 sendmsg(3, {msg_name={sa_family=AF_INET6, 
sin6_port=htons(35224), inet_pton(AF_INET6, "::ffff:171.54.148.10", 
&sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, msg_namelen=28, 
msg_iov=[{iov_base="Oe4\37:3\344 &\243W\10~c\\\316\2640\255*\231 
OY\326b\26\300\264&\33\""..., iov_len=1302}], msg_iovlen=1, msg_control=
[{cmsg_len=28, cmsg_level=SOL_TCP, cmsg_type=0x8}], msg_controllen=28, 
msg_flags=0}, 0) = 1302 <0.000054>

Let's start with recvmsg . We can clearly see that the ipi_addr for the source is being passed correctly: ipi_addr=inet_addr("172.16.90.131"). This part works as expected. Looking at sendmsg  almost instantly tells us where the problem is. The field we want, ip_spec_dst is not being set as we make this system call. So the kernel continues to make wrong guesses as to what the source address may be.

This turned out to be a bug where the library was using IPROTO_TCP instead of IPPROTO_IPV4 as the control message level while making the sendmsg call. Was that it? Seemed a little anticlimactic. I submitted a slightly more typesafe fix and sure enough, straces now showed me what I was expecting to see.

18:22:08.334755 sendmsg(3, {msg_name={sa_family=AF_INET6, 
sin6_port=htons(37783), inet_pton(AF_INET6, "::ffff:171.54.148.10", 
&sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, msg_namelen=28, 
msg_iov=
[{iov_base="Ki\20NU\242\211Y\254\337\3107\224\201\233\242\2647\245}6jlE\2
70\227\3023_\353n\364"..., iov_len=33}], msg_iovlen=1, msg_control=
[{cmsg_len=28, cmsg_level=SOL_IP, cmsg_type=IP_PKTINFO, cmsg_data=
{ipi_ifindex=if_nametoindex("eth0"), 
ipi_spec_dst=inet_addr("198.41.200.200"),ipi_addr=inet_addr("0.0.0.0")}}
], msg_controllen=32, msg_flags=0}, 0) =
33 <0.000049>

cloudflared is now able to connect with UDP (QUIC) to the Cloudflare network from anywhere in the world!

$  cloudflared tunnel --protocol quic run sudarsans-tunnel
2021-09-21T11:37:30Z INF Starting tunnel tunnelID=a72e9cb7-90dc-499b-b9a0-04ee70f4ed78
2021-09-21T11:37:30Z INF Version 2021.9.1
2021-09-21T11:37:30Z INF GOOS: darwin, GOVersion: go1.16.5, GoArch: amd64
2021-09-21T11:37:30Z INF Settings: map[p:quic protocol:quic]
2021-09-21T11:37:30Z INF Initial protocol quic
2021-09-21T11:37:32Z INF Connection 3ade6501-4706-433e-a960-c793bc2eecd4 registered connIndex=0 location=AMS

While the programmatic bug causing this issue was a trivial one, the journey into systematically discovering the issue and understanding how Linux internals worked for UDP along the way turned out to be very rewarding for me. It also reiterated my belief that tcpdump and strace are indeed invaluable tools in anybody’s arsenal when debugging network problems.

What’s next?

You can give this a try with the latest cloudflared release at https://github.com/cloudflare/cloudflared/releases/latest. Just remember to set the protocol flag to quic. We plan to leverage this new mode to roll out some exciting new features for Cloudflare Tunnel. So upgrade away and keep watching this space for more information on how you can take advantage of this.