July 12, 2016

More data, more data

7 minute read

"multas per gentes et multa per aequora" [1]

The life of a request to CloudFlare begins and ends at the edge. But the afterlife! Like Catullus to Bithynia, the log generated by an HTTP request or a DNS query has much, much further to go.

This post comes from CloudFlare's Data Team. It reports the state of processing these sort of edge logs, including what's worked well for us and what remains a challenge in the time since our last post from April 2015.

Numbers, sense

In an edge network, where HTTP and DNS clients connect to thousands of servers distributed across the world, the key is to distribute those servers across many carefully picked points of presence—and with over 85 PoPs, no network has better representation than CloudFlare. The reverse of this distribution, however, has to happen for our network's logs. After anycast has scattered requests (and queries) to thousands of nodes at the edge, it's the Data Team's job to gather the resulting logs to a small number of central points and consolidate them for easy use by our customers.

The charts above depict (with some artifacts due to counter resets) the total structured logs sent from the edge to one of these central points yesterday, July 11th. Yesterday we saw:

An average of 3.6M HTTP logs per second, with peaks over 4.5M logs/s
An average of 750K DNS logs per second, with peaks over 1M logs/s

This is a typical, ordinary day, with the edge serving hundreds of Gbps in any given minute and transiting for >128M distinct IP addresses in any given hour.

Such a day results in nearly 360TB of raw, Cap’n Proto event logs. Brokering this data requires two Kafka clusters, comprising:

1196 cores,
170 10G NICs,
10.6TB RAM, and
4.3PB disk.

Downstream from Kafka sits roughly the same amount of hardware, some shared between Mesos + Docker and HDFS, some wholly dedicated to services like CitusDB.

We expect to see a significant increase in these numbers by the end of 2016.

Things that work (and that don't break)

What's worked well in this system, and why?

There are more, but space only permits naming five things.

The log forwarder. The graphs above use metrics from all instances of our log forwarding service running on the edge. This internal software, written in go, handles structured logs both at the edge (from nginx and from our DNS service) and on the data center side of the pipeline. Its plugin architecture has made it remarkably easy to add new log types and Kafka endpoints, and we've had great visibility into its operation thanks to the metrics we collect from it on ingress, drops, and buffer sizes.
Kafka. CloudFlare runs several Kafka clusters, including one with just under 80 brokers. Although some failure modes remain hard to automate, there's nothing else on the market that can do what we need it to do, and it does that every day remarkably well.
Persistence, and to a greatly improved extent, HTTP retrieval, for log sharing. CloudFlare offers log storage and retrieval to enterprise customers (known as Log Share or ELS), using a dedicated Kafka consumer, HDFS, and several go services. Thanks to more monitoring, improved resiliency to many kinds of network failures, better runbooks, and a lot of hard work, Log Share today offers significantly better availability than it did at the end of 2015. (The story of this work, by everyone in the team and a number of others, too, is worth its own post to detail.)
CitusDB. Few things in a data center may seem less glamorous than sharded PostgreSQL, but CitusDB continues to work great as a performant and easily managed service for persisting aggregated data. Support has been quite good, and the fact that it's "just PostgreSQL" under the hood greatly simplified a zero-downtime CitusDB major version upgrade and migration to a completely new cluster.
The platform and SRE! None of these parts could work well without the hard work done by Data's sister team, the Platform Team. We're exceptionally fortunate to have some of the best tools at our disposal, including OpenTSDB + Grafana for metrics, Prometheus for alerting, Sentry for exception reports, ElasticSearch + Kibana for logs, Marathon + Mesos + Docker for orchestration, and nginx + zoidberg for load balancing. In the same way, we also remain most grateful to CloudFlare's fantastic SREs, who continue to work with us in making the data center, different from the edge though it is, a reliable and efficient place to work in.

What's yet to be done

What do we want to improve or expect to add in the near future?

Time, not space, restricts these. In rough order:

Make each service more reliable for the customer. This work includes dozens of small things, from better runbooks and automation to capacity planning and wholesale updates to existing architecture. And it runs a continuum from external services like Log Share, which are squarely about providing data to customers, to internal ones, which are squarely about improving automation and visibility at the edge.
New analytics. Customers rely on CloudFlare to know what happens at the edge. We're working today to build analytics systems for all of CloudFlare's offerings and to extend our existing analytics so they tell customers more of what they need. Much of the technology we use in these new systems remains up for grabs, although one piece we have up (and handling a million DNS event logs per second) is Spark Streaming.
New data pipelines. Both customers and CloudFlare itself need systems that get request and query logs from one point to another. Such pipelines include external ones that push raw logs to the customer, as well as internal ones that let us push logs between data centers for disaster recovery and the like.
Better support for complex analysis. Finally, there will always be customers and internal users who need stronger tooling for analyzing high-dimension data at scale. Making it easy for people write such analyses, whether from a UI or as jobs run against a cluster, is a huge challenge and one we look forward to proving out.

For all these services, which must be built so they can work for everyone, a key challenge is making sure that the design (1) works well for the customer, and (2) can be implemented in an economic number of nodes. As much as we can, we work to build solutions that perform well on dozens to hundreds of nodes, not thousands.

Til we meet again

In this post from the Data Team, we talked about what's working and what we hope to work on next. We'd also love to talk with you! If you have thoughts on this post, or on what you'd like the Data Team to write about next, please do tell.

In addition: data at CloudFlare is a small team, so if these problems interest you, you stand to have some great problems to work on. Check out the Systems Software Engineer for Data and Data Analyst roles, and let us know what you think.

Finally, if you're headed to GopherCon, CloudFlare has three Gophers attending this week, including one from the Data Team, Alan Braithwaite @Caust1c. (Many, many services in data, including all production services today, are built with go.) Look us up!

"through many peoples and across many seas," the beginning of Catullus 101. ↩︎

Numbers, sense

Things that work (and that don't break)

What's yet to be done

Til we meet again

Related tags

Subscribe to receive notifications of new posts