One of our large scale data infrastructure challenges here at Cloudflare is around providing HTTP traffic analytics to our customers. HTTP Analytics is available to all our customers via two options:
- Analytics tab in Cloudflare dashboard
- Zone Analytics API with 2 endpoints
- Dashboard endpoint
- Co-locations endpoint (Enterprise plan only)
In this blog post I'm going to talk about the exciting evolution of the Cloudflare analytics pipeline over the last year. I'll start with a description of the old pipeline and the challenges that we experienced with it. Then, I'll describe how we leveraged ClickHouse to form the basis of a new and improved pipeline. In the process, I'll share details about how we went about schema design and performance tuning for ClickHouse. Finally, I'll look forward to what the Data team is thinking of providing in the future.
Let's start with the old data pipeline.
Old data pipeline
The previous pipeline was built in 2014. It has been mentioned previously in Scaling out PostgreSQL for CloudFlare Analytics using CitusDB and More data, more data blog posts from the Data team.
It had following components:
- Log forwarder - collected Cap'n Proto formatted logs from the edge, notably DNS and Nginx logs, and shipped them to Kafka in Cloudflare central datacenter.
- Kafka cluster - consisted of 106 brokers with x3 replication factor, 106 partitions, ingested Cap'n Proto formatted logs at average rate 6M logs per second.
- Kafka consumers - each of 106 partitions had dedicated Go consumer (a.k.a. Zoneagg consumer), which read logs and produced aggregates per partition per zone per minute and then wrote them into Postgres.
Postgres database - single instance PostgreSQL database (a.k.a. RollupDB), accepted aggregates from Zoneagg consumers and wrote them into temporary tables per partition per minute. It then rolled-up the aggregates into further aggregates with aggregation cron. More specifically:
- Aggregates per partition, minute, zone → aggregates data per minute, zone
- Aggregates per minute, zone → aggregates data per hour, zone
- Aggregates per hour, zone → aggregates data per day, zone
- Aggregates per day, zone → aggregates data per month, zone
- Citus Cluster - consisted of Citus main and 11 Citus workers with x2 replication factor (a.k.a. Zoneagg Citus cluster), the storage behind Zone Analytics API and our BI internal tools. It had replication cron, which did remote copy of tables from Postgres instance into Citus worker shards.
- Zone Analytics API - served queries from internal PHP API. It consisted of 5 API instances written in Go and queried Citus cluster, and was not visible to external users.
- PHP API - 3 instances of proxying API, which forwarded public API queries to internal Zone Analytics API, and had some business logic on zone plans, error messages, etc.
- Load Balancer - nginx proxy, forwarded queries to PHP API/Zone Analytics API.
Cloudflare has grown tremendously since this pipeline was originally designed in 2014. It started off processing under 1M requests per second and grew to current levels of 6M requests per second. The pipeline had served us and our customers well over the years, but began to split at the seams. Any system should be re-engineered after some time, when requirements change.
Some specific disadvantages of the original pipeline were:
- Postgres SPOF - single PostgreSQL instance was a SPOF (Single Point of Failure), as it didn't have replicas or backups and if we were to lose this node, whole analytics pipeline could be paralyzed and produce no new aggregates for Zone Analytics API.
- Citus main SPOF - Citus main was the entrypoint to all Zone Analytics API queries and if it went down, all our customers' Analytics API queries would return errors.
- Complex codebase - thousands of lines of bash and SQL for aggregations, and thousands of lines of Go for API and Kafka consumers made the pipeline difficult to maintain and debug.
- Many dependencies - the pipeline consisted of many components, and failure in any individual component could result in halting the entire pipeline.
- High maintenance cost - due to its complex architecture and codebase, there were frequent incidents, which sometimes took engineers from the Data team and other teams many hours to mitigate.
Over time, as our request volume grew, the challenges of operating this pipeline became more apparent, and we realized that this system was being pushed to its limits. This realization inspired us to think about which components would be ideal candidates for replacement, and led us to build new data pipeline.
Our first design of an improved analytics pipeline centred around the use of the Apache Flink stream processing system. We had previously used Flink for other data pipelines, so it was a natural choice for us. However, these pipelines had been at a much lower rate than the 6M requests per second we needed to process for HTTP Analytics, and we struggled to get Flink to scale to this volume - it just couldn't keep up with ingestion rate per partition on all 6M HTTP requests per second.
Our colleagues on our DNS team had already built and productionized DNS analytics pipeline atop ClickHouse. They wrote about it in "How Cloudflare analyzes 1M DNS queries per second" blog post. So, we decided to take a deeper look at ClickHouse.
"ClickHouse не тормозит."
Translation from Russian: ClickHouse doesn't have brakes (or isn't slow)
© ClickHouse core developers
When exploring additional candidates for replacing some of the key infrastructure of our old pipeline, we realized that using a column oriented database might be well suited to our analytics workloads. We wanted to identify a column oriented database that was horizontally scalable and fault tolerant to help us deliver good uptime guarantees, and extremely performant and space efficient such that it could handle our scale. We quickly realized that ClickHouse could satisfy these criteria, and then some.
ClickHouse is an open source column-oriented database management system capable of real time generation of analytical data reports using SQL queries. It is blazing fast, linearly scalable, hardware efficient, fault tolerant, feature rich, highly reliable, simple and handy. ClickHouse core developers provide great help on solving issues, merging and maintaining our PRs into ClickHouse. For example, engineers from Cloudflare have contributed a whole bunch of code back upstream:
- Aggregate function topK by Marek Vavruša
- IP prefix dictionary by Marek Vavruša
- SummingMergeTree engine optimizations by Marek Vavruša
- Kafka table Engine by Marek Vavruša. We're thinking to replace Kafka Go consumers with this engine when it will be stable enough and ingest from Kafka into ClickHouse directly.
- Aggregate function sumMap by Alex Bocharov. Without this function it would be impossible to build our new Zone Analytics API.
- Mark cache fix by Alex Bocharov
- uniqHLL12 function fix for big cardinalities by Alex Bocharov. The description of the issue and its fix should be an interesting reading.
Along with filing many bug reports, we also report about every issue we face in our cluster, which we hope will help to improve ClickHouse in future.
Even though DNS analytics on ClickHouse had been a great success, we were still skeptical that we would be able to scale ClickHouse to the needs of the HTTP pipeline:
- Kafka DNS topic has on average 1.5M messages per second vs 6M messages per second for HTTP requests topic.
- Kafka DNS topic average uncompressed message size is 130B vs 1630B for HTTP requests topic.
- DNS query ClickHouse record consists of 40 columns vs 104 columns for HTTP request ClickHouse record.
After unsuccessful attempts with Flink, we were skeptical of ClickHouse being able to keep up with the high ingestion rate. Luckily, early prototype showed promising performance and we decided to proceed with old pipeline replacement. The first step in replacing the old pipeline was to design a schema for the new ClickHouse tables.
ClickHouse schema design
Once we identified ClickHouse as a potential candidate, we began exploring how we could port our existing Postgres/Citus schemas to make them compatible with ClickHouse.
For our Zone Analytics API we need to produce many different aggregations for each zone (domain) and time period (minutely / hourly / daily / monthly). For deeper dive about specifics of aggregates please follow Zone Analytics API documentation or this handy spreadsheet.
These aggregations should be available for any time range for the last 365 days. While ClickHouse is a really great tool to work with non-aggregated data, with our volume of 6M requests per second we just cannot afford yet to store non-aggregated data for that long.
To give you an idea of how much data is that, here is some "napkin-math" capacity planning. I'm going to use an average insertion rate of 6M requests per second and $100 as a cost estimate of 1 TiB to calculate storage cost for 1 year in different message formats:
|Metric||Cap'n Proto||Cap'n Proto (zstd)||ClickHouse|
|Avg message/record size||1630 B||360 B||36.74 B|
|Storage requirements for 1 year||273.93 PiB||60.5 PiB||18.52 PiB (RF x3)|
|Storage cost for 1 year||$28M||$6.2M||$1.9M|
Even though storage requirements are quite scary, we're still considering to store raw (non-aggregated) requests logs in ClickHouse for 1 month+. See "Future of Data APIs" section below.
Non-aggregated requests table
We store over 100+ columns, collecting lots of different kinds of metrics about each request passed through Cloudflare. Some of these columns are also available in our Enterprise Log Share product, however ClickHouse non-aggregated requests table has more fields.
With so many columns to store and huge storage requirements we've decided to proceed with the aggregated-data approach, which worked well for us before in old pipeline and which will provide us with backward compatibility.
Aggregations schema design #1
According to the API documentation, we need to provide lots of different requests breakdowns and to satisfy these requirements we decided to test the following approach:
- Create Cickhouse materialized views with ReplicatedAggregatingMergeTree engine pointing to non-aggregated requests table and containing minutely aggregates data for each of the breakdowns:
- Requests totals - containing numbers like total requests, bytes, threats, uniques, etc.
- Requests by colo - containing requests, bytes, etc. breakdown by edgeColoId - each of 120+ Cloudflare datacenters
- Requests by http status - containing breakdown by HTTP status code, e.g. 200, 404, 500, etc.
- Requests by content type - containing breakdown by response content type, e.g. HTML, JS, CSS, etc.
- Requests by country - containing breakdown by client country (based on IP)
- Requests by threat type - containing breakdown by threat type
- Requests by browser - containing breakdown by browser family, extracted from user agent
- Requests by ip class - containing breakdown by client IP class
- Write the code gathering data from all 8 materialized views, using two approaches:
- Querying all 8 materialized views at once using JOIN
- Querying each one of 8 materialized views separately in parallel
- Run performance testing benchmark against common Zone Analytics API queries
Schema design #1 didn't work out well. ClickHouse JOIN syntax forces to write monstrous query over 300 lines of SQL, repeating the selected columns many times because you can do only pairwise joins in ClickHouse.
As for querying each of materialized views separately in parallel, benchmark showed prominent, but moderate results - query throughput would be a little bit better than using our Citus based old pipeline.
Aggregations schema design #2
In our second iteration of the schema design, we strove to keep a similar structure to our existing Citus tables. To do this, we experimented with the SummingMergeTree engine, which is described in detail by the excellent ClickHouse documentation:
In addition, a table can have nested data structures that are processed in a special way. If the name of a nested table ends in 'Map' and it contains at least two columns that meet the following criteria... then this nested table is interpreted as a mapping of key => (values...), and when merging its rows, the elements of two data sets are merged by 'key' with a summation of the corresponding (values...).
We were pleased to find this feature, because the SummingMergeTree engine allowed us to significantly reduce the number of tables required as compared to our initial approach. At the same time, it allowed us to match the structure of our existing Citus tables. The reason was that the ClickHouse Nested structure ending in 'Map' was similar to the Postgres hstore data type, which we used extensively in the old pipeline.
However, there were two existing issues with ClickHouse maps:
- SummingMergeTree does aggregation for all records with same primary key, but final aggregation across all shards should be done using some aggregate function, which didn't exist in ClickHouse.
- For storing uniques (uniques visitors based on IP), we need to use AggregateFunction data type, and although SummingMergeTree allows you to create column with such data type, it will not perform aggregation on it for records with same primary keys.
To resolve problem #1, we had to create a new aggregation function sumMap. Luckily, ClickHouse source code is of excellent quality and its core developers are very helpful with reviewing and merging requested changes.
As for problem #2, we had to put uniques into separate materialized view, which uses the ReplicatedAggregatingMergeTree Engine and supports merge of AggregateFunction states for records with the same primary keys. We're considering adding the same functionality into SummingMergeTree, so it will simplify our schema even more.
We also created a separate materialized view for the Colo endpoint because it has much lower usage (5% for Colo endpoint queries, 95% for Zone dashboard queries), so its more dispersed primary key will not affect performance of Zone dashboard queries.
Once schema design was acceptable, we proceeded to performance testing.
ClickHouse performance tuning
We explored a number of avenues for performance improvement in ClickHouse. These included tuning index granularity, and improving the merge performance of the SummingMergeTree engine.
By default ClickHouse recommends to use 8192 index granularity. There is nice article explaining ClickHouse primary keys and index granularity in depth.
While default index granularity might be excellent choice for most of use cases, in our case we decided to choose the following index granularities:
- For the main non-aggregated requests table we chose an index granularity of 16384. For this table, the number of rows read in a query is typically on the order of millions to billions. In this case, a large index granularity does not make a huge difference on query performance.
- For the aggregated requests_* stables, we chose an index granularity of 32. A low index granularity makes sense when we only need to scan and return a few rows. It made a huge difference in API performance - query latency decreased by 50% and throughput increased by ~3 times when we changed index granularity 8192 → 32.
Not relevant to performance, but we also disabled the min_execution_speed setting, so queries scanning just a few rows won't return exception because of "slow speed" of scanning rows per second.
On the aggregation/merge side, we've made some ClickHouse optimizations as well, like increasing SummingMergeTree maps merge speed by x7 times, which we contributed back into ClickHouse for everyone's benefit.
Once we had completed the performance tuning for ClickHouse, we could bring it all together into a new data pipeline. Next, we describe the architecture for our new, ClickHouse-based data pipeline.
New data pipeline
The new pipeline architecture re-uses some of the components from old pipeline, however it replaces its most weak components.
New components include:
- Kafka consumers - 106 Go consumers per each partition consume Cap'n Proto raw logs and extract/prepare needed 100+ ClickHouse fields. Consumers no longer do any aggregation logic.
- ClickHouse cluster - 36 nodes with x3 replication factor. It handles non-aggregate requests logs ingestion and then produces aggregates using materialized views.
- Zone Analytics API - rewritten and optimized version of API in Go, with many meaningful metrics, healthchecks, failover scenarios.
As you can see the architecture of new pipeline is much simpler and fault-tolerant. It provides Analytics for all our 7M+ customers' domains totalling more than 2.5 billion monthly unique visitors and over 1.5 trillion monthly page views.
On average we process 6M HTTP requests per second, with peaks of upto 8M requests per second.
Average log message size in Cap’n Proto format used to be ~1630B, but thanks to amazing job on Kafka compression by our Platform Operations Team, it decreased significantly. Please see "Squeezing the firehose: getting the most from Kafka compression" blog post with deeper dive into those optimisations.
Benefits of new pipeline
- No SPOF - removed all SPOFs and bottlenecks, everything has at least x3 replication factor.
- Fault-tolerant - it's more fault-tolerant, even if Kafka consumer or ClickHouse node or Zone Analytics API instance fails, it doesn't impact the service.
- Scalable - we can add more Kafka brokers or ClickHouse nodes and scale ingestion as we grow. We are not so confident about query performance when cluster will grow to hundreds of nodes. However, Yandex team managed to scale their cluster to 500+ nodes, distributed geographically between several data centers, using two-level sharding.
- Reduced complexity - due to removing messy crons and consumers which were doing aggregations and refactoring API code we were able to:
- Shutdown Postgres RollupDB instance and free it up for reuse.
- Shutdown Citus cluster 12 nodes and free it up for reuse. As we won't use Citus for serious workload anymore we can reduce our operational and support costs.
- Delete tens of thousands of lines of old Go, SQL, Bash, and PHP code.
- Remove WWW PHP API dependency and extra latency.
- Improved API throughput and latency - with previous pipeline Zone Analytics API was struggling to serve more than 15 queries per second, so we had to introduce temporary hard rate limits for largest users. With new pipeline we were able to remove hard rate limits and now we are serving ~40 queries per second. We went further and did intensive load testing for new API and with current setup and hardware we are able serve up to ~150 queries per second and this is scalable with additional nodes.
- Easier to operate - with shutdown of many unreliable components, we are finally at the point where it's relatively easy to operate this pipeline. ClickHouse quality helps us a lot in this matter.
- Decreased amount of incidents - with new more reliable pipeline, we now have fewer incidents than before, which ultimately has reduced on-call burden. Finally, we can sleep peacefully at night :-).
Recently, we've improved the throughput and latency of the new pipeline even further with better hardware. I'll provide details about this cluster below.
Our ClickHouse cluster
In total we have 36 ClickHouse nodes. The new hardware is a big upgrade for us:
- Chassis - Quanta D51PH-1ULH chassis instead of Quanta D51B-2U chassis (2x less physical space)
- CPU - 40 logical cores E5-2630 v3 @ 2.40 GHz instead of 32 cores E5-2630 v4 @ 2.20 GHz
- RAM - 256 GB RAM instead of 128 GB RAM
- Disks - 12 x 10 TB Seagate ST10000NM0016-1TT101 disks instead of 12 x 6 TB Toshiba TOSHIBA MG04ACA600E
- Network - 2 x 25G Mellanox ConnectX-4 in MC-LAG instead of 2 x 10G Intel 82599ES
Our Platform Operations team noticed that ClickHouse is not great at running heterogeneous clusters yet, so we need to gradually replace all nodes in the existing cluster with new hardware, all 36 of them. The process is fairly straightforward, it's no different than replacing a failed node. The problem is that ClickHouse doesn't throttle recovery.
Here is more information about our cluster:
- Avg insertion rate - all our pipelines bring together 11M rows per second.
- Avg insertion bandwidth - 47 Gbps.
- Avg queries per second - on average cluster serves ~40 queries per second with frequent peaks up to ~80 queries per second.
- CPU time - after recent hardware upgrade and all optimizations, our cluster CPU time is quite low.
- Max disk IO (device time) - it's low as well.
In order to make the switch to the new pipeline as seamless as possible, we performed a transfer of historical data from the old pipeline. Next, I discuss the process of this data transfer.
Historical data transfer
As we have 1 year storage requirements, we had to do one-time ETL (Extract Transfer Load) from the old Citus cluster into ClickHouse.
At Cloudflare we love Go and its goroutines, so it was quite straightforward to write a simple ETL job, which:
- For each minute/hour/day/month extracts data from Citus cluster
- Transforms Citus data into ClickHouse format and applies needed business logic
- Loads data into ClickHouse
The whole process took couple of days and over 60+ billions rows of data were transferred successfully with consistency checks. The completion of this process finally led to the shutdown of old pipeline. However, our work does not end there, and we are constantly looking to the future. In the next section, I'll share some details about what we are planning.
Future of Data APIs
We're currently working on something called "Log Push". Log push allows you to specify a desired data endpoint and have your HTTP request logs sent there automatically at regular intervals. At the moment, it's in private beta and going to support sending logs to:
- Amazon S3 bucket
- Google Cloud Service bucket
- Other storage services and platforms
It's expected to be generally available soon, but if you are interested in this new product and you want to try it out please contact our Customer Support team.
Logs SQL API
We're also evaluating possibility of building new product called Logs SQL API. The idea is to provide customers access to their logs via flexible API which supports standard SQL syntax and JSON/CSV/TSV/XML format response.
Queries can extract:
- Raw requests logs fields (e.g. SELECT field1, field2, ... FROM requests WHERE ...)
- Aggregated data from requests logs (e.g. SELECT clientIPv4, count() FROM requests GROUP BY clientIPv4 ORDER BY count() DESC LIMIT 10)
Google BigQuery provides similar SQL API and Amazon has product callled Kinesis Data analytics with SQL API support as well.
Another option we're exploring is to provide syntax similar to DNS Analytics API with filters and dimensions.
We're excited to hear your feedback and know more about your analytics use case. It can help us a lot to build new products!
All this could not be possible without hard work across multiple teams! First of all thanks to other Data team engineers for their tremendous efforts to make this all happen. Platform Operations Team made significant contributions to this project, especially Ivan Babrou and Daniel Dao. Contributions from Marek Vavruša in DNS Team were also very helpful.
Finally, Data team at Cloudflare is a small team, so if you're interested in building and operating distributed services, you stand to have some great problems to work on. Check out the Distributed Systems Engineer - Data and Data Infrastructure Engineer roles in London, UK and San Francisco, US, and let us know what you think.