The Cloudflare Blog

Scaling out PostgreSQL for CloudFlare Analytics using CitusDB

Albert Strasheim — Thu, 09 Apr 2015 17:32:05 GMT

When I joined CloudFlare about 18 months ago, we had just started to build out our new Data Platform. At that point, the log processing and analytics pipeline built in the early days of the company had reached its limits. This was due to the rapidly increasing log volume from our Edge Platform where we’ve had to deal with traffic growth in excess of 400% annually.

Our log processing pipeline started out like most everybody else’s: compressed log files shipped to a central location for aggregation by a motley collection of Perl scripts and C++ programs with a single PostgreSQL instance to store the aggregated data. Since then, CloudFlare has grown to serve millions of requests per second for millions of sites. Apart from the hundreds of terabytes of log data that has to be aggregated every day, we also face some unique challenges in providing detailed analytics for each of the millions of sites on CloudFlare.

For the next iteration of our Customer Analytics application, we wanted to get something up and running quickly, try out Kafka, write the aggregation application in Go, and see what could be done to scale out our trusty go-to database, PostgreSQL, from a single machine to a cluster of servers without requiring us to deal with sharding in the application.

As we were analyzing our scaling requirements for PostgreSQL, we came across Citus Data, one of the companies to launch out of Y Combinator in the summer of 2011. Citus Data builds a database called CitusDB that scales out PostgreSQL for real-time workloads. Because CitusDB enables both real-time data ingest and sub-second queries across billions of rows, it has become a crucial part of our analytics infrastructure.

Log Processing Pipeline for Analytics

Before jumping into the details of our database backend, let’s review the pipeline that takes a log event from CloudFlare’s Edge to our analytics database.

An HTTP access log event proceeds through the CloudFlare data pipeline as follows:

A web browser makes a request (e.g., an HTTP GET request).
An Nginx web server running Lua code handles the request and generates a binary log event in Cap’n Proto format.
A Go program akin to Heka receives the log event from Nginx over a UNIX socket, batches it with other events, compresses the batch using a fast algorithm like Snappy or LZ4, and sends it to our data center over a TLS-encrypted TCP connection.
Another Go program (the Kafka shim) receives the log event stream, decrypts it, decompresses the batches, and produces the events into a Kafka topic with partitions replicated on many servers.
Go aggregators (one process per partition) consume the topic-partitions and insert aggregates (not individual events) with 1-minute granularity into the CitusDB database. Further rollups to 1-hour and 1-day granularity occur later to reduce the amount of data to be queried and to speed up queries over intervals spanning many hours or days.

Why Go?

Previous blog posts and talks have covered various CloudFlare projects that have been built using Go. We’ve found that Go is a great language for teams to use when building the kinds of distributed systems needed at CloudFlare, and this is true regardless of an engineer’s level of experience with Go. Our Customer Analytics team is made up of engineers that have been using Go since before its 1.0 release as well as complete Go newbies. Team members that were new to Go were able to spin up quickly, and the code base has remained maintainable even as we’ve continued to build many more data processing and aggregation applications such as a new version of our Layer 7 DDoS attack mitigation system.

Another factor that makes Go great is the ever-expanding ecosystem of third party libraries. We used go-capnproto to generate Go code to handle binary log events in Cap’n Proto format from a common schema shared between Go, C++, and Lua projects. Go support for Kafka with Shopify’s Sarama library, support for ZooKeeper with go-zookeeper, support for PostgreSQL/CitusDB through database/sql and the lib/pq driver are all very good.

Why Kafka?

As we started building our new data processing applications in Go, we had some additional requirements for the pipeline:

Use a queue with persistence to allow short periods of downtime for downstream servers and/or consumer services.
Make the data available for processing in real time by scripts written by members of our Site Reliability Engineering team.
Allow future aggregators to be built in other languages like Java, C or C++.

After extensive testing, we selected Kafka as the first stage of the log processing pipeline.

Why Postgres?

As we mentioned when PostgreSQL 9.3 was released, PostgreSQL has long been an important part of our stack, and for good reason.

Foreign data wrappers and other extension mechanisms make PostgreSQL an excellent platform for storing lots of data, or as a gateway to other NoSQL data stores, without having to give up the power of SQL. PostgreSQL also has great performance and documentation. Lastly, PostgreSQL has a large and active community, and we've had the privilege of meeting many of the PostgreSQL contributors at meetups held at the CloudFlare office and elsewhere, organized by the The San Francisco Bay Area PostgreSQL Meetup Group.

Why CitusDB?

CloudFlare has been using PostgreSQL since day one. We trust it, and we wanted to keep using it. However, CloudFlare's data has been growing rapidly, and we were running into the limitations of a single PostgreSQL instance. Our team was tasked with scaling out our analytics database in a short time so we started by defining the criteria that are important to us:

Performance: Our system powers the Customer Analytics dashboard, so typical queries need to return in less than a second even when dealing with data from many customer sites over long time periods.
PostgreSQL: We have extensive experience running PostgreSQL in production. We also find several extensions useful, e.g., Hstore enables us to store semi-structured data and HyperLogLog (HLL) makes unique count approximation queries fast.
Scaling: We need to dynamically scale out our cluster for performance and huge data storage. That is, if we realize that our cluster is becoming overutilized, we want to solve the problem by just adding new machines.
High availability: This cluster needs to be highly available. As such, the cluster needs to automatically recover from failures like disks dying or servers going down.
Business intelligence queries: in addition to sub-second responses for customer queries, we need to be able to perform business intelligence queries that may need to analyze billions of rows of analytics data.

At first, we evaluated what it would take to build an application that deals with sharding on top of stock PostgreSQL. We investigated using the postgres_fdw extension to provide a unified view on top of a number of independent PostgreSQL servers, but this solution did not deal well with servers going down.

Research into the major players in the PostgreSQL space indicated that CitusDB had the potential to be a great fit for us. On the performance point, they already had customers running real-time analytics with queries running in parallel across a large cluster in tens of milliseconds.

CitusDB has also maintained compatibility with PostgreSQL, not by forking the code base like other vendors, but by extending it to plan and execute distributed queries. Furthermore, CitusDB used the concept of many logical shards so that if we were to add new machines to our cluster, we could easily rebalance the shards in the cluster by calling a simple PostgreSQL user-defined function.

With CitusDB, we could replicate logical shards to independent machines in the cluster, and automatically fail over between replicas even during queries. In case of a hardware failure, we could also use the rebalance function to re-replicate shards in the cluster.

CitusDB Architecture

CitusDB follows an architecture similar to Hadoop to scale out Postgres: one primary node holds authoritative metadata about shards in the cluster and parallelizes incoming queries. The worker nodes then do all the actual work of running the queries.

In CloudFlare's case, the cluster holds about 1 million shards and each shard is replicated to multiple machines. When the application sends a query to the cluster, the primary node first prunes away unrelated shards and finds the specific shards relevant to the query. The primary node then transforms the query into many smaller queries for parallel execution and ships those smaller queries to the worker nodes.

Finally, the primary node receives intermediate results from the workers, merges them, and returns the final results to the application. This takes anywhere between 25 milliseconds to 2 seconds for queries in the CloudFlare analytics cluster, depending on whether some or all of the data is available in page cache.

From a high availability standpoint, when a worker node fails, the primary node automatically fails over to the replicas, even during a query. The primary node holds slowly changing metadata, making it a good fit for continuous backups or PostgreSQL's streaming replication feature. Citus Data is currently working on further improvements to make it easy to replicate the primary metadata to all the other nodes.

At CloudFlare, we love the CitusDB architecture because it enabled us to continue using PostgreSQL. Our analytics dashboard and BI tools connect to Citus using standard PostgreSQL connectors, and tools like pg_dump and pg_upgrade just work. Two features that stand out for us are CitusDB’s PostgreSQL extensions that power our analytics dashboards, and CitusDB’s ability to parallelize the logic in those extensions out of the box.

Postgres Extensions on CitusDB

PostgreSQL extensions are pieces of software that add functionality to the core database itself. Some examples are data types, user-defined functions, operators, aggregates, and custom index types. PostgreSQL has more than 150 publicly available official extensions. We’d like to highlight two of these extensions that might be of general interest. It’s worth noting that with CitusDB all of these extensions automatically scale to many servers without any changes.

HyperLogLog

HyperLogLog is a sophisticated algorithm developed for doing unique count approximations quickly. And since a HLL implementation for PostgreSQL was open sourced by the good folks at Aggregate Knowledge, we could use it with CitusDB unchanged because it’s compatible with most (if not all) Postgres extensions.

HLL was important for our application because we needed to compute unique IP counts across various time intervals in real time and we didn’t want to store the unique IPs themselves. With this extension, we could, for example, count the number of unique IP addresses accessing a customer site in a minute, but still have an accurate count when further rolling up the aggregated data into a 1-hour aggregate.

Hstore

The hstore data type stores sets of key/value pairs within a single PostgreSQL value. This can be helpful in various scenarios such as with rows with many attributes that are rarely examined, or to represent semi-structured data. We use the hstore data type to hold counters for sparse categories (e.g. country, HTTP status, data center).

With the hstore data type, we save ourselves from the burden of denormalizing our table schema into hundreds or thousands of columns. For example, we have one hstore data type that holds the number of requests coming in from different data centers per minute per CloudFlare customer. With millions of customers and hundreds of data centers, this counter data ends up being very sparse. Thanks to hstore, we can efficiently store that data, and thanks to CitusDB, we can efficiently parallelize queries of that data.

For future applications, we are also investigating other extensions such as the Postgres columnar store extension cstore_fdw that Citus Data has open sourced. This will allow us to compress and store even more historical analytics data in a smaller footprint.

Conclusion

CitusDB has been working very well for us as the new backend for our Customer Analytics system. We have also found many uses for the analytics data in a business intelligence context. The ease with which we can run distributed queries on the data allows us to quickly answer new questions about the CloudFlare network that arise from anyone in the company, from the SRE team through to Sales.

We are looking forward to features available in the recently released CitusDB 4.0, especially the performance improvements and the new shard rebalancer. We’re also excited about using the JSONB data type with CitusDB 4.0, along with all the other improvements that come standard as part of PostgreSQL 9.4.

Finally, if you’re interested in building and operating distributed services like Kafka or CitusDB and writing Go as part of a dynamic team dealing with big (nay, gargantuan) amounts of data, CloudFlare is hiring.

It's Go Time on Linux

Albert Strasheim — Wed, 05 Mar 2014 00:00:00 GMT

Some interesting changes related to timekeeping in the upcoming Go 1.3 release inspired us to take a closer look at how Go programs keep time with the help of the Linux kernel. Timekeeping is a complex topic and determining the current time isn’t as simple as it might seem at first glance.

Go running on the Linux kernel has been used to build many important systems like RRDNS (our DNS server) at CloudFlare. Accurately, precisely and efficiently determining the time is an important part of many of the these systems.

To see why time is important, consider that humans have had some trouble convincing computers to keep time for them in the recent past. It been a bit more than a decade since we had to dust off our best COBOL programmers to tackle Y2K.

More recently, a bug in the handling of a leap second propagated through the Network Time Protocol (NTP) also took many systems off-line. As we've seen in recent days, NTP is very useful for synchronizing computer clocks and/or DDoSing them. The leap second bug received extensive coverage. Google was ready but many other popular sites were taken offline.

We also have the Year 2038 problem to look forward to. Hopefully there will still be a few engineers around then that remember what this 32-bit thing was all about.

Time in Go

Everything starts with the time package that is part of Go’s standard library. The time package provides types for Time, Duration, Ticker, Timer and various utility functions for manipulating these types.

The most commonly used function in this package is probably the time.Now function, which returns the current time as a Time struct. The Time struct has 3 fields:

type Time struct {
    sec int64
	nsec uintptr
	loc *Location
}

Location contains the timezone information for the time.

Duration is used to express the difference between two Times and to configure timers and tickers.

Timer is useful for implementing a timeout, typically as part of a select statement. Ticker can be used to wake up periodically, usually when you are using select in a for loop.

Go’s time package is used in many other places in the Go standard library. When dealing with socket connections that may go slow or stop sending data completely, one uses the SetDeadline functions that are part of the net.Conn interface.

We love writing tests at CloudFlare, and having unit tests that include some kind of random component can turn up interesting issues. You can use the current time to seed random number generators in tests, using:

rand.New(rand.NewSource(time.Now().UnixNano()))

If you’re generating random numbers for a secure application, you really want to be using the crypto/rand package. Interestingly, even the initialization of crypto/rand.Reader incorporates the current time.

The current time also features when one logs something using the log package.

A very useful service called Sourcegraph turns up more than 6000 examples where time.Now is used. For example, the Camlistore code base calls time.Now in about 130 different places.

With time.Now being as pervasive as it is, have you ever wondered how it works? Time to dive deeper.

System calls

The most important changes to the way in which Go programs keep time on Linux was committed on 8 November 2012 in changeset 42c8d3aadc40 Let’s analyse the commit message and the code for some clues:

runtime: use vDSO clock_gettime for time.now & runtime.nanotime
on Linux/amd64. Performance improvement aside, time.Now() now 
gets real nanosecond resolution on supported systems.

To understand this commit message better, we first need to review the system calls available on Linux for obtaining the value of the clock.

In the beginning, there was time and gettimeofday, which existed in SVr4, 4.3BSD and was described in POSIX.1-2001. time returns the number of seconds since the Unix epoch, 1970-01-01 00:00:00 UTC and is defined in C as:

time_t time(time_t *t)

time_t is 4 bytes on 32-bit platforms and 8 bytes on 64-bit platforms, hence the Y2038 problem mentioned above.

gettimeofday returns the number of seconds and microseconds since the epoch and is defined in C as:

int gettimeofday(struct timeval *tv, struct timezone *tz)

gettimeofday populates a struct timeval, which has the following fields:

struct timeval {
	time_t tv_sec; /* seconds */
	suseconds_t tv_usec; /* microseconds */
}

gettimeofday yields timestamps that only have microsecond precision. POSIX.1-2008 marks gettimeofday as obsolete, recommending the use of clock_gettime instead, which is defined in C as:

int clock_gettime(clockid_t clk_id, struct timespec *tp)

clock_gettime populates a struct timespec, which has the following fields:

struct timespec {
	time_t tv_sec; /* seconds */
	long tv_nsec; /* nanoseconds */
}

clock_gettime can yield timestamps that have nanosecond precision. The clock ID parameter determines the type of clock to use. Of interest to us are:

CLOCK_REALTIME: a system-wide clock that measures real time. This clock is affected by discontinuous jumps in the system time and by incremental adjustments by made using the adjtime function or NTP.
CLOCK_MONOTONIC: a clock that represents monotonic time since some unspecified starting point. This clock is not affected by discontinous jumps in the system time, but is affected by adjtime and NTP.
CLOCK_MONOTONIC_RAW: similar to CLOCK_MONOTONIC, but not subject to adjustment by adjtime or NTP.

time.Now

With this background we can look at the code for time.Now. (Hint: click on the function name in godoc to look at the code yourself.)

The time.Now function is implemented using a function called now that is internal to package time, but is actually provided by the Go runtime. In other words, there is no code for the function in package time itself.

Let’s take a closer look at the Linux implementations for the 386 and amd64 platforms. We see that these functions are implemented in assembler and call a function to retrieve the current time. You might have been expecting to see a system call, i.e. an INT 0x80 instruction on 386 or the SYSCALL instruction on amd64, to the kernel at this point, but Go does something much more interesting on Linux.

The Linux kernel provides Virtual Dynamically linked Shared Objects (vDSO) as a way for user space applications to make low-overhead calls to functions that would normally involve a system call to the kernel.

If you’re writing your application in a language that uses glibc, you are probably already getting your time via vDSO. Go doesn’t use glibc, so it has to implement this functionality in its runtime. The relevant code is in vdso_linux_amd64.c in the runtime package.

Finally, if you’re the kind of person that likes to stare into the bowels of your operating system, here’s the kernel side of the vDSO.

vDSO support for time functions is currently 64-bit only, but a kernel patch is in the works to add them on 32-bit platforms. When this happens, the Go runtime code will have to be updated to take advantage of this.

Benchmarks

When frequently calling a function to determine the time, you may be interested to know how long it takes to return. In other words, how now is “now” really? For benchmarking purposes, you can make these system calls directly from Go code. We have prepared a patch to Dave Cheney’s excellent autobench project so that you may benchmark these system calls and other time-related functions yourself.

Benchmarks can also help us measure the time saved by calling gettimeofday and clock_gettime via the vDSO mechanism instead of the traditional system call path.

We’ll also use autobench to compare the performance of different versions of Go for the same set of time functions.

All benchmarks numbers below were obtained on an Intel Core i7-3540M CPU running at its maximum clock speed of 3 GHz. The CPU frequency scaling governor was set to performance mode to ensure reliable benchmark results.

We’ll use the Go 1.2 stable release as a baseline.

BenchmarkSyscallTime and BenchmarkVDSOTime measure the time it takes to make a time system call and vDSO call, respectively:

BenchmarkSyscallTime 38.2 ns/op
BenchmarkVDSOTime    3.85 ns/op

BenchmarkSyscallGettimeofday and BenchmarkVDSOGettimeofday measure the time it takes make to call a gettimeofday system call and vDSO call, respectively:

BenchmarkSyscallGettimeofday 59.3 ns/op
BenchmarkVDSOGettimeofday    23.4 ns/op

BenchmarkTimeNow measures the time it takes to call time.Now, which makes an underlying vDSO call to clock_gettime(CLOCK_REALTIME) and converts the returned value to a time.Time struct:

BenchmarkTimeNow 23.6 ns/op

Using autobench, we can also compare different Go versions against each other. To see how far we've come in the last few years, we also compared Go 1.2 to Go 1.0.3, which was released on Sep 27, 2012. The major difference was in the benchmark for time.Now:

benchmark        old ns/op  new ns/op  delta
BenchmarkTimeNow 406        23         -94.19%

To repeat this test with Go 1.0.3, you'll need the fix in changeset 419dcca62a3d to compile if you are using a recent version of GCC.

Clock Sources

The speed at which you can tell time also depending on the clock source being used by your kernel. To see which clock sources you have available, run the following command:

$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource
tsc hpet acpi_pm

To see which clock source is currently in use, run this command:

$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
tsc

Time Stamp Counter (TSC) is a 64-bit register that can be read by the RDTSC instruction. It is faster to read than the High Precision Event Timer (HPET). The ACPI Power Management Timer (APCI PMT) is another timer that is found on many motherboards.

We can also use the same benchmarks above to compare the TSC clock source and a HPET clock source. Doing so requires booting Linux with the clocksource=hpet kernel command line parameter. Here are the results:

benchmark                              tsc ns/op   hpet ns/op  delta
BenchmarkSyscallGettimeofday              59          645      +987.69%
BenchmarkVDSOGettimeofday                 23          598      +2455.56%
BenchmarkSyscallClockGettimeRealtime      58          642      +995.56%
BenchmarkSyscallClockGettimeMonotonic     57          641      +1012.85%
BenchmarkTimeNow                          23          598      +2433.90%

As you can see, querying the HPET clock source takes significantly longer.

Not all CPUs are created equal. To see which TSC features your CPU supports, run the following command:

$ cat /proc/cpuinfo | grep tsc

You will see some or all of the following CPU flags related to the TSC:

tsc
rdtscp
constant_tsc
nonstop_tsc

The tsc flag indicates that your CPU has the 64-bit TSC register, which has been present since the Pentium.

The rdtscp flag indicates that your CPU supports the newer RDTSCP instruction, in addition to the RDTSC instruction. Intel has an interesting whitepaper on How to Benchmark Code Execution Times on Intel IA-32 and IA-64 Instruction Set Architectures with more details about the differences.

The constant_tsc flag indicates that the TSC runs at constant frequency irrespective of the current frequency, voltage or throttling state of the CPU, commonly referred to as its P- and T-state.

The nonstop_tsc flag indicates that TSC does not stop, irrespective of the CPU’s power saving mode, referred to as its C-state.

These features work in conjunction to provide an invariant TSC. There is more discussion over at the Intel Software forums if you’re interested.

Coming up in Go 1.3

There has also been some interesting work on the time front in the upcoming Go 1.3 release. The time.Sleep function, Ticker, and Timer now use clock_gettime(CLOCK_MONOTONIC) on Linux and other platforms. This work has been a good example of the broader community contributing to improve the core of Go, as can be seen in issue 6007, CL 53010043 and discussion on golang-dev.