Ivan Babrou

Keepalives considered harmful

2020-03-19

You’d think keepalives would always be helpful, but turns out reality isn’t always what you expect it to be. It really helps if you read Why does one NGINX worker take all the load? first....

Ivan Babrou

Introducing ebpf_exporter

2018-08-24

Speed & Reliability eBPF Linux Programming

Here at Cloudflare we use Prometheus to collect operational metrics. We run it on hundreds of servers and ingest millions of metrics per second to get insight into our network and provide the best possible service to our customers....

Ivan Babrou

Tracing System CPU on Debian Stretch

2018-05-13

Speed & Reliability Kafka eBPF Linux

How an innocent OS upgrade triggered a cascade of issues and forced us into tracing Linux networking internals....

Ivan Babrou

Squeezing the firehose: getting the most from Kafka compression

2018-03-05

Compression Speed & Reliability Kafka

How Cloudflare was able to save hundreds of gigabits of network bandwidth and terabytes of storage from Kafka....

Ivan Babrou

Manage Cloudflare records with Salt

2016-12-14

GitHub API DNS Reliability Salt

We use Salt to manage our ever growing global fleet of machines. Salt is great for managing configurations and being the source of truth. We use it for remote command execution and for network automation tasks....

Ivan Babrou

Debugging war story: the mystery of NXDOMAIN

2016-12-07

DNS Reliability

The following blog post describes a debugging adventure on Cloudflare's Mesos-based cluster. This internal cluster is primarily used to process log file information so that Cloudflare customers have analytics, and for our systems that detect and respond to attacks....

Ivan Babrou