Keeping the Cloudflare API 'all green' using Python-based testing
March 07, 2023 7:00PM
Scout is an automated system providing constant end to end testing and monitoring of live APIs over different environments and resources. Scout does it by periodically running self explanatory Python tests...
Continue reading »
Using Apache Kafka to process 1 trillion inter-service messages
July 19, 2022 2:00PM
Engineering
Kafka
Productivity
We learnt a lot about Kafka on the way to 1 trillion messages, and built some interesting internal tools to ease adoption that will be explored in this blog post...
A Byzantine failure in the real world
November 27, 2020 12:00PM
Post Mortem
API
Postgres
Outage
Engineering
When we review design documents at Cloudflare, we are always on the lookout for Single Points of Failure (SPOFs). In this post, we present a timeline of a real-world incident, and how an interesting failure mode known as a Byzantine fault played a role in a cascading series of events....
Cloudflare outage on July 17, 2020
July 18, 2020 2:22AM
Post Mortem
Outage
Engineering
Today a configuration error in our backbone network caused an outage for Internet properties and Cloudflare services that lasted 27 minutes. We saw traffic drop by about 50% across our network....
Using data science and machine learning for improved customer support
June 15, 2020 12:00PM
Machine Learning
Data
Support
Engineering
In this blog post we’ll explore three tricks that can be used for data science that helped us solve real problems for our customer support group and our customers. Two for natural language processing in a customer support context and one for identifying attack Internet attack traffic....
April 13, 2020 12:00PM
Helping sites get back online: the origin monitoring intern project
Over the course of ten weeks, our team of three interns (two engineering, one product management) went from a problem statement to a new feature, which is still working in production for all Cloudflare customers....