Using Apache Kafka to process 1 trillion inter-service messages
July 19, 2022 2:00PM
We learnt a lot about Kafka on the way to 1 trillion messages, and built some interesting internal tools to ease adoption that will be explored in this blog post...
Continue reading »
A Byzantine failure in the real world
November 27, 2020 12:00PM
Post Mortem
API
Postgres
Outage
Engineering
When we review design documents at Cloudflare, we are always on the lookout for Single Points of Failure (SPOFs). In this post, we present a timeline of a real-world incident, and how an interesting failure mode known as a Byzantine fault played a role in a cascading series of events....
Cloudflare outage on July 17, 2020
July 18, 2020 2:22AM
Post Mortem
Outage
Engineering
Today a configuration error in our backbone network caused an outage for Internet properties and Cloudflare services that lasted 27 minutes. We saw traffic drop by about 50% across our network....
Using data science and machine learning for improved customer support
June 15, 2020 12:00PM
Machine Learning
Data
Support
Engineering
In this blog post we’ll explore three tricks that can be used for data science that helped us solve real problems for our customer support group and our customers. Two for natural language processing in a customer support context and one for identifying attack Internet attack traffic....
Helping sites get back online: the origin monitoring intern project
April 13, 2020 12:00PM
Internship Experience
Monitoring
Engineering
Life @ Cloudflare
Over the course of ten weeks, our team of three interns (two engineering, one product management) went from a problem statement to a new feature, which is still working in production for all Cloudflare customers....
April 09, 2020 12:00PM
Internship Experience: Cryptography Engineer
Back in the summer of 2017 I was an intern at Cloudflare. During the scholastic year I was a grad student working on automorphic forms and computational Langlands at Berkeley....