This is part 2 of a six part series based on a talk I gave in Trento, Italy. Part 1 is here.
It’s always best to speak plainly and honestly about the situation you are in. Or as Matthew Prince likes to put it “Panic Early”. Long ago I started a company in Silicon Valley which had the most beautiful code. We could have taught a computer science course from the code base. But we had hardly any customers and we failed to “Panic Early” and not face up to the fact that our market was too small.
Ironically, the CEO of that company used to tell people “Get bad news out fast”. This is a good maxim to live by, if you have bad news then deliver it quickly and clearly. If you don’t the bad news won’t go away, and the situation will likely get worse.
Cloudbleed
Cloudflare had a very, very serious security problem back in 2017. This problem became known as Cloudbleed. We had, without knowing it, been leaking memory from inside our machines into responses returned to web browsers. And because our machines are shared across millions of web sites, that meant that HTTP requests containing potentially very sensitive information could have been leaked.
Worse this information was being cached by search engines. So, anyone could go to Google or Bing or Baidu and look for sensitive information just by knowing a few keywords. Luckily, for us, Google’s Project Zero discovered that we were leaking by looking at Google’s crawler cache. They informed us and we were quickly able to stop the leak.
But that didn’t diminish the fact that private information (much of which would have been transmitted encrypted) had been cached by search engines. Although we stopped the leak within 45 minutes the cleanup task was massive. It was massive firstly because we had to find what had been leaked and secondly because we had to find all the search engines with caches and somehow ask them to delete cached data.
None of the search engines had handled something like this before. We were asking for mass deletion of data and it took a long time (at least it felt like a long time) to get to the right people and start to get cached data deleted.
From the very first night of Cloudbleed I started collecting information to be able to write the public disclosure. Ultimately, when Project Zero wanted to go public, we were ready with a long, transparent blog post on the subject and were able to talk about it.
It was, by far, the most difficult week of my career. Firstly, we had the bug itself, secondly, we had the cleanup, and then we had to tell people what had happened. Throughout that week I barely slept (and I am not exaggerating) and a large team of people across Cloudflare in the US, UK and elsewhere kept in contact constantly. We learnt that it’s possible to keep a Google Hangout between two offices running, literally, for days without interruption.
Known Unknowns
The hardest thing was that we seriously did not know, at the beginning, whether Cloudflare would survive. Right at the start it looked terrible, it was terrible, and we had two questions: “What private data has actually been leaked and cached?” and “Did anyone find this and actively exploit it?”.
We answered both by extensive searching and collating of information from search engines. Ultimately, myself and others called customers and spoke to them on the phone. We were able to tell them what we’d found and statistically what was likely to have leaked.
The second question was answered by looking for evidence of exploitation in our logging systems. But there was something very tricky: Cloudflare had long limited the amount of data it logs for privacy reasons. So, we had to dig into statistical analysis of all sorts of data (crash rates, saved core dumps, errors in Sentry, sampled data) to look for exploitation.
We split into separate teams to look for different evidence and only myself and Matthew Prince knew what each team was seeing. We did that because we didn’t want one team to influence another. We wanted to be sure that we were right before publishing our second blog with more detailed information.
We didn’t find evidence of exploitation. And while serious, the data cached in search engines was found to contain little really private information. But it was very, very serious and we all knew that this could have been worse.
Although I look back at those two weeks as the worst of my career, to quote Charles Dickens: “It was the best of times, it was the worst of times”. Most of the company didn’t know Cloudbleed had happened until we went public. The morning it became public I showered very early and took a cab to the office.
Normally, the office is quite quiet in the morning and I was stunned to walk into an office full of people. People who asked me “What can we do?”. It was an incredible feeling. We printed a large poster of Winston Churchill staring down at the team saying, “If you’re going through Hell, keep going!”. Everyone pitched in.
In the middle of it someone from the press, the BBC I think, asked me if I’d changed any passwords because of Cloudbleed. I said I had not. And that was true. I didn’t change anything personally. But in the middle of that firestorm I took a lot of criticism from armchair critics for that.
Although terrible, Cloudbleed reinforced the culture of Cloudflare: openness and helping others. We were all in together and we got through it. And our customers saw that: we didn’t lose major customers, in fact, we gained customers who told us “We want to work with you because you were so open”.
Helping to Build Cloudflare
Part 2: The Most Difficult Fortnight (you are here)