Talk Transcript: How Cloudflare Thinks About Security

This is the text I used for a talk at artificial intelligence powered translation platform, Unbabel, in Lisbon on September 25, 2019.

Bom dia. Eu sou John Graham-Cumming o CTO do Cloudflare. E agora eu vou falar em inglês.

Thanks for inviting me to talk about Cloudflare and how we think about security. I’m about to move to Portugal permanently so I hope I’ll be able to do this talk in Portuguese in a few months.

I know that most of you don’t have English as a first language so I’m going to speak a little more deliberately than usual. And I’ll make the text of this talk available for you to read.

But there are no slides today.

I’m going to talk about how Cloudflare thinks about internal security, how we protect ourselves and how we secure our day to day work. This isn’t a talk about Cloudflare’s products.

Culture

Let’s begin with culture.

Many companies have culture statements. I think almost 100% of these are pure nonsense. Culture is how you act every day, not words written in the wall.

One significant piece of company culture is the internal Security Incident mailing list which anyone in the company can send a message to. And they do! So far this month there have been 55 separate emails to that list reporting a security problem.

These mails come from all over the company, from every department. Two to three per day. And each mail is investigated by the internal security team. Each mail is assigned a Security Incident issue in our internal Atlassian Jira instance.

People send: reports that their laptop or phone has been stolen (their credentials get immediately invalidated), suspicions about a weird email that they’ve received (it might be phishing or malware in an attachment), a concern about physical security (for example, someone wanders into the office and starts asking odd questions), that they clicked on a bad link, that they lost their access card, and, occasionally, a security concern about our product.

Things like stolen or lost laptops and phones happen way more often than you’d imagine. We seem to lose about two per month. For that reason and many others we use full disk encryption on devices, complex passwords and two factor auth on every service employees need to access. And we discourage anyone storing anything on their laptop and ask them to primarily use cloud apps for work. Plus we centrally manage machines and can remote wipe.

We have a 100% blame free culture. You clicked on a weird link? We’ll help you. Lost your phone? We’ll help you. Think you might have been phished? We’ll help you.

This has led to a culture of reporting problems, however minor, when they occur. It’s our first line of internal defense.

Just this month I clicked on a link that sent my web browser crazy hopping through redirects until I ended up at a bad place. I reported that to the mailing list.

I’ve never worked anywhere with such a strong culture of reporting security problems big and small.

Hackers

We also use HackerOne to let people report security problems from the outside. This month we’ve received 14 reports of security problems. To be honest, most of what we receive through HackerOne is very low priority. People run automated scanning tools and report the smallest of configuration problems, or, quite often, things that they don’t understand but that look like security problems to them. But we triage and handle them all.

And people do on occasion report things that we need to fix.

We also have a private paid bug bounty program where we work with a group of individual hackers (around 150 right now) who get paid for the vulnerabilities that they’ve found.

We’ve found that this combination of a public responsible disclosure program and then a private paid program is working well. We invite the best hackers who come in through the public program to work with us closely in the private program.

Identity

So, that’s all about people, internal and external, reporting problems, vulnerabilities, or attacks. A very short step from that is knowing who the people are.

And that’s where identity and authentication become critical. In fact, as an industry trend identity management and authentication are one of the biggest areas of spending by CSOs and CISOs. And Cloudflare is no different.

OK, well it is different, instead of spending a lot of identity and authentication we’ve built our own solutions.

We did not always have good identity practices. In fact, for many years our systems had different logins and passwords and it was a complete mess. When a new employee started accounts had to be made on Google for email and calendar, on Atlassian for Jira and Wiki, on the VPN, on the WiFi network and then on a myriad of other systems for the blog, HR, SSH, build systems, etc. etc.

And when someone left all that had to be undone. And frequently this was done incorrectly. People would leave and accounts would still be left running for a period of time. This was a huge headache for us and is a huge headache for literally every company.

If I could tell companies one thing they can do to improve their security it would be: sort out identity and authentication. We did and it made things so much better.

This makes the process of bringing someone on board much smoother and the same when they leave. We can control who accesses what systems from a single control panel.

I have one login via a product we built called Cloudflare Access and I can get access to pretty much everything. I looked in my LastPass Vault while writing this talk and there are a total of just five username and password combination and two of those needed deleting because we’ve migrated those systems to Access.

So, yes, we use password managers. And we lock down everything with high quality passwords and two factor authentication. Everyone at Cloudflare has a Yubikey and access to TOTP (such as Google Authenticator). There are three golden rules: all passwords should be created by the password manager, all authentication has to have a second factor and the second factor cannot be SMS.

We had great fun rolling out Yubikeys to the company because we did it during our annual retreat in a single company wide sitting. Each year Cloudflare gets the entire company together (now over 1,000 people) in a hotel for two to three days of working together, learning from outside experts and physical and cultural activities.

Last year the security team gave everyone a pair of physical security tokens (a Yubikey and a Titan Key from Google for Bluetooth) and in an epic session configured everyone’s accounts to use them.

Note: do not attempt to get 500 people to sync Bluetooth devices in the same room at the same time. Bluetooth cannot cope.

Another important thing we implemented is automatic timeout of access to a system. If you don’t use access to a system you lose it. That way we don’t have accounts that might have access to sensitive systems that could potentially be exploited.

Openness

To return to the subject of Culture for a moment an important Cloudflare trait is openness.

Some of you may know that back in 2017 Cloudflare had a horrible bug in our software that became called Cloudbleed. This bug leaked memory from inside our servers into people’s web browsing. Some of that web browsing was being done by search engine crawlers and ended up in the caches of search engines like Google.

We had to do two things: stop the actual bug (this was relatively easy and was done in under an hour) and then clean up the equivalent of an oil spill of data. That took longer (about a week to ten days) and was very complicated.

But from the very first night when we were informed of the problem we began documenting what had happened and what were doing. I opened an EMACS buffer in the dead of night and started keeping a record.

That record turned into a giant disclosure blog post that contained the gory details of the error we made, its consequences and how we reacted once the error was known.

We followed up a few days later with a further long blog post assessing the impact and risk associated with the problem.

This approach to being totally open ended up being a huge success for us. It increased trust in our product and made people want to work with us more.

I was on my way to Berlin to give a talk to a large retailer about Cloudbleed when I suddenly realized that the company I was giving the talk at was NOT a customer. And I asked the salesperson I was with what I was doing.

I walked in to their 1,000 person engineering team all assembled to hear my talk. Afterwards the VP of Engineering thanked me saying that our transparency had made them want to work with us rather than their current vendor. My talk was really a sales pitch.

Similarly, at RSA last year I gave a talk about Cloudbleed and a very large company’s CSO came up and asked to use my talk internally to try to encourage their company to be so open.

When on July 2 this year we had an outage, which wasn’t security related, we once again blogged in incredible detail about what happened. And once again we heard from people about how our transparency mattered to them.

The lesson is that being open about mistakes increases trust. And if people trust you then they’ll tend to tell you when there are problems. I get a ton of reports of potential security problems via Twitter or email.

Change

After Cloudbleed we started changing how we write software. Cloudbleed was caused, in part, by the use of memory-unsafe languages. In that case it was C code that could run past the end of a buffer.

We didn’t want that to happen again and so we’ve prioritized languages where that simply cannot happen. Such as Go and Rust. We were very well known for using Go. If you’ve ever visited a Cloudflare website, or used an app (and you have because of our scale) that uses us for its API then you’ve first done a DNS query to one of our servers.

That DNS query will have been responded to by a Go program called RRDNS.

There’s also a lot of Rust being written at Cloudflare and some of our newer products are being created using it. For example, Firewall Rules which do arbitrary filtering of requests to our customers are handled by a Rust program that needs to be low latency, stable and secure.

Security is a company wide commitment

The other post-Cloudbleed change was that any crashes on our machines came under the spotlight from the very top. If a process crashes I personally get emailed about it. And if the team doesn’t take those crashes seriously they get me poking at them until they do.

We missed the fact that Cloudbleed was crashing our machines and we won’t let that happen again. We use Sentry to correlate information about crashes and the Sentry output is one of the first things I look at in the morning.

Which, I think, brings up an important point. I spoke earlier about our culture of “If you see something weird, say something” but it’s equally important that security comes from the top down.

Our CSO, Joe Sullivan, doesn’t report to me, he reports to the CEO. That sends a clear message about where security sits in the company. But, also, the security team itself isn’t sitting quietly in the corner securing everything.

They are setting standards, acting as trusted advisors, and helping deal with incidents. But their biggest role is to be a source of knowledge for the rest of the company. Everyone at Cloudflare plays a role in keeping us secure.

You might expect me to have access to our all our systems, a passcard that gets me into any room, a login for any service. But the opposite is true: I don’t have access to most things. I don’t need it to get my job done and so I don’t have it.

This makes me a less attractive target for hackers, and we apply the same rule to everyone. If you don’t need access for your job you don’t get it. That’s made a lot easier by the identity and authentication systems and by our rule about timing out access if you don’t use a service. You probably didn’t need it in the first place.

The flip side of all of us owning security is that deliberately doing the wrong thing has severe consequences.

Making a mistake is just fine. The person who wrote the bad line of code that caused Cloudbleed didn’t get fired, the person who wrote the bad regex that brought our service to a halt on July 2 is still with us.‌‌

Detection and Response‌‌

Naturally, things do go wrong internally. Things that didn’t get reported. To do with them we need to detect problems quickly. This is an area where the security team does have real expertise and data.‌‌

We do this by collecting data about how our endpoints (my laptop, a company phone, servers on the edge of our network) are behaving. And this is fed into a homebuilt data platform that allows the security team to alert on anomalies.‌‌

It also allows them to look at historical data in case of a problem that occurred in the past, or to understand when a problem started. ‌‌

Initially the team was going to use a commercial data platform or SIEM but they quickly realized that these platforms are incredibly expensive and they could build their own at a considerably lower price.‌‌

Also, Cloudflare handles a huge amount of data. When you’re looking at operating system level events on machines in 194 cities plus every employee you’re dealing with a huge stream. And the commercial data platforms love to charge by the size of that stream.‌‌

We are integrating internal DNS data, activity on individual machines, network netflow information, badge reader logs and operating system level events to get a complete picture of what’s happening on any machine we own.‌‌

When someone joins Cloudflare they travel to our head office in San Francisco for a week of training. Part of that training involves getting their laptop and setting it up and getting familiar with our internal systems and security.‌‌

During one of these orientation weeks a new employee managed to download malware while setting up their laptop. Our internal detection systems spotted this happening and the security team popped over to the orientation room and helped the employee get a fresh laptop.‌‌

The time between the malware being downloaded and detected was about 40 minutes.‌‌

If you don’t want to build something like this yourself, take a look at Google’s Chronicle product. It’s very cool. ‌‌

One really rich source of data about your organization is DNS. For example, you can often spot malware just by the DNS queries it makes from a machine. If you do one thing then make sure all your machines use a single DNS resolver and get its logs.‌‌‌‌

Edge Security‌‌

In some ways the most interesting part of Cloudflare is the least interesting from a security perspective. Not because there aren’t great technical challenges to securing machines in 194 cities but because some of the more apparently mundane things I’ve talked about how such huge impact.‌‌

Identity, Authentication, Culture, Detection and Response.‌‌

But, of course, the edge needs securing. And it’s a combination of physical data center security and software. ‌‌

To give you one example let’s talk about SSL private keys. Those keys need to be distributed to our machines so that when an SSL connection is made to one of our servers we can respond. But SSL private keys are… private!‌‌

And we have a lot of them. So we have to distribute private key material securely. This is a hard problem. We encrypt the private keys while at rest and in transport with a separate key that is distributed to our edge machines securely. ‌‌

Access to that key is tightly controlled so that no one can start decrypting keys in our database. And if our database leaked then the keys couldn’t be decrypted since the key needed is stored separately.‌‌

And that key is itself GPG encrypted.‌‌

But wait… there’s more!‌‌

We don’t actually want to have decrypted keys stored in any process that accessible from the Internet. So we use a technology called Keyless SSL where the keys are kept by a separate process and accessed only when needed to perform operations.‌‌

And Keyless SSL can run anywhere. For example, it doesn’t have to be on the same machine as the machine handling an SSL connection. It doesn’t even have to be in the same country. Some of our customers make use of that to specify where their keys are distributed to).

Use Cloudflare to secure Cloudflare

One key strategy of Cloudflare is to eat our own dogfood. If you’ve not heard that term before it’s quite common in the US. The idea is that if you’re making food for dogs you should be so confident in its quality that you’d eat it yourself.

Cloudflare does the same for security. We use our own products to secure ourselves. But more than that if we see that there’s a product we don’t currently have in our security toolkit then we’ll go and build it.

Since Cloudflare is a cybersecurity company we face the same challenges as our customers, but we can also build our way out of those challenges. In this way, our internal security team is also a product team. They help to build or influence the direction of our own products.

The team is also a Cloudflare customer using our products to secure us and we get feedback internally on how well our products work. That makes us more secure and our products better.

Our customers data is more precious than ours‌‌

The data that passes through Cloudflare’s network is private and often very personal. Just think of your web browsing or app use. So we take great care of it.‌‌

We’re handling that data on behalf of our customers. They are trusting us to handle it with care and so we think of it as more precious than our own internal data.‌‌

Of course, we secure both because the security of one is related to the security of the other. But it’s worth thinking about the data you have that, in a way, belongs to your customer and is only in your care.‌‌‌‌

Finally‌‌

I hope this talk has been useful. I’ve tried to give you a sense of how Cloudflare thinks about security and operates. We don’t claim to be the ultimate geniuses of security and would love to hear your thoughts, ideas and experiences so we can improve.‌‌

Security is not static and requires constant attention and part of that attention is listening to what’s worked for others.‌‌

Thank you.‌‌‌‌‌‌‌‌‌‌‌‌

The Cloudflare Blog