How Cloudflare security responded to Log4j 2 vulnerability

At Cloudflare, when we learn about a new security vulnerability, we quickly bring together teams to answer two distinct questions: (1) what can we do to ensure our customers’ infrastructures are protected, and (2) what can we do to ensure that our own environment is secure. Yesterday, December 9, 2021, when a serious vulnerability in the popular Java-based logging package Log4j was publicly disclosed, our security teams jumped into action to help respond to the first question and answer the second question. This post explores the second.

We cover the details of how this vulnerability works in a separate blog post: Inside the Log4j2 vulnerability (CVE-2021-44228), but in summary, this vulnerability allows an attacker to execute code on a remote server. Because of the widespread use of Java and Log4j, this is likely one of the most serious vulnerabilities on the Internet since both Heartbleed and ShellShock. The vulnerability is listed as CVE-2021-44228. The CVE description states that the vulnerability affects Log4j2 <=2.14.1 and is patched in 2.15. The vulnerability additionally impacts all versions of log4j 1.x; however, it is End of Life and has other security vulnerabilities that will not be fixed. Upgrading to 2.15 is the recommended action to take. You can also read about how we updated our WAF rules to help protect our customers in this post: CVE-2021-44228 - Log4j RCE 0-day mitigation

Timeline

One of the first things we do whenever we respond to an incident is start drafting a timeline of events we need to review and understand within the context of the situation. Some examples from our timeline here include:

2021-12-09 16:57 UTC - Hackerone report received regarding Log4j RCE on developers.cloudflare.com
2021-12-10 09:56 UTC - First WAF rule shipped to Cloudflare Specials ruleset
2021-12-10 10:00 UTC - Formal engineering INCIDENT is opened and work begins to identify areas we need to patch Log4j
2021-12-10 10:33 UTC - Logstash deployed with patch to mitigate vulnerability.
2021-12-10 10:44 UTC - Second WAF rule is live as part of Cloudflare managed rules
2021-12-10 10:50 UTC - ElasticSearch restart begins with patch to mitigate vulnerability
2021-12-10 11:05 UTC - ElasticSearch restart concludes and is no longer vulnerable
2021-12-10 11:45 UTC - Bitbucket is patched and no longer vulnerable
2021-12-10 21:22 UTC - Hackerone report closed as Informative after it was unable to be reproduced

Addressing internal impact

An important question when dealing with any software vulnerability, and what may actually be the hardest question that every company has to answer in this particular case is: where are all the places that the vulnerable software is actually running?

If the vulnerability is in a proprietary piece of software licensed by one company to the rest of the world, that is easy to answer - you just find that one piece of software. But in this case that was much harder. Log4j is a widely used piece of software but not one that people who are not Java developers are likely to be familiar with. Our first action was to refamiliarize ourselves with all places in our infrastructure where we were running software on the JVM, in order to determine which software components could be vulnerable to this issue.

We were able to create an inventory of all software we have running on the JVM using our centralized code repositories. We used this information to research and determine each individual Java application we had, whether it contained Log4j, and which version of Log4j was compiled into it.

We discovered that our ElasticSearch, LogStash, and Bitbucket contained instances of the vulnerable Log4j package that was between versions 2.0 and 2.14.1. We were able to use the mitigation strategies described in the official Log4j security documentation to patch the issue. For each instance of Log4j we either removed the JndiLookup class from the classpath:

zip -q -d log4j-core-*.jar org/apache/logging/log4j/core/lookup/JndiLookup.class

Or we set the mitigating system property in the log4j configuration:

log4j2.formatMsgNoLookups

We were able to quickly mitigate this issue in these packages using these strategies while waiting for new versions of the packages to be released.

Reviewing External Reports

Even before we were done making the list of internal places where the vulnerable software was running, we started by looking at external reports - from HackerOne, our bug bounty program, and a public post in GitHub suggesting that we might be at risk.

We identified at least two reports that seemed to indicate that Cloudflare was vulnerable and compromised. In one of the reports was the following screenshot:

This example is targeting our developer documentation hosted at https://developer.cloudflare.com. On the right-hand side, the attacker demonstrates that a DNS query was received for the payload he sent to our server. However, the IP address flagged here is 173.194.95.134, a member of a Google owned IPv4 subnet (173.194.94.0/23).

Cloudflare’s developer documentation is hosted as a Cloudflare Worker and only serves static assets. The repository is public. The Worker relies on Google’s analytics library as seen here, therefore, we hypothesize that the attacker was not receiving a request from Cloudflare but through Google's servers.

Our backend servers receive logging from Workers, but exploitation was also not possible in this instance as we leverage robust Kubernetes egress policies that prevent calling out to the Internet. The only communication allowed is to a curated set of internal services.

When we received a similar report in our vulnerability disclosure program while gathering more information, the researcher was unable to reproduce the issue. This further enforced our hypothesis that it was third party servers, and they may have patched the issue.

Was Cloudflare compromised?

While we were running versions of the software as described above, thanks to our speed of response and defense in depth approach, we do not believe Cloudflare was compromised. We have invested significant efforts into validating this, and we will continue working on this effort until everything is known about this vulnerability. Here is a bit about that part of our efforts.

As we were working to evaluate and isolate all the contexts in which the vulnerable software might be running and remediate them, we started a separate workstream to analyze whether any of those instances had been exploited. Our detection and response methodology follows industry standard Incident Response practices and was thoroughly deployed to validate whether any of our assets were indeed compromised. We followed a multi-pronged approach described next.

Reviewing Internal Data

Our asset inventory and code scanning tooling allowed us to identify all applications and services reliant on Apache Log4j. While these applications were being reviewed and upgraded if needed, we were performing a thorough scan of these services and hosts. Specifically, the exploit for CVE-2021-44228 relies on particular patterns in log messages and parameters, for example \$\{jndi:(ldap[s]?|rmi|dns):/[^\n]+. For each potentially impacted service, we performed a log analysis to expose any attempts at exploitation.

Reviewing Network Analytics

Our network analytics allow us to identify suspicious network behavior that may be indicative of attempted or actual exploitation of our infrastructure. We scrutinised our network data to identify the following:

Suspicious Inbound and Outbound ActivityBy analyzing suspicious inbound and outbound connections, we were able to sweep our environment and identify whether any of our systems were displaying signs of active compromise.
Targeted Systems & ServicesBy leveraging pattern analytics against our network data, we uncovered systems and services targeted by threat-actors. This allowed us to perform correlative searches against our asset inventory, and drill down to each host to determine if any of those machines were exposed to the vulnerability or actively exploited.
Network IndicatorsFrom the aforementioned analysis, we gained insight into the infrastructure of various threat actors and identified network indicators being utilized in attempted exploitation of this vulnerability. Outbound activity to these indicators was blocked in Cloudflare Gateway.

Reviewing endpoints

We were able to correlate our log analytics and network analytics workflows to supplement our endpoint analysis. From our findings from both of those analyses, we were able to craft endpoint scanning criteria to identify any additional potentially impacted systems and analyze individual endpoints for signs of active compromise. We utilized the following techniques:

Signature Based Scanning

We are in the process of deploying custom Yara detection rules to alert on exploitation of the vulnerability. These rules will be deployed in the Endpoint Detection and Response agent running on all of our infrastructure and our centralized Security Information and Events Management (SIEM) tool.

Anomalous Process Execution and Persistence Analysis

Cloudflare continuously collects and analyzes endpoint process events from our infrastructure. We used these events to search for post-exploitation techniques like download of second stage exploits, anomalous child processes, etc.

Using all of these approaches, we have found no evidence of compromise.

Third-Party risk

In the analysis above, we focused on reviewing code and data we generate ourselves. But like most companies, we also rely on software that we have licensed from third parties. When we started our investigation into this matter, we also partnered with the company’s information technology team to pull together a list of each and every primary third-party provider and all sub-processors to inquire about whether they were affected. We’re in the process of receiving and reviewing responses from the providers. Any providers who we deem critical that are impacted by this vulnerability will be disabled and blocked until they are fully remediated.

Validation that our defense-in depth approach worked

As we responded to this incident, we found several places where our defense in depth approach worked.

Restricting outbound traffic

Restricting the ability to call home is an essential part of the kill-chain to make exploitation of vulnerabilities much harder. As noted above, we leverage Kubernetes network policies to restrict egress to the Internet on our deployments. In this context, it prevents the next-stage of the attack, and the network connection to attacker controlled resources is dropped.

All of our externally facing services are protected by Cloudflare. The origin servers for these services are set up via authenticated origin pulls. This means that none of the servers are exposed directly to the Internet.

2. Using Cloudflare to secure Cloudflare

All of our internal services are protected by our Zero-trust product, Cloudflare Access. Therefore, once we had patched the limited attack surface we had identified, any exploit attempts to Cloudflare’s systems or customers leveraging Access would have required the attacker to authenticate.

And because we have the Cloudflare WAF product deployed as part of our effort to secure Cloudflare using Cloudflare, we benefited from all the work being done to protect our customers. All new WAF rules written to protect against this vulnerability were updated with a default action of BLOCK. Like every other customer who has the WAF deployed, we are now receiving protection without any action required on our side.

Conclusion

While our response to this challenging situation continues, we hope that this outline of our efforts helps others. We are grateful for all the support we have received from within and outside of Cloudflare.

Thank you to Evan Johnson, Anjum Ahuja, Sourov Zaman, David Haynes, and Jackie Keith who also contributed to this blog.

The Cloudflare Blog