The Cloudflare Blog

Using data science and machine learning for improved customer support

Junade Ali — Mon, 15 Jun 2020 11:00:00 GMT

In this blog post we’ll explore three tricks that can be used for data science that helped us solve real problems for our customer support group and our customers. Two for natural language processing in a customer support context and one for identifying attack Internet attack traffic.

Through these examples, we hope to demonstrate how invaluable data processing tricks, visualisations and tools can be before putting data into a machine learning algorithm. By refining data prior to processing, we are able to achieve dramatically improved results without needing to change the underlying machine learning strategies which are used.

Know the Limits (Language Classification)

When browsing a social media site, you may find the site prompts you to translate a post even though it is in your language.

We recently came across a similar problem at Cloudflare when we were looking into language classification for chat support messages. Using an off-the-shelf classification algorithm, users with short messages often had their chats classified incorrectly and our analysis found there’s a correlation between the length of a message and the accuracy of the classification (based on the browser Accept-Language header and the languages of the country where the request was submitted):

On a subset of tickets, comparing the classified language against the web browser Accept-Language header, we found there was broad agreement between these two properties. When we considered the languages associated with the user’s country, we found another signal.

In 67% of our sample, we found agreement between these three signals. In 15% of instances the classified language agreed with only the Accept-Language header and in 5% of cases there was only agreement with the languages associated with the user’s country.

We decided the ideal approach was to train a machine learning model that would take all three signals (plus the confidence rate from the language classification algorithm) and use that to make a prediction. By knowing the limits of a given classification algorithm, we were able to develop an approach that helped compliment it.

A naive approach to do the same may not even need a trained model to do so, simply requiring agreement between two of three properties (classified language, Accept-Language header and country header) helps make a decision about the right language to use.

Hold Your Fire (Fuzzy String Matching)

Fuzzy String Matching is often used in natural language processing when trying to extract information from human text. For example, this can be used for extracting error messages from customer support tickets to do automatic classification. At Cloudflare, we use this as one signal in our natural language processing pipeline for support tickets.

Engineers often use the Levenshtein distance algorithm for string matching; for example, this algorithm is implemented in the Python fuzzywuzzy library. This approach has a high computational overhead (for two strings of length k and l, the algorithm runs in O(k * l) time).

To understand the performance of different string matching algorithms in a customer support context, we compared multiple algorithms (Cosine, Dice, Damerau, LCS and Levenshtein) and measured the true positive rate (TP), false positive rate (FP) and the ratio of false positives to true positives (FP/TP).

We opted for the Cosine algorithm, not just because it outperformed the Levenshtein algorithm, but also the computational difficulty was reduced to O(k + l) time. The Cosine similarity algorithm is a very simple algorithm; it works by representing words or phrases as a vector representation in a multidimensional vector space, where each unique letter of an alphabet is a separate dimension. The smaller the angle between the two vectors, the closer the word is to another.

The mathematical definitions of each string similarity algorithm and a scientific comparison can be found in our paper: M. Pikies and J. Ali, "String similarity algorithms for a ticket classification system," 2019 6th International Conference on Control, Decision and Information Technologies (CoDIT), Paris, France, 2019, pp. 36-41. https://doi.org/10.1109/CoDIT.2019.8820497

There were other optimisations we introduce to the fuzzy string matching approaches; the similarity threshold is determined by evaluating the True Positive and False Positive rates on various sample data. We further devised a new tokenization approach for handling phrases and numeric strings whilst using the FastText natural language processing library to determine candidate values for fuzzy string matching and to improve overall accuracy, we will share more about these optimisations in a further blog post.

“Beyond it is Another Dimension” (Threat Identification)

Attack alerting is particularly important at Cloudflare - this is useful for both monitoring the overall status of our network and providing proactive support to particular at-risk customers.

DDoS attacks can be represented in granularity by a few different features; including differences in request or error rates over a temporal baseline, the relationship between errors and request volumes and other metrics that indicate attack behaviour. One example of a metric we use to differentiate between whether a customer is under a low volume attack or they are experiencing another issue is the relationship between 499 error codes vs 5xx HTTP status codes. Cloudflare’s network edge returns a 499 status code when the client disconnects before the origin web server has an opportunity to respond, whilst 5xx status codes indicate an error handling the request. In the chart below; the x-axis measures the differential increase in 5xx errors over a baseline, whilst the y-axis represents the rate of 499 responses (each scatter represents a 15 minute interval). During a DDoS attack we notice a linear correlation between these criteria, whilst origin issues typically have an increase in one metric instead of another:

The next question is how this data can be used in more complicated situations - take the following example of identifying a credential stuffing attack in aggregate. We looked at a small number of anonymised data fields for the most prolific attackers of WordPress login portals. The data is based purely on HTTP headers, in total we saw 820 unique IPs towards 16,248 distinct zones (the IPs were hashed and requests were placed into “buckets” as they were collected). As WordPress returns a HTTP 200 when a login fails and a HTTP 302 on a successful login (redirecting to the login panel), we’re able to analyse this just from the status code returned.

On the left hand chart, the x-axis represents a normalised number of unique zones that are under attack (0 means the attacker is hitting the same site whilst 1 means the attacker is hitting all different sites) and the y-axis represents the success rate (using HTTP status codes, identifying the chance of a successful login). The right hand side chart switches the x-axis out for something called the “variety ratio” - this measures the rate of abnormal 4xx/5xx HTTP status codes (i.e. firewall blocks, rate limiting HTTP headers or 5xx status codes). We see clear clusters on both charts:

However, by plotting this chart in three dimensions with all three fields represented - clusters appear. These clusters are then grouped using an unsupervised clustering algorithm (agglomerative hierarchical clustering):

Cluster 1 has 99.45% of requests from the same country and 99.45% from the same User-Agent. This tactic, however, has advantages when looking at other clusters - for example, Cluster 0 had 89% of requests coming from three User-Agents (75%, 12.3% and 1.7%, respectively). By using this approach we are able to correlate such attacks together even when they would be hard to identify on a request-to-request basis (as they are being made from different IPs and with different request headers). Such strategies allow us to fingerprint attacks regardless of whether attackers are continuously changing how they make these requests to us.

By aggregating data together then representing the data in multiple dimensions, we are able to gain visibility into the data that would ordinarily not be possible on a request-to-request basis. In product level functionality, it is often important to make decisions on a signal-to-signal basis (“should this request be challenged whilst this one is allowed?”) but by looking at the data in aggregate we are able to focus on the interesting clusters and provide alerting systems which identify anomalies. Performing this in multiple dimensions provides the tools to reduce false positives dramatically.

Conclusion

From natural language processing to intelligent threat fingerprinting, using data science techniques has improved our ability to build new functionality. Recently, new machine learning approaches and strategies have been designed to process this data more efficiently and effectively; however, preprocessing of data remains a vital tool for doing this. When seeking to optimise data processing pipelines, it often helps to look not just at the tools being used, but also the input and structure of the data you seek to process.

If you're interested in using data science techniques to identify threats on a large scale network, we're hiring for Support Engineers (including Security Operations, Technical Support and Support Operations Engineering) in San Francisco, Austin, Champaign, London, Lisbon, Munich and Singapore.

Time-Based One-Time Passwords for Phone Support

Junade Ali — Fri, 17 Apr 2020 12:00:00 GMT

As part of Cloudflare’s support offering, we provide phone support to Enterprise customers who are experiencing critical business issues.

For account security, specific account settings and sensitive details are not discussed via phone. From today, we are providing Enterprise customers with the ability to configure phone authentication to allow for greater support to be offered over the phone without need to perform validation through support tickets.

After providing your email address to a Cloudflare Support representative, you can now provide a token generated from the Cloudflare dashboard or via a 2FA app like Google Authenticator. So, a customer is able to prove over the phone that they are who they say they are.

Configuring Phone Authentication

If you are an existing Enterprise customer interested in phone support, please contact your Customer Success Manager for eligibility information and set-up. If you are interested in our Enterprise offering, please get in contact via our Enterprise plan page.

If you already have phone support eligibility, you can generate single-use tokens from the Cloudflare dashboard or configure an authenticator app to do the same remotely.

On the support page, you will see a card called “Emergency Phone Support Hotline – Authentication”. From here you can generate a Single-Use Token for authenticating a single call or configure an Authenticator App to generate tokens from a 2FA app.

For more detailed instructions, please see the “Emergency Phone” section of the Contacting Cloudflare Support article on the Cloudflare Knowledge Base.

How it Works

A standardised approach for generating TOTPs (Time-Based One-Time Passwords) is described in RFC 6238 – this is the approach that is often used for setting up Two Factor Authentication on websites.

When configuring a TOTP authenticator app, you are usually asked to scan a QR code or input a long alphanumeric string. This is a randomly generated secret that is shared between your local authenticator app and the web service where you are configuring TOTP. After TOTP is configured, this is stored between both the web server and your local device.

TOTP password generation relies on two key inputs; the shared secret and the number of seconds since the Unix epoch (Unix time). The timestamp is integer divided by a validity period (often 30 seconds) and this value is put into a cryptographic hash function alongside the secret to generate an output. The hexadecimal output is then truncated to provide the decimal digits which are shown to the user. The Avalanche Effect means that whenever the inputs that go into the hash function change slightly (e.g. the timestamp increments), a completely different hash output is generated.

This approach is fairly widely used and is available in a number of libraries depending on your preferred programming language. However, as our phone validation functionality offers both authenticator app support and generation of a single-use token from the dashboard (where no shared secret exists) - some deviation was required.

We generate a single use token by creating a hash of an internal user ID combined with a Cloudflare-internal secret, which in turn is used to generate RFC 6238 compliant time-based one-time passwords. Similarly, this service can generate random passwords for any user without needing to store additional secrets. This is then surfaced to the user every 30 seconds via a JavaScript request without exposing the secret used to generate the token.

One question you may be asking yourself after all of this is why don’t we simply use the 2FA mechanism which users use to login for phone validation too? Firstly, we don’t want to accustom users to providing their 2FA tokens to anyone else (they should purely be used for logging in). Secondly, as you may have noticed - we recently began supporting WebAuthn keys for logging in, as these are physical tokens used for website authentication they aren’t suited to usage on a mobile device.

To improve user experience during a phone call, we also validate tokens in the previous time step in the event it has expired by the time the user has read it out (indeed, RFC 6238 provides that “at most one time step is allowed as the network delay”). This means a token can be valid for up to one minute.

The APIs powering this service are then wrapped with API gateways that offer audit logging both for customer actions and actions completed by staff members. This provides a clear audit trail for customer authentication.

Future Work

Authentication is a critical component to securing customer support interactions. Authentication tooling must develop alongside support contact channels; from web forms behind logins to using JWT tokens for validating live chat sessions and now TOTP phone authentication. This is complimented by technical support engineers who will manage risk by routing certain issues into traditional support tickets and being able to refer some cases to named customer success managers for approval.

We are constantly advancing our support experience; for example, we plan to further improve our Enterprise Phone Support by giving users the ability to request a callback from a support agent within our dashboard. As always, right here on our blog we’ll keep you up-to-date with improvements in our service.

Project Crossbow: Lessons from Refactoring a Large-Scale Internal Tool

Junade Ali — Tue, 07 Apr 2020 07:00:00 GMT

Cloudflare’s global network currently spans 200 cities in more than 90 countries. Engineers working in product, technical support and operations often need to be able to debug network issues from particular locations or individual servers.

Crossbow is the internal tool for doing just this; allowing Cloudflare’s Technical Support Engineers to perform diagnostic activities from running commands (like traceroutes, cURL requests and DNS queries) to debugging product features and performance using bespoke tools.

In September last year, an Engineering Manager at Cloudflare asked to transition Crossbow from a Product Engineering team to the Support Operations team. The tool had been a secondary focus and had been transitioned through multiple engineering teams without developing subject matter knowledge.

The Support Operations team at Cloudflare is closely aligned with Cloudflare’s Technical Support Engineers; developing diagnostic tooling and Natural Language Processing technology to drive efficiency. Based on this alignment, it was decided that Support Operations was the best team to own this tool.

Learning from Sisyphus

Whilst seeking advice on the transition process, an SRE Engineering Manager in Cloudflare suggested reading: “A Case Study in Community-Driven Software Adoption”. This book proved a truly invaluable read for anyone thinking of doing internal tool development or contributing to such tooling. The book describes why multiple tools are often created for the same purpose by different autonomous teams and how this issue can be overcome. The book also describes challenges and approaches to gaining adoption of tooling, especially where this requires some behaviour change for engineers who use such tools.

That said, there are some things we learnt along the way of taking over Crossbow and performing a refactor and revamp of a large-scale internal tool. This blog post seeks to be an addendum to such guidance and provide some further practical advice.

In this blog post we won’t dwell too much on the work of the Cloudflare Support Operations team, but this can be found in the SRECon talk: “Support Operations Engineering: Scaling Developer Products to the Millions”. The software development methodology used in Cloudflare’s Support Operations Group closely resembles Extreme Programming.

Cutting The Fat

There were two ways of using Crossbow, a CLI (command line interface) and UI in Cloudflare’s internal tool for Cloudflare’s Technical Support Engineers. Maintaining both interfaces clearly had significant overhead for improvement efforts, and we took the decision to deprecate one of the interfaces. This allowed us to focus our efforts on one platform to achieve large-scale improvements across technology, usability and functionality.

We set-up a poll to allow engineering, operations, solutions engineering and technical support teams to provide their feedback on how they used the tooling. Polling was not only critical for gaining vital information to how different teams used the tool, but also ensured that prior to deprecation that people knew their views were taken onboard. We polled not only on the option people preferred, but which options they felt were necessary to them and the reasons as to why.

We found that the reasons for favouring the web UI primarily revolved around the absence of documentation and training. Instead, we discovered those who used the CLI found it far more critical for their workflow. Product Engineering teams do not routinely have access to the support UI but some found it necessary to use Crossbow for their jobs and users wanted to be able to automate commands with shell scripts.

Technically, the UI was in JavaScript with an API Gateway service that converted HTTP requests to gRPC alongside some configuration to allow it to work in the support UI. The CLI directly interfaced with the gRPC API so it was a simpler system. Given the Cloudflare Support Operations team primarily works on Systems Engineering projects and had limited UI resources, the decision to deprecate the UI was also in our own interest.

We rolled out a new internal Crossbow user group, trained up teams and created new documentation, provided advance notification of deprecation and abrogated the source code of these services. We also dramatically improved the user experience when using the CLI for users through simple improvements to the help information and easier CLI usage.

Rearchitecting Pub/Sub with Cloudflare Access

One of the primary challenges we encountered was how the system architecture for Crossbow was designed many years ago. A gRPC API ran commands at Cloudflare’s edge network using a configuration management tool which the SRE team expressed a desire to deprecate (with Crossbow being the last user of it).

During a visit to the Singapore Office, the Edge SRE Engineering Manager locally wanted his team to understand Crossbow and how to contribute. During this meeting, we provided an overview of the current architecture and the team there were forthcoming in providing potential refactoring ideas to handle global network stability and move away from the old pipeline. This provided invaluable insight into the common issues experienced between technical approaches and instances of where the tool would fail requiring Technical Support Engineers to consult the SRE team.

We decided to adopt a more simple pub/sub pipeline, instead the edge network would expose a gRPC daemon that would listen for new jobs and execute them and then make a callback to the API service with the results (which would be relayed onto the client).

For authentication between the API service and the client or the API service and the network edge, we implemented a JWT authentication scheme. For a CLI user, the authentication was done by querying an HTTP endpoint behind Cloudflare Access using cloudflared, which provided a JWT the client could use for authentication with gRPC. In practice, this looks something like this:

CLI makes request to authentication server using cloudflared
Authentication server responds with signed JWT token
CLI makes gRPC request with JWT authentication token to API service
API service validates token using a public key

The gRPC API endpoint was placed on Cloudflare Spectrum; as users were authenticated using Cloudflare Access, we could remove the requirement for users to be on the company VPN to use the tool. The new authentication pipeline, combined with a single user interface, also allowed us to improve the collection of metrics and usage logs of the tool.

Risk Management

Risk is inherent in the activities undertaken by engineering professionals, meaning that members of the profession have a significant role to play in managing and limiting it. - Guidance on Risk, Engineering Council

As with all engineering projects, it was critical to manage risk. However, the risk to manage is different for different engineering projects. Availability wasn’t the largest factor, given that Technical Support Engineers could escalate issues to the SRE team if the tool wasn’t available. The main risk was security of the Cloudflare network and ensuring Crossbow did not affect the availability of any other services. To this end we took methodical steps to improve isolation and engaged the InfoSec team early to assist with specification and code reviews of the new pipeline. Where a risk to availability existed, we ensured this was properly communicated to the support team and the internal Crossbow user group to communicate the risk/reward that existed.

Feedback, Build, Refactor, Measure

The Support Operations team at Cloudflare works using a methodology based on Extreme Programming. A key tenant of Extreme Programming is that of Test Driven Development, this is often described as a “red-green-green” pattern or “red-green-refactor”. First the engineer enshrines the requirements in tests, then they make those tests pass and then refactor to improve code quality before pushing the software.

As we took on this project, the Cloudflare Support and SRE teams were working on Project Baton - an effort to allow Technical Support Engineers to handle more customer escalations without handover to the SRE teams.

As part of this effort, they had already created an invaluable resource in the form of a feature wish list for Crossbow. We associated JIRAs with all these items and prioritised this work to deliver such feature requests using a Test Driven Development workflow and the introduction of Continuous Integration. Critically we measured such improvements once deployed. Adding simple functionality like support for MTR (a Linux network diagnostic tool) and exposing support for different cURL flags provided improvements in usage.

We were also able to embed Crossbow support for other tools available at the network edge created by other teams, allowing them to maintain such tools and expose features to Crossbow users. Through the creation of an improved development environment and documentation, we were able to drive Product Engineering teams to contribute functionality that was in the mutual interest of them and the customer support team.

Finally, we owned a number of tools which were used by Technical Support Engineers to discover what Cloudflare configuration was applied to a given URL and performing distributed performance testing, we deprecated these tools and rolled them into Crossbow. Another tool owned by the Cloudflare Workers team, called Edge Worker Debug was rolled into Crossbow and the team deprecated their tool.

Results

From implementing user analytics on the tool on the 16 December 2019 to the week ending the 22 January 2020, we found a found usage increase of 4.5x. This growth primarily happened within a 4 week period; by adding the most wanted functionality, we were able to achieve a critical saturation of usage amongst Technical Support Engineers.

Beyond this point, it became critical to use the number of checks being run as a metric to evaluate how useful the tool was. For example, only the week starting January 27 saw no meaningful increase in unique users (a 14% usage increase over the previous week - within the normal fluctuation of stable usage). However, over the same timeframe, we saw a 2.6x increase in the number of tests being run - coinciding with introduction of a number of new high-usage functionalities.

Conclusion

Through removing low-value/high-maintenance functionality and merciless refactoring, we were dramatically able to improve the quality of Crossbow and therefore improve the velocity of delivery. We were able to dramatically improve usage through enabling functionality to measure usage, receive feature requests in feedback loops with users and test-driven development. Consolidation of tooling reduced overhead of developing support tooling across the business, providing a common framework for developing and exposing functionality for Technical Support Engineers.

There are two key counterintuitive learnings from this project. The first is that cutting functionality can drive usage, providing this is done intelligently. In our case, the web UI contained no additional functionality that wasn’t in the CLI, yet caused substantial engineering overhead for maintenance. By deprecating this functionality, we were able to reduce technical debt and thereby improve the velocity of delivering more important functionality. This effort requires effective communication of the decision making process and involvement from those who are impacted by such a decision.

Secondly, tool development efforts are often focussed by user feedback but lack a means of objectively measuring such improvements. When logging is added, it is often done purely for security and audit logging purposes. Whilst feedback loops with users are invaluable, it is critical to have an objective measure of how successful such a feature is and how it is used. Effective measurement drives the decision making process of future tooling and therefore, in the long run, the usage data can be more important than the original feature itself.

If you're interested in debugging interesting technical problems on a network with these tools, we're hiring for Support Engineers (including Security Operations, Technical Support and Support Operations Engineering) in San Francisco, Austin, Champaign, London, Lisbon, Munich and Singapore.

Pwned Passwords Padding (ft. Lava Lamps and Workers)

Junade Ali — Wed, 04 Mar 2020 13:00:00 GMT

The Pwned Passwords API (part of Troy Hunt’s Have I Been Pwned service) is used tens of millions of times each day, to alert users if their credentials are breached in a variety of online services, browser extensions and applications. Using Cloudflare, the API cached around 99% of requests, making it very efficient to run.

From today, we are offering a new security advancement in the Pwned Passwords API - API clients can receive responses padded with random data. This exists to effectively protect from any potential attack vectors which seek to use passive analysis of the size of API responses to identify which anonymised bucket a user is querying. I am hugely grateful to security researcher Matt Weir who I met at PasswordsCon in Stockholm and has explored proof-of-concept analysis of unpadded API responses in Pwned Passwords and has driven some of the work to consider the addition of padded responses.

Now, by passing a header of “Add-Padding” with a value of “true”, Pwned Passwords API users are able to request padded API responses (to a minimum of 800 entries with additional padding of a further 0-200 entries). The padding consists of randomly generated hash suffixes with the usage count field set to “0”.

Clients using this approach should seek to exclude 0-usage hash suffixes from breach validation. Given most implementations of PwnedPasswords simply do string matching on the suffix of a hash, there is no real performance implication of searching through the padding data. The false positive risk if a hash suffix matches a randomly generated response is very low, 619/(2^35*4) ≈ 4.44 x 10^-40. This means you’d need to do about 10⁴⁰ queries (roughly a query for every two atoms in the universe) to have a 44.4% probability of a collision.

In the future, non-padded responses will be deprecated outright (and all responses will be padded) once clients have had a chance to update.

You can see an example padded request by running the following curl request:

curl -H Add-Padding:true https://api.pwnedpasswords.com/range/FFFFF

API Structure

The high level structure of the Pwned Passwords API is discussed in my original blog post “Validating Leaked Passwords with k-Anonymity”. In essence, a client queries the API for the first 5 hexadecimal characters of a SHA-1 hashed password (amounting to 20 bits), a list of responses is returned with the remaining 35 hexadecimal characters of the hash (140 bits) of every breached password in the dataset. Each hash suffix is appended with a colon (“:”) and the number of times that given hash is found in the breached data.

An example query for FFFFF can be seen below, with the structure represented:

Without padding, the message length varies given the amount of hash suffixes in the bucket that is queried. It is known that it is possible to fingerprint TLS traffic based on the encrypted message length - fortunately this padding can be inserted in the API responses themselves (in the HTTP content). We can see the difference in download size between two unpadded buckets by running:

$ curl -so /dev/null https://api.pwnedpasswords.com/range/E0812 -w '%{size_download} bytes\n'
17022 bytes
$ curl -so /dev/null https://api.pwnedpasswords.com/range/834EF -w '%{size_download} bytes\n'
25118 bytes

The randomised padded entries can be found with with the “:0” suffix (indicating usage count); for example, below the top three entries are real entries whilst the last 3 represent padding data:

FF1A63ACC70BEA924C5DBABEE4B9B18C82D:10
FF8A0382AA9C8D9536EFBA77F261815334D:12
FFEE791CBAC0F6305CAF0CEE06BBE131160:2
2F811DCB8FF6098B838DDED4D478B0E4032:0
A1BABA501C55ACB6BDDC6D150CF585F20BE:0
9F31397459FF46B347A376F58506E420A58:0

Compression and Randomisation

Cloudflare supports both GZip and Brotli for compression. Compression benefits the PwnedPasswords API as responses are hexadecimal represented in ASCII. That said, compression is somewhat limited given the Avalanche Effect in hashing algorithms (that a small change in an input results in a completely different hash output) - each range searched has dramatically different input passwords and the remaining 35 characters of the SHA-1 hash are similarly different and have no expected similarity between them.

Accordingly, if one were to simply pad messages with null messages (say “000...”), the compression could mean that values padded to the same could be differentiated after compression. Similarly, even without compression, padding messages with the same data could still yield credible attacks.

Accordingly, padding is instead generated with randomly generated entries. In order to not break clients, such padding is generated to effectively look like legitimate hash suffixes. It is possible, however, to identify such messages as randomised padding. As the PwnedPasswords API contains a count field (distinguished by a colon after the remainder of the hex followed by a numerical count), randomised entries can be distinguished with a 0 usage.

Lava Lamps and Workers

I’ve written before about how cache optimisation of Pwned Passwords (including using Cloudflare Workers). Cloudflare Workers has an additional benefit that Workers run before elements are pulled from cache.

This allows for randomised entries to be generated dynamically on a request-to-request basis instead of being cached. This means the resulting randomised padding can differ from request-to-request (thus the amount of entries in a given response and the size of the response).

Cloudflare Workers supports the Web Crypto API, providing for exposure of a cryptographically sound random number generator. This random number generator is used to decide the variable amount of padding added to each response. Whilst a cryptographically secure random number generator is used for determining the amount of padding, as the random hexadecimal padding does not need to be indistinguishable from the real hashes, for computational performance we use the non-cryptographically secure Math.random() to generate the actual content of the padding.

Famously, one of the sources of entropy used in Cloudflare servers is sourced from Lava Lamps. By filming a wall of lava lamps in our San Francisco office (with individual photoreceptors picking up on random noise beyond the movement of the lava), we are able to generate random seed data used in servers (complimented by other sources of entropy along the way). This lava lamp entropy is used alongside the randomness sources on individual servers. This entropy is used to seed cryptographically secure pseudorandom number generators (CSPRNG) algorithms when generating random numbers. Cloudflare Workers runtime uses the v8 engine for JavaScript, with randomness sourced from /dev/urandom on the server itself.

Each response is padded to a minimum of 800 hash suffixes and a randomly generated amount of additional padding (from 200 entries).

This can be seen in two ways, firstly we can see that repeating the same responses to the same endpoint (with the underlying response being cached), yields a randomised amount of lines between 800 and 1000:

$ for run in {1..10}; do curl -s -H Add-Padding:true https://api.pwnedpasswords.com/range/FFFFF | wc -l; done
     831
     956
     870
     980
     932
     868
     856
     961
     912
     827

Secondly, we can see a randomised download size in each response:

$ for run in {1..10}; do curl -so /dev/null -H Add-Padding:true https://api.pwnedpasswords.com/range/FFFFF -w '%{size_download} bytes\n'; done
35572 bytes
37358 bytes
38194 bytes
33596 bytes
32304 bytes
37168 bytes
32532 bytes
37928 bytes
35154 bytes
33178 bytes

Future Work and Conclusion

There has been a considerable amount of research that has complemented the anonymity approach in Pwned Passwords. For example; Google and Stanford have written a paper about their approach implemented in Google Password Checkup, “Protecting accounts from credential stuffing with password breach alerting” [Usenix].

We have done a significant amount of work exploring more advanced protocols for Pwned Passwords, some of this work can be found in a paper we worked on with academics at Cornell University, “Protocols for Checking Compromised Credentials” [ACM or arXiv preprint]. This research offers two new protocols (FSB, frequency smoothing bucketization, and IDB, identifier-based bucketization) to further reduce information leakage in the APIs.

Further work is needed before these protocols gain the production worthiness that we’d like before they are shipped - but, as always, we’ll keep you updated here on our blog.

Concise Christmas Cryptography Challenges 2019

Junade Ali — Tue, 25 Dec 2018 17:42:02 GMT

Last year we published some crypto challenges to keep you momentarily occupied from the festivities. This year, we're doing the same. Whether you're bored or just want to learn a bit more about the technologies that encrypt the internet, feel free to give these short cryptography quizzes a go.

We're withholding answers until the start of the new year, to give you a chance to solve them without spoilers. Before we reveal the answers; if you manage to solve them, we'll be giving the first 5 people to get the answers right some Cloudflare swag. Fill out your answers and details using this form so we know where to send it.

Have fun!

UPDATE: This quiz is now closed. Thank you to everyone who's played. We have received many responses, 15 of which got all the answers right; we will shortly be sending out some swag to those who got the answers right.

NOTE: Hints, now followed with solutions, are below the questions, avoid scrolling too far if you want to avoid any spoilers.

Challenges

Client says Hello

Client says hello, as follows:

00000c07ac01784f437dbfc70800450000f2560140004006db58ac1020c843c7f80cd1f701bbc8b2af3449b598758018102a72a700000101080a675bce16787abd8716030100b9010000b503035c1ea569d5f64df3d8630de8bdddd1152e75f528ae577d2436949ce8deb7108600004400ffc02cc02bc024c023c00ac009c008c030c02fc028c027c014c013c012009f009e006b0067003900330016009d009c003d003c0035002f000a00af00ae008d008c008b010000480000000b000900000663666c2e7265000a00080006001700180019000b00020100000d0012001004010201050106010403020305030603000500050100000000001200000017000052305655494338795157524c656d6443436c5246574651675430346754456c4f52564d674e434242546b51674e513d3d

[Raw puzzle without text wrap]

Time-Based One-Time Password

A user has an authenticator device to generate one time passwords for logins to their banking website. The implementation contains a fatal flaw.

At the following times, the following codes are generated (all in GMT/UTC):

Friday, 21 December 2018 16:29:28 - 084342
Saturday, 22 December 2018 13:11:53 - 411907
Tuesday, 25 December 2018 12:15:03 - 617041

What code will be generated at precisely midnight of the 1st of January 2019?

RPKI

At Cloudflare, we just setup RPKI: we signed a few hundred prefixes in order to reduce route leaks. But some of the prefixes hide a secret message. Find the ROAs that look different, decode the word!

Hints

Client says Hello

This challenge has 3 hints, as follows:

Challenge is based on a network capture
https://blog.cloudflare.com/encrypted-sni/
What's weird about the Frame?

TOTP

The Time-Based One-Time Password Algorithm is described in RFC 6238, which was based of RFC4226 (providing an algorithm for HOTP). The TOTP algorithm requires input of two important parameters, the time and a shared secret - could one be missing?

The implementation used to generate the TOTP codes for the challenge uses SHA-1 as a digest algorithm.

RPKI

Note: This challenge will no longer be valid after mid-January 2019.

This challenge has 4 hints, as follows:

Hint #0: Four or six? Probably six.
Hint #1: If only there was a way of listing only our IPs!
Hint #2: What is the only part of the ROA where we can hide information into
Hint #3: Subtract the reserve, the char will show itself

Solutions

If you prefer video form, someone has created a YouTube video of the how to solve the problems, else the written solutions are below:

Client says Hello

The string was (mostly) a capture from Wireshark of a Client Hello frame in TLS 1.2 handshake; as such, it reveals the Server Name where the connection is intended to go; in this case cfl.re.

There is a string suffixed to this hex stream which shouldn't be there; it's a base64 encoded string R0VUIC8yQWRLemdCClRFWFQgT04gTElORVMgNCBBTkQgNQ==. Decoding this string reveals:

GET /2AdKzgBTEXT ON LINES 4 AND 5

Accordingly; https://cfl.re/2AdKzgB redirects to https://www.cloudflare.com/robots.txt; on lines 4 and 5 is the phrase: "Dear robot be nice".

The GET request would obviously ordinarily not be appended to the Client Hello like this; however SNI information would be. You can find more about the work Cloudflare is doing to encrypt such information, so attackers cannot see which site you're visiting, in the following post: Encrypting SNI: Fixing One of the Core Internet Bugs

TOTP

For this part, I'm going to use the pyotp library to demonstrate how the challenge is set-up:

>>> import pyotp>>> totp = pyotp.TOTP('')>>> print totp.at(1545409768)084342>>> print totp.at(1545484313)411907>>> print totp.at(1545740103)617041

Note that the argument to the TOTP function is set to an empty string, this means that there is no secret in place; and the one time passwords are generated solely from a hash of the time. Accordingly, a TOTP with the timestamp generated at midnight on New Year is 301554.

Whilst this may seem like a somewhat incredulous position for a developer to end up in - searching GitHub, I was even able to find implementations that used the default secret (base32secret3232) for all users wanting to authenticate to a website. This means that any other user's One Time Password is valid for any other account, and the secret could likely be breached fairly easily (as it isn't randomly generated).

RPKI

Cloudflare can only generate ROAs based on their prefixes. The IPv6 prefixes are listed here: https://cloudflare.com/ips-v6.

Using any RPKI validated prefix list (https://rpki.cloudflare.com/rpki.json, or using the GUI of the RIPE’s RPKI Validator), test out our IPv6 prefixes. Some of them will appear coming from Reserved ASNs for Private Use:

2803:f800:cfcf:cfcf:cfcf:cfcf:cfcf:1 - B
2803:f800:cfcf:cfcf:cfcf:cfcf:cfcf:2 - R
2803:f800:cfcf:cfcf:cfcf:cfcf:cfcf:3 - A
2803:f800:cfcf:cfcf:cfcf:cfcf:cfcf:4 - V
2803:f800:cfcf:cfcf:cfcf:cfcf:cfcf:5 - O

Subtract 4200000000, it will give you one byte for each character of the secret word.

Repeat until the word is decoded.

Interested in helping build a better internet and drive security online? Cloudflare is hiring.

Ten new data centers: Cloudflare expands global network to 165 cities

Nitin Rao — Thu, 20 Dec 2018 16:13:00 GMT

Cloudflare is excited to announce the addition of ten new data centers across the United States, Bahrain, Russia, Vietnam, Pakistan and France (Réunion). We're delighted to help improve the performance and security of over 12 million domains across these diverse countries that collectively represent about half a billion Internet users.

Our global network now spans 165 cities, with 46 new cities added just this year, and several dozen additional locations being actively worked on.

United States of America

Our expansion begins in the United States, where Cloudflare's 36th and 37th data centers in the nation serve Charlotte (North Carolina) and Columbus (Ohio) respectively. They are promising markets for interconnection, and join our existing deployments in Ashburn, Atlanta, Boston, Chicago, Dallas, Denver, Detroit, Houston, Indianapolis, Jacksonville, Kansas City, Las Vegas, Los Angeles, McAllen, Memphis, Miami, Minneapolis, Montgomery, Nashville, Newark, Norfolk, Omaha, Philadelphia, Portland, Richmond, Sacramento, Salt Lake City, San Diego, San Jose, Seattle, St. Louis, Tallahassee, and Tampa.

Bahrain

Cloudflare's Manama (Bahrain) data center, our 158th globally, further expands our Middle East coverage. A growing hub for cloud computing, including public sector adoption (with the Kingdom's "Cloud First" policy), Bahrain is attracting talent and investment in innovative companies.

Russia

Cloudflare's new St. Petersburg deployment serves as a point of redundancy to our existing Moscow facility, while also expanding our surface area to withstand DDoS attacks and reducing latency for local Internet users. (Hint: If you live in Novosibirsk or other parts of Russia, stay tuned for upcoming Cloudflare deployments near you).

Vietnam

Hànội and Hồ Chí Minh City, the two most populated cities in Vietnam with an estimated population of 8 million and 9 million respectively, now host Cloudflare's 160th and 161st data center.

On November 19, 1997, the Internet officially became available in Vietnam. Since then, several telecommunication companies - including VNPT, FPT, Viettel, CMC, VDC, and NetNam - have played a critical role in integrating the use of Internet into the government systems, business environment, school facilities, and many other organizations. With our new data centers in place, we are delighted to help provide a faster and safer Internet experience.

Pakistan

The world's sixth most populous country, Pakistan is a land of delicious food, breathtaking natural beauty, poetry and, of course cricket. Its natural beauty is exemplified by being the home of 5 out of 14 mountains which are at least 8,000m high, including K2, the second highest peak in the world. Pakistan's rich history includes the 5,000 year old lost civilization of Mohenjo-daro, with incredible design from complex architecture on a grid-layout to advanced water and sewage systems.

Nanga Parbat - Creative Commons Attribution-Share Alike 3.0 Unported

Today, Cloudflare is unveiling three new data centers housed in Pakistan, one in each of the most populous cities - Karachi and Lahore - alongside an additional data center in the capital city, Islamabad. We are already seeing latency per request decrease by over 3x and as much as 150ms, and expect this to further improve as we tune routing for all our customers.

Latency from PTCL to Cloudflare customers reduces by over 3x across Pakistan. Courtesy: Cedexis

Réunion (France)

8,000 miles away, the final stop in today's expansion is Sainte-Marie in the Réunion island, the overseas department France off the coast of Magadascar (which can also expect some Cloudflare servers very soon!)

Expansion ahead!

Even beyond these, we are working on at least six new cities in each of Africa, Asia, Europe, North America, and South America. Guess 20 upcoming locations to receive Cloudflare swag.

Banking-Grade Credential Stuffing: The Futility of Partial Password Validation

Junade Ali — Thu, 20 Dec 2018 13:00:00 GMT

Recently when logging into one of my credit card providers, I was greeted by a familiar screen. After entering in my username, the service asked me to supply 3 random characters from my password to validate ownership of my account.

It is increasingly common knowledge in the InfoSec community that this practice is the antithesis of, what we now understand to be, secure password management.

For starters; sites prompting you for Partial Password Validation cannot store your passwords securely using algorithms like BCrypt or Argon2. If the service provider is ever breached, such plain-text passwords can be used to login to other sites where the account holder uses the same password (known as a Credential Stuffing attack).

Increased difficulty using long, randomly-generated passwords from Password Managers, leads to users favouring their memory over securely generated unique passwords. Those using Password Managers must extract their password from their vault, paste it somewhere else and then calculate the correct characters to put in. With this increased complexity, it further incentivises users to (re-)use simple passwords they can remember and count off on their fingers (and likely repeatedly use on other sites).

This is not to distinct thinking that originally bought us complex password composition rules, which would also ultimately incentivise password re-use. I have previously written how broken password management policies came about and the collaborative solution I worked on with security researcher, Troy Hunt, to dissuade password reuse (examples of this implementation can be found on Troy’s blog).

However, in this blog post I want us to follow through some of the logic of Partial Password Validation that is used by so many websites, including banks and services which certainly contain sensitive data. Let’s see how effective this strategy really is.

How secure is partial password data?

In one data breach, we saw that 86% of users were using passwords already in data breaches. So, if we have certain characters of a password, how easy is it to identify that password from a data breach?

So; how much information can 3 characters provide us about a password a user logs in with (assuming such a password is breached)?

By selecting 10,000 passwords randomly from a public database of 488,129 breached passwords (after excluding passwords less than 8 characters long), I selected 3 random characters from each password and saw how many passwords in the complete dataset had the same characters in the same place as the original password. This simulation was run 11 times per password considered.

In the 110,000 simulations, 58% of tested 3-character password segments were only valid for that password in the database. Additionally, 28% could be associated with only one other password and 8% with two other passwords.

As such, we can demonstrate that in this database, the presence of only 3 characters of a password is sufficient to let us breach a significant proportion of such accounts. Using slow brute force attacks, hundreds of passwords can be checked against an account before triggering any form of rate limiting (if any), combining password segments with common password databases allows us to bring down the search candidates substantially.

Research for these tactics exists outside the scope of simulations alone. Prior work by the University of Edinburgh and the University of Aberdeen in a paper on Partial Password implementations and attacks, created a password projection dictionary using a larger data breach (RockYou) alongside recording credentials during login attempts to user accounts. Result collection was based on a survey of user logins to online banking sites and showed a worryingly high success rate of being able to predict solutions to challenge pages (in particular, against banking websites which required no, or a weak, second credential):

We find that with 6 guesses, an attacker can respond correctly to 2-place challenges on 6-digit PINs with a success rate of 30%. Recording up to 4 runs, an attacker can succeed over 60% of the time, or by combining guessing and recording, over 90%. Alphanumeric passwords do somewhat better: responding to 3-place challenges on 8-character alphanumeric passwords, with up to 10 guesses, the attacker can achieve a success rate of 5.5%. Combining guessing and recording increases that to 25% with one recorded run and at least 80% with four runs.

Partial Password Validation arguably exists to prevent key loggers (either in software or hardware form) from learning your complete password; however now let's put this hypothesis to the test next.

Getting the full password

Every time my banking site asks for a set of random characters of my password on login, someone who knows my keystrokes can learn even more. With personal computer ownership at an all time high (e.g. household computer ownership at 88% in the UK) it is increasingly likely such users will repeatedly login to their services using a single device.

Let's consider the threat model of whether Partial Password Validation actually helps. How many times would a user need to repeatedly login until their password is fully breached? This problem represents a modified form of the coupon collector's problem (except with multiple selections), and as such approximations can be calculated in mathematical terms; but instead, here I'll demonstrate this using real password data and simulations.

It is trivial to calculate how many possible ways 3 random characters can be selected from a password using binomial coefficients; there are 220 ways of selecting 3 characters from a 12 character password, but it is fundamentally important to know that, at the end of the day, you only need 12 characters to breach a password. Every time a user is prompted for more random characters, we have more chance of finding out what their password is.

By repeatedly taking the lengths of random passwords (over 8 characters in lengths), I ran a simulation to see how many times it would take for the entire password to become known. 110,000 simulations were run collectively on 10,000 passwords.

In simulation, 58% of passwords are revealed in entirety after 7 logins, 90% after 12 and 99% after 19 logins. These results would be even more garish, with a maximum password length constraint (and/or including passwords less than 8 characters); as is implemented by many online banks using such an approach.

The Road to Multi-Factor Authentication

There we have it; there is no substantive keylogger protection from Partial Password Validation; all you get is a guarantee that the service you're dealing with isn't capable (or is too lazy) to properly secure your passwords.

Password Managers are a blessing and they provide a huge boost in security by allowing users to set strong, unique passwords on a per site basis and Partial Password Validation poses a massive usability frontier to users adopting such practices. In the words of the British National Cyber Security Centre: "Let them paste passwords".

So what do you do instead? Instead of merely using what your customers know to secure their accounts, additionally use something they have. The industry standard for key logger protection is Two Factor Authentication (or Multi Factor Authentication); using either apps on your users smartphones to generate cryptographically secure tokens, or, as is increasingly the case of banks, giving your customers hardware Two Factor token generators to validate their login attempts. Hardware token generators provide a distinct piece of hardware to generate secure tokens, that users cannot use to login to their online accounts.

Increasingly SMS is used to send a user a password to validate their login attempt, in addition to a password the users enters. SMS isn't technically Second Factor Authentication, and instead represents a One Time Password. This can also be said for users logging in to a site, using a password, on the same device they use to generate their one time token. Additionally, SMS tokens are not cryptographically generated on the device itself, using the RFC 6238 standard that's employed by 2FA apps, but are sent over a protocol with numerous security shortcomings. In practice, this means that interception can reveal the One Time Password that you receive.

Whilst there are better alternatives to SMS, and indeed users who are educated to use such tactics should be allowed to do so, there is a usability gap to setting up Two Factor Authentication and somewhat of a cost and operational barrier to issuing users hardware devices. However; in a time where the vast majority of users guard their secure accounts using nothing more than their passwords, it is somewhat excusable to default opt-in users to SMS One Time Passwords, whilst providing the option for Two Factor Authentication if desired. The UK tax office (Her Majesty's Revenue & Customs) have gone down this route, requiring users to set-up some form of Multi-Step Verification before being able to proceed with managing their accounts; either via SMS, landline or an authenticator app.

In conclusion; whilst SMS isn't the most secure way of generating a One Time Password, it offers huge advantages over a user only using a password. Any form of One Time Password is better than none at all, and it provides a path for users to handle the security threats that Partial Password Verification promises, but falls short on.

Want to help businesses, small and big, protect their users from security threats, small and large? The Cloudflare technical support team is hiring.

Optimising Caching on Pwned Passwords (with Workers)

Junade Ali — Thu, 09 Aug 2018 15:42:50 GMT

In February, Troy Hunt unveiled Pwned Passwords v2. Containing over half a billion real world leaked passwords, this database provides a vital tool for correcting the course of how the industry combats modern threats against password security.

In supporting this project; I built a k-Anonymity model to add a layer of security to performed queries. This model allows for enhanced caching by mapping multiple leaked password hashes to a single hash prefix and additionally being performed in a deterministic HTTP-friendly way (which allows caching whereas other implementations of Private Set Intersection require a degree of randomness).

Since launch, PwnedPasswords, using this anonymity model and delivered by Cloudflare, has been implemented in a widespread way across a wide variety of platforms - from site like EVE Online and Kogan to tools like 1Password and Okta's PassProtect. The anonymity model is also used by Firefox Monitor when checking if an email is in a data breach.

Since it has been adopted, Troy has tweeted out about the high cache hit ratio; and people have been asking me about my "secret ways" of gaining such a high cache hit ratio. Over time I touched various pieces of Cloudflare's caching systems; in late 2016 I worked to bring Bypass Cache on Cookie functionality to our self-service Business plan users and wrestled with cache implications of CSRF tokens - however Pwned Passwords was far more fun to help show the power of Cloudflare's cache functionality from the perspective of a user.

Looks like Pwned Passwords traffic has started to double over the norm, trending around 8M requests a day now. @IcyApril made a cache change to improve stability but reduce hit ratio around the 10th, but that's improving again now with higher volumes (94% for the last week). pic.twitter.com/HwMDLlmBEY
— Troy Hunt (@troyhunt) June 25, 2018

Will @IcyApril secret ways ever be released?!
— Neal (@tun35) May 7, 2018

It is worth noting that PwnedPasswords is not like a typical website in terms of caching - it contains 16^5 possible API queries (any possible form of five hexadecimal charecters, in total over a million possible queries) in order to guarantee k-Anonymity in the API. Whilst the API guarantees k-Anonymity, it does not guarantee l-Diversity, meaning individual queries can occur more than others.

For ordinary websites, with fewer assets, the cache hit ratio can be far greater. An example of this is another site Troy set-up using our barebones free plan; by simply configuring a Page Rule with the Cache Everything option (and setting an Edge Cache TTL option, should the Cache-Control headers from your origin not do so), you are able to cache static HTML easily.

When I've written about really high cache-hit ratios on @haveibeenpwned courtesy of @Cloudflare, some people have suggested it's due to higher-level plans. Here's https://t.co/Y4GlsInvu2 running on the *free* plan: 99.0% cache hit ratio on requests and 99.5% on bandwidth. Free! pic.twitter.com/pP0wo7qKF3
— Troy Hunt (@troyhunt) July 31, 2018

Origin Headers

Indeed, the fact the queries are usually API queries makes a substantial difference. When optimising caching, the most important thing to look for is instances where the same cache asset is stored multiple times for different cache keys; for some assets this may involve selectively ignoring query strings for cache purposes, but for APIs the devil is more in the detail.

When a HTTP request is made from a JavaScript asset (as is done when PwnedPasswords is directly implemented in login forms) - the site will also send an Origin header to indicate where a fetch originates from.

When you make a search on haveibeenpwned.com/Passwords, there's a bit of JavaScript which takes the password and applies the k-Anonymity model by SHA-1 hashing the password and truncating the hash to the first five charecters and sending that request off to https://api.pwnedpasswords.com/range/A94A8 (then performing a check to see if any of the contained suffixes are in the response).

In the headers of this request to PwnedPasswords.com, you can see the request contains an Origin header of the querying site.

This header is often useful for mitigating Cross-Site Request Forgery (CSRF) vulnerabilities by only allowing certain Origins to make HTTP requests using Cross-Origin Resource Sharing (CORS).

In the context of an API, this does not nessecarily make sense where there is no state (i.e. cookies). However, Cloudflare's default Cache Key contains this header for those who wish to use it. This means, Cloudflare will store a new cached copy of the asset whenever a different Origin header is present. Whilst this is ordinarily not a problem (most sites have one Origin header, or just a handful when using CORS), PwnedPasswords has Origin headers coming from websites all over the internet.

As Pwned Passwords will always respond with the same for a given request, regardless of the Origin header - we are able to remove this header from the Cache Key using our Custom Cache Key functionality.

Incidently, JavaScript CDNs will frequently be requested to fetch assets as sub-resources from another JavaScript asset - removing the Origin header from their Cache Key can have similar benefits:

Just applied some @Cloudflare cache magic I experimented with to get @troyhunt's Pwned Passwords API cache hit ratio to ~91%, to a large JS CDN (@unpkg) during a slow traffic period. Traffic 30mins post deploy shows a growing ~94% Cache Hit Ratio (with a planned cache purge!). pic.twitter.com/ZQmfzEi4Y2
— Junade Ali (@IcyApril) May 6, 2018

Case Insensitivity

One thing I realised after speaking to Stefán Jökull Sigurðarson from EVE Online was that different users were querying assets using different casing; for example, instead of range/A94A8 - a request to range/a94a8 would result in the same asset. As the Cache Key accounted for case sensitivity, the asset would be cached twice.

Unfortuantely, the API was already public with both forms of casing being acceptable once I started these optimisations.

Enter Cloudflare Workers

Instead of adjusting the cache key to solve this problem, I decided to use Cloudflare Workers - allowing me to adjust cache behaviour using JavaScript.

Troy initially had a simple worker on the site to enable CORS:

addEventListener('fetch', event => {
    event.respondWith(checkAndDispatchReports(event.request))
})

async function checkAndDispatchReports(req) {
    if(req.method === 'OPTIONS') {
        let responseHeaders = setCorsHeaders(new Headers())
        return new Response('', {headers:responseHeaders})
    } else {
        return await fetch(req)
    }
}

function setCorsHeaders(headers) {
    headers.set('Access-Control-Allow-Origin', '*')
    headers.set('Access-Control-Allow-Methods', 'GET')
    headers.set('Access-Control-Allow-Headers', 'access-control-allow-headers')
    headers.set('Access-Control-Max-Age', 1728000)
    return headers
}

I added to this worker to ensure that when a request left Workers, the hash prefix would always be upper case, additionally I used the cacheKey flag to allow the Cache Key to be set directly in Workers when making the request (instead of using our internal Custom Cache Key configuration):

addEventListener('fetch', event => {
  event.respondWith(handleRequest(event.request));
})

/**
 * Fetch request after making casing of hash prefix uniform
 * @param {Request} request
 */
async function handleRequest(request) {
      
  if(request.method === 'OPTIONS') {
    let responseHeaders = setCorsHeaders(new Headers())
    return new Response('', {headers:responseHeaders})
  }

  const url = new URL(request.url);

  if (!url.pathname.startsWith("/range/")) {
    const response = await fetch(request)
    return response;
  }

  const prefix = url.pathname.substr(7);
  const newRequest = "https://api.pwnedpasswords.com/range/" + prefix.toUpperCase()

  if (prefix === prefix.toUpperCase()) {
    const response = await fetch(request, { cf: { cacheKey: newRequest } })
    return response;
  }

  const init = {
      method: request.method,
      headers: request.headers
  }
  
  const modifiedRequest = new Request(newRequest, init)
  const response = await fetch(modifiedRequest, { cf: { cacheKey: newRequest } })
  return response
}

function setCorsHeaders(headers) {
    headers.set('Access-Control-Allow-Origin', '*')
    headers.set('Access-Control-Allow-Methods', 'GET')
    headers.set('Access-Control-Allow-Headers', 'access-control-allow-headers')
    headers.set('Access-Control-Max-Age', 1728000)
    return headers
}

Incidentially, our Workers team are working on some really cool stuff around controlling our cache APIs at a fine grained level, you'll be able to see some of that stuff in due course by following this blog.

Argo

Finally, Argo plays an important part in improving Cache Hit ratio. Once toggled on, it is known for optimising speed at which traffic travels around the internet - but it also means that when traffic is routed from one Cloudflare data center to another, if an asset is cached closer to the origin web server, the asset will be served from that data center. In essence, it offers Tiered Cache functionality; by making sure when traffic comes from a less used Cloudflare data center, it can still utilise the cache from a data center recieving greater traffic (and more likely to have an asset in cache). This prevents an asset from having to travel all the way around the world whilst still being served from cache (even if not optimally close to the user).

Conclusion

By using Cloudflare's caching functionality, we are able to reduce the amount of times a single asset is in cache by accidental variations in the request parameters. Workers offers a mechanism to control the cache of assets on Cloudflare, with more fine-grained controls under active development.

By implementing this on Pwned Passwords; we are able to provide developers a simple and fast interface to reduce password reuse amonst their users, thereby limiting the effects of Credential Stuffing attacks on their system. If only Irene Adler had used a password manager:

Interested in helping debug performance, cache and security issues for websites of all sizes? We're hiring for Support Engineers to join us in London, and additionally those speaking Japanese, Korean or Mandarin in our Singapore office.

Going Proactive on Security: Driving Encryption Adoption Intelligently

Junade Ali — Tue, 24 Jul 2018 17:32:43 GMT

It's no secret that Cloudflare operates at a huge scale. Cloudflare provides security and performance to over 9 million websites all around the world, from small businesses and WordPress blogs to Fortune 500 companies. That means one in every 10 web requests goes through our network.

However, hidden behind the scenes, we offer support in using our platform to all our customers - whether they're on our free plan or on our Enterprise offering. This blog post dives into some of the technology that helps make this possible and how we're using it to drive encryption and build a better web.

Why Now?

Recently web browser vendors have been working on extending encryption on the internet. Traditionally they would use positive indicators to mark encrypted traffic as secure; when traffic was served securely over HTTPS, a green padlock would indicate in your browser that this was the case. In moving to standardise encryption online, Google Chrome have been leading the charge in marking insecure page loads as "Not Secure". Today, this UI change has been pushed out to all Google Chrome users globally for all websites: any website loaded over HTTP will be marked as insecure.

That's not all though; all resources loaded by a website need to be loaded over HTTPS and such sites need to be configured properly to avoid mixed-content warnings, not to mention correctly configuring secure cryptography at the web server. Cloudflare helped bring widespread adoption of HTTPS to the internet by offering free of charge SSL certificates; in doing so we've become experts at knowing where web developers trip up in configuring HTTPS on their websites. HTTPS is now important for everyone who builds on the web, not just those with an interest in cryptography.

Meet HelperBot

In recent months, we’ve taken this expertise to help our Cloudflare customers avoid common mistakes. One of things me and my team have been working on building has been intelligent systems which automatically triage support tickets and present relevant debugging information upfront to the agent assigned to the ticket.

We use a custom-build Natural Language Processing model to determine the issues related to what the customer is discussing, and then we run technical tests in a Chain-of-Responsibility (with the most relevant to the customer running first) to determine what's going wrong. We then automatically triage the ticket and present this information to the support agent in the ticket.

Here's an example of a piece of the information we present upfront:

Whilst we initially manually built automated debugging tests, we soon used Search Based Software Engineering strategies to self-write debugging automations based on various data points (such as the underlying technologies powering a site, their configuration or their error rates). When we detect anomalies, we are able to present this information upfront to our support agents to reduce the manual debugging they must conduct. In essence, we are able to get the software to write itself from test behaviour, within reason.

Whilst this data is largely mostly internally used; we are starting to A/B test new versions of our support ticket submission form which present a subset of this information upfront to users before they write into us - allowing them to the answers to their problem quicker.

Being Proactive About Security

To help drive adoption of a more secure internet - and drive down common misconfigurations of SSL - we have started testing emailing customers proactively about Mixed Content errors and Redirect Loops associated with HTTPS web server misconfigurations.

By joining forces with our Marketing team, we were able to run an ongoing campaign of testing user behaviour to proactive security advice. Users receive messages similar to the one below.

With this capability, we decided to expose the functionality to a wider audience, including those not already using Cloudflare.

SSL Test Tool (Powered by HelperBot-External)

To help website owners make the transition to HTTPS, we've launched the SSL Test Tool. We internally codenamed the backend as HelperBot-External, after the internal HelperBot service. We decided to take a subset of the SSL tests we use internally and allow someone to run a basic version of the scan on their own site. This helps users understand what they need to do to move their site to HTTPS by detecting the most common issues. By doing so, we seek to help users who are struggling to get over the line in enabling HTTPS on their sites by providing them some dynamic guidance in a plain-English fashion.

The tool runs 12 tests across three key categories of errors: HTTPS Disabled, Client Errors and Cryptography Errors. Unlike other tools, these are tests are based on the questions we see real users ask about their SSL configuration and the tasks they most struggle with. This is a tool designed to support all web developers in enabling HTTPS, not just those with an interest in cryptography. For example; by educating users about mixed content errors, we are able to make the case for them enabling HTTPS Strict Transport Security, thereby improving the security practices they adopt.

Further; these tests are available to everyone. We believe it’s important that the entire Internet be safer, not only for our customers and their visitors (although, admittedly, Cloudflare’s SSL and crypto features make it very simple to be HTTPS-ready).

Conclusion: Just the Beginning

As we grow our intelligence capabilities; we do so to provide better performance and security to our customers. We want build a better internet and make our users more successful on our platform. Whilst there's still plenty of ground left to cover in building out our intelligent capability for supporting customers, we're developing rapidly and focussed on using those skills to improve things our customers care about.

DNS-Over-TLS Built-In & Enforced - 1.1.1.1 and the GL.iNet GL-AR750S

Junade Ali — Sat, 14 Jul 2018 17:13:00 GMT

GL.iNet GL-AR750S in black, same form-factor as the prior white GL.iNet GL-AR750. Credit card for comparison.

Back in April, I wrote about how it was possible to modify a router to encrypt DNS queries over TLS using Cloudflare's 1.1.1.1 DNS Resolver. For this, I used the GL.iNet GL-AR750 because it was pre-installed with OpenWRT (LEDE). The folks at GL.iNet read that blog post and decided to bake DNS-Over-TLS support into their new router using the 1.1.1.1 resolver, they sent me one to take a look at before it's available for pre-release. Their new router can also be configured to force DNS traffic to be encrypted before leaving your local network, which is particularly useful for any IoT or mobile device with hard-coded DNS settings that would ordinarily ignore your routers DNS settings and send DNS queries in plain-text.

In my previous blog post I discussed how DNS was often the weakest link in the chain when it came to browsing privacy; whilst HTTP traffic is increasingly encrypted, this is seldom the case for DNS traffic. This makes it relatively trivial for an intermediary to work out what site you're sending traffic to. In that post, I went through the technical steps required to modify a router using OpenWRT to support DNS Privacy using the DNS-Over-TLS protocol.

GL.iNet were in contact since I wrote the original blog post and very supportive of encrypting DNS queries at the router level. Last week whilst working in Cloudflare's San Francisco office, they reached out to me over Twitter to let me know they were soon to launch a new product with a new web UI containing a "DNS over TLS from Cloudflare" feature and offered to send me the new router before it was even available for pre-order.

On arrival back to our London office, I found a package from Hong Kong waiting for me. Aside from the difference in colour, the AR750S itself is identical in form-factor to the AR750 and was packaged up very similarly. They both have capacity for external storage, an OpenVPN client and can be powered over USB; amongst many other useful functionalities. Alongside the S suffixing the model number, I did notice the new model had some upgraded specs, but I won't dwell on that here.

Below you can see the white AR750 and the new black AR750S router together for comparison. Both have a WAN ethernet port, 2 LAN ethernet ports, a USB port for external storage (plus a micro SD port) and a micro USB power port.

The UI is where the real changes come. In the More Settings tab, there's an option to configure DNS with some nice options.

One notable option is the DNS over TLS from Cloudflare toggle. This option uses the TLS security protocol for encrypting DNS queries, helping increase privacy and prevent eavesdropping.

Another option, Override DNS Settings for All Clients, forcibly overrides the DNS configuration on all clients so that queries are encrypted to the WAN. Unencrypted DNS traffic is intercepted by the router, and by forcing traffic to use it's own local resolver, it is able to transparently rewrite traffic to be encrypted before leaving the router and heading out into the public internet to the upstream resolver - 1.1.1.1.

This option is particularly useful when dealing with embedded systems or IoT devices which don't have configurable DNS options; Smart TVs, TV boxes, your toaster, etc. As this router can proxy traffic over to other Wi-Fi networks (and is portable), this is particularly useful when connecting out to an ordinarily insecure Wi-Fi network; the router can sit in the middle and transparently upgrade unencrypted DNS queries. This is even useful when dealing with phones and tablets where you can't install a DNS-Over-TLS client.

These options both come disabled by default, but can easily be toggled in the UI. As before, you can configure other DNS resolvers by toggling "Manual DNS Server Settings" and entering in any other DNS servers.

There are a number of other cool features I've noticed in this router; for example, the More Settings > Advanced option takes you into a standard LuCi UI that ordinarily comes bundled with LEDE routers. Like previous routers, you can easily SSH into the device and install various program and perform customisations.

For example; after installing TCPDump on the router, I am able to run tcpdump -n -i wlan-sta 'port 853' to see encrypted DNS traffic leaving the router. When I run a DNS query over an unencrypted resolver (using dig A junade.com on my local computer), I can see the outgoing DNS traffic upgraded to encrypted queries on 1.1.1.1 and 1.0.0.1.

If you're interested in learning how to configure 1.1.1.1 on other routers, your computer or your phone - check out the project landing page at https://1.1.1.1/. If you're a developer and want to learn about how you can integrate 1.1.1.1 into your project with either DNS-Over-TLS or DNS-Over-HTTPS, checkout the 1.1.1.1 Developer Documentation.

Privacy-Protecting Portable Router: Adding DNS-Over-TLS support to OpenWRT (LEDE) with Unbound

Junade Ali — Mon, 09 Apr 2018 19:20:32 GMT

If you want to skip ahead to instructions, scroll to the next section. But I, like a TLS handshake, am very verbose so please enjoy this opener.

Imagine this scenario - I'm at a restaurant and need to have a private phone conversation but unfortunately my phone's battery is drained. To get around this problem, I borrow my friend's phone and dial the number - to protect my privacy I walk outside. When I'm done with the call, I come back inside and return the phone.

Whilst the phone itself doesn't store the conversation I've had, it does have a log of the recently dialed number, if the friend from whom I borrowed the phone wanted to, they could easily see who I actually called - even if they don't specifically know the topic of conversation.

Sometimes, the data about who you've spoken to can tell an aweful lot about the conversation - if someone was to call an emotional support hotline or a debt collector, you could probably infer a lot about the conversation from the caller ID.

When we browse the internet, we use encryption to try and protect the conversations we have. When you connect to a website over HTTPS, a green padlock lights up on your browser and let's you know that your conversation is encrypted such that it is computationally difficult for an adversary sitting between you and the website's server to see what you're talking about.

I've previously blogged about how, under certain circumstances, it is possible to strip away this this encryption and the mitigations that websites can use to prevent this. Unfortunately, there is a far more fundamental problem to privacy online.

As is common IT knowledge, before your browser makes a HTTP connection to a website (say, cloudflare.com), your client needs to make a DNS query to work out the IP Address where the HTTP connection should be made. The same is true for any other application layer protocol, when you connect using a hostname instead of an IP Address. For a primer on DNS, we have an article on the basics of DNS on our Learning Centre.

Whilst encryption technologies have been fairly long-standing for HTTP itself, only recently have such encryption techniques been standardised for DNS. Chances are, if you don't know if your DNS traffic is encrypted - it isn't.

In practice this means that when you connect to a website that uses HTTPS, even though your conversation is encrypted - someone able to intercept your connection is able to see what website you're looking for and (depending on how the site is secured) even manipulate the response to get you to communicate with a different server.

This is particularly useful for evesdroppers; be they the network that's running the free Wi-Fi hotspot looking to sell your data to targetted advertisers or the hacker sipping on a latte whilst intercepting your network traffic (ironically dressed in a black hoodie and a balaclava).

By switching your DNS resolver to use Cloudflare's DNS Resolver, you get a faster browsing experience whilst ensuring that the people who run your DNS resolver aren't selling off that data to target you with ads. However, whilst Cloudflare Resolver supports both DNS-over-HTTPS and DNS-over-TLS, to make sure the connection between Cloudflare Resolver and you is encrypted, you may need to follow some additional configuration steps like enabling a DNS over HTTPS client.

This blog post explains how you can configure an OpenWRT router to encrypt outbound traffic to Cloudflare Resolver. This is particularly useful when you want to protect the traffic for the devices in house which may not support encrypted DNS protocols; such as your TV or IoT enabled toaster. Whilst local clients may still explicitly override your local DNS resolver on your router, many will default to using it.

OpenWRT (LEDE)

Over the weekend, prior to writing this post, I ordered a new wireless router, the GL.iNet GL-AR750. This router has a very small form-factor and is marketed as a "Travel Router" and can act as a Wi-Fi repeater as well as a traditional Wi-Fi Router. At it's longest edge, the router itself is around the length of my index finger:

I didn't just order this specific router because of it's form-factor, it also comes pre-installed with OpenWRT - an embedded Linux-based operating system that's well suited for routers. In May 2016, OpenWRT was forked as LEDE (the Linux Embedded Development Environment) and was re-merged with the OpenWRT project in January 2018.

For those of you without a router with LEDE pre-installed, you can follow along with this blog post on any other router that supports being flashed with the OpenWRT firmware; more information can be found on the OpenWRT Support Devices page. Though, please be aware that, depending on your device, this may carry some risk.

Support for DNS-over-TLS (or, the lack of)

The router I'm playing with has a configuration option to configure upstream DNS Resolver that it will use when a query isn't cached in it's own internal resolver. This local resolver is then suggested to clients that connect to the router.

For the sake of experimentation - through the web UI, I am able to configure this router to use 1.1.1.1, 1.0.0.1, 2606:4700:4700::1111 and2606:4700:4700::1001 as the upsteam DNS servers (with the IPv6 addresses updated if the network doesn't support them):

By connecting the router's WAN port to my computer, I am able to sniff traffic as it leaves the router by using Wireshark before it goes out to the actual WAN. When a DNS query isn't in my routers cache it is forwarded to 1.1.1.1. As my router is sending these queries unecrypted instead of using DNS-over-TLS, I am able to see these DNS queries being sent around the internet in unencrypted form:

Although Cloudflare Resolver supports DNS-over-TLS, unfortuantely my router doesn't and will simply send all queries unencrypted.

Setting Up DNS-Over-TLS

By default, LEDE comes pre-installed using Dnsmasq as an internal resolver and therefore doesn't support DNS-over-TLS. So that we can get our requests encrypted, we're going to replace Dnsmasq with Unbound and odhcpd. I've based the steps I'm following from the very useful OpenWRT Unbound package documentation.

Before we can get started, we need to SSH into our router, if you're prompted for a password, this will likely be identical to the one you set up for the web portal:

LEDE uses opkg as it's package manager of choice. Firstly, let's update the package list, then we install Unbound with Unbound-Control and the full version of odhcpd:

opkg update
opkg install unbound odhcpd unbound-control
opkg remove dnsmasq

Note that you can additionally install the Luci app for Unbound should you wish to control it with the standard user interface.

opkg install luci-app-unbound

As my router isn't currently running vanilla LEDE, it's user interface won't be altered if I was to install this and I haven't tested this module myself.

With Unbound in place, we can add some configuration to ensure Unbound uses 1.1.1.1, 1.0.0.1, 2606:4700:4700::1111 and2606:4700:4700::1001 as the DNS resolvers with TLS encryption. I've done this by appending some configuration to /etc/unbound/unbound_ext.conf using Vim:

forward-zone:
  name: "."
  forward-addr: 1.1.1.1@853                   
  forward-addr: 1.0.0.1@853                             
  forward-addr: 2606:4700:4700::1111@853
  forward-addr: 2606:4700:4700::1001@853
  forward-ssl-upstream: yes

In the Unbound configuration file at /etc/config/unbound, I've added some required configuration parameters as outlined in the package documentation. In my case, I backed up the configuration file and simply used the following:

config unbound
  option add_local_fqdn '1'
  option add_wan_fqdn '1'
  option dhcp_link 'odhcpd'
  option dhcp4_slaac6 '1'
  option domain 'lan'
  option domain_type 'static'
  option listen_port '53'
  option rebind_protection '1'
  option unbound_control '1'

If you do have additional parameters in the file, ensure that nothing overrides the parameters set - being especially cautious about the unbound_control parameter.

I've also merged the following configuration with /etc/config/dhcp (leaving some existing entries alone):

config dhcp 'lan'
        option dhcpv4 'server'
        option dhcpv6 'server'
        option interface 'lan'
        option leasetime '12h'
        option ra 'server'
        option ra_management '1'

config odhcpd 'odhcpd'
        option maindhcp '1'
        option leasefile '/var/lib/odhcpd/dhcp.leases'
        option leasetrigger '/usr/lib/unbound/odhcpd.sh'
...

Finally, we can enable autostart on Unbound and start it:

service unbound enable
service unbound start

Here's the proof of the pudding; when we intercept DNS queries between our router and the wider internet, we'll notice they are encrypted with TLS v1.2:

Conclusion

In this blog post, we've discussed how encrypting your DNS traffic can help privacy protect your internet browsing. By replacing Dnsmasq with Unbound, we are able to allow OpenWRT to take advantage of DNS-over-TLS to help encrypt our web traffic.

Validating Leaked Passwords with k-Anonymity

Junade Ali — Wed, 21 Feb 2018 19:00:44 GMT

Today, v2 of Pwned Passwords was released as part of the Have I Been Pwned service offered by Troy Hunt. Containing over half a billion real world leaked passwords, this database provides a vital tool for correcting the course of how the industry combats modern threats against password security.

I have written about how we need to rethink password security and Pwned Passwords v2 in the following post: How Developers Got Password Security So Wrong. Instead, in this post I want to discuss one of the technical contributions Cloudflare has made towards protecting user information when using this tool.

Cloudflare continues to support Pwned Passwords by providing CDN and security functionality such that the data can easily be made available for download in raw form to organisations to protect their customers. Further, as part of the second iteration of this project, I have also worked with Troy on designing and implementing API endpoints that support anonymised range queries to function as an additional layer of security for those consuming the API, that is visible to the client.

This contribution allows for Pwned Passwords clients to use range queries to search for breached passwords, without having to disclose a complete unsalted password hash to the service.

Getting Password Security Right

Over time, the industry has realised that complex password composition rules (such as requiring a minimum number of special characters) have done little to improve user behaviour in making stronger passwords; they have done little to prevent users from putting personal information in passwords, avoiding common passwords or prevent the use of previously breached passwords[1]. Credential Stuffing has become a real threat recently; usernames and passwords are obtained from compromised websites and then injected into other websites until you find user accounts that are compromised.

This fundamentally works because users reuse passwords across different websites; when one set of credentials is breached on one site, this can be reused on other websites. Here are some examples of how credentials can be breached from insecure websites:

Websites which don't use rate limiting or challenge login requests can have a user's log-in credentials breached using brute force attacks of common passwords for a given user,
database dumps from hacked websites can be taken offline and the password hashes can be cracked; modern GPUs make this very efficient for dictionary passwords (even with algorithms like Argon2, PBKDF2 and BCrypt),
many websites continue not to use any form of password hashing, once breached they can be captured in raw form,
Proxy Attacks or hijacking a web server can allow for capturing passwords before they're hashed.

This becomes a problem with password reuse; having obtained real life username/password combinations, they can be injected into other websites (such as payment gateways, social networks, etc) until access is obtained to more accounts (often of a higher value than the original compromised site).

Under recent NIST guidance, it is a requirement, when storing or updating passwords, to ensure they do not contain values which are commonly used, expected or compromised[2]. Research has found that 88.41% of users who received a fear appeal later set unique passwords, whilst only 4.45% of users who did not receive a fear appeal would set a unique password[3].

Unfortunately, there are a lot of leaked passwords out there; the downloadable raw data from Pwned Passwords currently contains over 30 GB in password hashes.

Anonymising Password Hashes

The key problem in checking passwords against the old Pwned Passwords API (and all similar services) lies in how passwords are checked; with users being effectively required to submit unsalted hashes of passwords to identify if the password is breached. The hashes must be unsalted, as salting them makes them computationally difficult to search quickly.

Currently there are two choices that are available for validating whether a password is or is not leaked:

Submit the password (in an unsalted hash) to a third-party service, where the hash can potentially be stored for later cracking or analysis. For example, if you make an API call for a leaked password to a third-party API service using a WordPress plugin, the IP of the request can be used to identify the WordPress installation and then breach it when the password is cracked (such as from a later disclosure); or,
download the entire list of password hashes, uncompress the dataset and then run a search to see if your password hash is listed.

Needless to say, this conflict can seem like being placed between a security-conscious rock and an insecure hard place.

The Middle Way

The Private Set Intersection (PSI) Problem

Academic computer scientists have considered the problem of how two (or more) parties can validate the intersection of data (from two or more unequal sets of data either side already has) without either sharing information about what they have. Whilst this work is exciting, unfortunately these techniques are new and haven't been subject to long-term review by the cryptography community and cryptographic primitives have not been implemented in any major libraries. Additionally (but critically), PSI implementations have substantially higher overhead than our k-Anonymity approach (particularly for communication[4]). Even the current academic state-of-the-art is not with acceptable performance bounds for an API service, with the communication overhead being equivalent to downloading the entire set of data.

k-Anonymity

Instead, our approach adds an additional layer of security by utilising a mathematical property known as k-Anonymity and applying it to password hashes in the form of range queries. As such, the Pwned Passwords API service never gains enough information about a non-breached password hash to be able to breach it later.

k-Anonymity is used in multiple fields to release anonymised but workable datasets; for example, so that hospitals can release patient information for medical research whilst withholding information that discloses personal information. Formally, a data set can be said to hold the property of k-Anonymity, if for every record in a released table, there are k − 1 other records identical to it.

By using this property, we are able to seperate hashes into anonymised "buckets". A client is able to anonymise the user-supplied hash and then download all leaked hashes in the same anonymised "bucket" as that hash, then do an offline check to see if the user-supplied hash is in that breached bucket.

In more concrete terms:

In essence, we turn the table on password derivation functions; instead of seeking to salt hashes to the point at which they are unique (against identical inputs), we instead introduce ambiguity into what the client is requesting.

Given hashes are essentially fixed-length hexadecimal values, we are able to simply truncate them, instead of having to resort to a decision tree structure to filter down the data. This does mean buckets are of unequal sizes but allows clients to query in a single API request.

This approach can be implemented in a trivial way. Suppose a user enters the password test into a login form and the service they’re logging into is programmed to validate whether their password is in a database of leaked password hashes. Firstly the client will generate a hash (in our example using SHA-1) of a94a8fe5ccb19ba61c4c0873d391e987982fbbd3. The client will then truncate the hash to a predetermined number of characters (for example, 5) resulting in a Hash Prefix of a94a8. This Hash Prefix is then used to query the remote database for all hashes starting with that prefix (for example, by making a HTTP request to example.com/a94a8.txt). The entire hash list is then downloaded and each downloaded hash is then compared to see if any match the locally generated hash. If so, the password is known to have been leaked.

As this can easily be implemented over HTTP, client side caching can easily be used for performance purposes; the API is simple enough for developers to implement with little pain.

Below is a simple Bash implementation of how the Pwned Passwords API can be queried using range queries (Gist):

#!/bin/bash

echo -n Password:
read -s password
echo
hash="$(echo -n $password | openssl sha1)"
upperCase="$(echo $hash | tr '[a-z]' '[A-Z]')"
prefix="${upperCase:0:5}"
response=$(curl -s https://api.pwnedpasswords.com/range/$prefix)
while read -r line; do
  lineOriginal="$prefix$line"
  if [ "${lineOriginal:0:40}" == "$upperCase" ]; then
    echo "Password breached."
    exit 1
  fi
done <<< "$response"

echo "Password not found in breached database."
exit 0

Implementation

Hashes (even in unsalted form) have two useful properties that are useful in anonymising data.

Firstly, the Avalanche Effect means that a small change in a hash results in a very different output; this means that you can't infer the contents of one hash from another hash. This is true even in truncated form.

For example; the Hash Prefix 21BD1 contains 475 seemingly unrelated passwords, including:

lauragpe
alexguo029
BDnd9102
melobie
quvekyny

Further, hashes are fairly uniformally distributed. If we were to count the original 320 million leaked passwords (in Troy's dataset) by the first hexadecimal charectar of the hash, the difference between the hashes associated to the largest and the smallest Hash Prefix is ≈ 1%. The chart below shows hash count by their first hexadecimal digit:

Algorithm 1 provides us a simple check to discover how much we should truncate hashes by to ensure every "bucket" has more than one hash in it. This requires every hash to be sorted by hexadecimal value. This algorithm, including an initial merge sort, runs in roughly O(n log n + n) time (worst-case):

After identifying the Maximum Hash Prefix length, it is fairly easy to seperate the hashes into seperate buckets, as described in Algorithm 3:

This implementation was originally evaluated on a dataset of over 320 million breached passwords and we find the Maximum Prefix Length that all hashes can be truncated to, whilst maintaining the property k-anonymity, is 5 characters. When hashes are grouped together by a Hash Prefix of 5 characters, we find the median number of hashes associated with a Hash Prefix is 305. With the range of response sizes for a query varying from 8.6KB to 16.8KB (a median of 12.2KB), the dataset is usable in many practical scenarios and is certainly a good response size for an API client.

On the new Pwned Password dataset (with over half a billion) passwords and whilst keeping the Hash Prefix length 5; the average number of hashes returned is 478 - with the smallest being 381 (E0812 and E613D) and the largest Hash Prefix being 584 (00000 and 4A4E8).

Splitting the hashes into buckets by a Hash Prefix of 5 would mean a maximum of 16^5 = 1,048,576 buckets would be utilised (for SHA-1), assuming that every possible Hash Prefix would contain at least one hash. In the datasets we found this to be the case and the amount of distinct Hash Prefix values was equal to the highest possible quantity of buckets. Whilst for secure hashing algorithms it is computationally inefficient to invert the hash function, it is worth noting that as the length of a SHA-1 hash is a total of 40 hexadecimal characters long and 5 characters is utilised by the Hash Prefix, the total number of possible hashes associated with a Hash Prefix is 16^{35} ≈ 1.39E42.

Important Caveats

It is important to note that where a user's password is already breached, an API call for a specific range of breached passwords can reduce the search candidates used in a brute-force attack. Whilst users with existing breached passwords are already vulnerable to brute-force attacks, searching for a specific range can help reduce the amount of search candidates - although the API service would have no way of determining if the client was or was not searching for a password that was breached. Using a deterministic algorithm to run queries for other Hash Prefixes can help reduce this risk.

One reason this is important is that this implementation does not currently guarantee l-diversity, meaning a bucket may contain a hash which is of substantially higher use than others. In the future we hope to use percentile-based usage information from the original breached data to better guarantee this property.

For general users, Pwned Passwords is usually exposed via web interface, it uses a JavaScript client to run this process; if the origin web server was hijacked to change the JavaScript being returned, this computation could be removed (and the password could be sent to the hijacked origin server). Whilst JavaScript requests are somewhat transparent to the client (in the case of a developer), this may not be depended on and for technical users, non-web client based requests are preferable.

The original use-case for this service was to be deployed privately in a Cloudflare data centre where our services can use it to enhance user security, and use range queries to complement the existing transport security used. Depending on your risks, it's safer to deploy this service yourself (in your own data centre) and use the k-anonymity approach to validate passwords where services do not themselves have the resources to store an entire database of leaked password hashes.

I would strongly recommend against storing the range queries used by users of your service, but if you do for whatever reason, store them as aggregate analytics such that they cannot be linked back to any given user's password.

Final Thoughts

Going forward, as we test this technology more, Cloudflare is looking into how we can use a private deployment of this service to better offer security functionality, both for log-in requests to our dashboard and for customers who want to prevent against credential stuffing on their own websites using our edge network. We also seek to consider how we can incorporate recent work on the Private Set Interesection Problem alongside considering l-diversity for additional security guarantees. As always; we'll keep you updated right here, on our blog.

Campbell, J., Ma, W. and Kleeman, D., 2011. Impact of restrictive composition policy on user password choices. Behaviour & Information Technology, 30(3), pp.379-388. ↩︎
Grassi, P. A., Fenton, J. L., Newton, E. M., Perlner, R. A., Regenscheid, A. R., Burr, W. E., Richer, J. P., Lefkovitz, N. B., Danker, J. M., Choong, Y.-Y., Greene, K. K., and Theofanos, M. F. (2017). NIST Special Publication 800-63B Digital Identity Guidelines, chapter Authentication and Lifecycle Management. National Institute of Standards and Technology, U.S. Department of Commerce. ↩︎
Jenkins, Jeffrey L., Mark Grimes, Jeffrey Gainer Proudfoot, and Paul Benjamin Lowry. "Improving password cybersecurity through inexpensive and minimally invasive means: Detecting and deterring password reuse through keystroke-dynamics monitoring and just-in-time fear appeals." Information Technology for Development 20, no. 2 (2014): 196-213. ↩︎
De Cristofaro, E., Gasti, P. and Tsudik, G., 2012, December. Fast and private computation of cardinality of set intersection and union. In International Conference on Cryptology and Network Security (pp. 218-231). Springer, Berlin, Heidelberg. ↩︎

How Developers got Password Security so Wrong

Junade Ali — Wed, 21 Feb 2018 19:00:11 GMT

Both in our real lives, and online, there are times where we need to authenticate ourselves - where we need to confirm we are who we say we are. This can be done using three things:

Something you know
Something you have
Something you are

Passwords are an example of something you know; they were introduced in 1961 for computer authentication for a time-share computer in MIT. Shortly afterwards, a PhD researcher breached this system (by being able to simply download a list of unencrypted passwords) and used the time allocated to others on the computer.

As time has gone on; developers have continued to store passwords insecurely, and users have continued to set them weakly. Despite this, no viable alternative has been created for password security. To date, no system has been created that retains all the benefits that passwords offer as researchers have rarely considered real world constraints[1]. For example; when using fingerprints for authentication, engineers often forget that there is a sizable percentage of the population that do not have usable fingerprints or hardware upgrade costs.

Cracking Passwords

In the 1970s, people started thinking about how to better store passwords and cryptographic hashing started to emerge.

Cryptographic hashes work like trapdoors; whilst it's easy to hash a password, it's far harder to turn that "hash" back into the original output (or computationally difficult for an ideal hashing algorithm). They are used in a lot of things from speeding up searching from files, to the One Time Password generators in banks.

Passwords should ideally use specialised hashing functions like Argon2, BCrypt or PBKDF2, they are modified to prevent Rainbow Table attacks.

If you were to hash the password, p4$$w0rd using the SHA-1 hashing algorithm, the output would be 6c067b3288c1b5c791afa04e12fb013ed2e84d10. This output is the same every time the algorithm is run. As a result, attackers are able to create Rainbow Tables which contain the hashes of common passwords and then this information is used to break password hashes (where the password and hash is listed in a Rainbow Table).

Algorithms like BCrypt essentially salt passwords before they hash them using a random string. This random string is stored alongside the password hash and is used to help make the password harder to crack by making the output unique. The hashing process is repeated many times (defined by a difficulty variable), each time adding the random salt onto the output of the hash and rerunning the hash computation.

For example; the BCrypt hash $2a$10$N9qo8uLOickgx2ZMRZoMyeIjZAgcfl7p92ldGxad68LJZdL17lhWy starts with $2a$10$ which indicates the algorithm used is BCrypt and contains a random salt of N9qo8uLOickgx2ZMRZoMye and a resulting hash of IjZAgcfl7p92ldGxad68LJZdL17lhWy. Storing the salt allows the password hash to be regenerated identically when the input is known.

Unfortunately; salting is no longer enough, passwords can be cracked quicker and quicker using modern GPUs (specialised at doing the same task over and over). When a site suffers a security breach, users passwords can be taken offline in database dumps in order to be cracked offline.

Additionally; websites that fail to rate limit login requests or use captchas, can be challenged by Brute Force attacks. For a given user, an attacker will repeatedly try different (but common) passwords until they gain access to a given users account.

Sometimes sites will lock users out after a handful of failed login attempts, attacks can instead be targeted to move on quickly to a new account after the most common set of a passwords has been attempted. Lists like the following (in some cases with many, many more passwords) can be attempted to breach an account:

The industry has tried to combat this problem by requiring password composition rules; requiring users comply to complex rules before setting passwords (requiring a minimum amount of numbers or punctuation symbols). Research has shown that this work hasn't helped combat the problem of password reuse, weak passwords or users putting personal information in passwords.

Credential Stuffing

Whilst it may seem that this is only a bad sign for websites that store passwords weakly, Credential Stuffing makes this problem even worse.

It is common for users to reuse passwords from site to site, meaning a username and password from a compromised website can be used to breach far more important information - like online banking gateways or government logins. When a password is reused - it takes just one website being breached, to gain access to others that a users has credentials for.

This Is Not Fine - The Nib

Fixing Passwords

There are fundamentally three things that need to be done to fix this problem:

Good UX to improve User Decisions
Improve Developer Education
Eliminating reuse of breached passwords

How Can I Secure Myself (or my Users)?

Before discussing the things we're doing, I wanted to briefly discuss what you can do to help protect yourself now. For most users, there are three steps you can immediately take to help yourself.

Use a Password Manager (like 1Password or LastPass) to set random, unique passwords for every site. Additionally, look to enable Two-Factor Authentication where possible; this uses something you have, in addition to the password you know, to validate you. This will mean, alongside your password, you have to enter a short-lived password from a device like your phone before being able to login to any site.

Two-Factor Authentication is supported on many of the worlds most popular social media, banking and shopping sites. You can find out how to enable it on popular websites at turnon2fa.com. If you are a developer, you should take efforts to ensure you support Two Factor Authentication.

Set a secure memorable password for your password manager; and yes, turn on Two-Factor Authentication for it (and keep your backup codes safe). You can find additional security tips (including tips on how to create a secure main password) in my blog post: Simple Cyber Security Tips.

Developers should look to abolish bad practice composition rules (and simplify them as much as possible). Password expiration policies do more harm than good, so seek to do away with them. For further information refer to the blog post by the UK's National Cyber Security Centre: The problems with forcing regular password expiry.

Finally; Troy Hunt has an excellent blog post on passwords for users and developers alike: Passwords Evolved: Authentication Guidance for the Modern Era

Improving Developer Education

Developers should seek to build a culture of security in the organisations where they work; try and talk about security, talk about the benefits of challenging malicious login requests and talk about password hashing in simple terms.

If you're working on an open-source project that handles authentication; expose easy password hashing APIs - for example the password_hash, password_needs_rehash & password_verify functions in modern PHP versions.

Eliminating Password Reuse

We know that complex password composition rules are largely ineffective, and recent guidance has followed suit. A better alternative to composition rules is to block users from signing up with passwords which are known to have been breached. Under recent NIST guidance, it is a requirement, when storing or updating passwords, to ensure they do not contain values which are commonly used, expected or compromised[2].

This is easier said than done, the recent version of Troy Hunt's Pwned Passwords database contains over half a billion passwords (over 30 GB uncompressed). Whilst developers can use API services to check if a password is reused, this requires either sending the raw password, or the password in an unsalted hash. This can be especially problematic when multiple services handle authentication in a business, and each has to store a large quantity of passwords.

This is a problem I've started looking into recently; as part of our contribution to Troy Hunt's Pwned Passwords database, I have designed a range search API that allows developers to check if a password is reused without needing to share the password (even in hashed form) - instead only needing to send a short segment of the cryptographic hash used. You can find more information on this contribution in the post: Validating Leaked Passwords with k-Anonymity.

Version 2 of Pwned Passwords is now available - you can find more information on how it works on Troy Hunt's blog post "I've Just Launched Pwned Passwords, Version 2".

Bonneau, J., Herley, C., Van Oorschot, P.C. and Stajano, F., 2012, May. The quest to replace passwords: A framework for comparative evaluation of web authentication schemes. In Security and Privacy (SP), 2012 IEEE Symposium on (pp. 553-567). IEEE. ↩︎
Grassi, P. A., Fenton, J. L., Newton, E. M., Perlner, R. A., Regenscheid, A. R., Burr, W. E., Richer, J. P., Lefkovitz, N. B., Danker, J. M., Choong, Y.-Y., Greene, K. K., and Theofanos, M. F. (2017). NIST Special Publication 800-63B Digital Identity Guidelines, chapter Authentication and Lifecycle Management. National Institute of Standards and Technology, U.S. Department of Commerce. ↩︎

Concise (Post-Christmas) Cryptography Challenges

Junade Ali — Tue, 26 Dec 2017 16:53:14 GMT

It's the day after Christmas; or, depending on your geography, Boxing Day. With the festivities over, you may still find yourself stuck at home and somewhat bored.

Either way; here are three relatively short cryptography challenges, you can use to keep you momentarily occupied. Other than the hints (and some internet searching), you shouldn't require a particularly deep cryptography knowledge to start diving into these challenges. For hints and spoilers, scroll down below the challenges!

Challenges

Password Hashing

The first one is simple enough to explain; here are 5 hashes (from user passwords), crack them:

$2y$10$TYau45etgP4173/zx1usm.uO34TXAld/8e0/jKC5b0jHCqs/MZGBi
$2y$10$qQVWugep3jGmh4ZHuHqw8exczy4t8BZ/Jy6H4vnbRiXw.BGwQUrHu
$2y$10$DuZ0T/Qieif009SdR5HD5OOiFl/WJaDyCDB/ztWIM.1koiDJrN5eu
$2y$10$0ClJ1I7LQxMNva/NwRa5L.4ly3EHB8eFR5CckXpgRRKAQHXvEL5oS
$2y$10$LIWMJJgX.Ti9DYrYiaotHuqi34eZ2axl8/i1Cd68GYsYAG02Icwve

HTTP Strict Transport Security

A website works by redirecting its www. subdomain to a regional subdomain (i.e. uk.), the site uses HSTS to prevent SSLStrip attacks. You can see cURL requests of the headers from the redirects below, how would you practically go about stripping HTTPS in this example?

$ curl -i http://www.example.com
HTTP/1.1 302 Moved Temporarily
Server: nginx
Date: Tue, 26 Dec 2017 12:26:51 GMT
Content-Type: text/html
Transfer-Encoding: chunked
Connection: keep-alive
location: https://uk.example.com/

$ curl -i http://uk.example.com
HTTP/1.1 200 OK
Server: nginx
Content-Type: text/html; charset=UTF-8
Transfer-Encoding: chunked
Connection: keep-alive
Keep-Alive: timeout=5
Vary: Accept-Encoding
Cache-Control: no-cache
Date: Tue, 26 Dec 2017 12:26:53 GMT
Strict-Transport-Security: max-age=31536000
...

AES-256

The following images below are encrypted using AES 256 in CTR mode. Can you work out what the images originally were from?

1_enc.bmp

2_enc.bmp

Hints

Password Hashing

What kind of hash algorithm has been used here? Given the input is from humans, how would you go about cracking it efficiently?

HTTP Strict Transport Security

All the information you could possibly need on this topic is in the following blog post: Performing & Preventing SSL Stripping: A Plain-English Primer.

AES-256

From the original images; 1.bmp and 2.bmp, were encrypted using a command similar to the one below - the only thing which changed each time the command was run was the file names used:

openssl enc -aes-256-ctr -e -in 1.bmp -out 1_enc.bmp -k SomePassword -iv 0010011 -nosalt

Another hint; both images are from the same source image, each just has different layers of the same original image.

Solutions (Spoiler Alert!)

Password Hashing

The hashes start with a $2y$ identifier, this indicates the hash has been created using BCrypt ( $2*$ usually indicates BCrypt). The hashes also indicate they were generated using a somewhat decent work factor of 10.

Despite the recent implementations of the Argon2 key derivation function, BCrypt is still generally considered secure. Although it's unfeasible to crack BCrypt itself, users have likely not been as security conscious when setting passwords.

Let's try using a password dictionary to try and crack these hashes, I'll start with a small dictionary of just about 10 000 passwords.

Using this password list, we can crack these hashes (the mode number, 3200 represents BCrypt):

hashcat -a 0 -m 3200 hashes.txt ~/Downloads/10_million_password_list_top_10000.txt --force

Despite being on my laptop, it only takes 90 seconds to crack these hashes (the output below is in the format hash:plaintext):

$2y$10$DuZ0T/Qieif009SdR5HD5OOiFl/WJaDyCDB/ztWIM.1koiDJrN5eu:password1
$2y$10$TYau45etgP4173/zx1usm.uO34TXAld/8e0/jKC5b0jHCqs/MZGBi:password
$2y$10$qQVWugep3jGmh4ZHuHqw8exczy4t8BZ/Jy6H4vnbRiXw.BGwQUrHu:hotdog
$2y$10$0ClJ1I7LQxMNva/NwRa5L.4ly3EHB8eFR5CckXpgRRKAQHXvEL5oS:88888888
$2y$10$LIWMJJgX.Ti9DYrYiaotHuqi34eZ2axl8/i1Cd68GYsYAG02Icwve:hello123

HTTP Strict Transport Security

The first redirect is performed over plain-text HTTP and doesn't have HSTS enabled, but the redirect goes to a subdomain that does have HSTS enabled.

If we were to try stripping away the HTTPS in the redirect (i.e. forcibly changing HTTPS to HTTP using an on-path attacker attack), we wouldn't have much luck due to HSTS being enabled on the subdomain we're redirecting to.

Instead, we need to rewrite the uk. subdomain to point to a subdomain which doesn't have HSTS enabled, uk_secure. for example. In the worst case, we can simply redirect to an entirely phoney domain. You'll need some degree of DNS interception to do this.

From the phoney subdomain or domain, you can then proxy traffic back to the original domain - all the while acting as an on-path attacker for any private information crossing the wire.

AES-256

Generating Encrypted Files

Before we learn to crack this puzzle, let me explain how I set things up.

The original image looked like this:

Prior to encrypting this image, I split them into two distinct parts.

1.bmp

2.bmp

The first image was encrypted with the following command:

openssl enc -aes-256-ctr -e -in 1.bmp -out 1_enc.bmp -k hunter2#security -iv 0010011 -nosalt

Beyond the choice of cipher, there are two important options here:

-iv - We require OpenSSL to use a specific nonce instead of a dynamic one
-nosalt - Encryption keys in OpenSSL are salted prior to hashing, this option prevents this from happening

Using a Hex Editor, I added the headers to ensure the encrypted files were correctly rendered as BMP files. This resulted in the files you saw in the challenge above.

Reversing the Encryption

The critical mistake above is that the encryption key and the nonce are identical. Both ciphertexts were generated from identical nonces using identical keys (unsalted before they were hashed).

Additionally, we are using the AES in CTR mode, which vulnerable to the “two-time pad” attack.

We firstly run an XOR operation on the two files:

$ git clone https://github.com/scangeo/xor-files.git
$ cd xor-files/source
$ gcc xor-files.c -o xor-files
$ ./xor-files /some/place/1_enc.bmp /some/place/2_enc.bmp > /some/place/xor.bmp

The next step is to add BMP image headers such that we can display the image:

The resulting image is then inverted (using ImageMagick):

convert xor.bmp -negate xor_invert.bmp

We then obtain the original input:

Conclusion

If you're interested in debugging cybersecurity challenges on a network that sees ~10% of global internet traffic, we're hiring for security & cryptography engineers and Support Engineers globally.

We hope you've had a relaxing holiday season, and wish you a very happy new year!

Simple Cyber Security Tips (for your Parents)

Junade Ali — Mon, 25 Dec 2017 15:32:37 GMT

Today, December 25th, Cloudflare offices around the world are taking a break. From San Francisco to London and Singapore; engineers have retreated home for the holidays (albeit with those engineers on-call closely monitoring their mobile phones).

Software engineering pro-tip:
Do not, I repeat, do not deploy this week. That is how you end up debugging a critical issue from your parent's wifi in your old bedroom while your spouse hates you for abandoning them with your racist uncle.
— Chris Albon (@chrisalbon) December 20, 2017

Whilst our Support and SRE teams operated on a schedule to ensure fingers were on keyboards; on Saturday, I headed out of the London bound for the Warwickshire countryside. Away from the barracks of the London tech scene, it didn't take long for the following conversation to happen:

Family member: "So what do you do nowadays?"
Me: "I work in Cyber Security."
Family member: "There seems to be a new cyber attack every day on the news! What can I possibly do to keep myself safe?"

If you work in the tech industry, you may find a family member asking you for advice on cybersecurity. This blog post will hopefully save you from stuttering whilst trying to formulate advice (like I did).

The Basics

The WannaCry Ransomware Attack was one of the most high-profile cyberattacks of 2017. In essence, ransomware works by infecting a computer, then encrypting files - preventing users from being able to access them. Users then see a window on their screen demanding payment with the promise of decrypting files. Multiple copycat viruses also sprung up, using the same exploit as WannaCry.

It is worth noting that even after paying, you're unlikely to see your files back (don't expect an honest transaction from criminals).

WannaCry was estimated to have infected over 300 000 computers around the world; this included high-profile government agencies and corporations, the UK's National Health Service was one notable instance of this.

Despite the wide-ranging impact of this attack, a lot of victims could have protected themselves fairly easily. Security patches had already been available to fix the bug that allowed this attack to happen and installing anti-virus software could have contained the spread of this ransomware.

For consumers; it is generally a good idea to install updates, particularly security updates. Platforms like Windows XP no longer receive security updates, and therefore shouldn't be used - regardless of how up-to-date they are on security patches.

Of course, it is also essential to back-up your most indispensable files, not just because of the damage security vulnerabilities.

Don't put your eggs in one Basket

It may not be Easter, but you certainly should not be putting all your eggs in one basket. For this reason; it is often not a good idea to use the same password across multiple sites.

Passwords have been around since 1961, however, no alternative has been found which keeps all their benefits; users continue to set passwords weakly, and website developers continue to store them insecurely.

When developers store computer passwords, they should do so in a way that they can check a computer password is correct but they can never know what the original password is. Unfortunately many websites (including some popular ones) implement internet security poorly. When they get hacked, a password dump can be leaked with everyone's emails/usernames alongside their passwords.

If the same email/username and password combination are used on multiple sites, hackers can automatically use the breached user data from one site to attempt logins against other websites you use online.

For this reason, it's absolutely critical to use a unique password across multiple sites. Password manager apps like LastPass or 1Password allow you to use unique randomly-generated passwords for each site but manage them from one encrypted wallet using a master password.

Simple passwords, based on personal information or using individual words in the dictionary, are far from safe too. Computers can repeatedly go through common passwords in order to crack them. Similarly, adding numbers and symbols (i.e. changing password to p4$$w0rd) will do little to help also.

When you have to choose a password you need to remember, you can create strong passwords from sentences. For example: "At Christmas my dog stole 2 pairs of Doc Martens shoes!" can become ACmds2poDMs! Passwords based on simple sentences can be long, but still easy to remember.

Another approach is to simply select four random dictionary words, for example: WindySoapLongBoulevard. (For obvious reasons, don't actually use that as your password.) Although this password uses solely letters, it is more secure than a shorter password that would also use numbers and symbols.

Layering Security

Authentication is how computers confirm you are who you say you are. Fundamentally, is done using either:

Something you know
Something you have
Something you are

A password is an example of how you can log-in using "something you know"; if someone is able to gain access to that password, it's game-over for that online account.

Instead, it is possible to use "something you have" as well. This means, should your password be intercepted or disclosed, you still have another safeguard to protect your account.

In practice, this means that after entering your password onto a website, you may also be prompted for another code that you need to read off an app on your phone. This is known as Two-Factor Authentication.

Two-Factor Authentication is supported on many of the worlds most popular social media, banking and shopping sites.

Know who you talk to

When you browse to a website online, you may notice a lock symbol light up in your address bar. This indicates encryption is enabled when talking to the website, this is important in order to prevent interception.

When inputting personal information into websites, it is important you check this green lock appears and that the website address starts with "https://".

It is, however, important to double check the address bar you're putting your personal information into. Is it cloudflare.com or have you been redirected away to a dodgy website at cloudflair.com or cloudflare.net?

Despite how common encrypted web traffic has become, on many sites, it still remains relatively easy to strip away this encryption - by pointing internet traffic to a different address. I describe how this can be done in:Performing & Preventing SSL Stripping: A Plain-English Primer

It is also often good guidance to be careful about the links you see in emails; they legitimate emails from your bank, or just someone trying to capture your personal information from their fake "phishing" website that looks just like your bank? Just because someone has a little bit of information about you, doesn't mean they are who they say they are. When in doubt; void following links directly in emails, and check the validity of the email independently (such as by directly going to your banking website). A correct looking "to" address isn't enough to prove an email is coming from who it says it's from.

Conclusions

We always hear of new and innovative security vulnerabilities, but for most users, remembering a handful of simple security tips is enough to protect against the majority of security threats.

In Summary:

As a rule-of-thumb, install the latest security patches
Don't use obsolete software which doesn't provide security patches
Use well-trusted anti-virus
Back-up the files and folders you can't expect to lose
Use a Password Manager to set random, unique passwords for every site
Don't use common keywords or personal information as passwords
Adding numbers and symbols to passwords often doesn't add security but impacts how easy they are to remember
Enable Two-Factor Authentication on sites which support it
Check the address bar when inputting personal information, make sure the connection is encrypted and the site address is correct
Don't believe everything you see in your email inbox or trust every link sent through email; even if the sender has some information about you.

Finally, from everyone at Cloudflare, we wish you a wonderful and safe holiday season. For further reading, check out the Internet Mince Pie Database.

The Curious Case of Caching CSRF Tokens

Junade Ali — Wed, 13 Dec 2017 14:00:00 GMT

It is now commonly accepted as fact that web performance is critical for business. Slower sites can affect conversion rates on e-commerce stores, they can affect your sign-up rate on your SaaS service and lower the readership of your content.

In the run-up to Thanksgiving and Black Friday, e-commerce sites turned to services like Cloudflare to help optimise their performance and withstand the traffic spikes of the shopping season.

In preparation, an e-commerce customer joined Cloudflare on the 9th November, a few weeks before the shopping season. Instead of joining via our Enterprise plan, they were a self-serve customer who signed-up by subscribing to our Business plan online and switching their nameservers over to us.

Their site was running Magento, a notably slow e-commerce platform - filled with lots of interesting PHP, with a considerable amount of soft code in XML. Running version 1.9, the platform was somewhat outdated (Magento was totally rewritten in version 2.0 and subsequent releases).

Despite the somewhat dated technology, the e-commerce site was "good enough" for this customer and had done it's job for many years.

They were the first to notice an interesting technical issue surrounding how performance and security can often feel at odds with each other. Although they were the first to highlight this issue, into the run-up of Black Friday, we ultimately saw around a dozen customers on Magento 1.8/1.9 have similar issues.

Initial Optimisations

After signing-up for Cloudflare, the site owners attempted to make some changes to ensure their site was loading quickly.

The website developers had already ensured the site was loading over HTTPS, in doing so they were able to ensure their site was loading over the new HTTP/2 Protocol and made some changes to ensure their site was optimised for HTTP/2 (for details, see our blog post on HTTP/2 For Web Developers).

At Cloudflare we've taken steps to ensure that there isn't a latency overhead for establishing a secure TLS connection, here is a non-complete list of optimisations we use:

TLS Session Resumption
OCSP Stapling
Fast Elliptic Curve Cryptography prioritised
Dynamic TLS Record Sizing

Additionally, they had enabled HTTP/2 Server Push to ensure critical CSS/JS assets could be pushed to clients when they made their first request. Without Server Push, a client has to download the HTML response, interpret it and then work out assets it needs to download.

Big images were Lazy Loaded, only downloading them when they needed to be seen by the users. Additionally, they had enabled a Cloudflare feature called Polish. With this enabled, Cloudflare dynamically works out whether it's faster serve an image in WebP (a new image format developed by Google) or whether it's faster to serve it in a different format.

These optimisations did make some improvement to performance, but their site was still slow.

Respect The TTFB

In web performance, there are a few different things which can affect the response times - I've crudely summarised them into the following three categories:

Connection & Request Time - Before a request can be sent off for a website to load something, a few things need to happen: DNS queries, a TCP handshake to establish the connection with the web server and a TLS handshake to establish a secure connection
Page Render - A dynamic site needs to query databases, call APIs, write logs, render views, etc before a response can be made to a client
Response Speed - Downloading the response from web server, browser-side rendering of the HTML and pulling the other resources linked in the HTML

The e-commerce site had taken steps to improve their Response Speed by enabling HTTP/2 and performing other on-site optimisations. They had also optimised their Connection & Response Time by using a CDN service like Cloudflare to provide fast DNS and reduce latency when optimising TLS/TCP connections.

However, they now realised the critical step they needed to optimise was around the Page Render that would happen on their web server.

By looking at a Waterfall View of how their site loaded (similar to the one below) they could see the main constraint.

Example Waterfall view from WebSiteOptimization.com

On the initial request, you can see the green "Time to First Byte" view taking a very long time.

Many browsers have tools for viewing Waterfall Charts like the one above, Google provide some excellent documentation for Chrome on doing this: Get Started with Analyzing Network Performance in Chrome DevTools. You can also generate these graphs fairly easily from site speed test tools like WebPageTest.org.

Time to First Byte itself is an often misunderstood metric and often can't be attributed to a single fault. For example; using a CDN service like Cloudflare may increase TTFB by a few milliseconds, but do so to the benefit of an overall load time. This can be as the CDN is adding additional compression functionality to speed up the response, or simply as it has to establish a connection back to the origin web server (which isn't visible by the client).

There are instances where it is important to debug why TTFB is a problem. For example; in this instance, the e-commerce platform was taking upwards of 3 seconds just to generate the HTML response. In this case, it was clear the constraint was the server-side Page Render.

When the web server was generating dynamic content, it was having to query databases and perform logic before a request could be served. In most instances (i.e. a product page) the page would be identical to every other request. It would only be when someone would add something to their shopping cart that the site would really become dynamic.

Enabling Cookie-Based Caching

Before someone logs into the the Magento admin panel or adds something to their shopping cart, the page view is anonymous and will be served up identically to every visitor. It will only be the when an anonymous visitor logs in or adds something to their shopping cart that they will see a page that's dynamic and unlike every other page that's been rendered.

It therefore is possible to cache those anonymous requests so that Magento on an origin server doesn't need to constantly regenerate the HTML.

Cloudflare users on our Business Plan are able to cache anonymous page views when using Magneto using our Bypass Cache on Cookie functionality. This allows for static HTML to be cached at our edge, with no need for it to be regenerated from request to request.

This provides a huge performance boost for the first few page visits of a visitor, and allows them still to interact with the dynamic site when they need to. Additionally, it helps keep load down on the origin server in the event of traffic spikes, sparing precious server CPU time for those who need it to complete dynamic actions like paying for an order.

Here's an example of how this can be configured in Cloudflare using the Page Rules functionality:

The Page Rule configuration above instructs Cloudflare to "Cache Everything (including HTML), but bypass the cache when it sees a request which contains any of the cookies: external_no_cache, PHPSESSID or adminhtml. The final Edge Cache TTL setting just instructs Cloudflare to keep HTML files in cache for a month, this is necessary as Magento by default uses headers to discourage caching.

The site administrator configured their site to work something like this:

On the first request, the user is anonymous and their request indistinguishable from any other - their page can be served from the Cloudflare cache
When the customer adds something to their shopping cart, they do that via a POST request - as methods like POST, PUT and DELETE are intended to change a resource, they bypass the Cloudflare cache
On the POST request to add something to their shopping cart, Magento will set a cookie called external_no_cache
As the site owner has configured Cloudflare to bypass the cache when we see a request containing the external_no_cache cookie, all subsequent requests go direct to origin

This behaviour can be summarised in the following crude diagram:

The site administrators initially enabled this configuration on a subdomain for testing purposes, but noticed something rather strange. When they would add something to the cart on their test site, the cart would show up empty. If they then tried again to add something to the cart, the item would be added successfully.

The customer reported one additional, interesting piece of information - when they tried to mimic this cookie-based caching behaviour internally using Varnish, they faced the exact same issue.

In essence, the Add to Cart functionality would fail, but only on the first request. This was indeed odd behaviour, and the customer reached out to Cloudflare Support.

Debugging

The customer wrote in just as our Singapore office were finishing up their afternoon and was initially triaged by a Support Engineer in that office.

The Support Agent evaluated what the problem was and initially identified that if the frontend cookie was missing, the Add to Cart functionality would fail.

No matter which page you access on Magento, it will attempt to set a frontend cookie, even if it doesn't add an external_no_cache cookie

When Cloudflare caches static content, the default behaviour is to strip away any cookies coming from the server if the file is going to end up in cache - this is a security safeguard to prevent customers accidentally caching private session cookies. This applies when a cached response contains a Set-Cookie header, but does not apply when the cookie is set via JavaScript - in order to allow functionality like Google Analytics to work.

They had identified that the caching logic at our network edge was working fine, but for whatever reason Magento would refuse to add something to a shopping cart without a valid frontend cookie. Why was this?

As Singapore handed their shift work over to London, the Support Engineer working on this ticket decided to escalate the ticket up to me. This was largely as, towards the end of last year, I had owned the re-pricing of this feature (which opened it up to our self-service Business plan users, instead of being Enterprise-only). That said; I had not touched Magneto in many years, even when I was working in digital agencies I wasn't the most enthusiastic to build on it.

The Support Agent provided some internal comments that described the issue in detail and their own debugging steps, with an effective "TL;DR" summary:

Debugging these kinds of customer issues is not as simple as putting breakpoints into a codebase. Often, for our Support Engineers, the customers origin server acts as a black-box and there can be many moving parts, and they of course have to manage the expectations of a real customer at the other end. This level of problem solving fun, is one of the reasons I still like answering customer support tickets when I get a chance.

Before attempting to debug anything, I double checked that the Support Agent was correct that nothing had gone wrong on our end - I trusted their judgement and no others customers were reporting their caching functionality had broken, but it is always best to cross-check manual debugging work. I ran some checks to ensure that there were no regressions in our Lua codebase that controls caching logic:

Checked that there were no changes to this logic in our internal code respository
Check that automated tests are still in place and build successfully
Run checks on production to verify that caching behaviour still works as normal

As Cloudflare has customers across so many platforms, I also checked to ensure that there were no breaking changes in Magento codebase that would cause this bug to occur. Occasionally we find our customers accidentally come across bugs in CMS platforms which are unreported. This, fortunately, was not one of those instances.

The next step is to attempt to replicate the issue locally and away from the customers site. I spun up a vanilla instance of Magento 1.9 and set it up with an identical Cloudflare configuration. The experiment was successful and I was able to replicate the customer issue.

I had an instinctive feeling that it was the Cross Site Request Forgery Protection functionality that was at fault here and I started tweaking my own test Magento installation to see if this was the cases.

Cross Site Request Forgery attacks work by exploiting the fact that one site on the internet can get a client to make requests to another site.

For example; suppose you have an online bank account with the ability to send money to other accounts. Once logged in, there is a form to send money which uses the following HTML:


Account Name:

Amount:

After logging in and doing your transactions, you don't log-out of the website - but you simply navigate elsewhere online. Whilst browsing around you come across a button on a website that contains the text "Click me! Why not?". You click the button, and £10,000 goes from your bank account to mine.

This happens because the button you clicked was connected to an endpoint on the banking website, and contained hidden fields instructing it to send me £10,000 of your cash:

In order to prevent these attacks, CSRF Tokens are inserted as hidden fields into web forms:


Account Name:

Amount:

A cookie is first set on the clients computer containing a random session cookie. When a form is served to the client, a CSRF token is generated using that cookie. The server will check that the CSRF token submitted in the HTML form actually matches the session cookie, and if it doesn't block the request.

In this instance, as there was no session cookie ever set (Cloudflare would strip it out before it entered cache), the POST request to the Add to Cart functionality could never verify the CSRF token and the request would fail.

Due to CSRF vulnerabilities, Magento applied CSRF protection to all forms; this broke Full Page Cache implementations in Magento 1.8.x/1.9.x. You can find all the details in the SUPEE-6285 patch documentation from Magento.

Caching Content with CSRF Protected Forms

To validate that CSRF Tokens were definitely at fault here, I completely disabled CSRF Protection in Magento. Obviously you should never do this in production, I found it slightly odd that they even had a UI toggle for this!

Another method which was created in the Magento Community was an extension to disable CSRF Protection just for the Add To Cart functionality (Inovarti_FixAddToCartMage18), under the argument that CSRF risks are far reduced when we're talking about Add to Cart functionality. This is still not ideal, we should ideally have CSRF Protection on every form when we're talking about actions which change site behaviour.

There is, however, a third way. I did some digging and identified a Magento plugin that effectively uses JavaScript to inject a dynamic CSRF token the moment a user clicks the Add to Cart button but just before the request is actually submitted. There's quite a lengthy Github thread which outlines this issue and references the Pull Requests which fixed this behaviour in the the Magento Turpentine plugin. I won't repeat the set-up instructions here, but they can be found in an article I've written on the Cloudflare Knowledge Base: Caching Static HTML with Magento (version 1 & 2)

Effectively what happens here is that the dynamic CSRF token is only injected into the web page the moment that it's needed. This is actually the behaviour that's implemented in other e-commerce platforms and Magento 2.0+, allowing Full Page Caching to be implemented quite easily. We had to recommend this plugin as it wouldn't be practical for the site owner to simply update to Magneto 2.

One thing to be wary of when exposing CSRF tokens via an AJAX endpoint is JSON Hijacking. There are some tips on how you can prevent this in the OWASP AJAX Security Cheat Sheet. Iain Collins has a Medium post with further discussion on the security merits of CSRF Tokens via AJAX (that said, however you're performing CSRF prevention, Same Origin Policies and HTTPOnly cookies FTW!).

There is an even cooler way you can do this using Cloudflare's Edge Workers offering. Soon this will allow you to run JavaScript at our Edge network, and you can use that to dynamically insert CSRF tokens into cached content (and, then either perform cryptographic validation of CSRF either at our Edge or the Origin itself using a shared secret).

But this has been a problem since 2015?

Another interesting observation is that the Magento patch which caused this interesting behaviour had been around since July 7, 2015. Why did our Support Team only see this issue in the run-up to Black Friday in 2017? What's more, we ultimately saw around a dozen support tickets around this exact issue on Magento 1.8/1.9 over the course over 6 weeks.

When an Enterprise customer ordinarily joins Cloudflare, there is a named Solutions Engineer who gets them up and running and ensures there is no pain; however when you sign-up online with a credit card, your forgo this privilege.

Last year, we released Bypass Cache on Cookie to self-serve users when a lot of e-commerce customers were in their Christmas/New Year release freeze and not making changes to their websites. Since then, there were no major shopping events; most the sites enabling this feature were new build websites using Magento 2 where this wasn't an issue.

In the run-up to Black Friday, performance and coping under load became a key consideration for developers working on legacy e-commerce websites - and they turned to Cloudflare. Given the large, but steady, influx of e-commerce websites joining Cloudflare - the low overall percentage of those on Magento 1.8/1.9 became noticeable.

Conclusion

Caching anonymous page views is an important, and in some cases, essential mechanism to dramatically improve site performance to substantially reduce site load, especially during traffic spikes. Whilst aggressively caching content when users are anonymous, you can bypass the cache and allow users to use the dynamic functionality your site has to offer.

When you need to insert a dynamic state into cached content, JavaScript offers a nice compromise. JavaScript allows us to cache HTML for anonymous page visits, but insert a state when the users interact in a certain way. In essence, defusing this conflict between performance and security. In the future you'll be able to run this JavaScript logic at our network edge using Cloudflare Edge Workers.

It also remains important to respect the RESTful properties of HTTP and ensure GET, OPTIONS and HEAD requests remain safe and instead using POST, PUT, PATCH and DELETE as necessary.

If you're interested in debugging interesting technical problems on a network that sees around 10% of global internet traffic, we're hiring for Support Engineers in San Francisco, London, Austin and Singapore.

The New DDoS Landscape

Junade Ali — Thu, 23 Nov 2017 03:28:00 GMT

News outlets and blogs will frequently compare DDoS attacks by the volume of traffic that a victim receives. Surely this makes some sense, right? The greater the volume of traffic a victim receives, the harder to mitigate an attack - right?

At least, this is how things used to work. An attacker would gain capacity and then use that capacity to launch an attack. With enough capacity, an attack would overwhelm the victim's network hardware with junk traffic such that they can no longer serve legitimate requests. If your web traffic is served by a server with a 100 Gbps port and someone sends you 200 Gbps, your network will be saturated and the website will be unavailable.

Recently, this dynamic has shifted as attackers have gotten far more sophisticated. The practical realities of the modern Internet have increased the amount of effort required to clog up the network capacity of a DDoS victim - attackers have noticed this and are now choosing to perform attacks higher up the network stack.

In recent months, Cloudflare has seen a dramatic reduction in simple attempts to flood our network with junk traffic. Whilst we continue to see large network level attacks, in excess of 300 and 400 Gbps, network level attacks in general have become far less common (the largest recent attack was just over 0.5 Tbps). This has been especially true since the end of September when we made official a policy that would not remove any customers from our network merely for receiving a DDoS attack that's too big, including those on our free plan.

Far from attackers simply closing shop, we see a trend whereby attackers are moving to more advanced application-layer attack strategies. This trend is not only seen in metrics from our automated attack mitigation systems, but has also been the experience of our frontline customer support engineers. Whilst we continue to see very large network level attacks, note that they are occurring less frequently since the introduction of Unmetered Mitigation:

To understand how the landscape has made such a dramatic shift, we must first understand how DDoS attacks are performed.

Performing a DDoS

From presentation by @IcyApril - First thing you absolutely need for a successful DDoS - is a cool costume. pic.twitter.com/WIC0LjF4ka
— majek04 (@majek04) November 22, 2017

The first thing you need before you can carry out a DDoS Attack is capacity. You need the network resources to be able to overwhelm your victim.

To build up capacity, attackers have a few mechanisms at their disposal; three such examples are Botnets, IoT Devices and DNS Amplification:

Botnets

Computer viruses are deployed for multiple reasons, for example; they can harvest private information from users or blackmail users into paying them money to get their precious files back. Another utility of computer viruses is building capacity to perform DDoS Attacks.

A Botnet is a network of infected computers that are centrally controlled by an attacker; these zombie computers can then be used to send spam emails or perform DDoS attacks.

Consumers have access to faster Internet than ever before. In November 2016, the average UK broadband upload speed reached 4.3Mbps - this means a Botnet which has infected a little under 2,400 computers can launch an attack of around 10Gbps. Such capacity is plenty enough to saturate the end networks that power most websites online.

On August 17th, 2017, multiple networks online were subject to significant attacks from a botnet known as WireX. Researchers from a variety of organisations, including from Akamai, Cloudflare, Flashpoint, Google, Oracle Dyn, RiskIQ, Team Cymru, and other organizations cooperated to combat this botnet - eventually leading to hundreds of Android apps being removed and a process started to remove the malware-ridden apps from all devices.

IoT Devices

More and more of our everyday appliances are being embedded with Internet connectivity. Like other types of technology, they can be taken over with malware and controlled to launch large-scale DDoS attacks.

Towards the end of last year, we began to see Internet-connected cameras start to launch large DDoS attacks. The use of video cameras was advantageous to attackers in the sense they needed to be connected to networks with enough bandwidth to be capable of streaming video.

Mirai was one such botnet which targeted Internet-connected cameras and Internet routers. It would start by logging into the web dashboard of a device using a table of 60 default usernames and passwords, then installing malware on the device.

Where users set passwords instead of them merely being the default values, other pieces of malware can use Dictionary Attacks to repeatedly guess simple user-configured passwords, using a list of common passwords like the one shown below. I have self-censored some of the passwords, apparently users can be in a fairly angry state-of-mind when setting passwords:

Passwords aside, back in May, I blogged specifically about some examples of security risks we are starting to see which are specific to IoT devices: IoT Security Anti-Patterns.

DNS Amplification

DNS is the phonebook of the Internet; in order to reach this site, your local computer used DNS to look up which IP Address would serve traffic for blog.cloudflare.com. I can perform this DNS queries from my command line using dig A blog.cloudflare.com:

Firstly notice that the response is pretty big, it's certainly bigger than the question we asked.

DNS is built on a Transport Protocol called UDP, when using UDP it's easy to forge the requester of a query as UDP doesn't require a handshake before sending a response.

Due to these two factors, someone is able to make a DNS query on behalf of someone else. We can make a request for a relatively small DNS query, which will will then result in a long response being sent somewhere else.

Online there are open DNS resolvers that will take a request from anyone online and send the response to anyone else. In most cases we should not be exposing open DNS resolvers to the internet (most are open due to configuration mistakes). However, when intentionally exposing DNS resolvers online, security steps should be taken - StrongArm has a primer on securing open DNS resolvers on their blog.

Let's use a hypothetical to illustrate this point. Imagine that you wrote to a mail order retailer requesting a catalogue (if you still know what one is). You'd send a relatively short postcard with your contact information and your request - you'd then get back quite a big catalogue. Now imagine you did the same, but sent hundreds of these postcards and instead included the address of someone else. Assuming the retailer was obliging in sending such a vast amount of catalogues, your friend could wake up one day with their front door blocked with catalogues.

In 2013, we blogged about how one DNS Amplification attack we faced almost broke the internet; however, in recent times aside from one exceptionally high attack (occurring just after we launched Unmetered DDoS Mitigation), DNS Amplification attacks have generally been a low proportion of attacks we see recently:

Whilst we're seeing fewer of these attacks, you can find a more detailed overview on our learning centre: DNS Amplification Attack

DDoS Mitigation: The Old Priorities

Using the capacity an attacker has built up, they can send junk traffic to a web property. This is referred to as a Layer 3/4 attack. This kind of attack primarily seeks to block up the network capacity of the victim network.

Above all, mitigating these attacks requires capacity. If you get an attack of 600 Gbps and you only have 10 Gbps of capacity you either need to pay an intermediary network to filter traffic for you or have your network go offline due to the force of the attack.

As a network, Cloudflare works by passing a customer's traffic through our network; in doing so, we are able to apply performance optimisations and security filtering to the traffic we see. One such security filter is removing junk traffic associated with Layer 3/4 DDoS attacks.

Cloudflare's network was built when large scale DDoS Attacks were becoming a reality. Huge network capacity, spread out over the world in many different data centres makes it easier to absorb large attacks. We currently have over 15 Tbps and this figure is always growing fast.

Preventing DDoS attacks needs a little more sophistication than just capacity though. While traditional Content Delivery Networks are built using Unicast technology, Cloudflare's network is built using an Anycast design.

In essence, this means that network traffic is routed to the nearest available Point-of-Presence and it is not possible for an attacker to override this routing behaviour - the routing is effectively performed using BGP, the routing protocol of the Internet.

Unicast networks will frequently use technology like DNS to steer traffic to close data centres. This routing can easily be overridden by an attacker, allowing them to force attack traffic to a single data centre. This is not possible with Cloudflare's Anycast network; meaning we maintain full control of how our traffic is routed (providing intermediary ISPs respect our routes). With this network design, have the ability to rapidly update routing decisions even against ISPs which ordinarily do not respect cache expiration times for DNS records (TTLs).

Cloudflare's network also maintains an Open Peering Policy; we are open to interconnecting our network with any other network without cost. This means we tend to eliminate intermediary networks across our network. When we are under attack, we usually have a very short network path from the attacker to us - this means there are no intermediary networks which can suffer collateral damage.

The New Landscape

I started this blog post with a chart which demonstrates the frequency of a type of network-layer attack known as a SYN Flood against the Cloudflare network. You'll notice how the largest attacks are further spaced out over the past few months:

You can also see that this trend does not follow when compared to a graph of Application Layer DDoS attacks which we continue to see coming in:

The chart above has an important caveat, an Application Layer attack is defined by what we determine an attack is. Application Layer (Layer 7) attacks are more indistinguishable from real traffic than Layer 3/4 attacks. The attacks effectively resemble normal web requests, instead of junk traffic.

Attackers can order their Botnets to perform attacks against websites using "Headless Browsers" which have no user interface. Such Headless Browsers work exactly like normal browsers, except that they are controlled programmatically instead of being controlled via a window on a user's screen.

Botnets can use Headless Browsers to effectively make HTTP requests that load and behave just like ordinary web requests. As this can be done programmatically, they can order bots to repeat these HTTP requests rapidly - effectively taking up the entire capacity of a website, taking it offline for ordinary visitors.

This is a non-trivial problem to solve. At Cloudflare, we have specific services like Gatebot which identify DDoS attacks by picking up on anomalies in network traffic. We have tooling like "I'm Under Attack Mode" to analyse traffic to ensure the visitor is human. This is, however, only part of the story.

A $20/month server running a resource intense e-commerce platform may not be able to cope with any more than a dozen concurrent HTTP requests before being unable to serve any more traffic.

An attack which can take down a small e-commerce site will likely not even be a drop in the ocean for Cloudflare's network, which sees around 10% of Internet requests online.

The chart below outlines DDoS attacks per day against Cloudflare customers; but it is important to bear in mind that this includes what we define as an attack. In recent times, Cloudflare has built specific products to help customers define what they think an attack looks like and how much traffic they feel they should cope with.

A Web Developers Guide to Defeating an Application Layer DDoS Attack

One of the reasons why Application Layer DDoS attacks are so attractive is due to the uneven balance between the relative computational overhead required to for someone to request a web page, and the computational difficulty in serving one. Serving a dynamic website requires all kinds of operations; fetching information from a database, firing off API requests to separate services, rendering a page, writing log lines and potentially even pushing data down a message queue.

Fundamentally, there are two ways of dealing with this problem:

making the balance between requester and server, less asymmetric by making it easier to serve web requests
limiting requests which are in such excess, they are blatantly abusive

It remains critical you have a high-capacity DDoS mitigation network in front of your web application; one of the reasons why Application Layer attacks are increasingly attractive to attackers because networks have gotten good at mitigating volumetric attacks at the network layer.

Cloudflare has found that whilst performing Application Layer attacks, attackers will sometimes pick cryptographic ciphers that are the hardest for servers to calculate and are not usually accelerated in hardware. What this means is the attackers will try and consume more of your servers resources by actually using the fact you offer encrypted HTTPS connections against you. Having a proxy in front of your web traffic has the added benefit of ensuring that it has to establish a brand new secure connection to your origin web server - effectively meaning you don't have to worry about Presentation Layer attacks.

Additionally, offloading services which don't have custom application logic (like DNS) to managed providers can help ensure you have less surface area to worry about at the Application Layer.

Aggressive Caching

One of the ways to make it easier to serve web requests is to use some form of caching. There are multiple forms of caching; however, here I'm going to be talking about how you enable caching for HTTP requests.

Suppose you're using a CMS (Content Management System) to update your blog, the vast majority of visitors to your blog will see the identical page to every other visitor. It will only be the when an anonymous visitor logs in or leaves a comment that they will see a page that's dynamic and unlike every other page that's been rendered.

Despite the vast majority of HTTP requests to specific URLs being identical, your CMS has to regenerate the page for every single request as if it was brand new. Application Layer DDoS attacks exploit this to amplification to make their attacks more brutal.

Caching proxies like NGINX and services like Cloudflare allow you to specify that until a user has a browser cookie that de-anonymises them, content can be served from cache. Alongside performance benefits, these configuration changes can prevent the most crude Application Layer DDoS Attacks.

For further information on this, you can consult NGINX guide to caching or alternatively see my blog post on caching anonymous page views:

Rate Limiting

Caching isn't enough; non-idempotent HTTP requests like POST, PUT and DELETE are not safe to cache - as such making these requests can bypass caching efforts used to prevent Application Layer DDoS attacks. Additionally, attackers can attempt to vary URLs to bypass advanced caching behaviour.

Software exists for web servers to be able to perform rate limiting before anything hits dynamic logic; examples of such tools include Fail2Ban and Apache mod_ratelimit.

If you do rate limiting on your server itself, be sure to configure your edge network to cache the rate limit block pages, such that attackers cannot perform Application Layer attackers once blocked. This can be done by caching responses with a 429 status code against a Custom Cache Key based on the IP Address.

Services like Cloudflare offer Rate Limiting at their edge network; for example Cloudflare Rate Limiting.

Conclusion

As the capacity of networks like Cloudflare continue to grow, attackers move from attempting DDoS attacks at the network layer to performing DDoS attacks targeted at applications themselves.

For applications to be resilient to DDoS attacks, it is no longer enough to use a large network. A large network must be complemented with tooling that is able to filter malicious Application Layer attack traffic, even when attackers are able to make such attacks look near-legitimate.

Performing & Preventing SSL Stripping: A Plain-English Primer

Junade Ali — Fri, 20 Oct 2017 16:23:00 GMT

Over the past few days we learned about a new attack that posed a serious weakness in the encryption protocol used to secure all modern Wi-Fi networks. The KRACK Attack effectively allows interception of traffic on wireless networks secured by the WPA2 protocol. Whilst it is possible to backward patch implementations to mitigate this vulnerability, security updates are rarely installed universally.

Prior to this vulnerability, there were no shortage of wireless networks that were vulnerable to interception attacks. Some wireless networks continue to use a dated security protocol (called WEP) that is demonstrably "totally insecure" [1]; other wireless networks, such as those in coffee shops and airports, remain completely open and do not authenticate users. Once an attacker gains access to a network, they can act as a proxy to intercept connections over the network (using tactics known as ARP Cache Poisoning and DNS Hijacking). And yes, these interception tactics can easily be deployed against wired networks where someone gains access to an ethernet port.

With all this known, it is beyond doubt that it is simply not secure to blindly trust the medium that connects your users to the internet. HTTPS was created to allow HTTP traffic to be transmitted in encrypted form, however the authors of the KRACK Attack presented a video demonstration of how the encryption could be completely stripped away on a popular dating site (despite the website supporting HTTPS). This blog post presents a plain-english primer on how HTTPS protection can be stripped and mechanisms for mitigating this.

HTTP over TLS

The internet is built on a patchwork of standards, with components being refactored and rebuilt in new published standards. When one standard is found to be flawed, it is later patched or replaced by a new standard. As a standard is falsified and replaced by a better one, the internet as a whole becomes better.

The HTTP protocol was originally specified to communicate data in the clear over the internet. Prior to the official introduction of HTTP 1.0, the first documented version of HTTP was known as HTTP V0.9 and was published in 1991. Netscape were the first to recognise the need for greater security assurance over the internet and in mid-1994 HTTPS was implemented into the Netscape browser. In order to implement greater security assurance, a technology called SSL (Secure Socket Layer) was created.

SSL 1.0 was short-lived (and not even officially standardised) due to a number of security concerns and shortcomings. This protocol was incrementally updated in SSL 2.0 and SSL 3.0; this was then iteratively superseded by the TLS (Transport Layer Security) standards.

Protocol	SSL 1.0	SSL 2.0	SSL 3.0	TLS 1.0	TLS 1.1	TLS 1.2	TLS 1.3
Release	N/A	1995	1996	1999	2006	2008	Draft

Each of those versions come with different security constraints and support varies from browser-to-browser. Additionally, the encryption ciphers used can be configured somewhat independently of the overlying protocol. It is therefore vital to ensure any HTTPS-enabled web server is set up to use a configuration that’s optimised to balance browser support against security. I won't go into detail on this in this post, however you can read about SSL and TLS Deployment Best Practices in documentation provided by SSL Labs.

At a high-level; the end result of HTTP over TLS, is that when a site is requested over https:// instead of http:// the connection is completed in an encrypted manner. This process provides a reasonable guarantee of both privacy and integrity; in other words, we don't just encrypt the messages we're sending, we make sure the message we receive isn't altered over the wire. When a secure connection is established, web browsers can indicate this to their users by lighting the browser bar green.

As SSL Certificates themselves are signed by Certificate Authorities, a degree of "domain validation" is carried out - the Certificate Authority makes sure they are only validating a certificate which is owned by someone who has the ability to make changes to the website. This provides a degree of assurance that certificates aren't being issued to attackers who can then seem legitimate when intercepting web traffic. In the event a certificate ends up in the wrong hands, a Certificate Revocation List can be used to retract the certificate. Such lists are then automatically downloaded by modern Operating Systems to ensure that when an invalid certificate is served, it is marked as insecure in the browser. As there are a considerable number (>100) certificate authorities with the power to issue SSL certificates, it is possible to allowlist which Certificate Authorities should issue certificates for a given domain by configuring CAA DNS records.

Nowadays, this process can be simplified, for example; when your traffic is proxied through the Cloudflare network, we will dynamically manage renewing and signing your certificate for you (whilst using Cloudflare's Origin CA to generate a certificate to encrypt traffic back to the origin web server). Similarly, the EFF offer a tool called CertBot to make it relatively easy to install and generate Let's Encrypt certificates from the command line.

When using HTTPS, it is important that the entire content of the website is then loaded over HTTPS - not just the login pages. It used to be common practice for websites to initially present the login page over a secure encrypted connection, then when the user was logged in, they would degrade the connection back to HTTP. Once logged into a website, a session cookie is stored on the local browser to allow the website to ensure the user is logged in.

In 2010, Eric Butler demonstrated how insecure this was by building a simple interception tool called FireSheep. By Eavesdropping wireless connections, FireSheep would capture the login session for common websites. Whilst the attacker would not necessarily be able to capture the password of the website, they would be able to capture the login session and perform behaviours on websites as if they were login. They would also be able to intercept traffic as the user was logged in.

When connecting to a website using SSL, the first request should usually redirect the user to a secure version of the website. For example; when you first visit http://www.cloudflare.com/ a HTTP 301 redirect is used to send you to the HTTPS version of the site, https://www.cloudflare.com/.

This raises an important question; if someone is able to intercept the unencrypted request to the HTTP version of the site, couldn't they then strip away the encryption and serve the site back to the user without encryption? This was a question explored by Moxie Marlinspike, which later led to the creation of HSTS.

HTTP Strict Transport Security (HSTS)

In 2009 at Blackhat DC, Moxie Marlinspike presented a tool known as SSLStrip. This tool would intercept HTTP traffic and whenever it spotted redirects or links to sites using HTTPS, it would transparently strip them away.

Instead of the victim connecting directly to a website; the victim would connect to the attacker, and the attacker would initiate the connection back to the website. This attack is known as an on-path attack.

The magic of SSLStrip was that whenever it would spot a link to a HTTPS webpage on an unencrypted HTTP connection, it would replace the HTTPS with a HTTP and sit in the middle to intercept the connection. The interceptor would make the encrypted connection to back to the web server in HTTPS, and serve the traffic back to the site visitor unencrypted (logging any interesting passwords or credit card information in the process).

In response, a protocol called HTTP Strict Transport Security (HSTS) was created in 2012 and specified in RFC 6797. The protocol works by the server responding with a special header called Strict-Transport-Security which contains a response telling the client that whenever they reconnect to the site, they must use HTTPS. This response contains a "max-age" field which defines how long this rule should last for since it was last seen.

Whilst this provided an improvement in preventing interception attacks, it wasn't perfect and there remain a number of shortcomings.

HSTS Preloading

One of the shortcomings of HSTS is the fact that it requires a previous connection to know to always connect securely to a particular site. When the visitor first connects to the website, they won't have received the HSTS rule that tells them to always use HTTPS. Only on subsequent connections will the visitor's browser be aware of the HSTS rule that requires them to connect over HTTPS.

Other mechanisms of attacking HSTS have been explored; for example by hijacking the protocol used to the sync a computer's time (NTP), it can be possible to set a computers date and time to one in the future. This date and time can be set to a value when the HSTS rule has expired and thereby bypassing HSTS[2].

HSTS Preload Lists are one potential solution to help with these issues, they effectively work by hard coding a list of websites that need to be connected to using HTTPS-only. Sites with HSTS enabled can be submitted to the Chrome HSTS Preload List at hstspreload.org; which is also used as the basis of the preload lists used in other browsers.

Inside the source code of Google Chrome, there is a file which contains a hard-coded file listing the HSTS properties for all domains in the Preload List. Each entry is formatted in JSON and looks something like this:

Even with preload, things still aren't perfect. Suppose someone is reading a blog about books and on that blog there is a link to purchase a book from an online retailer. Despite the fact the online retailer enforces HTTPS using HSTS it is possible to conduct an on-path attack, providing the blog linking to the online retailer does not use HTTPS.

More Still to be Done

Leonardo Nve revived SSLStrip in a new version called SSLStrip+, with the ability to avoid HSTS. When a site is connected to over an unencrypted HTTP connection, SSLStrip+ will look for links to HTTPS sites. When a link is found to a HTTPS site, it is rewritten to HTTP and critically the domain is rewritten to an artificial domain which is not on the HSTS Preload list.

For example; suppose a site contains a link to https://example.com/, the HSTS could be stripped by rewriting the URLs to http://example.org/; with attacker sitting in the middle, receiving traffic from http://example.org/ and proxying it to https://example.com/.

Such an attack can also be performed against redirects; suppose http://example.net/ is loaded over HTTP but then redirects to https://example.com/ which is loaded over HTTPS. At the point the redirect is carried out, the legitimate HSTS-protected site can be redirected to a phony domain which the attacker uses to serve traffic over HTTP and intercept traffic.

As more and more of the internet moves to HTTPS, the surface area of this attack will get smaller as there is less unencrypted HTTP traffic to intercept.

In the latest newly-released version of Google Chrome (version 62), websites which serve input forms (such as credit card forms and password fields) on insecure connections are flagged as "Not Secure" to the user, instead of a neutral message. When in Incognito (private browsing) mode, Chrome will flag any website as insecure if it does not use HTTPS.

This change helps make it clearer to users when HTTPS has been stripped away from a webpage as they try to log in. Additionally, in making this change, it hoped that more websites will adopt HTTPS - thereby improving the security of the internet as a whole.

Final Remarks

This blog post has discussed mechanisms of stripping HTTPS away from websites and in particular how HSTS can affect this. It is worth noting that there are other potential attack vectors within various HTTPS specifications and in certain ciphers; this blog post hasn't gone into them.

Despite HTTPS offering a mechanism for encrypting web traffic, it is important to implement technologies such as HTTP Strict Transport Security to ensure it is enforced, and preferably submit your site to HSTS Preload lists. As more websites do this, the security of the internet overall is improved.

To learn how you can implement HTTPS and HSTS in practice, I'd highly recommend Troy Hunt's blog post: The 6-Step "Happy Path" to HTTPS. His blog post goes into how you can enable strong HTTPS in practice, and additionally touches on a technology that I didn't mention here known as CSP (Content Security Policy). CSP allows you to automatically upgrade or block HTTP requests when loading pages over HTTPS, as this poses another attack vector.

Stubblefield, A., Ioannidis, J. and Rubin, A.D., 2002, February. Using the Fluhrer, Mantin, and Shamir Attack to Break WEP. In NDSS. ↩︎
Selvi, J., 2014. Bypassing HTTP strict transport security. Black Hat Europe. ↩︎

Cloudflare London Meetup Recap

Junade Ali — Thu, 12 Oct 2017 14:17:00 GMT

Cloudflare helps make over 6 million websites faster and more secure. In doing so, Cloudflare has a vast and diverse community of users throughout the world. Whether discussing Cloudflare on social media, browsing our community forums or following Pull Requests on our open-source projects; there is no shortage of lively discussions amongst Cloudflare users. Occasionally, however, it is important to move these discussions out from cyberspace and take time to connect in person.

A little while ago, we did exactly this and ran a meetup in the Cloudflare London office. Ivan Rustic from Hardenize was our guest speaker, he demonstrated how Hardenize developed a Cloudflare App to help build a culture of security. I presented two other talks which included a primer on how the Cloudflare network is architected and wrapped up with a discussion on how you can build and monetise your very own Cloudflare App.

Since we presented this meet-up, I've received a few requests to share the videos of all the talks. You can find all three of the talks from our last London office meet-up in this blog post.

How Cloudflare Works

App Highlight: Hardenize by Ivan Ristić

Introduction to Building with Cloudflare Apps

Learn More...

If you're interested in the kinds of Denial-of-Service attacks that Cloudflare faces, and how we help mitigate them; check out our Learning Centre for further information on DDOS mitigation.

The Cloudflare Apps platform allows developers to ship their code to any of the 6 million websites on the Cloudflare network. When building on our apps platform, your code can be previewed and deployed on any Cloudflare site in seconds. Browse the apps documentation to learn more about building on Cloudflare Apps

We'd love for you to join us at our next meet-up. To stay up-to-date on local events, you can join our Meetup groups in San Francisco, Austin and, of course, London. Want to request a Cloudflare meetup or workshop in your city? Please drop a line to community@cloudflare.com.

A New API Binding: cloudflare-php

Junade Ali — Sat, 23 Sep 2017 00:01:35 GMT

Back in May last year, one of my colleagues blogged about the introduction of our Python binding for the Cloudflare API and drew reference to our other bindings in Go and Node. Today we are complimenting this range by introducing a new official binding, this time in PHP.

This binding is available via Packagist as cloudflare/sdk, you can install it using Composer simply by running composer require cloudflare/sdk. We have documented various use-cases in our "Cloudflare PHP API Binding" KB article to help you get started.

Alternatively should you wish to help contribute, or just give us a star on GitHub, feel free to browse to the cloudflare-php source code.

PHP is a controversial language, and there is no doubt there are elements of bad design within the language (as is the case with many other languages). However, love it or hate it, PHP is a language of high adoption; as of September 2017 W3Techs report that PHP is used by 82.8% of all the websites whose server-side programming language is known. In creating this binding the question clearly wasn't on the merits of PHP, but whether we wanted to help drive improvements to the developer experience for the sizeable number of developers integrating with us whilst using PHP.

In order to help those looking to contribute or build upon this library, I write this blog post to explain some of the design decisions made in putting this together.

Exclusively for PHP 7

PHP 5 initially introduced the ability for type hinting on the basis of classes and interfaces, this opened up (albeit seldom used) parametric polymorphic behaviour in PHP. Type hinting on the basis of interfaces made it easier for those developing in PHP to follow the Gang of Four's famous guidance: "Program to an 'interface', not an 'implementation'."

Type hinting has slowly developed in PHP, in PHP 7.0 the ability for Scalar Type Hinting was released after a few rounds of RFCs. Additionally PHP 7.0 introduced Return Type Declarations, allowing return values to be type hinted in a similar way to argument type hinting. In this library we extensively use Scalar Type Hinting and Return Type Declarations thereby restricting the backward compatibility that's available with PHP 5.

In order for backward compatibility to be available, these improvements to type hinting simply would not be implementable and the associated benefits would be lost. With Active Support no longer being offered to PHP 5.6 and Security Support little over a year away from disappearing for the entirety of PHP 5.x, we decided the additional coverage wasn't worth the cost.

Object Composition

What do we mean by a software architecture? To me the term architecture conveys a notion of the core elements of the system, the pieces that are difficult to change. A foundation on which the rest must be built. Martin Fowler

When getting started with this package, you'll notice there are 3 classes you'll need to instantiate:

$key     = new \Cloudflare\API\Auth\APIKey('user@example.com', 'apiKey');
$adapter = new Cloudflare\API\Adapter\Guzzle($key);
$user    = new \Cloudflare\API\Endpoints\User($adapter);
    
echo $user->getUserID();

The first class being instantiated is called APIKey (a few other classes for authentication are available). We then proceed to instantiate the Guzzle class and the APIKey object is then injected into the constructor of the Guzzle class. The Auth interface that the APIKey class implements is fairly simple:

namespace Cloudflare\API\Auth;

interface Auth
{
    public function getHeaders(): array;
}

The Adapter interface (which the Guzzle class implements) makes explicit that an object built on the Auth interface is expected to be injected into the constructor:

namespace Cloudflare\API\Adapter;

use Cloudflare\API\Auth\Auth;
use Psr\Http\Message\ResponseInterface;

interface Adapter
{
...
    public function __construct(Auth $auth, string $baseURI);
...
}

In doing so; we define that classes which implement the Adapter interface are to be composed using objects made from classes which implement the Auth interface.

So why am I explaining basic Dependency Injection here? It is critical to understand as the design of our API changes, the mechanisms for Authentication may vary independently of the HTTP Client or indeed API Endpoints themselves. Similarly the HTTP Client or the API Endpoints may vary independently of the other elements involved. Indeed, this package already contains three classes for the purpose of authentication (APIKey, UserServiceKey and None) which need to be interchangeably used. This package therefore considers the possibility for changes to different components in the API and seeks to allow these components to vary independently.

Dependency Injection is also used where the parameters for an API Endpoint become more complicated then what is permitted by simpler variables types; for example, this is done for defining the Target or Configuration when configuring a Page Rule:

require_once('vendor/autoload.php');

$key = new \Cloudflare\API\Auth\APIKey('mjsa@junade.com', 'apiKey');
$adapter = new Cloudflare\API\Adapter\Guzzle($key);
$zones = new \Cloudflare\API\Endpoints\Zones($adapter);

$zoneID = $zones->getZoneID("junade.com");

$pageRulesTarget = new \Cloudflare\API\Configurations\PageRulesTargets('https://junade.com/noCache/*');

$pageRulesConfig = new \Cloudflare\API\Configurations\PageRulesActions();
$pageRulesConfig->setCacheLevel('bypass');

$pageRules = new \Cloudflare\API\Endpoints\PageRules($adapter);
$pageRules->createPageRule($zoneID, $pageRulesTarget, $pageRulesConfig, true, 6);

The structure of this project is overall based on simple object composition; this provides a far more simple object model for the long-term and a design that provides higher flexibility. For example; should we later want to create an Endpoint class which is a composite of other Endpoints, it becomes fairly trivial for us to build this by implementing the same interface as the other Endpoint classes. As more code is added, we are able to keep the design of the software relatively thinly layered.

Testing/Mocking HTTP Requests

If you're interesting in helping contribute to this repository; there are two key ways you can help:

Building out coverage of endpoints on our API
Building out test coverage of those endpoint classes

The PHP-FIG (PHP Framework Interop Group) put together a standard on how HTTP responses can be represented in an interface, this is described in the PSR-7 standard. This response interface is utilised by our HTTP Adapter interface in which responses to API requests are type hinted to this interface (Psr\Http\Message\ResponseInterface).

By using this standard, it's easier to add further abstractions for additional HTTP clients and mock HTTP responses for unit testing. Let's assume the JSON response is stored in the $response variable and we want to test the listIPs method in the IPs Endpoint class:

public function testListIPs() {
  $stream = GuzzleHttp\Psr7\stream_for($response);
  $response = new GuzzleHttp\Psr7\Response(200, ['Content-Type' => 'application/json'], $stream);
  $mock = $this->getMockBuilder(\Cloudflare\API\Adapter\Adapter::class)->getMock();
  $mock->method('get')->willReturn($response);

  $mock->expects($this->once())
    ->method('get')
    ->with($this->equalTo('ips'), $this->equalTo([])
  );

   $ips = new \Cloudflare\API\Endpoints\IPs($mock);
   $ips = $ips->listIPs();
   $this->assertObjectHasAttribute("ipv4_cidrs", $ips);
   $this->assertObjectHasAttribute("ipv6_cidrs", $ips);
}

We are able to build a simple mock of our Adapter interface by using the standardised PSR-7 response format, when we do so we are able to define what parameters PHPUnit expects to be passed to this mock. With a mock Adapter class in place we are able to test the IPs Endpoint class as any if it was using a real HTTP client.

Conclusions

Through building on modern versions of PHP, using good Object-Oriented Programming theory and allowing for effective testing we hope our PHP API binding provides a developer experience that is pleasant to build upon.

If you're interesting in helping improve the design of this codebase, I'd encourage you to take a look at the PHP API binding source code on GitHub (and optionally give us a star).

If you work with Go or PHP and you're interested in helping Cloudflare turn our high-traffic customer-facing API into an ever more modern service-oriented environment; we're hiring for Web Engineers in San Francisco, Austin and London.