Posts tagged "php"

HTTP Analytics for 6M requests per second using ClickHouse

Alex Bocharov — Tue, 06 Mar 2018 13:00:00 GMT

One of our large scale data infrastructure challenges here at Cloudflare is around providing HTTP traffic analytics to our customers. HTTP Analytics is available to all our customers via two options:

Analytics tab in Cloudflare dashboard
Zone Analytics API with 2 endpoints
- Dashboard endpoint
- Co-locations endpoint (Enterprise plan only)

In this blog post I'm going to talk about the exciting evolution of the Cloudflare analytics pipeline over the last year. I'll start with a description of the old pipeline and the challenges that we experienced with it. Then, I'll describe how we leveraged ClickHouse to form the basis of a new and improved pipeline. In the process, I'll share details about how we went about schema design and performance tuning for ClickHouse. Finally, I'll look forward to what the Data team is thinking of providing in the future.

Let's start with the old data pipeline.

Old data pipeline

The previous pipeline was built in 2014. It has been mentioned previously in Scaling out PostgreSQL for CloudFlare Analytics using CitusDB and More data, more data blog posts from the Data team.

It had following components:

Log forwarder - collected Cap'n Proto formatted logs from the edge, notably DNS and Nginx logs, and shipped them to Kafka in Cloudflare central datacenter.
Kafka cluster - consisted of 106 brokers with x3 replication factor, 106 partitions, ingested Cap'n Proto formatted logs at average rate 6M logs per second.
Kafka consumers - each of 106 partitions had dedicated Go consumer (a.k.a. Zoneagg consumer), which read logs and produced aggregates per partition per zone per minute and then wrote them into Postgres.
Postgres database - single instance PostgreSQL database (a.k.a. RollupDB), accepted aggregates from Zoneagg consumers and wrote them into temporary tables per partition per minute. It then rolled-up the aggregates into further aggregates with aggregation cron. More specifically:
- Aggregates per partition, minute, zone → aggregates data per minute, zone
- Aggregates per minute, zone → aggregates data per hour, zone
- Aggregates per hour, zone → aggregates data per day, zone
- Aggregates per day, zone → aggregates data per month, zone
Citus Cluster - consisted of Citus main and 11 Citus workers with x2 replication factor (a.k.a. Zoneagg Citus cluster), the storage behind Zone Analytics API and our BI internal tools. It had replication cron, which did remote copy of tables from Postgres instance into Citus worker shards.
Zone Analytics API - served queries from internal PHP API. It consisted of 5 API instances written in Go and queried Citus cluster, and was not visible to external users.
PHP API - 3 instances of proxying API, which forwarded public API queries to internal Zone Analytics API, and had some business logic on zone plans, error messages, etc.
Load Balancer - nginx proxy, forwarded queries to PHP API/Zone Analytics API.

Cloudflare has grown tremendously since this pipeline was originally designed in 2014. It started off processing under 1M requests per second and grew to current levels of 6M requests per second. The pipeline had served us and our customers well over the years, but began to split at the seams. Any system should be re-engineered after some time, when requirements change.

Some specific disadvantages of the original pipeline were:

Postgres SPOF - single PostgreSQL instance was a SPOF (Single Point of Failure), as it didn't have replicas or backups and if we were to lose this node, whole analytics pipeline could be paralyzed and produce no new aggregates for Zone Analytics API.
Citus main SPOF - Citus main was the entrypoint to all Zone Analytics API queries and if it went down, all our customers' Analytics API queries would return errors.
Complex codebase - thousands of lines of bash and SQL for aggregations, and thousands of lines of Go for API and Kafka consumers made the pipeline difficult to maintain and debug.
Many dependencies - the pipeline consisted of many components, and failure in any individual component could result in halting the entire pipeline.
High maintenance cost - due to its complex architecture and codebase, there were frequent incidents, which sometimes took engineers from the Data team and other teams many hours to mitigate.

Over time, as our request volume grew, the challenges of operating this pipeline became more apparent, and we realized that this system was being pushed to its limits. This realization inspired us to think about which components would be ideal candidates for replacement, and led us to build new data pipeline.

Our first design of an improved analytics pipeline centred around the use of the Apache Flink stream processing system. We had previously used Flink for other data pipelines, so it was a natural choice for us. However, these pipelines had been at a much lower rate than the 6M requests per second we needed to process for HTTP Analytics, and we struggled to get Flink to scale to this volume - it just couldn't keep up with ingestion rate per partition on all 6M HTTP requests per second.

Our colleagues on our DNS team had already built and productionized DNS analytics pipeline atop ClickHouse. They wrote about it in "How Cloudflare analyzes 1M DNS queries per second" blog post. So, we decided to take a deeper look at ClickHouse.

ClickHouse

"ClickHouse не тормозит."
Translation from Russian: ClickHouse doesn't have brakes (or isn't slow)
© ClickHouse core developers

When exploring additional candidates for replacing some of the key infrastructure of our old pipeline, we realized that using a column oriented database might be well suited to our analytics workloads. We wanted to identify a column oriented database that was horizontally scalable and fault tolerant to help us deliver good uptime guarantees, and extremely performant and space efficient such that it could handle our scale. We quickly realized that ClickHouse could satisfy these criteria, and then some.

ClickHouse is an open source column-oriented database management system capable of real time generation of analytical data reports using SQL queries. It is blazing fast, linearly scalable, hardware efficient, fault tolerant, feature rich, highly reliable, simple and handy. ClickHouse core developers provide great help on solving issues, merging and maintaining our PRs into ClickHouse. For example, engineers from Cloudflare have contributed a whole bunch of code back upstream:

Aggregate function topK by Marek Vavruša
IP prefix dictionary by Marek Vavruša
SummingMergeTree engine optimizations by Marek Vavruša
Kafka table Engine by Marek Vavruša. We're thinking to replace Kafka Go consumers with this engine when it will be stable enough and ingest from Kafka into ClickHouse directly.
Aggregate function sumMap by Alex Bocharov. Without this function it would be impossible to build our new Zone Analytics API.
Mark cache fix by Alex Bocharov
uniqHLL12 function fix for big cardinalities by Alex Bocharov. The description of the issue and its fix should be an interesting reading.

Along with filing many bug reports, we also report about every issue we face in our cluster, which we hope will help to improve ClickHouse in future.

Even though DNS analytics on ClickHouse had been a great success, we were still skeptical that we would be able to scale ClickHouse to the needs of the HTTP pipeline:

Kafka DNS topic has on average 1.5M messages per second vs 6M messages per second for HTTP requests topic.
Kafka DNS topic average uncompressed message size is 130B vs 1630B for HTTP requests topic.
DNS query ClickHouse record consists of 40 columns vs 104 columns for HTTP request ClickHouse record.

After unsuccessful attempts with Flink, we were skeptical of ClickHouse being able to keep up with the high ingestion rate. Luckily, early prototype showed promising performance and we decided to proceed with old pipeline replacement. The first step in replacing the old pipeline was to design a schema for the new ClickHouse tables.

ClickHouse schema design

Once we identified ClickHouse as a potential candidate, we began exploring how we could port our existing Postgres/Citus schemas to make them compatible with ClickHouse.

For our Zone Analytics API we need to produce many different aggregations for each zone (domain) and time period (minutely / hourly / daily / monthly). For deeper dive about specifics of aggregates please follow Zone Analytics API documentation or this handy spreadsheet.

These aggregations should be available for any time range for the last 365 days. While ClickHouse is a really great tool to work with non-aggregated data, with our volume of 6M requests per second we just cannot afford yet to store non-aggregated data for that long.

To give you an idea of how much data is that, here is some "napkin-math" capacity planning. I'm going to use an average insertion rate of 6M requests per second and $100 as a cost estimate of 1 TiB to calculate storage cost for 1 year in different message formats:

And that is if we assume that requests per second will stay the same, but in fact it's growing fast all the time.

Even though storage requirements are quite scary, we're still considering to store raw (non-aggregated) requests logs in ClickHouse for 1 month+. See "Future of Data APIs" section below.

Non-aggregated requests table

We store over 100+ columns, collecting lots of different kinds of metrics about each request passed through Cloudflare. Some of these columns are also available in our Enterprise Log Share product, however ClickHouse non-aggregated requests table has more fields.

With so many columns to store and huge storage requirements we've decided to proceed with the aggregated-data approach, which worked well for us before in old pipeline and which will provide us with backward compatibility.

Aggregations schema design #1

According to the API documentation, we need to provide lots of different requests breakdowns and to satisfy these requirements we decided to test the following approach:

Create Cickhouse materialized views with ReplicatedAggregatingMergeTree engine pointing to non-aggregated requests table and containing minutely aggregates data for each of the breakdowns:
- Requests totals - containing numbers like total requests, bytes, threats, uniques, etc.
- Requests by colo - containing requests, bytes, etc. breakdown by edgeColoId - each of 120+ Cloudflare datacenters
- Requests by http status - containing breakdown by HTTP status code, e.g. 200, 404, 500, etc.
- Requests by content type - containing breakdown by response content type, e.g. HTML, JS, CSS, etc.
- Requests by country - containing breakdown by client country (based on IP)
- Requests by threat type - containing breakdown by threat type
- Requests by browser - containing breakdown by browser family, extracted from user agent
- Requests by ip class - containing breakdown by client IP class
Write the code gathering data from all 8 materialized views, using two approaches:
- Querying all 8 materialized views at once using JOIN
- Querying each one of 8 materialized views separately in parallel
Run performance testing benchmark against common Zone Analytics API queries

Schema design #1 didn't work out well. ClickHouse JOIN syntax forces to write monstrous query over 300 lines of SQL, repeating the selected columns many times because you can do only pairwise joins in ClickHouse.

As for querying each of materialized views separately in parallel, benchmark showed prominent, but moderate results - query throughput would be a little bit better than using our Citus based old pipeline.

Aggregations schema design #2

In our second iteration of the schema design, we strove to keep a similar structure to our existing Citus tables. To do this, we experimented with the SummingMergeTree engine, which is described in detail by the excellent ClickHouse documentation:

In addition, a table can have nested data structures that are processed in a special way. If the name of a nested table ends in 'Map' and it contains at least two columns that meet the following criteria... then this nested table is interpreted as a mapping of key => (values...), and when merging its rows, the elements of two data sets are merged by 'key' with a summation of the corresponding (values...).

We were pleased to find this feature, because the SummingMergeTree engine allowed us to significantly reduce the number of tables required as compared to our initial approach. At the same time, it allowed us to match the structure of our existing Citus tables. The reason was that the ClickHouse Nested structure ending in 'Map' was similar to the Postgres hstore data type, which we used extensively in the old pipeline.

However, there were two existing issues with ClickHouse maps:

SummingMergeTree does aggregation for all records with same primary key, but final aggregation across all shards should be done using some aggregate function, which didn't exist in ClickHouse.
For storing uniques (uniques visitors based on IP), we need to use AggregateFunction data type, and although SummingMergeTree allows you to create column with such data type, it will not perform aggregation on it for records with same primary keys.

To resolve problem #1, we had to create a new aggregation function sumMap. Luckily, ClickHouse source code is of excellent quality and its core developers are very helpful with reviewing and merging requested changes.

As for problem #2, we had to put uniques into separate materialized view, which uses the ReplicatedAggregatingMergeTree Engine and supports merge of AggregateFunction states for records with the same primary keys. We're considering adding the same functionality into SummingMergeTree, so it will simplify our schema even more.

We also created a separate materialized view for the Colo endpoint because it has much lower usage (5% for Colo endpoint queries, 95% for Zone dashboard queries), so its more dispersed primary key will not affect performance of Zone dashboard queries.

Once schema design was acceptable, we proceeded to performance testing.

ClickHouse performance tuning

We explored a number of avenues for performance improvement in ClickHouse. These included tuning index granularity, and improving the merge performance of the SummingMergeTree engine.

By default ClickHouse recommends to use 8192 index granularity. There is nice article explaining ClickHouse primary keys and index granularity in depth.

While default index granularity might be excellent choice for most of use cases, in our case we decided to choose the following index granularities:

For the main non-aggregated requests table we chose an index granularity of 16384. For this table, the number of rows read in a query is typically on the order of millions to billions. In this case, a large index granularity does not make a huge difference on query performance.
For the aggregated requests_* stables, we chose an index granularity of 32. A low index granularity makes sense when we only need to scan and return a few rows. It made a huge difference in API performance - query latency decreased by 50% and throughput increased by ~3 times when we changed index granularity 8192 → 32.

Not relevant to performance, but we also disabled the min_execution_speed setting, so queries scanning just a few rows won't return exception because of "slow speed" of scanning rows per second.

On the aggregation/merge side, we've made some ClickHouse optimizations as well, like increasing SummingMergeTree maps merge speed by x7 times, which we contributed back into ClickHouse for everyone's benefit.

Once we had completed the performance tuning for ClickHouse, we could bring it all together into a new data pipeline. Next, we describe the architecture for our new, ClickHouse-based data pipeline.

New data pipeline

The new pipeline architecture re-uses some of the components from old pipeline, however it replaces its most weak components.

New components include:

Kafka consumers - 106 Go consumers per each partition consume Cap'n Proto raw logs and extract/prepare needed 100+ ClickHouse fields. Consumers no longer do any aggregation logic.
ClickHouse cluster - 36 nodes with x3 replication factor. It handles non-aggregate requests logs ingestion and then produces aggregates using materialized views.
Zone Analytics API - rewritten and optimized version of API in Go, with many meaningful metrics, healthchecks, failover scenarios.

As you can see the architecture of new pipeline is much simpler and fault-tolerant. It provides Analytics for all our 7M+ customers' domains totalling more than 2.5 billion monthly unique visitors and over 1.5 trillion monthly page views.

On average we process 6M HTTP requests per second, with peaks of upto 8M requests per second.

Average log message size in Cap’n Proto format used to be ~1630B, but thanks to amazing job on Kafka compression by our Platform Operations Team, it decreased significantly. Please see "Squeezing the firehose: getting the most from Kafka compression" blog post with deeper dive into those optimisations.

Benefits of new pipeline

No SPOF - removed all SPOFs and bottlenecks, everything has at least x3 replication factor.
Fault-tolerant - it's more fault-tolerant, even if Kafka consumer or ClickHouse node or Zone Analytics API instance fails, it doesn't impact the service.
Scalable - we can add more Kafka brokers or ClickHouse nodes and scale ingestion as we grow. We are not so confident about query performance when cluster will grow to hundreds of nodes. However, Yandex team managed to scale their cluster to 500+ nodes, distributed geographically between several data centers, using two-level sharding.
Reduced complexity - due to removing messy crons and consumers which were doing aggregations and refactoring API code we were able to:
- Shutdown Postgres RollupDB instance and free it up for reuse.
- Shutdown Citus cluster 12 nodes and free it up for reuse. As we won't use Citus for serious workload anymore we can reduce our operational and support costs.
- Delete tens of thousands of lines of old Go, SQL, Bash, and PHP code.
- Remove WWW PHP API dependency and extra latency.
Improved API throughput and latency - with previous pipeline Zone Analytics API was struggling to serve more than 15 queries per second, so we had to introduce temporary hard rate limits for largest users. With new pipeline we were able to remove hard rate limits and now we are serving ~40 queries per second. We went further and did intensive load testing for new API and with current setup and hardware we are able serve up to ~150 queries per second and this is scalable with additional nodes.

Easier to operate - with shutdown of many unreliable components, we are finally at the point where it's relatively easy to operate this pipeline. ClickHouse quality helps us a lot in this matter.
Decreased amount of incidents - with new more reliable pipeline, we now have fewer incidents than before, which ultimately has reduced on-call burden. Finally, we can sleep peacefully at night :-).

Recently, we've improved the throughput and latency of the new pipeline even further with better hardware. I'll provide details about this cluster below.

Our ClickHouse cluster

In total we have 36 ClickHouse nodes. The new hardware is a big upgrade for us:

Chassis - Quanta D51PH-1ULH chassis instead of Quanta D51B-2U chassis (2x less physical space)
CPU - 40 logical cores E5-2630 v3 @ 2.40 GHz instead of 32 cores E5-2630 v4 @ 2.20 GHz
RAM - 256 GB RAM instead of 128 GB RAM
Disks - 12 x 10 TB Seagate ST10000NM0016-1TT101 disks instead of 12 x 6 TB Toshiba TOSHIBA MG04ACA600E
Network - 2 x 25G Mellanox ConnectX-4 in MC-LAG instead of 2 x 10G Intel 82599ES

Our Platform Operations team noticed that ClickHouse is not great at running heterogeneous clusters yet, so we need to gradually replace all nodes in the existing cluster with new hardware, all 36 of them. The process is fairly straightforward, it's no different than replacing a failed node. The problem is that ClickHouse doesn't throttle recovery.

Here is more information about our cluster:

Avg insertion rate - all our pipelines bring together 11M rows per second.
Avg insertion bandwidth - 47 Gbps.
Avg queries per second - on average cluster serves ~40 queries per second with frequent peaks up to ~80 queries per second.
CPU time - after recent hardware upgrade and all optimizations, our cluster CPU time is quite low.

Max disk IO (device time) - it's low as well.

In order to make the switch to the new pipeline as seamless as possible, we performed a transfer of historical data from the old pipeline. Next, I discuss the process of this data transfer.

Historical data transfer

As we have 1 year storage requirements, we had to do one-time ETL (Extract Transfer Load) from the old Citus cluster into ClickHouse.

At Cloudflare we love Go and its goroutines, so it was quite straightforward to write a simple ETL job, which:

For each minute/hour/day/month extracts data from Citus cluster
Transforms Citus data into ClickHouse format and applies needed business logic
Loads data into ClickHouse

The whole process took couple of days and over 60+ billions rows of data were transferred successfully with consistency checks. The completion of this process finally led to the shutdown of old pipeline. However, our work does not end there, and we are constantly looking to the future. In the next section, I'll share some details about what we are planning.

Future of Data APIs

Log Push

We're currently working on something called "Log Push". Log push allows you to specify a desired data endpoint and have your HTTP request logs sent there automatically at regular intervals. At the moment, it's in private beta and going to support sending logs to:

Amazon S3 bucket
Google Cloud Service bucket
Other storage services and platforms

It's expected to be generally available soon, but if you are interested in this new product and you want to try it out please contact our Customer Support team.

Logs SQL API

We're also evaluating possibility of building new product called Logs SQL API. The idea is to provide customers access to their logs via flexible API which supports standard SQL syntax and JSON/CSV/TSV/XML format response.

Queries can extract:

Raw requests logs fields (e.g. SELECT field1, field2, ... FROM requests WHERE ...)
Aggregated data from requests logs (e.g. SELECT clientIPv4, count() FROM requests GROUP BY clientIPv4 ORDER BY count() DESC LIMIT 10)

Google BigQuery provides similar SQL API and Amazon has product callled Kinesis Data analytics with SQL API support as well.

Another option we're exploring is to provide syntax similar to DNS Analytics API with filters and dimensions.

We're excited to hear your feedback and know more about your analytics use case. It can help us a lot to build new products!

Conclusion

All this could not be possible without hard work across multiple teams! First of all thanks to other Data team engineers for their tremendous efforts to make this all happen. Platform Operations Team made significant contributions to this project, especially Ivan Babrou and Daniel Dao. Contributions from Marek Vavruša in DNS Team were also very helpful.

Finally, Data team at Cloudflare is a small team, so if you're interested in building and operating distributed services, you stand to have some great problems to work on. Check out the Distributed Systems Engineer - Data and Data Infrastructure Engineer roles in London, UK and San Francisco, US, and let us know what you think.

The end of the road for Server: cloudflare-nginx

John Graham-Cumming — Mon, 11 Dec 2017 14:00:00 GMT

Six years ago when I joined Cloudflare the company had a capital F, about 20 employees, and a software stack that was mostly NGINX, PHP and PowerDNS (there was even a little Apache). Today, things are quite different.

CC BY-SA 2.0 image by Randy Merrill

The F got lowercased, there are now more than 500 people and the software stack has changed radically. PowerDNS is gone and has been replaced with our own DNS server, RRDNS, written in Go. The PHP code that used to handle the business logic of dealing with our customers’ HTTP requests is now Lua code, Apache is long gone and new technologies like Railgun, Warp, Argo and Tiered Cache have been added to our ‘edge’ stack.

And yet our servers still identify themselves in HTTP responses with

Of course, NGINX is still a part of our stack, but the code that handles HTTP requests goes well beyond the capabilities of NGINX alone. It’s also not hard to imagine a time where the role of NGINX diminishes further. We currently run four instances of NGINX on each edge machine (one for SSL, one for non-SSL, one for caching and one for connections between data centers). We used to have a fifth but it’s been deprecated and are planning for the merging of the SSL and non-SSL instances.

As we have done with other bits of software (such as the KyotoTycoon distributed key-value store or PowerDNS) we’re quite likely to write our own caching or web serving code at some point. The time may come when we no longer use NGINX for caching, for example. And so, now is a good time to switch away from Server: cloudflare-nginx.

We like to write our own when the cost of customizing or configuring existing open source software becomes too high. For example, we switched away from PowerDNS because it was becoming too complicated to implement all the logic we need for the services we provide.

Over the next month we will be transitioning to simply:

If you have software that looks for cloudflare-nginx in the Server header it’s time to update it.

We’ve worked closely with companies that rely on the Server header to determine whether a website, application or API uses Cloudflare, so that their software or service is updated and we’ll be rolling out this change in stages between December 18, 2017 and January 15, 2018. Between those dates Cloudflare-powered HTTP responses may contain either Server: cloudflare-nginx or Server: cloudflare.

Cloudflare Wants to Buy Your Meetup Group Pizza

Andrew Fitch — Fri, 10 Nov 2017 15:00:00 GMT

If you’re a web dev / devops / etc. meetup group that also works toward building a faster, safer Internet, I want to support your awesome group by buying you pizza. If your group’s focus falls within one of the subject categories below and you’re willing to give us a 30 second shout out and tweet a photo of your group and @Cloudflare, your meetup’s pizza expense will be reimbursed.

Get Your Pizza $ Reimbursed »

Developer Relations at Cloudflare & why we’re doing this

I’m Andrew Fitch and I work on the Developer Relations team at Cloudflare. One of the things I like most about working in DevRel is empowering community members who are already doing great things out in the world. Whether they’re starting conferences, hosting local meetups, or writing educational content, I think it’s important to support them in their efforts and reward them for doing what they do. Community organizers are the glue that holds developers together socially. Let’s support them and make their lives easier by taking care of the pizza part of the equation.

What’s in it for Cloudflare?

We want web developers to target the apps platform
We want more people to think about working at Cloudflare
Some people only know of Cloudflare as a CDN, but it’s actually a security company and does much more than that. General awareness helps tell the story of what Cloudflare is about.

What kinds of groups we most want to support

We want to work with groups that are closely aligned with Cloudflare, groups which focus on web development, web security, devops, or tech ops. We often work with language-specific meetups (such as Go London User Group) or diversity & inclusion meetups (such as Women Who Code Portland). We’d also love to work with college student groups and any other coding groups which are focused on sharing technical knowledge with the community.

To get sponsored pizza, groups must be focused on...

Web performance & optimization
Front-end frameworks
Language-specific (JavaScript / PHP / Go / Lua / etc.)
Diversity & inclusion meetups for web devs / devops / tech ops
Workshops / classes for web devs / devops / tech ops

How it works

Interested groups need to read our Cloudflare Meetup Group Pizza Reimbursement Rules document.
If the group falls within the requirements, the group should proceed to schedule their meetup and pull the Cloudflare Introductory Slides ahead of time.
At the event, an organizer should present the Cloudflare Introductory Slides at the beginning, giving us a 30 second shout out, and take a photo of the group in front of the slides. Someone from the group should Tweet the photo and @Cloudflare, so our Community Manager may retweet.
After the event, an organizer will need to fill out our reimbursement form where they will upload a completed W-9, their mailing address, group details, and the link to the Tweet.
Within a month, the organizer should have a check from Cloudflare (and some laptop stickers for their group, if they want) in their hands.

Here are some examples of groups we’ve worked with so far

Go London User Group (GLUG)

The Go London User Group is a London-based community for anyone interested in the Go programming language.

GLUG provides offline opportunities to:

Discuss Go and related topics
Socialize with friendly people who are interested in Go
Find or fill Go-related jobs

They want GLUG to be a diverse and inclusive community. As such all attendees, organizers and sponsors are required to follow the community code of conduct.

This group’s mission is very much in alignment with Cloudflare’s guidelines, so we sponsored the pizza for their October Gophers event.

Women Who Code Portland

Women Who Code’s mission statement

Women Who Code is a global nonprofit dedicated to inspiring women to excel in technology careers by creating a global, connected community of women in technology.

What they offer

Monthly Networking Nights
Study Nights (JavaScript, React, DevOps, Design + Product, Algorithms)
Technical Workshops
Community Events (Hidden Figures screening, Women Who Strength Train, Happy Hours, etc.)
Free or discounted tickets to conferences

Women Who Code are a great example of an inclusive technical group. Cloudflare reimbursed them for their pizza at their October JavaScript Study Night.

GDG Phoenix / PHX Android

GDG Phoenix / PHX Android group is for anyone interested in Android design and development. All skill levels are welcome. Participants are welcome to the meet up to hang out and watch or be part of the conversation at events.

This group is associated with the Google Developer Group program and coordinate activities with the national organization. Group activities are not exclusively Android related, however, they can cover any technology.

Cloudflare sponsored their October event “How espresso works . . . Android talk with Michael Bailey”.

I hope a lot of awesome groups around the world will take advantage of our offer. Enjoy the pizza!

Get Your Pizza $ Reimbursed »

© 2017 Cloudflare, Inc. All rights reserved. The Cloudflare logo and marks are trademarks of Cloudflare. All other company and product names may be trademarks of the respective companies with which they are associated.

A New API Binding: cloudflare-php

Junade Ali — Sat, 23 Sep 2017 00:01:35 GMT

Back in May last year, one of my colleagues blogged about the introduction of our Python binding for the Cloudflare API and drew reference to our other bindings in Go and Node. Today we are complimenting this range by introducing a new official binding, this time in PHP.

This binding is available via Packagist as cloudflare/sdk, you can install it using Composer simply by running composer require cloudflare/sdk. We have documented various use-cases in our "Cloudflare PHP API Binding" KB article to help you get started.

Alternatively should you wish to help contribute, or just give us a star on GitHub, feel free to browse to the cloudflare-php source code.

PHP is a controversial language, and there is no doubt there are elements of bad design within the language (as is the case with many other languages). However, love it or hate it, PHP is a language of high adoption; as of September 2017 W3Techs report that PHP is used by 82.8% of all the websites whose server-side programming language is known. In creating this binding the question clearly wasn't on the merits of PHP, but whether we wanted to help drive improvements to the developer experience for the sizeable number of developers integrating with us whilst using PHP.

In order to help those looking to contribute or build upon this library, I write this blog post to explain some of the design decisions made in putting this together.

Exclusively for PHP 7

PHP 5 initially introduced the ability for type hinting on the basis of classes and interfaces, this opened up (albeit seldom used) parametric polymorphic behaviour in PHP. Type hinting on the basis of interfaces made it easier for those developing in PHP to follow the Gang of Four's famous guidance: "Program to an 'interface', not an 'implementation'."

Type hinting has slowly developed in PHP, in PHP 7.0 the ability for Scalar Type Hinting was released after a few rounds of RFCs. Additionally PHP 7.0 introduced Return Type Declarations, allowing return values to be type hinted in a similar way to argument type hinting. In this library we extensively use Scalar Type Hinting and Return Type Declarations thereby restricting the backward compatibility that's available with PHP 5.

In order for backward compatibility to be available, these improvements to type hinting simply would not be implementable and the associated benefits would be lost. With Active Support no longer being offered to PHP 5.6 and Security Support little over a year away from disappearing for the entirety of PHP 5.x, we decided the additional coverage wasn't worth the cost.

Object Composition

What do we mean by a software architecture? To me the term architecture conveys a notion of the core elements of the system, the pieces that are difficult to change. A foundation on which the rest must be built. Martin Fowler

When getting started with this package, you'll notice there are 3 classes you'll need to instantiate:

The first class being instantiated is called APIKey (a few other classes for authentication are available). We then proceed to instantiate the Guzzle class and the APIKey object is then injected into the constructor of the Guzzle class. The Auth interface that the APIKey class implements is fairly simple:

The Adapter interface (which the Guzzle class implements) makes explicit that an object built on the Auth interface is expected to be injected into the constructor:

In doing so; we define that classes which implement the Adapter interface are to be composed using objects made from classes which implement the Auth interface.

So why am I explaining basic Dependency Injection here? It is critical to understand as the design of our API changes, the mechanisms for Authentication may vary independently of the HTTP Client or indeed API Endpoints themselves. Similarly the HTTP Client or the API Endpoints may vary independently of the other elements involved. Indeed, this package already contains three classes for the purpose of authentication (APIKey, UserServiceKey and None) which need to be interchangeably used. This package therefore considers the possibility for changes to different components in the API and seeks to allow these components to vary independently.

Dependency Injection is also used where the parameters for an API Endpoint become more complicated then what is permitted by simpler variables types; for example, this is done for defining the Target or Configuration when configuring a Page Rule:

The structure of this project is overall based on simple object composition; this provides a far more simple object model for the long-term and a design that provides higher flexibility. For example; should we later want to create an Endpoint class which is a composite of other Endpoints, it becomes fairly trivial for us to build this by implementing the same interface as the other Endpoint classes. As more code is added, we are able to keep the design of the software relatively thinly layered.

Testing/Mocking HTTP Requests

If you're interesting in helping contribute to this repository; there are two key ways you can help:

Building out coverage of endpoints on our API
Building out test coverage of those endpoint classes

The PHP-FIG (PHP Framework Interop Group) put together a standard on how HTTP responses can be represented in an interface, this is described in the PSR-7 standard. This response interface is utilised by our HTTP Adapter interface in which responses to API requests are type hinted to this interface (Psr\Http\Message\ResponseInterface).

By using this standard, it's easier to add further abstractions for additional HTTP clients and mock HTTP responses for unit testing. Let's assume the JSON response is stored in the $response variable and we want to test the listIPs method in the IPs Endpoint class:

We are able to build a simple mock of our Adapter interface by using the standardised PSR-7 response format, when we do so we are able to define what parameters PHPUnit expects to be passed to this mock. With a mock Adapter class in place we are able to test the IPs Endpoint class as any if it was using a real HTTP client.

Conclusions

Through building on modern versions of PHP, using good Object-Oriented Programming theory and allowing for effective testing we hope our PHP API binding provides a developer experience that is pleasant to build upon.

If you're interesting in helping improve the design of this codebase, I'd encourage you to take a look at the PHP API binding source code on GitHub (and optionally give us a star).

If you work with Go or PHP and you're interested in helping Cloudflare turn our high-traffic customer-facing API into an ever more modern service-oriented environment; we're hiring for Web Engineers in San Francisco, Austin and London.

Using Guzzle and PHPUnit for REST API Testing

Junade Ali — Wed, 28 Dec 2016 14:10:46 GMT

APIs are increasingly becoming the backbone of the modern internet - whether you're ordering food from an app on your phone or browsing a blog using a modern JavaScript framework, chances are those requests are flowing through an API. Given the need for APIs to evolve through refactoring and extension, having great automated tests allows you to develop fast without needing to slow down to run manual tests to work out what’s broken. Additionally, by having tests in place you’re able to firmly identify the requirements that your API should meet, your API tests effectively form a tangible and executable specification. API Testing offers an end-to-end mechanism of testing the behaviour of your API which has advantages in both reliability and also development productivity.

In this post I'll be demonstrating how you can test RESTful APIs in an automated fashion using PHP, by building a testing framework through creative use of two packages - Guzzle and PHPUnit. The resulting tests will be something you can run outside of your API as part of your deployment or CI (Continuous Integration) process.

Guzzle acts as a powerful HTTP client which we can use to simulate HTTP Requests against our API. Though PHPUnit acts as a Unit Test framework (based on XUnit), in this case we will be using this powerful testing framework to test the HTTP responses we get back from our APIs using Guzzle.

Preparing our Environment

In order to pull in the required packages, we’ll be using Composer - a dependency manager for PHP. Inside our Composer project, we can simply require the dependencies we’re after:

When we ran composer require for each of the two packages, Composer went ahead and actually downloaded the packages we want, these are stored in the vendor directory. Additionally when we ran composer update, Composer updated it’s PSR-4 autoload script that allows us to pull in all the dependencies we’ve required with one file include, you can find this in vendor/autoload.php.

With our dependencies in place, we can now configure PHPUnit to use Guzzle. In order to do this, we need to tell PHPUnit where our Composer autoload file is, but also where our tests are located. We can do this through writing a phpunit.xml in the root directory of our project:

In the XML above, the two noteworthy elements are the opening phpunit tag (which defines with a bootstrap property where our Composer autoload script is), additionally we have a testsuite element which defines our test suites (and a child directory property to define where the specific tests live). From here, we can just add an empty directory called tests for our tests to reside in.

If we now run PHPUnit (through the command ./vendor/bin/phpunit), we should see an output similar to the one I get below:

With our environment defined, we’re ready to move on to the next step. First; purely for the sake of convenience, I’ve added a shortcut to my composer.json file so that when I run composer test it will point to ./vendor/bin/phpunit. You can do this by adding the following JSON to your composer.json file:

Writing our Tests

As an example, I’ll be writing tests against an endpoint at httpbin.org. The first test I’ll write will be against the /user-agent endpoint, so I’ll create a file called UserAgentTest.php, be sure to extend the PHPUnit_Framework_TestCase class:

Before each test, PHPUnit will run the setUp method and after the test has executed it will run the tearDown method in the class (if they exist). By utilising these methods we can instantiate our Guzzle HTTP client ready for each test and then return to a clean slate afterwards:

Note that if you feel even more adventurous, you can utilise environment variables (through the getenv method) to set the baseurl - for this tutorial however, I’ll be keeping things simple.

With our setUp and tearDown methods in place, we can now go ahead and actually create our test methods. As I’ll start off by testing against the GET HTTP verb, I’ll name the first test method testGet. From here, we can make the request and then check the properties we get back.

In the method above, I’ve made a GET request to the user-agent endpoint. I can then check the response code I get back was indeed 200 using the first assertion. The next assertion I test against is whether the Content-Type header indicates the response is JSON. Finally I check that the JSON body itself actually contains the phrase “Guzzle” in the user-agent property.

We can add additional assertions as required, but we can also add additional methods for other HTTP verbs. For example, here’s a simple test to see if I get a 405 status code when I make a PUT request to the /user-agent endpoint:

Next time we run PHPUnit, we can see if our tests pass successfully and also get insight into some statistics surrounding the execution of these tests:

Conclusion

That’s all there is to this simple approach to API testing. If you want some insight into the overall code, feel free to review the project files in the Github repository.

If you find yourself using this testing set-up, be sure to review the Guzzle Request Options to learn what kind of HTTP requests you can run with Guzzle and also check out the types of assertions of you can run with PHPUnit.

Accelerating Node.js applications with HTTP/2 Server Push

Terin Stock — Tue, 16 Aug 2016 12:17:19 GMT

In April, we announced support for HTTP/2 Server Push via the HTTP Link header. My coworker John has demonstrated how easy it is to add Server Push to an example PHP application.

CC BY 2.0 image by Nicky Fernandes

We wanted to make it easy to improve the performance of contemporary websites built with Node.js. we developed the netjet middleware to parse the generated HTML and automatically add the Link headers. When used with an example Express application you can see the headers being added:

We use Ghost to power this blog, so if your browser supports HTTP/2 you have already benefited from Server Push without realizing it! More on that below.

In netjet, we use the PostHTML project to parse the HTML with a custom plugin. Right now it is looking for images, scripts and external stylesheets. You can implement this same technique in other environments too.

Putting an HTML parser in the response stack has a downside: it will increase the page load latency (or "time to first byte"). In most cases, the added latency will be overshadowed by other parts of your application, such as database access. However, netjet includes an adjustable LRU cache keyed by ETag headers, allowing netjet to insert Link headers quickly on pages already parsed.

If you are designing a brand new application, however, you should consider storing metadata on embedded resources alongside your content, eliminating the HTML parse, and possible latency increase, entirely.

Netjet is compatible with any Node.js HTML framework that supports Express-like middleware. Getting started is as simple as adding netjet to the beginning of your middleware chain.

With a little more work, you can even use netjet without frameworks.

See the netjet documentation for more information on the supported options.

Seeing what’s pushed

Chrome's Developer Tools makes it easy to verify that your site is using Server Push. The Network tab shows pushed assets with "Push" included as part of the initiator.

Unfortunately, Firefox's Developers Tools don't yet directly expose if the resource pushed. You can, however, check for the cf-h2-pushed header in the page's response headers, which contains a list of resources that CloudFlare offered browsers over Server Push.

Contributions to improve netjet or the documentation are greatly appreciated. I'm excited to hear where people are using netjet.

Ghost and Server Push

Ghost is one such exciting integration. With the aid of the Ghost team, I've integrated netjet, and it has been available as an opt-in beta since version 0.8.0.

If you are running a Ghost instance, you can enable Server Push by modifying the server's config.js file and add the preloadHeaders option to the production configuration block.

Ghost has put together a support article for Ghost(Pro) customers.

Conclusion

With netjet, your Node.js applications can start to use browser preloading and, when used with CloudFlare, HTTP/2 Server Push today.

At CloudFlare, we're excited to make tools to help increase the performance of websites. If you find this interesting, we are hiring in Austin, Texas; Champaign, Illinois; London; San Francisco; and Singapore.

HTTP/2 Server Push with multiple assets per Link header

John Graham-Cumming — Thu, 30 Jun 2016 12:09:46 GMT

In April we announced that we had added experimental support for HTTP/2 Server Push to all CloudFlare web sites. We did this so that our customers could iterate on this new functionality.

CC BY 2.0 image by https://www.flickr.com/photos/mryipyop/

Our implementation of Server Push made use of the HTTP Link header as detailed in W3C Preload Working Draft.

We also showed how to make Server Push work from within PHP code and many people started testing and using this feature.

However, there was a serious restriction in our initial version: it was not possible to specify more than one asset per Link header for Server Push and many CMS and web development platforms would not allow multiple Link headers.

We have now addressed that problem and it is possible to request that multiple assets be pushed in a single Link header. This change is live and was used to push assets in this blog post to your browser if your browser supports HTTP/2.

When CloudFlare reads a Link header sent by an origin web server it will remove assets that it pushes from the Link header passed on to the web browser. That made it a little difficult to debug problems with Link and Server Push so we have added a header called Cf-H2-Pushed which contains the assets that were pushed.

For example, browsing to this recent blog post results in the origin web server sending the following headers:

CloudFlare decides to HTTP/2 Server Push the assets /assets/css/screen.css?v=5fc240c512, /content/images/2016/06/Timeouts-001.png, /content/images/2016/06/Timeouts-002.png and /assets/js/jquery.fitvids.js?v=5fc240c512. As the response passes through CloudFlare those assets are removed from the Link header, pushed and added to the Cf-H2-Pushed header:

In Google Chrome Canary's Developer view the pushed assets can be clearly seen.

Conclusion

Please let us know how you use Server Push. We're particularly interested in experiences with pushing different types of resources (images vs. styles vs. scripts) and working out the optimal number of items to push (we currently allow up to 50 resources per page).

Using HTTP/2 Server Push with PHP

John Graham-Cumming — Fri, 13 May 2016 13:11:28 GMT

Two weeks ago CloudFlare announced that it was supporting HTTP/2 Server Push for all our customers. By simply adding a Link header to an HTTP response specifying preload CloudFlare would automatically push items to web browsers that support Server Push.

To illustrate how easy this is I create a small PHP page that uses the PHP header function to insert appropriate Link headers to push images to the web browser via CloudFlare. The web page looks like this when loaded:

There are two images loaded from the same server both of which are pushed if the web browser supports Server Push. This is achieved by inserting two Link headers in the HTTP response. The response looks like:

At the bottom are the two Link headers corresponding to the two images on the page with the rel=preload directive as specified in W3C preload draft.

The complete code can be found in this gist but the core of the code looks like this:

Since you have to call the PHP header function before any output (such as HTML) has occurred the code makes two calls to a helper function called pushImage first. pushImage adds the appropriate Link header and returns the HTML needed to insert the actual image in the page.

Later the variables $image1 and $image2 (which contain the HTML needed to display the images) is inserted use two other helper functions (ccbysa and ccbynd) that add captions. Those two helper functions don't play any part in the Server Push, they just ensure that the HTML is placed correctly on the page with a caption.

Notice that the Link header is added as follows:

The false second parameter tells header to not override an existing Link header. With that option specified multiple Link headers can be added. Without it the last call to pushImage would win.

Effect of Server Push

To understand the effect of Server Push on the example I used Google Chrome Canary and loaded the page twice (once without the Link headers so that no push would occur and once with).

Here's the simple waterfall over HTTP/2 with no Server Push:

The page load time was 651ms. You can see the initial page HTML load in 175ms followed by the two images.

How here's the waterfall with HTTP/2 Server Push:

The page load time was 251ms (the HTML loaded in 168ms) and Chrome is showing that the two images were pushed (see the Initiator column).

It's not 100% obvious what happened there but digging into chrome://net-internals/ it's possible to see a detailed HTTP/2 timeline. I've edited out a few details of the protocol (such as the window size changes) to focus in on the requests and responses.

The t value gives the tick time (1ms per tick).

The web browser sends an HTTP2_SESSION_SEND_HEADERS request asking for the web page (that's the first item above) and subsequently receives the response headers and page followed immediately by two push promises for the two images. It then immediately starts receiving image data (see the stream_id 2).

CloudFlare's web server is pushing the image contents before the browser asked for them. When CloudFlare pushes an item specified in a Link header the header itself is stripped (to prevent the browser from re-requesting the resource).

The Future Starts Here

We released HTTP/2 Server Push support to help kick start innovative use of this critical feature of HTTP/2. We would love people to start experimenting with it to see how much of a speed improvement is possible with their specific websites.

As can be seen from this blog post making use of Server Push in a web application is easy: just insert Link headers with the appropriate format. Server Push will be particularly fast if the items being pushed are also stored in CloudFlare's cache (as was the case with the examples above).

A Look at the New WordPress Brute Force Amplification Attack

Pasha Kravtsov — Fri, 16 Oct 2015 17:14:32 GMT

Recently, a new brute force attack method for WordPress instances was identified by Sucuri. This latest technique allows attackers to try a large number of WordPress username and password login combinations in a single HTTP request.

The vulnerability can easily be abused by a simple script to try a significant number of username and password combinations with a relatively small number of HTTP requests. The following diagram shows a 4-fold increase in login attempts to HTTP requests, but this can trivially be expanded to a thousand logins.

This form of brute force attack is harder to detect, since you won’t necessarily see a flood of requests. Fortunately, all CloudFlare paid customers have the option to enable a Web Application Firewall ruleset to stop this new attack method.

What is XML-RPC?

To understand the vulnerability, it’s important to understand the basics of the XML remote procedure protocol (XML-RPC).

XML-RPC uses XML encoding over HTTP to provide a remote procedure call protocol. It’s commonly used to execute various functions in a WordPress instance for APIs and other automated tasks. Requests that modify, manipulate, or view data using XML-RPC require user credentials with sufficient permissions.

Here is an example that requests a list of the user’s blogs:

The server responds with an XML message containing the requested information. The isAdmin name-value pair tells us our credentials were correct:

As shown in this request, you must provide proper authentication to get a successful response. You can, in theory, create a script that tries different combinations of the username and password, but that is a noisy option that isn’t very effective and is easily detected (the server logs would show a flood of failed login attempts).

This is where the system.multicall functionality comes into play. You can run multiple methods with a single HTTP request. This is useful for mass editing blogs or deleting large numbers of comments, etc. Any method that requires authentication can be abused to brute force credentials. Here is what a sample XML system.multicall payload would look like:

As you can see, this can lead to very obvious abuse.

Exploitation

During testing, I was able to call the method wp.getUserBlogs 1,000 times in a single HTTP request (limited only by PHP memory issues). If a user creates a simple shell loop that executes one thousand times and runs a PHP script that crafts an HTTP request with one thousand method calls all requiring authentication, then that user would be able to try one million unique logins in a very short period of time.

This makes brute forcing the login very fast and can run down a pretty large wordlist in a short period of time. Also note that the wp.getUserBlogs method isn’t the only RPC call requiring authentication. It’s possible to use any RPC method which requires authentication to attempt logins and brute force the Wordpress credentials.

CloudFlare Customers Are Protected

When using CloudFlare with a Pro level plan or higher, you have the ability to turn on the Web Application Firewall (WAF) and take advantage of the new WordPress ruleset I created to mitigate this attack—all without any major interaction or supervision on your end.

Our WAF works by checking HTTP requests for consistencies that line up with known attacks and malicious activities. If a request does appear to be malicious, we drop it at the edge so it never even reaches the customer’s origin server.

To enable the rule, navigate to your CloudFlare Firewall dashboard, and reference the rule named "Blocks amplified brute force attempts to xmlrpc.php" with the rule ID WP0018.

That’s all there is to it. Now you are protected from the new WordPress XML-RPC brute force amplification attack.

The Manual Solution

Another way to mitigate this attack is by disabling the ability to call thesystem.multicall method in your Wordpress installation by editing yourfunctions.php file. Adding the function mmx_remove_xmlrpc_methods() will alleviate the problem, like so:

Final Thoughts

XML-RPC can be a useful tool for making changes to WordPress and other web applications; however, improper implementation of certain features can result in unintended consequences. Default-on methods like system.multicall and pingback.ping (we have a WAF rule for that one, too) are just a few examples of possible exploits.

Properly configuring the CloudFlare Web Application Firewall for your Internet facing properties will protect you from such attacks with no changes to your server configuration.

The CLAMP Stack

Matthew Prince — Mon, 15 Aug 2011 20:51:00 GMT

Much of the modern Internet is built on what is known as the LAMP Stack. LAMP originally stood for Linux + Apache + MySQL + PHP. You'll find this basic stack at the core of companies like Google, Facebook, Yahoo, and more. While some argue that the P should stand for Python, and we at CloudFlare tend to prefer Postgres to MySQL and NGINX to Apache, the general point is the same: with these free foundational tools, great Internet applications can be made.

We were tickled the other day when we heard someone giving a talk on the "CLAMP" stack: CloudFlare + LAMP. More and more companies are using CloudFlare to extend the scalability of the LAMP stack from an individual machine to a global network. CloudFlare cuts the load on a typical LAMP stack web server by nearly 70%. That allows you to put twice the load on a single server the LAMP burning out.

With integration into more and more of the top hosting providers, it's easier than ever to turn your LAMP into a CLAMP.