Posts tagged "Tools"

Introducing HAR Sanitizer: secure HAR sharing

Kenny Johnson — Thu, 26 Oct 2023 13:20:05 GMT

On Wednesday, October 18th, 2023, Cloudflare’s Security Incident Response Team (SIRT) discovered an attack on our systems that originated from an authentication token stolen from one of Okta’s support systems. No Cloudflare customer information or systems were impacted by the incident, thanks to the real-time detection and rapid action of our Security Incident Response Team (SIRT) in tandem with our Zero Trust security posture and use of hardware keys. With that said, we’d rather not repeat the experience — and so we have built a new security tool that can help organizations render this type of attack obsolete for good.

The bad actor in the Okta breach compromised user sessions by capturing session tokens from administrators at Cloudflare and other impacted organizations. They did this by infiltrating Okta’s customer support system and stealing one of the most common mechanisms for troubleshooting — an HTTP Response Archive (HAR) file.

HAR files contain a record of a user’s browser session, a kind of step-by-step audit, that a user can share with someone like a help desk agent to diagnose an issue. However, the file can also contain sensitive information that can be used to launch an attack.

As a follow-up to the Okta breach, we are making a HAR file sanitizer available to everyone, not just Cloudflare customers, at no cost. We are publishing this tool under an open source license and are making it available to any support, engineering or security team. At Cloudflare, we are committed to making the Internet a better place and using HAR files without the threat of stolen sessions should be part of the future of the Internet.

HAR Files - a look back in time

Imagine being able to rewind time and revisit every single step a user took during a web session, scrutinizing each request and the responses the browser received.

HAR (HTTP Archive) files are a JSON formatted archive file of a web browser’s interaction with a web application. HAR files provide a detailed snapshot of every request, including headers, cookies, and other types of data sent to a web server by the browser. This makes them an invaluable resource to troubleshoot web application issues especially for complex, layered web applications.

The snapshot that a HAR file captures can contain the following information:

Complete Request and Response Headers: Every piece of data sent and received, including method types (GET, POST, etc.), status codes, URLs, cookies, and more.

Payload Content: Details of what was actually exchanged between the client and server, which can be essential for diagnosing issues related to data submission or retrieval.

Timing Information: Precise timing breakdowns of each phase – from DNS lookup, connection time, SSL handshake, to content download – giving insight into performance bottlenecks.

This information can be difficult to gather from an application’s logs due to the diverse nature of devices, browsers and networks used to access an application. A user would need to take dozens of manual steps. A HAR file gives them a one-click option to share diagnostic information with another party. The file is also standard, providing the developers, support teams, and administrators on the other side of the exchange with a consistent input to their own tooling. This minimizes the frustrating back-and-forth where teams try to recreate a user-reported problem, ensuring that everyone is, quite literally, on the same page.

HAR files as an attack vector

HAR files, while powerful, come with a cautionary note. Within the set of information they contain, session cookies make them a target for malicious actors.

The Role of Session Cookies

Before diving into the risks, it's crucial to understand the role of session cookies. A session cookie is sent from a server and stored on a user's browser to maintain stateful information across web sessions for that user. In simpler terms, it’s how the browser keeps you logged into an application for a period of time even if you close the page. Generally, these cookies live in local memory on a user’s browser and are not often shared. However, a HAR file is one of the most common ways that a session cookie could be inadvertently shared.

Dangers of a stolen session cookie

If a HAR file with a valid session cookie is shared, then there are a number of potential security threats that user, and company, may be exposed to:

Unauthorized Access: The biggest risk is unauthorized access. If a HAR file with a session cookie lands in the wrong hands, it grants entry to the user’s account for that application. For platforms that store personal data or financial details, the consequences of such a breach can be catastrophic. Especially if the session cookie of a user with administrative or elevated permissions is stolen.

Session Hijacking: Armed with a session cookie, attackers can impersonate legitimate users, a tactic known as session hijacking. This can lead to a range of malicious activities, from spreading misinformation to siphoning off funds.

Persistent Exposure: Unlike other forms of data, a session cookie's exposure risk doesn't necessarily end when a user session does. Depending on the cookie's lifespan, malicious actors could gain prolonged access, repeatedly compromising a user's digital interactions.

Gateway to Further Attacks: With access to a user's session, especially an administrator’s, attackers can probe for other vulnerabilities, exploit platform weaknesses, or jump to other applications.

Mitigating the impact of a stolen HAR file

Thankfully, there are ways to render a HAR file inert even if stolen by an attacker. One of the most effective methods is to “sanitize” a HAR file of any session related information before sharing it for debugging purposes.

The HAR sanitizer we are introducing today allows a user to upload any HAR file, and the tool will strip out any session related cookies or JSON Web Tokens (JWT). The tool is built entirely on Cloudflare Workers, and all sanitization is done client-side which means Cloudflare never sees the full contents of the session token.

Just enough sanitization

By default, the sanitizer will remove all session-related cookies and tokens — but there are some cases where these are essential for troubleshooting. For these scenarios, we are implementing a way to conditionally strip “just enough” data from the HAR file to render them safe, while still giving support teams the information they need.

The first product we’ve optimized the HAR sanitizer for is Cloudflare Access. Access relies on a user’s JWT — a compact token often used for secure authentication — to verify that a user should have access to the requested resource. This means a JWT plays a crucial role in troubleshooting issues with Cloudflare Access. We have tuned the HAR sanitizer to strip the cryptographic signature out of the Access JWT, rendering it inert, while still providing useful information for internal admins and Cloudflare support to debug issues.

Because HAR files can include a diverse array of data types, selectively sanitizing them is not a case of ‘one size fits all’. We will continue to expand support for other popular authentication tools to ensure we strip out “just enough” information.

What’s next

Over the coming months, we will launch additional security controls in Cloudflare Zero Trust to further mitigate attacks stemming from session tokens stolen from HAR files. This will include:

Enhanced Data Loss Prevention (DLP) file type scanning to include HAR file and session token detections, to ensure users in your organization can not share unsanitized files.
Expanded API CASB scanning to detect HAR files with session tokens in collaboration tools like Zendesk, Jira, Drive and O365.
Automated HAR sanitization of data in popular collaboration tools.

As always, we continue to expand our Cloudflare One Zero Trust suite to protect organizations of all sizes against an ever-evolving array of threats. Ready to get started? Sign up here to begin using Cloudflare One at no cost for teams of up to 50 users.

Project Crossbow: Lessons from Refactoring a Large-Scale Internal Tool

Junade Ali — Tue, 07 Apr 2020 07:00:00 GMT

Cloudflare’s global network currently spans 200 cities in more than 90 countries. Engineers working in product, technical support and operations often need to be able to debug network issues from particular locations or individual servers.

Crossbow is the internal tool for doing just this; allowing Cloudflare’s Technical Support Engineers to perform diagnostic activities from running commands (like traceroutes, cURL requests and DNS queries) to debugging product features and performance using bespoke tools.

In September last year, an Engineering Manager at Cloudflare asked to transition Crossbow from a Product Engineering team to the Support Operations team. The tool had been a secondary focus and had been transitioned through multiple engineering teams without developing subject matter knowledge.

The Support Operations team at Cloudflare is closely aligned with Cloudflare’s Technical Support Engineers; developing diagnostic tooling and Natural Language Processing technology to drive efficiency. Based on this alignment, it was decided that Support Operations was the best team to own this tool.

Learning from Sisyphus

Whilst seeking advice on the transition process, an SRE Engineering Manager in Cloudflare suggested reading: “A Case Study in Community-Driven Software Adoption”. This book proved a truly invaluable read for anyone thinking of doing internal tool development or contributing to such tooling. The book describes why multiple tools are often created for the same purpose by different autonomous teams and how this issue can be overcome. The book also describes challenges and approaches to gaining adoption of tooling, especially where this requires some behaviour change for engineers who use such tools.

That said, there are some things we learnt along the way of taking over Crossbow and performing a refactor and revamp of a large-scale internal tool. This blog post seeks to be an addendum to such guidance and provide some further practical advice.

In this blog post we won’t dwell too much on the work of the Cloudflare Support Operations team, but this can be found in the SRECon talk: “Support Operations Engineering: Scaling Developer Products to the Millions”. The software development methodology used in Cloudflare’s Support Operations Group closely resembles Extreme Programming.

Cutting The Fat

There were two ways of using Crossbow, a CLI (command line interface) and UI in Cloudflare’s internal tool for Cloudflare’s Technical Support Engineers. Maintaining both interfaces clearly had significant overhead for improvement efforts, and we took the decision to deprecate one of the interfaces. This allowed us to focus our efforts on one platform to achieve large-scale improvements across technology, usability and functionality.

We set-up a poll to allow engineering, operations, solutions engineering and technical support teams to provide their feedback on how they used the tooling. Polling was not only critical for gaining vital information to how different teams used the tool, but also ensured that prior to deprecation that people knew their views were taken onboard. We polled not only on the option people preferred, but which options they felt were necessary to them and the reasons as to why.

We found that the reasons for favouring the web UI primarily revolved around the absence of documentation and training. Instead, we discovered those who used the CLI found it far more critical for their workflow. Product Engineering teams do not routinely have access to the support UI but some found it necessary to use Crossbow for their jobs and users wanted to be able to automate commands with shell scripts.

Technically, the UI was in JavaScript with an API Gateway service that converted HTTP requests to gRPC alongside some configuration to allow it to work in the support UI. The CLI directly interfaced with the gRPC API so it was a simpler system. Given the Cloudflare Support Operations team primarily works on Systems Engineering projects and had limited UI resources, the decision to deprecate the UI was also in our own interest.

We rolled out a new internal Crossbow user group, trained up teams and created new documentation, provided advance notification of deprecation and abrogated the source code of these services. We also dramatically improved the user experience when using the CLI for users through simple improvements to the help information and easier CLI usage.

Rearchitecting Pub/Sub with Cloudflare Access

One of the primary challenges we encountered was how the system architecture for Crossbow was designed many years ago. A gRPC API ran commands at Cloudflare’s edge network using a configuration management tool which the SRE team expressed a desire to deprecate (with Crossbow being the last user of it).

During a visit to the Singapore Office, the Edge SRE Engineering Manager locally wanted his team to understand Crossbow and how to contribute. During this meeting, we provided an overview of the current architecture and the team there were forthcoming in providing potential refactoring ideas to handle global network stability and move away from the old pipeline. This provided invaluable insight into the common issues experienced between technical approaches and instances of where the tool would fail requiring Technical Support Engineers to consult the SRE team.

We decided to adopt a more simple pub/sub pipeline, instead the edge network would expose a gRPC daemon that would listen for new jobs and execute them and then make a callback to the API service with the results (which would be relayed onto the client).

For authentication between the API service and the client or the API service and the network edge, we implemented a JWT authentication scheme. For a CLI user, the authentication was done by querying an HTTP endpoint behind Cloudflare Access using cloudflared, which provided a JWT the client could use for authentication with gRPC. In practice, this looks something like this:

CLI makes request to authentication server using cloudflared
Authentication server responds with signed JWT token
CLI makes gRPC request with JWT authentication token to API service
API service validates token using a public key

The gRPC API endpoint was placed on Cloudflare Spectrum; as users were authenticated using Cloudflare Access, we could remove the requirement for users to be on the company VPN to use the tool. The new authentication pipeline, combined with a single user interface, also allowed us to improve the collection of metrics and usage logs of the tool.

Risk Management

Risk is inherent in the activities undertaken by engineering professionals, meaning that members of the profession have a significant role to play in managing and limiting it.
- Guidance on Risk, Engineering Council

As with all engineering projects, it was critical to manage risk. However, the risk to manage is different for different engineering projects. Availability wasn’t the largest factor, given that Technical Support Engineers could escalate issues to the SRE team if the tool wasn’t available. The main risk was security of the Cloudflare network and ensuring Crossbow did not affect the availability of any other services. To this end we took methodical steps to improve isolation and engaged the InfoSec team early to assist with specification and code reviews of the new pipeline. Where a risk to availability existed, we ensured this was properly communicated to the support team and the internal Crossbow user group to communicate the risk/reward that existed.

Feedback, Build, Refactor, Measure

The Support Operations team at Cloudflare works using a methodology based on Extreme Programming. A key tenant of Extreme Programming is that of Test Driven Development, this is often described as a “red-green-green” pattern or “red-green-refactor”. First the engineer enshrines the requirements in tests, then they make those tests pass and then refactor to improve code quality before pushing the software.

As we took on this project, the Cloudflare Support and SRE teams were working on Project Baton - an effort to allow Technical Support Engineers to handle more customer escalations without handover to the SRE teams.

As part of this effort, they had already created an invaluable resource in the form of a feature wish list for Crossbow. We associated JIRAs with all these items and prioritised this work to deliver such feature requests using a Test Driven Development workflow and the introduction of Continuous Integration. Critically we measured such improvements once deployed. Adding simple functionality like support for MTR (a Linux network diagnostic tool) and exposing support for different cURL flags provided improvements in usage.

We were also able to embed Crossbow support for other tools available at the network edge created by other teams, allowing them to maintain such tools and expose features to Crossbow users. Through the creation of an improved development environment and documentation, we were able to drive Product Engineering teams to contribute functionality that was in the mutual interest of them and the customer support team.

Finally, we owned a number of tools which were used by Technical Support Engineers to discover what Cloudflare configuration was applied to a given URL and performing distributed performance testing, we deprecated these tools and rolled them into Crossbow. Another tool owned by the Cloudflare Workers team, called Edge Worker Debug was rolled into Crossbow and the team deprecated their tool.

Results

From implementing user analytics on the tool on the 16 December 2019 to the week ending the 22 January 2020, we found a found usage increase of 4.5x. This growth primarily happened within a 4 week period; by adding the most wanted functionality, we were able to achieve a critical saturation of usage amongst Technical Support Engineers.

Beyond this point, it became critical to use the number of checks being run as a metric to evaluate how useful the tool was. For example, only the week starting January 27 saw no meaningful increase in unique users (a 14% usage increase over the previous week - within the normal fluctuation of stable usage). However, over the same timeframe, we saw a 2.6x increase in the number of tests being run - coinciding with introduction of a number of new high-usage functionalities.

Conclusion

Through removing low-value/high-maintenance functionality and merciless refactoring, we were dramatically able to improve the quality of Crossbow and therefore improve the velocity of delivery. We were able to dramatically improve usage through enabling functionality to measure usage, receive feature requests in feedback loops with users and test-driven development. Consolidation of tooling reduced overhead of developing support tooling across the business, providing a common framework for developing and exposing functionality for Technical Support Engineers.

There are two key counterintuitive learnings from this project. The first is that cutting functionality can drive usage, providing this is done intelligently. In our case, the web UI contained no additional functionality that wasn’t in the CLI, yet caused substantial engineering overhead for maintenance. By deprecating this functionality, we were able to reduce technical debt and thereby improve the velocity of delivering more important functionality. This effort requires effective communication of the decision making process and involvement from those who are impacted by such a decision.

Secondly, tool development efforts are often focussed by user feedback but lack a means of objectively measuring such improvements. When logging is added, it is often done purely for security and audit logging purposes. Whilst feedback loops with users are invaluable, it is critical to have an objective measure of how successful such a feature is and how it is used. Effective measurement drives the decision making process of future tooling and therefore, in the long run, the usage data can be more important than the original feature itself.

If you're interested in debugging interesting technical problems on a network with these tools, we're hiring for Support Engineers (including Security Operations, Technical Support and Support Operations Engineering) in San Francisco, Austin, Champaign, London, Lisbon, Munich and Singapore.

When Bloom filters don't bloom

Marek Majkowski — Mon, 02 Mar 2020 13:00:00 GMT

I've known about Bloom filters (named after Burton Bloom) since university, but I haven't had an opportunity to use them in anger. Last month this changed - I became fascinated with the promise of this data structure, but I quickly realized it had some drawbacks. This blog post is the tale of my brief love affair with Bloom filters.

While doing research about IP spoofing, I needed to examine whether the source IP addresses extracted from packets reaching our servers were legitimate, depending on the geographical location of our data centers. For example, source IPs belonging to a legitimate Italian ISP should not arrive in a Brazilian datacenter. This problem might sound simple, but in the ever-evolving landscape of the internet this is far from easy. Suffice it to say I ended up with many large text files with data like this:

This reads as: the IP 192.0.2.1 was recorded reaching Cloudflare data center number 107 with a legitimate request. This data came from many sources, including our active and passive probes, logs of certain domains we own (like cloudflare.com), public sources (like BGP table), etc. The same line would usually be repeated across multiple files.

I ended up with a gigantic collection of data of this kind. At some point I counted 1 billion lines across all the harvested sources. I usually write bash scripts to pre-process the inputs, but at this scale this approach wasn't working. For example, removing duplicates from this tiny file of a meager 600MiB and 40M lines, took... about an eternity:

Enough to say that deduplicating lines using the usual bash commands like 'sort' in various configurations (see '--parallel', '--buffer-size' and '--unique') was not optimal for such a large data set.

Bloom filters to the rescue

Image by David Eppstein Public Domain

Then I had a brainwave - it's not necessary to sort the lines! I just need to remove duplicated lines - using some kind of "set" data structure should be much faster. Furthermore, I roughly know the cardinality of the input file (number of unique lines), and I can live with some data points being lost - using a probabilistic data structure is fine!

Bloom-filters are a perfect fit!

While you should go and read Wikipedia on Bloom Filters, here is how I look at this data structure.

How would you implement a "set"? Given a perfect hash function, and infinite memory, we could just create an infinite bit array and set a bit number 'hash(item)' for each item we encounter. This would give us a perfect "set" data structure. Right? Trivial. Sadly, hash functions have collisions and infinite memory doesn't exist, so we have to compromise in our reality. But we can calculate and manage the probability of collisions. For example, imagine we have a good hash function, and 128GiB of memory. We can calculate the probability of the second item added to the bit array colliding would be 1 in 1099511627776. The probability of collision when adding more items worsens as we fill up the bit array.

Furthermore, we could use more than one hash function, and end up with a denser bit array. This is exactly what Bloom filters optimize for. A Bloom filter is a bunch of math on top of the four variables:

'n' - The number of input elements (cardinality)
'm' - Memory used by the bit-array
'k' - Number of hash functions counted for each input
'p' - Probability of a false positive match

Given the 'n' input cardinality and the 'p' desired probability of false positive, the Bloom filter math returns the 'm' memory required and 'k' number of hash functions needed.

Check out this excellent visualization by Thomas Hurst showing how parameters influence each other:

https://hur.st/bloomfilter/

mmuniq-bloom

Guided by this intuition, I set out on a journey to add a new tool to my toolbox - 'mmuniq-bloom', a probabilistic tool that, given input on STDIN, returns only unique lines on STDOUT, hopefully much faster than 'sort' + 'uniq' combo!

Here it is:

'mmuniq-bloom.c'

For simplicity and speed I designed 'mmuniq-bloom' with a couple of assumptions. First, unless otherwise instructed, it uses 8 hash functions k=8. This seems to be a close to optimal number for the data sizes I'm working with, and the hash function can quickly output 8 decent hashes. Then we align 'm', number of bits in the bit array, to be a power of two. This is to avoid the pricey % modulo operation, which compiles down to slow assembly 'div'. With power-of-two sizes we can just do bitwise AND. (For a fun read, see how compilers can optimize some divisions by using multiplication by a magic constant.)

We can now run it against the same data file we used before:

Oh, this is so much better! 12 seconds is much more manageable than 2 minutes before. But hold on... The program is using an optimized data structure, relatively limited memory footprint, optimized line-parsing and good output buffering... 12 seconds is still eternity compared to 'wc -l' tool:

What is going on? I understand that counting lines by 'wc' is easier than figuring out unique lines, but is it really worth the 26x difference? Where does all the CPU in 'mmuniq-bloom' go?

It must be my hash function. 'wc' doesn't need to spend all this CPU performing all this strange math for each of the 40M lines on input. I'm using a pretty non-trivial 'siphash24' hash function, so it surely burns the CPU, right? Let's check by running the code computing hash function but not doing any Bloom filter operations:

This is strange. Counting the hash function indeed costs about 2s, but the program took 12s in the previous run. The Bloom filter alone takes 10 seconds? How is that possible? It's such a simple data structure...

A secret weapon - a profiler

It was time to use a proper tool for the task - let's fire up a profiler and see where the CPU goes. First, let's fire an 'strace' to confirm we are not running any unexpected syscalls:

Everything looks good. The 10 calls to 'mmap' each taking 4ms (3971 us) is intriguing, but it's fine. We pre-populate memory up front with 'MAP_POPULATE' to save on page faults later.

What is the next step? Of course Linux's 'perf'!

Then we can see the results:

Right, so we indeed burn 87.2% of cycles in our hot code. Let's see where exactly. Doing 'perf annotate process_line --source' quickly shows something I didn't expect.

You can see 26.90% of CPU burned in the 'mov', but that's not all of it! The compiler correctly inlined the function, and unrolled the loop 8-fold. Summed up that 'mov' or 'uint64_t v = *p' line adds up to a great majority of cycles!

Clearly 'perf' must be mistaken, how can such a simple line cost so much? We can repeat the benchmark with any other profiler and it will show us the same problem. For example, I like using 'google-perftools' with kcachegrind since they emit eye-candy charts:

The rendered result looks like this:

Allow me to summarise what we found so far.

The generic 'wc' tool takes 0.45s CPU time to process 600MiB file. Our optimized 'mmuniq-bloom' tool takes 12 seconds. CPU is burned on one 'mov' instruction, dereferencing memory....

Image by Jose Nicdao CC BY/2.0

Oh! I how could I have forgotten. Random memory access is slow! It's very, very, very slow!

According to the general rule "latency numbers every programmer should know about", one RAM fetch is about 100ns. Let's do the math: 40 million lines, 8 hashes counted for each line. Since our Bloom filter is 128MiB, on our older hardware it doesn't fit into L3 cache! The hashes are uniformly distributed across the large memory range - each hash generates a memory miss. Adding it together that's...

That suggests 32 seconds burned just on memory fetches. The real program is faster, taking only 12s. This is because, although the Bloom filter data does not completely fit into L3 cache, it still gets some benefit from caching. It's easy to see with 'perf stat -d':

Right, so we should have had at least 320M LLC-load-misses, but we had only 280M. This still doesn't explain why the program was running only 12 seconds. But it doesn't really matter. What matters is that the number of cache misses is a real problem and we can only fix it by reducing the number of memory accesses. Let's try tuning Bloom filter to use only one hash function:

Ouch! That really hurt! The Bloom filter required 64 GiB of memory to get our desired false positive probability ratio of 1-error-per-10k-lines. This is terrible!

Also, it doesn't seem like we improved much. It took the OS 22 seconds to prepare memory for us, but we still burned 11 seconds in userspace. I guess this time any benefits from hitting memory less often were offset by lower cache-hit probability due to drastically increased memory size. In previous runs we required only 128MiB for the Bloom filter!

Dumping Bloom filters altogether

This is getting ridiculous. To get the same false positive guarantees we either must use many hashes in Bloom filter (like 8) and therefore many memory operations, or we can have 1 hash function, but enormous memory requirements.

We aren't really constrained by available memory, instead we want to optimize for reduced memory accesses. All we need is a data structure that requires at most 1 memory miss per item, and use less than 64 Gigs of RAM...

While we could think of more sophisticated data structures like Cuckoo filter, maybe we can be simpler. How about a good old simple hash table with linear probing?

Image by Vadims Podāns

Welcome mmuniq-hash

Here you can find a tweaked version of mmuniq-bloom, but using hash table:

'mmuniq-hash.c'

Instead of storing bits as for the Bloom-filter, we are now storing 64-bit hashes from the 'siphash24' function. This gives us much stronger probability guarantees, with probability of false positives much better than one error in 10k lines.

Let's do the math. Adding a new item to a hash table containing, say 40M, entries has '40M/2^64' chances of hitting a hash collision. This is about one in 461 billion - a reasonably low probability. But we are not adding one item to a pre-filled set! Instead we are adding 40M lines to the initially empty set. As per birthday paradox this has much higher chances of hitting a collision at some point. A decent approximation is 'n^2/2m', which in our case is '(40M2)/(2*(264))'. This is a chance of one in 23000. In other words, assuming we are using good hash function, every one in 23 thousand random sets of 40M items, will have a hash collision. This practical chance of hitting a collision is non-negligible, but it's still better than a Bloom filter and totally acceptable for my use case.

The hash table code runs faster, has better memory access patterns and better false positive probability than the Bloom filter approach.

Don't be scared about the "hash conflicts" line, it just indicates how full the hash table was. We are using linear probing, so when a bucket is already used, we just pick up the next empty bucket. In our case we had to skip over 0.7 buckets on average to find an empty slot in the table. This is fine and, since we iterate over the buckets in linear order, we can expect the memory to be nicely prefetched.

From the previous exercise we know our hash function takes about 2 seconds of this. Therefore, it's fair to say 40M memory hits take around 4 seconds.

Lessons learned

Modern CPUs are really good at sequential memory access when it's possible to predict memory fetch patterns (see Cache prefetching). Random memory access on the other hand is very costly.

Advanced data structures are very interesting, but beware. Modern computers require cache-optimized algorithms. When working with large datasets, not fitting L3, prefer optimizing for reduced number loads, over optimizing the amount of memory used.

I guess it's fair to say that Bloom filters are great, as long as they fit into the L3 cache. The moment this assumption is broken, they are terrible. This is not news, Bloom filters optimize for memory usage, not for memory access. For example, see the Cuckoo Filters paper.

Another thing is the ever-lasting discussion about hash functions. Frankly - in most cases it doesn't matter. The cost of counting even complex hash functions like 'siphash24' is small compared to the cost of random memory access. In our case simplifying the hash function will bring only small benefits. The CPU time is simply spent somewhere else - waiting for memory!

One colleague often says: "You can assume modern CPUs are infinitely fast. They run at infinite speed until they hit the memory wall".

Finally, don't follow my mistakes - everyone should start profiling with 'perf stat -d' and look at the "Instructions per cycle" (IPC) counter. If it's below 1, it generally means the program is stuck on waiting for memory. Values above 2 would be great, it would mean the workload is mostly CPU-bound. Sadly, I'm yet to see high values in the workloads I'm dealing with...

Improved mmuniq

With the help of my colleagues I've prepared a further improved version of the 'mmuniq' hash table based tool. See the code:

'mmuniq.c'

It is able to dynamically resize the hash table, to support inputs of unknown cardinality. Then, by using batching, it can effectively use the 'prefetch' CPU hint, speeding up the program by 35-40%. Beware, sprinkling the code with 'prefetch' rarely works. Instead, I specifically changed the flow of algorithms to take advantage of this instruction. With all the improvements I got the run time down to 2.1 seconds:

The end

Writing this basic tool which tries to be faster than 'sort | uniq' combo revealed some hidden gems of modern computing. With a bit of work we were able to speed it up from more than two minutes to 2 seconds. During this journey we learned about random memory access latency, and the power of cache friendly data structures. Fancy data structures are exciting, but in practice reducing random memory loads often brings better results.

Three little tools: mmsum, mmwatch, mmhistogram

John Graham-Cumming — Tue, 04 Jul 2017 10:32:20 GMT

In a recent blog post, my colleague Marek talked about some SSDP-based DDoS activity we'd been seeing recently. In that blog post he used a tool called mmhistogram to output an ASCII histogram.

That tool is part of a small suite of command-line tools that can be handy when messing with data. Since a reader asked for them to be open sourced... here they are.

mmhistogram

Suppose you have the following CSV of the ages of major Star Wars characters at the time of Episode IV:

You can get an ASCII histogram of the ages as follows using the mmhistogram tool.

Handy for getting a quick sense of the data. (These charts are inspired by the ASCII output from systemtap).

mmwatch

The mmwatch tool is handy if you want to look at output from a command-line tool that provides some snapshot of values, but need to have a rate.

For example, here's df -H on my machine:

Now imagine you were interested in understanding the rate of change in iused and ifree. You can with mmwatch. It's just like watch but looks for changing numbers and interprets them as rates:

Here's a short GIF showing it working:

mmsum

And the final tool is mmsum that simply sums a list of floating point numbers (one per line).

Suppose you are downloading real-time rainfall data from the UK's Environment Agency and would like to know the total current rainfall. mmsum can help:

All these tools can be found on the Cloudflare Github.

Building the simplest Go static analysis tool

Filippo Valsorda — Wed, 27 Apr 2016 15:01:15 GMT

Go native vendoring (a.k.a. GO15VENDOREXPERIMENT) allows you to freeze dependencies by putting them in a vendor folder in your project. The compiler will then look there before searching the GOPATH.

The only annoyance compared to using a per-project GOPATH, which is what we used to do, is that you might forget to vendor a package that you have in your GOPATH. The program will build for you, but it won't for anyone else. Back to the WFM times!

I decided I wanted something, a tool, to check that all my (non-stdlib) dependencies were vendored.

At first I thought of using go list, which Dave Cheney appropriately called a swiss army knife, but while it can show the entire recursive dependency tree (format .Deps), there's no way to know from the templating engine if a dependency is in the standard library.

We could just pass each output back into go list to check for .Standard, but I thought this would be a good occasion to build a very simple static analysis tool. Go's simplicity and libraries make it a very easy task, as you will see.

First, loading the program

We use golang.org/x/tools/go/loader to load the packages passed as arguments on the command line, including the test files based on a flag.

With these few lines we already replicated go list -f {{ .Deps }}!

The only missing loading feature here is wildcard (./...) support. That code is in the go tool source and it's unexported. There's an issue about exposing it, but for now packages are just copy-pasting it. We'll use a packaged version of that code, github.com/kisielk/gotool:

Finally, since we are only interested in the dependency tree today we instruct the parser to only go as far as the imports statements and we ignore the resulting "not used" errors:

Then, the actual logic

We now have a loader.Program object, which holds references to various loader.PackageInfo objects, which in turn are a combination of package, AST and types information. All you need to perform any kind of complex analysis. Not that we are going to do that today :)

We'll just replicate the go list logic to recognize stdlib packages and remove the packages passed on the command line from the list:

Then we just have to print a warning if any remaining package is not in a /vendor/ folder:

Done! You can find the tool here: https://github.com/FiloSottile/vendorcheck

DNS parser, meet Go fuzzer

Filippo Valsorda — Thu, 06 Aug 2015 13:40:40 GMT

Here at CloudFlare we are heavy users of the github.com/miekg/dns Go DNS library and we make sure to contribute to its development as much as possible. Therefore when Dmitry Vyukov published go-fuzz and started to uncover tens of bugs in the Go standard library, our task was clear.

Hot Fuzz

Fuzzing is the technique of testing software by continuously feeding it inputs that are automatically mutated. For C/C++, the wildly successful afl-fuzz tool by Michał Zalewski uses instrumented source coverage to judge which mutations pushed the program into new paths, eventually hitting many rarely-tested branches.

go-fuzz applies the same technique to Go programs, instrumenting the source by rewriting it (like godebug does). An interesting difference between afl-fuzz and go-fuzz is that the former normally operates on file inputs to unmodified programs, while the latter asks you to write a Go function and passes inputs to that. The former usually forks a new process for each input, the latter keeps calling the function without restarting often.

There is no strong technical reason for this difference (and indeed afl recently gained the ability to behave like go-fuzz), but it's likely due to the different ecosystems in which they operate: Go programs often expose well-documented, well-behaved APIs which enable the tester to write a good wrapper that doesn't contaminate state across calls. Also, Go programs are often easier to dive into and more predictable, thanks obviously to GC and memory management, but also to the general community repulsion towards unexpected global states and side effects. On the other hand many legacy C code bases are so intractable that the easy and stable file input interface is worth the performance tradeoff.

Back to our DNS library. RRDNS, our in-house DNS server, uses github.com/miekgs/dns for all its parsing needs, and it has proved to be up to the task. However, it's a bit fragile on the edge cases and has a track record of panicking on malformed packets. Thankfully, this is Go, not BIND C, and we can afford to recover() panics without worrying about ending up with insane memory states. Here's what we are doing

We saw an opportunity to make the library more robust so we wrote this initial simple fuzzing function:

To create a corpus of initial inputs we took our stress and regression test suites and used github.com/miekg/pcap to write a file per packet.

CC BY 2.0 image by JD Hancock

We then compiled our Fuzz function with go-fuzz, and launched the fuzzer on a lab server. The first thing go-fuzz does is minimize the corpus by throwing away packets that trigger the same code paths, then it starts mutating the inputs and passing them to Fuzz() in a loop. The mutations that don't fail (return 1) and expand code coverage are kept and iterated over. When the program panics, a small report (input and output) is saved and the program restarted. If you want to learn more about go-fuzz watch the author's GopherCon talk or read the README.

Crashes, mostly "index out of bounds", started to surface. go-fuzz becomes pretty slow and ineffective when the program crashes often, so while the CPUs burned I started fixing the bugs.

In some cases I just decided to change some parser patterns, for example reslicing and using len() instead of keeping offsets. However these can be potentially disrupting changes—I'm far from perfect—so I adapted the Fuzz function to keep an eye on the differences between the old and new, fixed parser, and crash if the new parser started refusing good packets or changed its behavior:

I was pretty happy about the robustness gain, but since we used the ParseDNSPacketSafely wrapper in RRDNS I didn't expect to find security vulnerabilities. I was wrong!

DNS names are made of labels, usually shown separated by dots. In a space saving effort, labels can be replaced by pointers to other names, so that if we know we encoded example.com at offset 15, www.example.com can be packed as www. + PTR(15). What we found is a bug in handling of pointers to empty names: when encountering the end of a name (0x00), if no label were read, "." (the empty name) was returned as a special case. Problem is that this special case was unaware of pointers, and it would instruct the parser to resume reading from the end of the pointed-to empty name instead of the end of the original name.

For example if the parser encountered at offset 60 a pointer to offset 15, and msg[15] == 0x00, parsing would then resume from offset 16 instead of 61, causing a infinite loop. This is a potential Denial of Service vulnerability.

We sent the fixes privately to the library maintainer while we patched our servers and we opened a PR once done. (Two bugs were independently found and fixed by Miek while we released our RRDNS updates, as it happens.)

Not just crashes and hangs

Thanks to its flexible fuzzing API, go-fuzz lends itself nicely not only to the mere search of crashing inputs, but can be used to explore all scenarios where edge cases are troublesome.

Useful applications range from checking output validation by adding crashing assertions to your Fuzz() function, to comparing the two ends of a unpack-pack chain and even comparing the behavior of two different versions or implementations of the same functionality.

For example, while preparing our DNSSEC engine for launch, I faced a weird bug that would happen only on production or under stress tests: NSEC records that were supposed to only have a couple bits set in their types bitmap would sometimes look like this

The catch was that our "pack and send" code pools []byte buffers to reduce GC and allocation churn, so buffers passed to dns.msg.PackBuffer(buf []byte) can be "dirty" from previous uses.

However, buf not being an array of zeroes was not handled by some github.com/miekgs/dns packers, including the NSEC rdata one, that would just OR present bits, without clearing ones that are supposed to be absent.

The fix was clear and easy: we benchmarked a few different ways to zero a buffer and updated the code like this

Note: a recent optimization turns zeroing range loops into memclr calls, so once 1.5 lands that will be much faster than copy().

But this was a boring fix! Wouldn't it be nicer if we could trust our library to work with any buffer we pass it? Luckily, this is exactly what coverage based fuzzing is good for: making sure all code paths behave in a certain way.

What I did then is write a Fuzz() function that would first parse a message, and then pack it to two different buffers: one filled with zeroes and one filled with 0xff. Any differences between the two results would signal cases where the underlying buffer is leaking into the output.

I wish here, too, I could show a PR fixing all the bugs, but go-fuzz did its job even too well and we are still triaging and fixing what it finds.

Anyway, once the fixes are done and go-fuzz falls silent, we will be free to drop the buffer zeroing step without worry, with no need to audit the whole codebase!

Do you fancy fuzzing the libraries that serve 43 billion queries per day? We are hiring in London, San Francisco and Singapore!

Go has a debugger—and it's awesome!

Filippo Valsorda — Thu, 18 Jun 2015 11:14:00 GMT

Something that often, uh... bugs[1] Go developers is the lack of a proper debugger. Sure, builds are ridiculously fast and easy, and println(hex.Dump(b)) is your friend, but sometimes it would be nice to just set a breakpoint and step through that endless if chain or print a bunch of values without recompiling ten times.

CC BY 2.0 image by Carl Milner

You could try to use some dirty gdb hacks that will work if you built your binary with a certain linker and ran it on some architectures when the moon was in a waxing crescent phase, but let's be honest, it isn't an enjoyable experience.

Well, worry no more! godebug is here!

godebug is an awesome cross-platform debugger created by the Mailgun team. You can read their introduction for some under-the-hood details, but here's the cool bit: instead of wrestling with half a dozen different ptrace interfaces that would not be portable, godebug rewrites your source code and injects function calls like godebug.Line on every line, godebug.Declare at every variable declaration, and godebug.SetTrace for breakpoints (i.e. wherever you type _ = "breakpoint").

I find this solution brilliant. What you get out of it is a (possibly cross-compiled) debug-enabled binary that you can drop on a staging server just like you would with a regular binary. When a breakpoint is reached, the program will stop inline and wait for you on stdin. It's the single-binary, zero-dependencies philosophy of Go that we love applied to debugging. Builds everywhere, runs everywhere, with no need for tools or permissions on the server. It even compiles to JavaScript with gopherjs (check out the Mailgun post above—show-offs ;) ).

You might ask, "But does it get a decent runtime speed or work with big applications?" Well, the other day I was seeing RRDNS—our in-house Go DNS server—hit a weird branch, so I placed a breakpoint a couple lines above the if in question, recompiled the whole of RRDNS with godebug instrumentation, dropped the binary on a staging server, and replayed some DNS traffic.

Boom. The request and the debug log paused (make sure to terminate any timeout you have in your tools), waiting for me to step through the code.

Sold yet? Here's how you use it: simply run godebug {build|run|test} instead of go {build|run|test}. We adapted godebug to resemble the go tool as much as possible. Remember to use -instrument if you want to be able to step into packages that are not main.

For example, here is part of the RRDNS Makefile:

Debugging is just a make bin/rrdns GODEBUG=rrdns/... away.

This tool is still young, but in my experience, perfectly functional. The UX could use some love if you can spare some time (as you can see above it's pretty spartan), but it should be easy to build on what's there already.

About source rewriting

Before closing, I'd like to say a few words about the technique of source rewriting in general. It powers many different Go tools, like test coverage, fuzzing and, indeed, debugging. It's made possible primarily by Go’s blazing-fast compiles, and it enables amazing cross-platform tools to be built easily.

However, since it's such a handy and powerful pattern, I feel like there should be a standard way to apply it in the context of the build process. After all, all the source rewriting tools need to implement a subset of the following features:

Wrap the main function
Conditionally rewrite source files
Keep global state

Why should every tool have to reinvent all the boilerplate to copy the source files, rewrite the source, make sure stale objects are not used, build the right packages, run the right tests, and interpret the CLI..? Basically, all of godebug/cmd.go. And what about gb, for example?

I think we need a framework for Go source code rewriting tools. (Spoiler, spoiler, ...)

If you’re interested in working on Go servers at scale and developing tools to do it better, remember we’re hiring in London, San Francisco, and Singapore!

I'm sorry. ↩︎