The Cloudflare Blog

NEON is the new black: fast JPEG optimization on ARM server

Vlad Krasnov — Fri, 13 Apr 2018 16:38:47 GMT

As engineers at Cloudflare quickly adapt our software stack to run on ARM, a few parts of our software stack have not been performing as well on ARM processors as they currently do on our Xeon® Silver 4116 CPUs. For the most part this is a matter of Intel specific optimizations some of which utilize SIMD or other special instructions.

One such example is the venerable jpegtran, one of the workhorses behind our Polish image optimization service.

A while ago I optimized our version of jpegtran for Intel processors. So when I ran a comparison on my test image, I was expecting that the Xeon would outperform ARM:

vlad@xeon:~$ time  ./jpegtran -outfile /dev/null -progressive -optimise -copy none test.jpg

real    0m2.305s
user    0m2.059s
sys     0m0.252s

vlad@arm:~$ time ./jpegtran -outfile /dev/null -progressive -optimise -copy none test.jpg

real    0m8.654s
user    0m8.433s
sys     0m0.225s

Ideally we want to have the ARM performing at or above 50% of the Xeon performance per core. This would make sure we have no performance regressions, and net performance gain, since the ARM CPUs have double the core count as our current 2 socket setup.

In this case, however, I was disappointed to discover an almost 4X slowdown.

Not one to despair, I figured out that applying the same optimizations I did for Intel would be trivial. Surely the NEON instructions map neatly to the SSE instructions I used before?

CC BY-SA 2.0 image by viZZZual.com

What is NEON

NEON is the ARMv8 version of SIMD, Single Instruction Multiple Data instruction set, where a single operation performs (generally) the same operation on several operands.

NEON operates on 32 dedicated 128-bit registers, similarly to Intel SSE. It can perform operations on 32-bit and 64-bit floating point numbers, or 8-bit, 16-bit, 32-bit and 64-bit signed or unsigned integers.

As with SSE you can program either in the assembly language, or in C using intrinsics. The intrinsics are usually easier to use, and depending on the application and the compiler can provide better performance, however intrinsics based code tends to be quite verbose.

If you opt to use the NEON intrinsics you have to include . While SSE intrinsic use __m128i for all SIMD integer operations, the intrinsics for NEON have distinct type for each integer and float width. For example operations on signed 16-bit integers use the int16x8_t type, which we are going to use. Similarly there is a uint16x8_t type for unsigned integer, as well as int8x16_t, int32x4_t and int64x2_t and their uint derivatives, that are self explanatory.

Getting started

Running perf tells me that the same two culprits are responsible for most of the CPU time spent:

perf record ./jpegtran -outfile /dev/null -progressive -optimise -copy none test.jpeg
perf report
  71.24%  lt-jpegtran  libjpeg.so.9.1.0   [.] encode_mcu_AC_refine
  15.24%  lt-jpegtran  libjpeg.so.9.1.0   [.] encode_mcu_AC_first

Aha, encode_mcu_AC_refine and encode_mcu_AC_first, my old nemeses!

The straightforward approach

encode_mcu_AC_refine

Let's recoup the optimizations we applied to encode_mcu_AC_refine previously. The function has two loops, with the heavier loop performing the following operation:

for (k = cinfo->Ss; k <= Se; k++) {
  temp = (*block)[natural_order[k]];
  if (temp < 0)
    temp = -temp;      /* temp is abs value of input */
  temp >>= Al;         /* apply the point transform */
  absvalues[k] = temp; /* save abs value for main pass */
  if (temp == 1)
    EOB = k;           /* EOB = index of last newly-nonzero coef */
}

And the SSE solution to this problem was:

__m128i x1 = _mm_setzero_si128(); // Load 8 16-bit values sequentially
x1 = _mm_insert_epi16(x1, (*block)[natural_order[k+0]], 0);
x1 = _mm_insert_epi16(x1, (*block)[natural_order[k+1]], 1);
x1 = _mm_insert_epi16(x1, (*block)[natural_order[k+2]], 2);
x1 = _mm_insert_epi16(x1, (*block)[natural_order[k+3]], 3);
x1 = _mm_insert_epi16(x1, (*block)[natural_order[k+4]], 4);
x1 = _mm_insert_epi16(x1, (*block)[natural_order[k+5]], 5);
x1 = _mm_insert_epi16(x1, (*block)[natural_order[k+6]], 6);
x1 = _mm_insert_epi16(x1, (*block)[natural_order[k+7]], 7);

x1 = _mm_abs_epi16(x1);       // Get absolute value of 16-bit integers
x1 = _mm_srli_epi16(x1, Al);  // >> 16-bit integers by Al bits

_mm_storeu_si128((__m128i*)&absvalues[k], x1);   // Store

x1 = _mm_cmpeq_epi16(x1, _mm_set1_epi16(1));     // Compare to 1
unsigned int idx = _mm_movemask_epi8(x1);        // Extract byte mask
EOB = idx? k + 16 - __builtin_clz(idx)/2 : EOB;  // Compute index

For the most part the transition to NEON is indeed straightforward.

To initialize a register to all zeros, we can use the vdupq_n_s16 intrinsic, that duplicates a given value across all lanes of a register. The insertions are performed with the vsetq_lane_s16 intrinsic. Use vabsq_s16 to get the absolute values.

The shift right instruction made me pause for a while. I simply couldn't find an instruction that can shift right by a non constant integer value. It doesn't exist. However the solution is very simple, you shift left by a negative amount! The intrinsic for that is vshlq_s16.

The absence of a right shift instruction is no coincidence. Unlike the x86 instruction set, that can theoretically support arbitrarily long instructions, and thus don't have to think twice before adding a new instruction, no matter how specialized or redundant it is, ARMv8 instruction set can only support 32-bit long instructions, and have a very limited opcode space. For this reason the instruction set is much more concise, and many instructions are in fact aliases to other instruction. Even the most basic MOV instruction is an alias for ORR (binary or). That means that programming for ARM and NEON sometimes requires greater creativity.

The final step of the loop, is comparing each element to 1, then getting the mask. Comparing for equality is performed with vceqq_s16. But again there is no operation to extract the mask. That is a problem. However, instead of getting a bitmask, it is possible to extract a whole byte from every lane into a 64-bit value, by first applying vuzp1q_u8 to the comparison result. vuzp1q_u8 interleaves the even indexed bytes of two vectors (whereas vuzp2q_u8 interleaves the odd indexes). So the solution would look something like that:

int16x8_t zero = vdupq_n_s16(0);
int16x8_t al_neon = vdupq_n_s16(-Al);
int16x8_t x0 = zero;
int16x8_t x1 = zero;

// Load 8 16-bit values sequentially
x1 = vsetq_lane_s16((*block)[natural_order[k+0]], x1, 0);
// Interleave the loads to compensate for latency
x0 = vsetq_lane_s16((*block)[natural_order[k+1]], x0, 1);
x1 = vsetq_lane_s16((*block)[natural_order[k+2]], x1, 2);
x0 = vsetq_lane_s16((*block)[natural_order[k+3]], x0, 3);
x1 = vsetq_lane_s16((*block)[natural_order[k+4]], x1, 4);
x0 = vsetq_lane_s16((*block)[natural_order[k+5]], x0, 5);
x1 = vsetq_lane_s16((*block)[natural_order[k+6]], x1, 6);
x0 = vsetq_lane_s16((*block)[natural_order[k+7]], x0, 7);
int16x8_t x = vorrq_s16(x1, x0);

x = vabsq_s16(x);            // Get absolute value of 16-bit integers
x = vshlq_s16(x, al_neon);   // >> 16-bit integers by Al bits

vst1q_s16(&absvalues[k], x); // Store
uint8x16_t is_one = vreinterpretq_u8_u16(vceqq_s16(x, one));  // Compare to 1
is_one = vuzp1q_u8(is_one, is_one);  // Compact the compare result into 64 bits

uint64_t idx = vgetq_lane_u64(vreinterpretq_u64_u8(is_one), 0); // Extract
EOB = idx ? k + 8 - __builtin_clzl(idx)/8 : EOB;                // Get the index

Note the intrinsics for explicit type casts. They don't actually emit any instructions, since regardless of the type the operands always occupy the same registers.

On to the second loop:

if ((temp = absvalues[k]) == 0) {
  r++;
  continue;
}

The SSE solution was:

__m128i t = _mm_loadu_si128((__m128i*)&absvalues[k]);
t = _mm_cmpeq_epi16(t, _mm_setzero_si128()); // Compare to 0
int idx = _mm_movemask_epi8(t);              // Extract byte mask
if (idx == 0xffff) {                         // Skip all zeros
  r += 8;
  k += 8;
  continue;
} else {                                     // Skip up to the first nonzero
  int skip = __builtin_ctz(~idx)/2;
  r += skip;
  k += skip;
  if (k>Se) break;      // Stop if gone too far
}
temp = absvalues[k];    // Load the next nonzero value

But we already know that there is no way to extract the byte mask. Instead of using NEON I chose to simply skip four zero values at a time, using 64-bit integers, like so:

uint64_t tt, *t = (uint64_t*)&absvalues[k];
if ( (tt = *t) == 0) while ( (tt = *++t) == 0); // Skip while all zeroes
int skip = __builtin_ctzl(tt)/16 + ((int64_t)t - 
           (int64_t)&absvalues[k])/2;           // Get index of next nonzero
k += skip;
r += skip;
temp = absvalues[k];

How fast are we now?

vlad@arm:~$ time ./jpegtran -outfile /dev/null -progressive -optimise -copy none test.jpg

real    0m4.008s
user    0m3.770s
sys     0m0.241s

Wow, that is incredible. Over 2X speedup!

encode_mcu_AC_first

The other function is quite similar, but the logic slightly differs on the first pass:

temp = (*block)[natural_order[k]];
if (temp < 0) {
  temp = -temp;             // Temp is abs value of input
  temp >>= Al;              // Apply the point transform
  temp2 = ~temp;
} else {
  temp >>= Al;              // Apply the point transform
  temp2 = temp;
}
t1[k] = temp;
t2[k] = temp2;

Here it is required to assign the absolute value of temp to t1[k], and its inverse to t2[k] if temp is negative, otherwise t2[k] assigned the same value as t1[k].

To get the inverse of a value, we use the vmvnq_s16 intrinsic, to check if the values are negative we need to compare with zero using the vcgezq_s16 and finally selecting based on the mask using vbslq_s16.

int16x8_t zero = vdupq_n_s16(0);
int16x8_t al_neon = vdupq_n_s16(-Al);

int16x8_t x0 = zero;
int16x8_t x1 = zero;

// Load 8 16-bit values sequentially
x1 = vsetq_lane_s16((*block)[natural_order[k+0]], x1, 0);
// Interleave the loads to compensate for latency
x0 = vsetq_lane_s16((*block)[natural_order[k+1]], x0, 1);
x1 = vsetq_lane_s16((*block)[natural_order[k+2]], x1, 2);
x0 = vsetq_lane_s16((*block)[natural_order[k+3]], x0, 3);
x1 = vsetq_lane_s16((*block)[natural_order[k+4]], x1, 4);
x0 = vsetq_lane_s16((*block)[natural_order[k+5]], x0, 5);
x1 = vsetq_lane_s16((*block)[natural_order[k+6]], x1, 6);
x0 = vsetq_lane_s16((*block)[natural_order[k+7]], x0, 7);
int16x8_t x = vorrq_s16(x1, x0);

uint16x8_t is_positive = vcgezq_s16(x); // Get positive mask

x = vabsq_s16(x);                 // Get absolute value of 16-bit integers
x = vshlq_s16(x, al_neon);        // >> 16-bit integers by Al bits
int16x8_t n = vmvnq_s16(x);       // Binary inverse
n = vbslq_s16(is_positive, x, n); // Select based on positive mask

vst1q_s16(&t1[k], x); // Store
vst1q_s16(&t2[k], n);

And the moment of truth:

vlad@arm:~$ time ./jpegtran -outfile /dev/null -progressive -optimise -copy none test.jpg

real    0m3.480s
user    0m3.243s
sys     0m0.241s

Overall 2.5X speedup from the original C implementation, but still 1.5X slower than Xeon.

Batch benchmark

While the improvement for the single image was impressive, it is not necessarily representative of all jpeg files. To understand the impact on overall performance I ran jpegtran over a set of 34,159 actual images from one of our caches. The total size of those images was 3,325,253KB. The total size after jpegtran was 3,067,753KB, or 8% improvement on average.

Using one thread, the Intel Xeon managed to process all those images in 14 minutes and 43 seconds. The original jpegtran on our ARM server took 29 minutes and 34 seconds. The improved jpegtran took only 13 minutes and 52 seconds, slightly outperforming even the Xeon processor, despite losing on the test image.

Going deeper

3.48 seconds, down from 8.654 represents a respectful 2.5X speedup.

It definitely meets the goal of being at least 50% as fast as Xeon, and it is faster in the batch benchmark, but it still feels like it is slower than it could be.

While going over the ARMv8 NEON instruction set, I found several unique instructions, that have no equivalent in SSE.

The first such instruction is TBL. It works as a lookup table, that can lookup 8 or 16 bytes from one to four consecutive registers. In the single register variant it is similar to the pshufb SSE instruction. In the four register variant, however, it can simultaneously lookup 16 bytes in a 64 byte table! What sorcery is that?

The intrinsic to use the 4 register variant is vqtbl4q_u8. Interestingly there is an instruction that can lookup 64 bytes in AVX-512, but we don't want to use that.

The next interesting thing I found, are instructions that can load or store and de/interleave data at the same time. They can load or store up to four registers simultaneously, while de/interleaving two, three or even four elements, of any supported width. The specifics are well presented in here. The load intrinsics used are of the form: vldNq_uW, where N can be 1,2,3,4 to indicate the interleave factor and W can be 8, 16, 32 or 64. Similarly vldNq_sW is used for signed types.

Finally very interesting instructions are the shift left/right and insert SLI and SRI. What they do is they shift the elements left or right, like a regular shift would, however instead of shifting in zero bits, the zeros are replaced with the original bits of the destination register! An intrinsic for that would look like vsliq_n_u16 or vsriq_n_u32.

Applying the new instructions

It might not be visible at first how those new instruction can help. Since I didn't have much time to dig into libjpeg or the jpeg spec, I had to resolve to heuristics.

From a quick look it became apparent that *block is defined as an array of 64 16-bit values. natural_order is an array of 32-bit integers that varies in length depending on the real block size, but is always padded with 16 entries. Also, despite the fact that it uses integers, the values are indexes in the range [0..63].

Another interesting observation is that blocks of size 64 are the most common by far for both encode_mcu_AC_refine and encode_mcu_AC_first. And it always makes sense to optimize for the most common case.

So essentially what we have here, is a 64 entry lookup table *block that uses natural_order as indices. Hmm, 64 entry lookup table, where did I see that before? Of course, the TBL instruction. Although TBL looks up bytes, and we need to lookup shorts, it is easy to do, since NEON lets us load and deinterleave the short into bytes in a single instruction using LD2, then we can use two lookups for each byte individually, and finally interleave again with ZIP1 and ZIP2. Similarly despite the fact that the indices are integers, and we only need the least significant byte of each, we can use LD4 to deinterleave them into bytes (the kosher way of course would be to rewrite the library to use bytes, but I wanted to avoid big changes).

After the data loading step is done, the point transforms for both functions remain the same, but in the end, to get a single bitmask for all 64 values we can use SLI and SRI to intelligently align the bits such that only one bit of each comparison mask remains, using TBL again to combine them.

For whatever reason, the compiler in that case produces somewhat suboptimal code, so I had to revert to assembly language for this specific optimization.

The code for encode_mcu_AC_refine:

    # Load and deintreleave the block
    ld2 {v0.16b - v1.16b}, [x0], 32
    ld2 {v16.16b - v17.16b}, [x0], 32
    ld2 {v18.16b - v19.16b}, [x0], 32
    ld2 {v20.16b - v21.16b}, [x0]

    mov v4.16b, v1.16b
    mov v5.16b, v17.16b
    mov v6.16b, v19.16b
    mov v7.16b, v21.16b
    mov v1.16b, v16.16b
    mov v2.16b, v18.16b
    mov v3.16b, v20.16b
    # Load the order 
    ld4 {v16.16b - v19.16b}, [x1], 64
    ld4 {v17.16b - v20.16b}, [x1], 64
    ld4 {v18.16b - v21.16b}, [x1], 64
    ld4 {v19.16b - v22.16b}, [x1]
    # Table lookup, LSB and MSB independently
    tbl v20.16b, {v0.16b - v3.16b}, v16.16b
    tbl v16.16b, {v4.16b - v7.16b}, v16.16b
    tbl v21.16b, {v0.16b - v3.16b}, v17.16b
    tbl v17.16b, {v4.16b - v7.16b}, v17.16b
    tbl v22.16b, {v0.16b - v3.16b}, v18.16b
    tbl v18.16b, {v4.16b - v7.16b}, v18.16b
    tbl v23.16b, {v0.16b - v3.16b}, v19.16b
    tbl v19.16b, {v4.16b - v7.16b}, v19.16b
    # Interleave MSB and LSB back
    zip1 v0.16b, v20.16b, v16.16b
    zip2 v1.16b, v20.16b, v16.16b
    zip1 v2.16b, v21.16b, v17.16b
    zip2 v3.16b, v21.16b, v17.16b
    zip1 v4.16b, v22.16b, v18.16b
    zip2 v5.16b, v22.16b, v18.16b
    zip1 v6.16b, v23.16b, v19.16b
    zip2 v7.16b, v23.16b, v19.16b
    # -Al
    neg w3, w3
    dup v16.8h, w3
    # Absolute then shift by Al
    abs v0.8h, v0.8h
    sshl v0.8h, v0.8h, v16.8h
    abs v1.8h, v1.8h
    sshl v1.8h, v1.8h, v16.8h
    abs v2.8h, v2.8h
    sshl v2.8h, v2.8h, v16.8h
    abs v3.8h, v3.8h
    sshl v3.8h, v3.8h, v16.8h
    abs v4.8h, v4.8h
    sshl v4.8h, v4.8h, v16.8h
    abs v5.8h, v5.8h
    sshl v5.8h, v5.8h, v16.8h
    abs v6.8h, v6.8h
    sshl v6.8h, v6.8h, v16.8h
    abs v7.8h, v7.8h
    sshl v7.8h, v7.8h, v16.8h
    # Store
    st1 {v0.16b - v3.16b}, [x2], 64
    st1 {v4.16b - v7.16b}, [x2]
    # Constant 1
    movi v16.8h, 0x1
    # Compare with 0 for zero mask
    cmeq v17.8h, v0.8h, #0
    cmeq v18.8h, v1.8h, #0
    cmeq v19.8h, v2.8h, #0
    cmeq v20.8h, v3.8h, #0
    cmeq v21.8h, v4.8h, #0
    cmeq v22.8h, v5.8h, #0
    cmeq v23.8h, v6.8h, #0
    cmeq v24.8h, v7.8h, #0
    # Compare with 1 for EOB mask
    cmeq v0.8h, v0.8h, v16.8h
    cmeq v1.8h, v1.8h, v16.8h
    cmeq v2.8h, v2.8h, v16.8h
    cmeq v3.8h, v3.8h, v16.8h
    cmeq v4.8h, v4.8h, v16.8h
    cmeq v5.8h, v5.8h, v16.8h
    cmeq v6.8h, v6.8h, v16.8h
    cmeq v7.8h, v7.8h, v16.8h
    # For both masks -> keep only one byte for each comparison
    uzp1 v0.16b, v0.16b, v1.16b
    uzp1 v1.16b, v2.16b, v3.16b
    uzp1 v2.16b, v4.16b, v5.16b
    uzp1 v3.16b, v6.16b, v7.16b

    uzp1 v17.16b, v17.16b, v18.16b
    uzp1 v18.16b, v19.16b, v20.16b
    uzp1 v19.16b, v21.16b, v22.16b
    uzp1 v20.16b, v23.16b, v24.16b
    # Shift left and insert (int16) to get a single bit from even to odd bytes
    sli v0.8h, v0.8h, 15
    sli v1.8h, v1.8h, 15
    sli v2.8h, v2.8h, 15
    sli v3.8h, v3.8h, 15

    sli v17.8h, v17.8h, 15
    sli v18.8h, v18.8h, 15
    sli v19.8h, v19.8h, 15
    sli v20.8h, v20.8h, 15
    # Shift right and insert (int32) to get two bits from off to even indices
    sri v0.4s, v0.4s, 18
    sri v1.4s, v1.4s, 18
    sri v2.4s, v2.4s, 18
    sri v3.4s, v3.4s, 18

    sri v17.4s, v17.4s, 18
    sri v18.4s, v18.4s, 18
    sri v19.4s, v19.4s, 18
    sri v20.4s, v20.4s, 18
    # Regular shift right to align the 4 bits at the bottom of each int64
    ushr v0.2d, v0.2d, 12
    ushr v1.2d, v1.2d, 12
    ushr v2.2d, v2.2d, 12
    ushr v3.2d, v3.2d, 12

    ushr v17.2d, v17.2d, 12
    ushr v18.2d, v18.2d, 12
    ushr v19.2d, v19.2d, 12
    ushr v20.2d, v20.2d, 12
    # Shift left and insert (int64) to combine all 8 bits into one byte
    sli v0.2d, v0.2d, 36
    sli v1.2d, v1.2d, 36
    sli v2.2d, v2.2d, 36
    sli v3.2d, v3.2d, 36

    sli v17.2d, v17.2d, 36
    sli v18.2d, v18.2d, 36
    sli v19.2d, v19.2d, 36
    sli v20.2d, v20.2d, 36
    # Combine all the byte mask insto a bit 64-bit mask for EOB and zero masks
    ldr d4, .shuf_mask
    tbl v5.8b, {v0.16b - v3.16b}, v4.8b
    tbl v6.8b, {v17.16b - v20.16b}, v4.8b
    # Extract lanes
    mov x0, v5.d[0]
    mov x1, v6.d[0]
    # Compute EOB
    rbit x0, x0
    clz x0, x0
    mov x2, 64
    sub x0, x2, x0
    # Not of zero mask (so 1 bits indecates non-zeroes)
    mvn x1, x1
    ret

If you look carefully at the code, you will see, that I decided that while generating the mask to find EOB is useful, I can use the same method to generate the mask for zero values, and then I can find the next nonzero value, and zero runlength this way:

uint64_t skip =__builtin_clzl(zero_mask << k);
r += skip;
k += skip;

Similarly for encode_mcu_AC_first:

    # Load the block
    ld2 {v0.16b - v1.16b}, [x0], 32
    ld2 {v16.16b - v17.16b}, [x0], 32
    ld2 {v18.16b - v19.16b}, [x0], 32
    ld2 {v20.16b - v21.16b}, [x0]

    mov v4.16b, v1.16b
    mov v5.16b, v17.16b
    mov v6.16b, v19.16b
    mov v7.16b, v21.16b
    mov v1.16b, v16.16b
    mov v2.16b, v18.16b
    mov v3.16b, v20.16b

    # Load the order 
    ld4 {v16.16b - v19.16b}, [x1], 64
    ld4 {v17.16b - v20.16b}, [x1], 64
    ld4 {v18.16b - v21.16b}, [x1], 64
    ld4 {v19.16b - v22.16b}, [x1]
    # Table lookup, LSB and MSB independently
    tbl v20.16b, {v0.16b - v3.16b}, v16.16b
    tbl v16.16b, {v4.16b - v7.16b}, v16.16b
    tbl v21.16b, {v0.16b - v3.16b}, v17.16b
    tbl v17.16b, {v4.16b - v7.16b}, v17.16b
    tbl v22.16b, {v0.16b - v3.16b}, v18.16b
    tbl v18.16b, {v4.16b - v7.16b}, v18.16b
    tbl v23.16b, {v0.16b - v3.16b}, v19.16b
    tbl v19.16b, {v4.16b - v7.16b}, v19.16b
    # Interleave MSB and LSB back
    zip1 v0.16b, v20.16b, v16.16b
    zip2 v1.16b, v20.16b, v16.16b
    zip1 v2.16b, v21.16b, v17.16b
    zip2 v3.16b, v21.16b, v17.16b
    zip1 v4.16b, v22.16b, v18.16b
    zip2 v5.16b, v22.16b, v18.16b
    zip1 v6.16b, v23.16b, v19.16b
    zip2 v7.16b, v23.16b, v19.16b
    # -Al
    neg w4, w4
    dup v24.8h, w4
    # Compare with 0 to get negative mask
    cmge v16.8h, v0.8h, #0
    # Absolute value and shift by Al
    abs v0.8h, v0.8h
    sshl v0.8h, v0.8h, v24.8h
    cmge v17.8h, v1.8h, #0
    abs v1.8h, v1.8h
    sshl v1.8h, v1.8h, v24.8h
    cmge v18.8h, v2.8h, #0
    abs v2.8h, v2.8h
    sshl v2.8h, v2.8h, v24.8h
    cmge v19.8h, v3.8h, #0
    abs v3.8h, v3.8h
    sshl v3.8h, v3.8h, v24.8h
    cmge v20.8h, v4.8h, #0
    abs v4.8h, v4.8h
    sshl v4.8h, v4.8h, v24.8h
    cmge v21.8h, v5.8h, #0
    abs v5.8h, v5.8h
    sshl v5.8h, v5.8h, v24.8h
    cmge v22.8h, v6.8h, #0
    abs v6.8h, v6.8h
    sshl v6.8h, v6.8h, v24.8h
    cmge v23.8h, v7.8h, #0
    abs v7.8h, v7.8h
    sshl v7.8h, v7.8h, v24.8h
    # ~
    mvn v24.16b, v0.16b
    mvn v25.16b, v1.16b
    mvn v26.16b, v2.16b
    mvn v27.16b, v3.16b
    mvn v28.16b, v4.16b
    mvn v29.16b, v5.16b
    mvn v30.16b, v6.16b
    mvn v31.16b, v7.16b
    # Select
    bsl v16.16b, v0.16b, v24.16b
    bsl v17.16b, v1.16b, v25.16b
    bsl v18.16b, v2.16b, v26.16b
    bsl v19.16b, v3.16b, v27.16b
    bsl v20.16b, v4.16b, v28.16b
    bsl v21.16b, v5.16b, v29.16b
    bsl v22.16b, v6.16b, v30.16b
    bsl v23.16b, v7.16b, v31.16b
    # Store t1
    st1 {v0.16b - v3.16b}, [x2], 64
    st1 {v4.16b - v7.16b}, [x2]
    # Store t2
    st1 {v16.16b - v19.16b}, [x3], 64
    st1 {v20.16b - v23.16b}, [x3]
    # Compute zero mask like before
    cmeq v17.8h, v0.8h, #0
    cmeq v18.8h, v1.8h, #0
    cmeq v19.8h, v2.8h, #0
    cmeq v20.8h, v3.8h, #0
    cmeq v21.8h, v4.8h, #0
    cmeq v22.8h, v5.8h, #0
    cmeq v23.8h, v6.8h, #0
    cmeq v24.8h, v7.8h, #0

    uzp1 v17.16b, v17.16b, v18.16b
    uzp1 v18.16b, v19.16b, v20.16b
    uzp1 v19.16b, v21.16b, v22.16b
    uzp1 v20.16b, v23.16b, v24.16b

    sli v17.8h, v17.8h, 15
    sli v18.8h, v18.8h, 15
    sli v19.8h, v19.8h, 15
    sli v20.8h, v20.8h, 15

    sri v17.4s, v17.4s, 18
    sri v18.4s, v18.4s, 18
    sri v19.4s, v19.4s, 18
    sri v20.4s, v20.4s, 18

    ushr v17.2d, v17.2d, 12
    ushr v18.2d, v18.2d, 12
    ushr v19.2d, v19.2d, 12
    ushr v20.2d, v20.2d, 12

    sli v17.2d, v17.2d, 36
    sli v18.2d, v18.2d, 36
    sli v19.2d, v19.2d, 36
    sli v20.2d, v20.2d, 36

    ldr d4, .shuf_mask
    tbl v6.8b, {v17.16b - v20.16b}, v4.8b

    mov x0, v6.d[0]
    mvn x0, x0
    ret

Final results and power

The final version of our jpegtran managed to reduce the test image in 2.756 seconds. Or an extra 1.26X speedup, that gets it incredibly close to the performance of the Xeon on that image. As a bonus batch performance also improved!

Another favorite part of mine, working with the Qualcomm Centriq CPU is the ability to take power readings, and be pleasantly surprised every time.

With the new implementation Centriq outperforms the Xeon at batch reduction for every number of workers. We usually run Polish with four workers, for which Centriq is now 1.3 times faster while also 6.5 times more power efficient.

Conclusion

It is evident that the Qualcomm Centriq is a powerful processor, that definitely provides a good bang for a buck. However, years of Intel leadership in the server and desktop space mean that a lot of software is better optimized for Intel processors.

For the most part writing optimizations for ARMv8 is not difficult, and we will be adjusting our software as needed, and publishing our efforts as we go.

You can find the updated code on our Github page.

Useful resources

Writing complex macros in Rust: Reverse Polish Notation

Ingvar Stepanyan — Wed, 31 Jan 2018 12:11:15 GMT

(This is a crosspost of a tutorial originally published on my personal blog)

Among other interesting features, Rust has a powerful macro system. Unfortunately, even after reading The Book and various tutorials, when it came to trying to implement a macro which involved processing complex lists of different elements, I still struggled to understand how it should be done, and it took some time till I got to that "ding" moment and started misusing macros for everything :) (ok, not everything as in the i-am-using-macros-because-i-dont-want-to-use-functions-and-specify-types-and-lifetimes everything like I've seen some people do, but anywhere it's actually useful)

CC BY 2.0 image by Conor Lawless

So, here is my take on describing the principles behind writing such macros. It assumes you have read the Macros section from The Book and are familiar with basic macros definitions and token types.

I'll take a Reverse Polish Notation as an example for this tutorial. It's interesting because it's simple enough, you might be already familiar with it from school, and yet to implement it statically at compile time, you already need to use a recursive macros approach.

Reverse Polish Notation (also called postfix notation) uses a stack for all its operations, so that any operand is pushed onto the stack, and any [binary] operator takes two operands from the stack, evaluates the result and puts it back. So an expression like following:

2 3 + 4 *

translates into:

Put 2 onto the stack.
Put 3 onto the stack.
Take two last values from the stack (3 and 2), apply operator + and put the result (5) back onto the stack.
Put 4 onto the stack.
Take two last values from the stack (4 and 5), apply operator * (4 * 5) and put the result (20) back onto the stack.
End of expression, the single value on the stack is the result (20).

In a more common infix notation, used in math and most modern programming languages, the expression would look like (2 + 3) * 4.

So let's write a macro that would evaluate RPN at compile-time by converting it into an infix notation that Rust understands.

macro_rules! rpn {
  // TODO
}

println!("{}", rpn!(2 3 + 4 *)); // 20

Let's start with pushing numbers onto the stack.

Macros currently don't allow matching literals, and expr won't work for us because it can accidentally match sequence like 2 + 3 ... instead of taking just a single number, so we'll resort to tt - a generic token matcher that matches only one token tree (whether it's a primitive token like literal/identifier/lifetime/etc. or a ()/[]/{}-parenthesized expression containing more tokens):

macro_rules! rpn {
  ($num:tt) => {
    // TODO
  };
}

Now, we'll need a variable for the stack.

Macros can't use real variables, because we want this stack to exist only at compile time. So, instead, the trick is to have a separate token sequence that can be passed around, and so used as kind of an accumulator.

In our case, let's represent it as a comma-separated sequence of expr (since we will be using it not only for simple numbers but also for intermediate infix expressions) and wrap it into brackets to separate from the rest of the input:

macro_rules! rpn {
  ([ $($stack:expr),* ] $num:tt) => {
    // TODO
  };
}

Now, a token sequence is not really a variable - you can't modify it in-place and do something afterwards. Instead, you can create a new copy of this token sequence with necessary modifications, and recursively call same macro again.

If you are coming from functional language background or worked with any library providing immutable data before, both of these approaches - mutating data by creating a modified copy and processing lists with a recursion - are likely already familiar to you:

macro_rules! rpn {
  ([ $($stack:expr),* ] $num:tt) => {
    rpn!([ $num $(, $stack)* ])
  };
}

Now, obviously, the case with just a single number is rather unlikely and not very interesting to us, so we'll need to match anything else after that number as a sequence of zero or more tt tokens, which can be passed to next invocation of our macro for further matching and processing:

macro_rules! rpn {
  ([ $($stack:expr),* ] $num:tt $($rest:tt)*) => {
      rpn!([ $num $(, $stack)* ] $($rest)*)
  };
}

At this point we're still missing operator support. How do we match operators?

If our RPN would be a sequence of tokens that we would want to process in an exactly same way, we could simply use a list like $($token:tt)*. Unfortunately, that wouldn't give us an ability to go through list and either push an operand or apply an operator depending on each token.

The Book says that "macro system does not deal with parse ambiguity at all", and that's true for a single macros branch - we can't match a sequence of numbers followed by an operator like $($num:tt)* + because + is also a valid token and could be matched by the tt group, but this is where recursive macros helps again.

If you have different branches in your macro definition, Rust will try them one by one, so we can put our operator branches before the numeric one and, this way, avoid any conflict:

macro_rules! rpn {
  ([ $($stack:expr),* ] + $($rest:tt)*) => {
    // TODO
  };
  
  ([ $($stack:expr),* ] - $($rest:tt)*) => {
    // TODO
  };
  
  ([ $($stack:expr),* ] * $($rest:tt)*) => {
    // TODO
  };
  
  ([ $($stack:expr),* ] / $($rest:tt)*) => {
    // TODO
  };

  ([ $($stack:expr),* ] $num:tt $($rest:tt)*) => {
    rpn!([ $num $(, $stack)* ] $($rest)*)
  };
}

As I said earlier, operators are applied to the last two numbers on the stack, so we'll need to match them separately, "evaluate" the result (construct a regular infix expression) and put it back:

macro_rules! rpn {
  ([ $b:expr, $a:expr $(, $stack:expr)* ] + $($rest:tt)*) => {
    rpn!([ $a + $b $(, $stack)* ] $($rest)*)
  };

  ([ $b:expr, $a:expr $(, $stack:expr)* ] - $($rest:tt)*) => {
    rpn!([ $a - $b $(, $stack)* ] $($rest)*)
  };

  ([ $b:expr, $a:expr $(, $stack:expr)* ] * $($rest:tt)*) => {
    rpn!([ $a * $b $(,$stack)* ] $($rest)*)
  };

  ([ $b:expr, $a:expr $(, $stack:expr)* ] / $($rest:tt)*) => {
    rpn!([ $a / $b $(,$stack)* ] $($rest)*)
  };

  ([ $($stack:expr),* ] $num:tt $($rest:tt)*) => {
    rpn!([ $num $(, $stack)* ] $($rest)*)
  };
}

I'm not really fan of such obvious repetitions, but, just like with literals, there is no special token type to match operators.

What we can do, however, is add a helper that would be responsible for the evaluation, and delegate any explicit operator branch to it.

In macros, you can't really use an external helper, but the only thing you can be sure about is that your macros is already in scope, so the usual trick is to have a branch in the same macro "marked" with some unique token sequence, and call it recursively like we did in regular branches.

Let's use @op as such marker, and accept any operator via tt inside it (tt would be unambiguous in such context because we'll be passing only operators to this helper).

And the stack does not need to be expanded in each separate branch anymore - since we wrapped it into [] brackets earlier, it can be matched as any another token tree (tt), and then passed into our helper:

macro_rules! rpn {
  (@op [ $b:expr, $a:expr $(, $stack:expr)* ] $op:tt $($rest:tt)*) => {
    rpn!([ $a $op $b $(, $stack)* ] $($rest)*)
  };

  ($stack:tt + $($rest:tt)*) => {
    rpn!(@op $stack + $($rest)*)
  };
  
  ($stack:tt - $($rest:tt)*) => {
    rpn!(@op $stack - $($rest)*)
  };

  ($stack:tt * $($rest:tt)*) => {
    rpn!(@op $stack * $($rest)*)
  };
  
  ($stack:tt / $($rest:tt)*) => {
    rpn!(@op $stack / $($rest)*)
  };

  ([ $($stack:expr),* ] $num:tt $($rest:tt)*) => {
    rpn!([ $num $(, $stack)* ] $($rest)*)
  };
}

Now any tokens are processed by corresponding branches, and we need to just handle final case when stack contains a single item, and no more tokens are left:

macro_rules! rpn {
  // ...
  
  ([ $result:expr ]) => {
    $result
  };
}

At this point, if you invoke this macro with an empty stack and RPN expression, it will already produce a correct result:

Playground

println!("{}", rpn!([] 2 3 + 4 *)); // 20

However, our stack is an implementation detail and we really wouldn't want every consumer to pass an empty stack in, so let's add another catch-all branch in the end that would serve as an entry point and add [] automatically:

Playground

macro_rules! rpn {
  // ...

  ($($tokens:tt)*) => {
    rpn!([] $($tokens)*)
  };
}

println!("{}", rpn!(2 3 + 4 *)); // 20

Our macro even works for more complex expressions, like the one from Wikipedia page about RPN!

println!("{}", rpn!(15 7 1 1 + - / 3 * 2 1 1 + + -)); // 5

Error handling

Now everything seems to work smoothly for correct RPN expressions, but for a macros to be production-ready we need to be sure that it can handle invalid input as well, with a reasonable error message.

First, let's try to insert another number in the middle and see what happens:

println!("{}", rpn!(2 3 7 + 4 *));

Output:

error[E0277]: the trait bound `[{integer}; 2]: std::fmt::Display` is not satisfied
  --> src/main.rs:36:20
   |
36 |     println!("{}", rpn!(2 3 7 + 4 *));
   |                    ^^^^^^^^^^^^^^^^^ `[{integer}; 2]` cannot be formatted with the default formatter; try using `:?` instead if you are using a format string
   |
   = help: the trait `std::fmt::Display` is not implemented for `[{integer}; 2]`
   = note: required by `std::fmt::Display::fmt`

Okay, that definitely doesn't look helpful as it doesn't provide any information relevant to the actual mistake in the expression.

In order to figure out what happened, we will need to debug our macros. For that, we'll use a trace_macros feature (and, like for any other optional compiler feature, you'll need a nightly version of Rust). We don't want to trace println! call, so we'll separate our RPN calculation to a variable:

Playground

#![feature(trace_macros)]

macro_rules! rpn { /* ... */ }

fn main() {
  trace_macros!(true);
  let e = rpn!(2 3 7 + 4 *);
  trace_macros!(false);
  println!("{}", e);
}

In the output we'll now see how our macro is being recursively evaluated step by step:

note: trace_macro
  --> src/main.rs:39:13
   |
39 |     let e = rpn!(2 3 7 + 4 *);
   |             ^^^^^^^^^^^^^^^^^
   |
   = note: expanding `rpn! { 2 3 7 + 4 * }`
   = note: to `rpn ! ( [  ] 2 3 7 + 4 * )`
   = note: expanding `rpn! { [  ] 2 3 7 + 4 * }`
   = note: to `rpn ! ( [ 2 ] 3 7 + 4 * )`
   = note: expanding `rpn! { [ 2 ] 3 7 + 4 * }`
   = note: to `rpn ! ( [ 3 , 2 ] 7 + 4 * )`
   = note: expanding `rpn! { [ 3 , 2 ] 7 + 4 * }`
   = note: to `rpn ! ( [ 7 , 3 , 2 ] + 4 * )`
   = note: expanding `rpn! { [ 7 , 3 , 2 ] + 4 * }`
   = note: to `rpn ! ( @ op [ 7 , 3 , 2 ] + 4 * )`
   = note: expanding `rpn! { @ op [ 7 , 3 , 2 ] + 4 * }`
   = note: to `rpn ! ( [ 3 + 7 , 2 ] 4 * )`
   = note: expanding `rpn! { [ 3 + 7 , 2 ] 4 * }`
   = note: to `rpn ! ( [ 4 , 3 + 7 , 2 ] * )`
   = note: expanding `rpn! { [ 4 , 3 + 7 , 2 ] * }`
   = note: to `rpn ! ( @ op [ 4 , 3 + 7 , 2 ] * )`
   = note: expanding `rpn! { @ op [ 4 , 3 + 7 , 2 ] * }`
   = note: to `rpn ! ( [ 3 + 7 * 4 , 2 ] )`
   = note: expanding `rpn! { [ 3 + 7 * 4 , 2 ] }`
   = note: to `rpn ! ( [  ] [ 3 + 7 * 4 , 2 ] )`
   = note: expanding `rpn! { [  ] [ 3 + 7 * 4 , 2 ] }`
   = note: to `rpn ! ( [ [ 3 + 7 * 4 , 2 ] ] )`
   = note: expanding `rpn! { [ [ 3 + 7 * 4 , 2 ] ] }`
   = note: to `[(3 + 7) * 4, 2]`

If we carefully look through the trace, we'll notice that the problem originates in these steps:

   = note: expanding `rpn! { [ 3 + 7 * 4 , 2 ] }`
   = note: to `rpn ! ( [  ] [ 3 + 7 * 4 , 2 ] )`

Since [ 3 + 7 * 4 , 2 ] was not matched by ([$result:expr]) => ... branch as a final expression, it was caught by our final catch-all ($($tokens:tt)*) => ... branch instead, prepended with an empty stack [] and then the original [ 3 + 7 * 4 , 2 ] was matched by generic $num:tt and pushed onto the stack as a single final value.

In order to prevent this from happening, let's insert another branch between these last two that would match any stack.

It would be hit only when we ran out of tokens, but stack didn't have exactly one final value, so we can treat it as a compile error and produce a more helpful error message using a built-in compile_error! macro.

Note that we can't use format! in this context since it uses runtime APIs to format a string, and instead we'll have to limit ourselves to built-in concat! and stringify! macros to format a message:

Playground

macro_rules! rpn {
  // ...

  ([ $result:expr ]) => {
    $result
  };

  ([ $($stack:expr),* ]) => {
    compile_error!(concat!(
      "Could not find final value for the expression, perhaps you missed an operator? Final stack: ",
      stringify!([ $($stack),* ])
    ))
  };

  ($($tokens:tt)*) => {
    rpn!([] $($tokens)*)
  };
}

The error message is now more meaningful and contains at least some details about current state of evaluation:

error: Could not find final value for the expression, perhaps you missed an operator? Final stack: [ (3 + 7) * 4 , 2 ]
  --> src/main.rs:31:9
   |
31 |         compile_error!(concat!("Could not find final value for the expression, perhaps you missed an operator? Final stack: ", stringify!([$($stack),*])))
   |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...
40 |     println!("{}", rpn!(2 3 7 + 4 *));
   |                    ----------------- in this macro invocation

But what if, instead, we miss some number?

Playground

println!("{}", rpn!(2 3 + *));

Unfortunately, this one is still not too helpful:

error: expected expression, found `@`
  --> src/main.rs:15:14
   |
15 |         rpn!(@op $stack * $($rest)*)
   |              ^
...
40 |     println!("{}", rpn!(2 3 + *));
   |                    ------------- in this macro invocation

If you try to use trace_macros, even it won't expand the stack here for some reason, but, luckily, it's relatively clear what's going on - @op has very specific conditions as to what should be matched (it expects at least two values on the stack), and, when it can't, @ gets matched by the same way-too-greedy $num:tt and pushed onto the stack.

To avoid this, again, we'll add another branch to match anything starting with @op that wasn't matched already, and produce a compile error:

Playground

macro_rules! rpn {
  (@op [ $b:expr, $a:expr $(, $stack:expr)* ] $op:tt $($rest:tt)*) => {
    rpn!([ $a $op $b $(, $stack)* ] $($rest)*)
  };

  (@op $stack:tt $op:tt $($rest:tt)*) => {
    compile_error!(concat!(
      "Could not apply operator `",
      stringify!($op),
      "` to the current stack: ",
      stringify!($stack)
    ))
  };

  // ...
}

Let's try again:

error: Could not apply operator `*` to the current stack: [ 2 + 3 ]
  --> src/main.rs:9:9
   |
9  |         compile_error!(concat!("Could not apply operator ", stringify!($op), " to current stack: ", stringify!($stack)))
   |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...
46 |     println!("{}", rpn!(2 3 + *));
   |                    ------------- in this macro invocation

Much better! Now our macro can evaluate any RPN expression at compile-time, and gracefully handles most common mistakes, so let's call it a day and say it's production-ready :)

There are many more small improvements we could add, but I'd like to leave them outside this demonstration tutorial.

Feel free to let me know if this has been useful and/or what topics you'd like to see better covered on Twitter!

A Very WebP New Year from Cloudflare

David Wragg — Wed, 21 Dec 2016 14:00:00 GMT

Cloudflare has an automatic image optimization feature called Polish, available to customers on paid plans. It recompresses images and removes unnecessary data so that they are delivered to browsers more quickly.

Up until now, Polish has not changed image types when optimizing (even if, for example, a PNG might sometimes have been smaller than the equivalent JPEG). But a new feature in Polish allows us to swap out an image for an equivalent image compressed using Google’s WebP format when the browser is capable of handling WebP and delivering that type of image would be quicker.

CC-BY 2.0 image by John Stratford

What is WebP?

The main image formats used on the web haven’t changed much since the early days (apart from the SVG vector format, PNG was the last one to establish itself, almost two decades ago).

WebP is a newer image format for the web, proposed by Google. It takes advantage of progress in image compression techniques since formats such as JPEG and PNG were designed. It is often able to compress the images into a significantly smaller amount of data than the older formats.

WebP is versatile and able to replace the three main raster image formats used on the web today:

WebP can do lossy compression, so it can be used instead of JPEG for photographic and photo-like images.
WebP can do lossless compression, and supports an alpha channel meaning images can have transparent regions. So it can be used instead of PNG, such as for images with sharp transitions that should be reproduced exactly (e.g. line art and graphic design elements).
WebP images can be animated, so it can be used as a replacement for animated GIF images.

Currently, the main browser that supports WebP is Google’s Chrome (both on desktop and mobile devices). See the WebP page on caniuse.com for more details.

Polish WebP conversion

Customers on the Pro, Business, and Enterprise plans can enable the automatic creation of WebP images by checking the WebP box in the Polish settings for a zone (these are found on the “Speed” page of the dashboard):

When this is enabled, Polish will optimize images just as it always has. But it will also convert the image to WebP, if WebP can shrink the image data more than the original format. These WebP images are only returned to web browsers that indicate they support WebP (e.g. Google Chrome), so most websites using Polish should be able to benefit from WebP conversion.

(Although Polish can now produce WebP images by converting them from other formats, it can't consume WebP images to optimize them. If you put a WebP image on an origin site, Polish won't do anything with it. Until the WebP ecosystem grows and matures, it is uncertain that attempting to optimize WebP is worthwhile.)

Polish has two modes: lossless and lossy. In lossless mode, JPEG images are optimized to remove unnecessary data, but the image displayed is unchanged. In lossy mode, Polish reduces the quality of JPEG images in a way that should not have a significant visible effect, but allows it to further reduce the size of the image data.

These modes are respected when JPEG images are converted to WebP. In lossless mode, the conversion is done in a way that preserves the image as faithfully as possible (due to the nature of the conversion, the resulting WebP might not be exactly identical, but there are unlikely to be any visible differences). In lossy mode, the conversion sacrifices a little quality in order to shrink the image data further, but as before, there should not be a significant visible effect.

These modes do not affect PNGs and GIFs, as these are lossless formats and so Polish will preserve images in those formats exactly.

Note that WebP conversion does not change the URLs of images, even if the file extension in the URL implies a different format. For example, a JPEG image at https://example.com/picture.jpg that has been converted to WebP will still have that same URL. The “Content-Type” HTTP header tells the browser the true format of an image.

By the Numbers

A few studies have been published of how well WebP compresses images compared with established formats. These studies provide a useful picture of how WebP performs. But before we released our WebP support, we decided to do a survey based on the context on which we planned to use WebP:

We evaluated WebP based on a collection of images gathered from the websites of our customers. The corpus consisted of 23,500 images (JPEG, PNG and GIFs).
Some studies compare WebP with JPEG by taking uncompressed images and compressing them to JPEG and WebP directly. But we wanted to know what happens when we convert an image that was has already been compressed as a JPEG. In a sense this is an unfair test, because a JPEG may contain artifacts due to compression that would not be present in the original raw image, and conversion to WebP may try to retain those artifacts. But it is such conversions matter for our use of WebP (this consideration does not apply to PNG and GIF conversions, because they are lossless).
We’re not just interested in whether WebP conversion can shrink images found on the web. We want to know how much WebP allows Polish to reduce the size further than it already does, thus providing a real end-user benefit. So our survey also includes the results of Polish without WebP.
In some cases, converting to WebP does not produce a result smaller than the optimized image in the original format. In such cases, we discard the WebP image. So the figures presented below do not penalize WebP for such cases.

Here is a chart showing the results of Polish, with and without WebP conversion. For each format, the average original image size is normalized to 100%, and the average sizes after Polishing are shown relative to that.

Here are the average savings corresponding to the chart:

Original Format	Polish without WebP	Polish using WebP
JPEG (with Polish lossless mode)	9%	19%
JPEG (with Polish lossy mode)	34%	47%
PNG	16%	38%
GIF	3%	16%

(The saving is calculated as 100% - (polished size) / (original size).)

As you can see, WebP conversion achieves significant size improvements not only for JPEG images, but also for PNG and GIF images. We believe supporting WebP will result in lower bandwidth and faster website delivery.

… and a WebP New Year

WebP does not yet have the same level of browser support as JPEG, PNG and GIF, but we are excited about its potential to streamline web pages. Polish WebP conversion allows our customers to adopt WebP with a simple change to the settings in the Cloudflare dashboard. So, if you are on one of our paid plans, we encourage you to try it out today.

PS — Want to help optimize the web? We’re hiring.

Doubling the speed of jpegtran with SIMD

Vlad Krasnov — Thu, 08 Oct 2015 10:10:41 GMT

It is no secret that at CloudFlare we put a great effort into accelerating our customers' websites. One way to do it is to reduce the size of the images on the website. This is what our Polish product is for. It takes various images and makes them smaller using open source tools, such as jpegtran, gifsicle and pngcrush.

However those tools are computationally expensive, and making them go faster, makes our servers go faster, and subsequently our customers' websites as well.

Recently, I noticed that we spent ten times as much time "polishing" jpeg images as we do when polishing pngs.

We already improved the performance of pngcrush by using our supercharged version of zlib. So it was time to look what can be done for jpegtran (part of the libjpeg distribution).

Quick profiling

To get fast results I usually use the Linux perf utility. It gives a nice, if simple, view of the hotspots in the code. I used this image for my benchmark.

CC BY 4.0 image by ESO

perf record ./jpegtran -outfile /dev/null -progressive -optimise -copy none test.jpeg

And we get:

perf report54.90% lt-jpegtran libjpeg.so.9.1.0 [.] encode_mcu_AC_refine28.58% lt-jpegtran libjpeg.so.9.1.0 [.] encode_mcu_AC_first

Wow. That is 83.5% of the time spent in just two functions! A quick look suggested that both functions deal with Huffman coding of the jpeg image. That is a fairly simple compression technique where frequently occurring characters are replaced with short encodings, and rare characters get longer encodings.

The time elapsed for the tested file was 6 seconds.

Where to start?

First let's look at the most used function, encode_mcu_AC_refine. It is the heaviest function, and therefore optimizing it would benefit us the most.

The function has two loops. The first loop:

for (k = cinfo->Ss; k <= Se; k++) {
    temp = (*block)[natural_order[k]];
    if (temp < 0)
      temp = -temp;      /* temp is abs value of input */
    temp >>= Al;         /* apply the point transform */
    absvalues[k] = temp; /* save abs value for main pass */
    if (temp == 1)
      EOB = k;           /* EOB = index of last newly-nonzero coef */
  }

This loop iterates over one array (natural_order), reads indices that it uses to read from *block into temp, computes the absolute value of temp, and shifts it to the right by some value. In addition it keeps the EOB variable pointing to the last value of the index for which temp was 1.

Generally looping over an array is the schoolbook exercise for SIMD (Single Instruction Multiple Data) instructions, however in this case we have two problems: 1) Indirect indices are hard for SIMD 2) Keeping the EOB value updated.

Optimizing

The first problem can not be solved on the function level. While Haswell introduced the new Gather instructions, that allow loading data using indices, those instructions are still slow, and more importantly they operate only on 32-bit and 64-bit values. We are dealing with 16-bit values here.

Our only option is therefore to load all the values sequentially. One SIMD XMM register is 128-bits wide and can hold eight 16-bit integers, therefore we will load eight values each iteration of the loop. (Note: we could also choose to use 256-bit wide YMM registers here, but that would only work on the newest CPUs, while gaining little performance).

    __m128i x1 = _mm_setzero_si128(); // Load 8 16-bit values sequentially
    x1 = _mm_insert_epi16(x1, (*block)[natural_order[k+0]], 0);
    x1 = _mm_insert_epi16(x1, (*block)[natural_order[k+1]], 1);
    x1 = _mm_insert_epi16(x1, (*block)[natural_order[k+2]], 2);
    x1 = _mm_insert_epi16(x1, (*block)[natural_order[k+3]], 3);
    x1 = _mm_insert_epi16(x1, (*block)[natural_order[k+4]], 4);
    x1 = _mm_insert_epi16(x1, (*block)[natural_order[k+5]], 5);
    x1 = _mm_insert_epi16(x1, (*block)[natural_order[k+6]], 6);
    x1 = _mm_insert_epi16(x1, (*block)[natural_order[k+7]], 7);

What you see above are intrinsic functions. Such intrinsics usually map to a specific CPU instruction. By using intrinsics we can integrate high performance code inside a C function, without writing assembly code. Usually they give more than enough flexibility and performance. The __m128i type corresponds to a 128-bit XMM register.

Performing the transformation is straightforward:

    x1 = _mm_abs_epi16(x1);        // Get absolute value of 16-bit integers
    x1 = _mm_srli_epi16(x1, Al);   // >> 16-bit integers by Al bits
    _mm_storeu_si128((__m128i*)&absvalues[k], x1); // Store

As you can see we only need two instructions to perform the simple transformation on eight values, and no branch is required as a bonus.

The next tricky part is to update EOB, so it points to the last nonzero index. For that we need to find the index of the highest nonzero value inside x1, and if there is one update EOB accordingly.

We can easily compare all the values inside x1 to one:

    x1 = _mm_cmpeq_epi16(x1, _mm_set1_epi16(1));  // Compare to 1

The result of this operation is a mask. For all equal values all bits are set to 1 and for all non-equal values all bits are cleared to 0. However, how do we find the highest index whose bits are all ones? Well, there is an instruction that extracts the top bit of each byte inside an __m128i value, and puts it into a regular integer. From there we can find the index by finding the top set bit in that integer:

    unsigned int idx = _mm_movemask_epi8(x1);        // Extract byte mask
    EOB = idx? k + 16 - __builtin_clz(idx)/2 : EOB;  // Compute index

That is it. EOB will be updated correctly.

What does that simple optimization give us? Running jpegtran again shows that the same image now takes 5.2 seconds! That is a 1.15x speedup right there. But no reason to stop now.

Keep going

Next the loop iterates over the array we just prepared (absvalues). At the beginning of the loop we see:

  if ((temp = absvalues[k]) == 0) {
      r++;
      continue;
    }

So it simply skips all the 0 values in the array, until it finds a nonzero value. Is it worth optimizing? Running a quick test, I found out that it is common to have long sequences of zero values. And iterating over them individually can be expensive. Let's try our previous method of finding the top nonzero element in an array here:

    __m128i t = _mm_loadu_si128((__m128i*)&absvalues[k]);
    t = _mm_cmpeq_epi16(t, _mm_setzero_si128()); // Compare to 0
    int idx = _mm_movemask_epi8(t);              // Extract byte mask
    if (idx == 0xffff) {                         // If all the values are zero skip them all
      r += 8;
      k += 8;
      continue;
    } else {             // If some are nonzero, skip up to the first nonzero
      int skip = __builtin_ctz(~idx)/2;
      r += skip;
      k += skip;
      if (k>Se) break;   // Stop if gone too far
    }
    temp = absvalues[k]; // Load the first nonzero value

What do we get now? Our test image now takes only 3.7 seconds. We are already at 1.62x speedup.

And what does perf show?

45.94% lt-jpegtran libjpeg.so.9.1.0 [.] encode_mcu_AC_first28.73% lt-jpegtran libjpeg.so.9.1.0 [.] encode_mcu_AC_refine

We can see that the tables have turned. Now encode_mcu_AC_first is the slower function of the two, and takes almost 50% of the time.

Moar optimizations!

Looking at encode_mcu_AC_first we see one loop, that performs similar operations as the function we already optimized. Why not reuse the optimizations we already know are good?

We shall split the one big loop into two smaller loops, one would perform the required transformations, and store them into auxiliary arrays, the other would read those values skipping the zero entries.

Note that transformations are a little different now:

  if (temp < 0) {
      temp = -temp;
      temp >>= Al;
      temp2 = ~temp;
    } else {
      temp >>= Al;
      temp2 = temp;
    }

The equivalent for SIMD would be:

    __m128i neg = _mm_cmpgt_epi16(_mm_setzero_si128(), x1);// Negative mask
    x1 = _mm_abs_epi16(x1);                                // Absolute value
    x1 = _mm_srli_epi16(x1, Al);                           // >>
    __m128i x2 = _mm_andnot_si128(x1, neg);                // Invert x1, and apply negative mask
    x2 = _mm_xor_si128(x2, _mm_andnot_si128(neg, x1));     // Apply non-negative mask

And skipping zero values is performed in the same manner as before.

Measuring the performance now gives as 2.65 seconds for the test image! And that is 2.25X faster than the original jpegtran.

Conclusions

Let's take a final look at the perf output:

40.78% lt-jpegtran libjpeg.so.9.1.0 [.] encode_mcu_AC_refine26.01% lt-jpegtran libjpeg.so.9.1.0 [.] encode_mcu_AC_first

We can see that the two functions we optimized still take a significant portion of the execution. But now it is just 67% of the entire program. If you are asking yourself how we got 2.25X speedup with those functions going from 83.5% to 67%, read about the Potato Paradox.

It is definitely worth making further optimizations here. But it might be better looking outside the functions into how the data is arranged and passed around.

The bottom line is: optimizing code is fun, and using SIMD instructions can give you significant boost. Such as in this case.

We at CloudFlare make sure that our servers run at top notch performance, so our customers' websites do as well!.

You can find the final code on our GitHub repository: https://github.com/cloudflare/jpegtran

Prepare Your Site for Traffic Spikes this Holiday Season

Andrew A. Schafer — Mon, 17 Nov 2014 19:46:32 GMT

The holiday season is approaching, and everyone is thinking about gifts for their friends and family. As people increasingly shop online, this means huge spikes in traffic for web sites---especially ecommerce sites. We want you to get the most out of this year’s surge in web traffic, so we’ve created a list of tips to help you prepare your site to ensure your visitors have a reliable and fast experience.

Make sure your site can handle traffic spikes:

1) Contact your hosting provider to understand the limits of your hosting plan

Even though CloudFlare offsets most of the load to your website via caching and request filtering, a certain amount of traffic will still pass through to your host. Knowing the limits of your plan can help prevent a bottleneck from your hosting plan.

2) Reduce the number of unwanted requests to your infrastructure

CloudFlare allows you to block IP address individually or IPs from entire regions. If you don’t want or need traffic from certain IPs or regions, you can block them using your Threat Control panel. This is useful for sites who know where their visitors usually come from.

3) Use CloudFlare IP addresses to your advantage

Take action to prevent attacks to your site during peak season by configuring your firewall to only accept traffic from CloudFlare IP addresses during the holidays. If you only accept CloudFlare IPs, you can prevent attackers from getting to your original IP address and knocking your site offline.

4) Ensure CloudFlare IPs are allowlisted

CloudFlare operates as a reverse proxy to your site so all connections come from CloudFlare IPs, so restricting our IPs can cause issues for visitors trying to access your site. The list of our IP can be found here: https://www.cloudflare.com/ips

5) Go beyond default caching for the fastest site possible

By default CloudFlare caches static content with our CDN; however, you can extend our caching by creating custom Page Rules. Under the Page Rules section of your account, you can set a pattern--either your entire website, or a section of your site--then turn on the “Cache everything” option. Creating a page rule and setting the Cache Everything option helps reduce the number of times CloudFlare has to hit your origin to download cacheable items.

Setting up a custom Page Rule like this is ideal if you have a campaign going on over the holiday season. With the Cache Everything option enabled, CloudFlare will be serving your entire site, taking the load off of your server completely, making you site as fast as possible.

Edge Cache Expire TTL and the Browser Cache Expire TTL allows you to determine how long we cache resources at our edge, and how long browsers will cache assets.

Further suggestions for optimizing CloudFlare:

1) Make sure your back-end analytics are accurate

To ensure visitor’s IPs show in your back-end server logs you can install mod_cloudflare to restore original visitor IP back to server logs. Our IP addresses will show up in your logs unless you install the modification to make sure you are logging the visitors’ actual IP addresses.

2) Turn on Auto Minification to send as little data as possible

Auto Minification is a method that helps your site send as little information as possible to increase performance. It works by taking JavaScript, CSS, and HTML and removing all comments and white spaces.

3) Turn on Rocket Loader to send data in the right order

Rocket Loader is an asynchronous JavaScript loader. It ensures that individual scripts on your page won’t block other content from loading, loads third party scripts in the order they are ready, and bundles all script into a single request so multiple responses can be streamed---in short, it makes your page render much faster on any device.

At a high level, Rocket Loader works like this:

4) Turn on Mirage to lazy load images

(Available for Pro, Business, and Enterprise plans)Mirage will determine which images are visible to the end user and send those first, then other images that are off the screen will be lazy loaded as needed. This feature is especially useful for sites that with many images like most ecommerce sites

For example: if your sites has a images of seventy different t-shirts for sale, rather than having the customer wait for all seventy images to load, Mirage quickly delivers the images immediately visible to the user, then loads the rest of the images as the customer scrolls down. By having the most important images load lightning fast, the end user’s experience is improved.

5) Turn on Polish to compress images

(Available for Pro, Business, and Enterprise plans)Polish recompresses images to make them as small as possible in order to increase site performance.

For example: You may have images on your site that are not optimally compressed. When CloudFlare puts those images into cache it will automatically recompress them making them smaller, and allowing them to be loaded as quickly as possible. CloudFlare can compress images in a lossless or lossy way.

6) Make sure those last minute changes are seen

If you want to make quick changes to your page and have your visitors see that change immediately, you can purge individual files from CloudFlare’s cache.

For example, if you are running a sale only for Black Friday, and you want new content displayed to your visitors, you can purge a single page so that CloudFlare will return to your origin server to fetch a new version of that page for our cache.

Please Note: if you purge your entire cache your origin will receive a flood of traffic until CloudFlare get all of your assets back into our cache.

One last thing

Even if you don’t implement any of the suggestions above you are still ahead of the game by being on CloudFlare’s network. Since we have 28 data centers around the world, we bring your site close to your visitors. And since we run an Anycast network, visitors are automatically directed to the closest data center meaning that your site will be faster as the request travels over a shorter distance.

We wish you all the best this holiday season! Good luck!

Experimenting with mozjpeg 2.0

John Graham-Cumming — Tue, 29 Jul 2014 02:29:00 GMT

One of the services that CloudFlare provides to paying customers is called Polish. Polish automatically recompresses images cached by CloudFlare to ensure that they are as small as possible and can be delivered to web browsers as quickly as possible.

We've recently rolled out a new version of Polish that uses updated techniques (and was completely rewritten from a collection of programs into a single executable written in Go). As part of that rewrite we looked at the performance of the recently released mozjpeg 2.0 project for JPEG compression.

To get a sense of its performance (both in terms of compression and in terms of CPU usage) when compared to libjpeg-turbo I randomly selected 10,000 JPEG images (totaling 2,564,135,285 bytes for an average image size of about 256KB) cached by CloudFlare and recompressed them using the jpegtran program provided by libjpeg-turbo 1.3.1 and mozjpeg 2.0. The exact command used in both cases was:

jpegtran -outfile out.jpg -optimise -copy none in.jpg

Of the 10,000 images in cache, mozjpeg 2.0 failed to make 691 of them any smaller compared with 3,471 for libjpeg-turbo. So mozjpeg 2.0 was significantly better at recompressing images.

On average images were compressed by 3.0% using mozjpeg 2.0 (ignoring images that weren't compressed at all) and by 2.5% using libjpeg-turbo (again ignoring images that weren't compressed at all). This seems similar to Mozilla's reported 5% improvement compared to libjpeg-turbo.

So, mozjpeg 2.0 achieved better compression on this set of files and compressed many more of them (93.1% vs. 65.3%).

As example, here's an image, not from the sample set. Its original size was 1,984,669 bytes. When compressed with libjpeg-turbo it is 1,956,200 bytes (2.4% removed); when compressed with mozjpeg 2.0 it is 1,874,491 (5.6% removed). (The mozjpeg 2.0 version is 4.2% smaller than the libjpeg-turbo version).

The distribution of compression ratios seen using mozjpeg 2.0 is shown below.

This improved compression comes at a price. The run time for the complete compression (including where compression failed to create an improvement) was 273 seconds for libjpeg-turbo and 474 seconds for mozjpeg 2.0. So mozjpeg 2.0 took about 1.7x longer, but, of course, achieved better compression on more of the files.

Because we'd like to get the highest compression possible we've assigned an engineer internally to look at optimization of mozjpeg 2.0 (specifically for the Intel processors we use) and will contribute back improvements to the project.

We're investing quite heavily in optimization projects (such as improvements to gzip (code here) and LuaJIT, and things like a very fast Aho-Corasick implementation). If you're interested in low-level optimization for Intel processors, think about joining us.

PS After this blog post was published some folks pointed out that the best comparison would be when the -progessive flag is used. I went back and checked and I had in fact done that in the 10,000 file test and so the data there is correct. However, the command shown above is not. The actual command used was:

jpegtran -outfile out.jpg -optimise -progressive -copy none in.jpg

Also, the image shown above was generated using the incorrect command because I did it outside the 10,000 file test. That paragraph above should say:

As example, here's an image, not from the sample set. Its original size was 1,984,669 bytes. When compressed with libjpeg-turbo it is 1,885,090 bytes (4% removed); when compressed with mozjpeg 2.0 it is 1,874,491 (5.6% removed). (The mozjpeg 2.0 version is 0.6% smaller than the libjpeg-turbo version).

Why mobile performance is difficult

John Graham-Cumming — Thu, 28 Jun 2012 06:48:00 GMT

Mobile web browsing is very different, at the network level, to browsing on a desktop machine connected to the Internet. Yet both use the very same protocols, and although TCP was designed to perform well on the fixed-line Internet, it doesn't perform as well on mobile networks. This post looks at why and how CloudFlare is helping.

We start with a simple ping. Here's a ping from my laptop machine (which is connected via 802.11g WiFi to a 20Mbps broadband connection) to a machine at Google. Looks like I'm getting a roundtrip time of about 20ms.

Here's the same ping done from my iPhone on the same WiFi network at the same location in the house. The ping time has gone up to about 60ms. So, in this instance, the round trip time had tripled just from going from laptop to phone.

But to see the real cost of mobile it's necessary to switch off WiFi and onto 3G. Here's the ping time on 3G to the same machine. Here's it's both much higher (we're now into 1/10 to 1/5 of a second territory) but it's also variable:

And then I get up and move to the front of the house and try again. The ping time has changed completely (the number of bars didn't) and I'm seeing between 0.5s and 1s of round trip time. That will have a serious effect on web browsing.

And for a final test I return to my original location and grip the iPhone firmly in my hand. The number of bars falls away and the round trip time becomes infinite! Pings simply aren't working any more.

What this illustrates is something that any smartphone user knows instinctively: network performance on phone is very variable and susceptible to location and environment. TCP would actually work just fine on a phone except for one small detail: phones don't stay in one location. Because they move around (while using the Internet) the parameters of the network (such as the latency) between the phone and the web server are changing and TCP wasn't designed to detect the sort of change that's happening.

In past posts I've looked at the effect of high latency on web browsing and TCP's connection and slow start cost. One of the fundamental parts of the TCP specification covers congestion avoidance: the detection and avoidance of congestion on the Internet. At the start of a connection TCP's slow start prevents it from blasting out packets until it's detected the maximum possible speed it can transmit at, and during a connection TCP actively watches for signs of congestion. The smooth running of the Internet as a whole relies on protocols like TCP being able to detect congestion and slow down. If not there'd likely be a congestion collapse.

Image credit: joiseyshowaa

TCP spots congestion by watching for lost packets. On the wired Internet lost packets are a sign of congestion: they're a sign that a buffer in a router or server somewhere along the route packets are taking is full and is simply dropping packets. When lost packets are detected by TCP it slows down.

That all falls apart on mobile networks because packets get lost for other reasons: you move around your house while surfing a web page, or you're on the train, or you just block the signal some other way. When that happens it's not congestion, but TCP thinks it is, and reacts by slowing down the connection.

It seems like it might be a simple matter to change the congestion avoidance algorithm in TCP to take into account the challenges of mobile networks, but it's actually an area of active research with many different possible replacements for the existing basic algorithm. It's hard because trying to balance maximizing throughput, preventing congestion on the Internet, dealing with actual congestion, and spotting phony congestion is complex.

Image credit: mikecogh

And if that weren't enough mobile networks also introduce another tricky problem: packet reordering. Although TCP is designed to cope with reordering of packets (because they might have followed different routes between source and destination) large reordering can occur in mobile networks when a mobile phone is handed off from one tower to the next.

For example, a stream of packets being transmitted by a moving mobile user (perhaps sending a large email) might be split with some going down one route through one tower and the rest through a different tower and by a different route.

This causes problems for some of the newer congestion avoidance algorithms (such as TCP New Reno) and can cause additional slow downs.

CloudFlare helps solve these problems for our customers in two ways. Firstly, we customize the parameters inside the TCP stacks in our web servers to tune for the best possible performance and secondly we actively monitor and classify the connections from people surfing our customers' sites.

By classifying connections we are able to dynamically determine the best way to behave on a connection. We know whether this is likely to be a high-latency mobile phone browsing session, or a high-bandwidth broadband connection in someone's home or office. Doing that allows us to give the best performance to end users, and ensure that customers' web sites are snappy wherever and however they are accessed.

And we continually look at ways of improving network performance for our customers by tuning TCP, monitoring performance, opening new data centers and introducing features like Rocket Loader, Mirage, Polish, SPDY, and Railgun.

Introducing Mirage: Automatic Responsive Web Design via Intelligent Image Loading

Matthew Prince — Wed, 06 Jun 2012 02:08:00 GMT

Yesterday, we announced Polish, which helps to automatically optimize the images on your site to decrease their size and thereby increase performance. Today we're releasing something more ambitious: a system to automatically manage the loading of images in order to maximize your site's performance which we call Mirage.

Impact of Images

Images are more than 50% of the data that makes up a typical website. A tool like CloudFlare's Polish substantially reduces the size of images. What would be even better than reducing image sizes would be not loading image data that isn't needed for the page. That's what Mirage does automatically.

To understand this, imagine a typical blog. A blog is usually a long page with a series of stories, the most recent of which are on top. Your browser window is a certain size, which is known as the viewport. When you load a page, you can only see the images within that window. However, most browsers happily download all the images on the page before the page is ready. This not only slows down page performance, but if you never scroll down the page then it also wastes bandwidth downloading images that will never be seen.

Responsive Design: Lazy Loading and Auto Resizing

CloudFlare Mirage is aware of the images on your page, the size of a visitor's viewport, and the type of device and network connection their device has. It then automatically optimizes image loading to only deliver the images that are necessary. Mirage prioritizes the loading of the images that are in the viewport. It then loads the other images as they are needed or as there are spare network resources available. You can see Mirage in action, loading images as a visitor scrolls down the page, in this video.

In addition to lazy loading, Mirage can also automatically optimize images to the size and resolution that is best for the page and device. One of the biggest wastes on the web is that people will upload a large image and then resize it down to a thumbnail or smaller size within HTML. Mirage detects cases like this and instructs the CloudFlare CDN to do the image resizing at the server. This means that if have a 1000px x 1000px image that you're only displaying at 100px x 100px then we deliver it to your visitors at the appropriate size and without wasting unnecessary bandwidth.

Mobile, Mobile, Mobile, Mobile, Mobile

Mirage significantly improves mobile performance in multiple ways. It automatically detects if a visitor is connecting over a mobile operator's network (where bandwidth is limited) or over a wifi conncetion (where bandwidth is less of a concern) and adjusts its image loading algorithm appropriately. So, for example, it will only load images as a visitor scrolls them into the viewport if you're on a 3G connection, where it will lazy load the all images in the background, prioritized based on where they appear on the page, if you're on wifi.

Mirage's dynamic sizing also takes into account the size of the viewport on a mobile device. A background image that is 2,000 pixels wide is wasted on an iPhone visitor's screen that is only 960-by-640-pixels. Mirage can automatically downsize the image so that you're never delivering more pixels than the device in question can display. Less data delivered to a mobile device not only improves a site's performance, but also helps your visitors with limited data plans not saturate their caps.

Automatic Responsive Web Design

Responsive web design is the principle that your site's layout should respond to the environment. Usually this requires significant code changes to your underlying site, which many existing sites have difficulty retrofitting into their current stack. CloudFlare's Mirage automatically enables some of the biggest wins from responsive web design without you having to change a single line of HTML.

You can pick the particular Mirage features you want to enable and they will be applied across your site. Mirage works via Javascript, but has a safe fallback for visitors with Javascript disabled. Since Mirage adds additional resource load to our network in terms of additional storage and CPU usage, we're limiting the feature to paid customers. You can turn on Mirage from the Performance tab of your CloudFlare Settings. Together, Polish and Mirage make a one-two punch that ensures your site is handling images in the best possible way for each visitor to your site.

Introducing Polish: Automatic Image Optimization

Matthew Prince — Mon, 04 Jun 2012 23:08:00 GMT

Today, the average web page has more than 85 objects (images, Javascript, CSS, etc.) that make up more than 750 KB of data. All that data needs to be downloaded when a page loads. On the average web page, more than 50% of the data is made up of images. And images are only getting larger as screen resolutions on devices improve.

A big part of making the web fast is reducing the amount of that data that makes up each page. If you can cut a page's size in half it's effectively as good as making a web connection twice as fast. This is especially important for mobile devices that have limited bandwidth. At CloudFlare, we've just rolled out two new features that help improve image performance: Polish & Mirage. This post is about Polish, which is the simpler of the two features. Tomorrow we'll send out another update about Mirage.

Introducing Polish: One-Click Simple Image Optimization

Polish automatically optimizes the images on your site. When an image is fetched from your origin our systems automatically optimize it in our cache. Subsequent requests for the same image will get the smaller, faster, optimized version of the image. We've been quietly testing Polish for the last few months and the results are impressive. On average, the data transfer saving, and therefore performance gains, from Polish far exceed more common techniques like code minification.

How Much Do You Want To Save?

You can enable two modes: Lossless or Lossy. The Lossless mode removes all the unnecessary bloat from an image file, such as the image header and meta data, without removing any image data. This means images will appear exactly the same as they would have before. In our tests, we're seeing an average file size reduction from the Lossless mode of 21% across actual CloudFlare customer images.

Lossy mode also removes the unnecessary bloat from an image file, but also applies a compression algorithm to suitable images. We picked the compression algorithm we're using to minimize any perceptible visual difference first and then attempt to get the most data savings. On average, using Lossy mode we've seen an average file reduction of 48% across actual CloudFlare customer images.

I'm one of those people who worries about things like losing image quality, so the engineering team gave me a challenge where they'd show me two images: one that went through the Polish Lossy algorithm and one that did not. While I'm sure that there are some people in the world who can spot the difference, for the vast majority of images I have to confess that I couldn't. For most sites, Lossy mode will be a big win but you can experiment between the two and see what works best for your site.

Image Optimization for the Win

Reducing image sizes has a particularly large impact on wireless devices that have limited bandwidth. If a large number of your visitors are accessing your site on a mobile device, Polish will be a big performance win.

Because the compressing of images adds CPU overhead, Polish is only available for paid accounts. If you already have a paid account, you can find it in your CloudFlare Settings under the Performance tab. (For those beta testers who helped us build the feature over the last few months, you'll continue to enjoy Polish even if you don't have a paid account — thanks for all your help!)

Watch this space tomorrow for another new feature called Mirage which helps with image optimization in a revolutionary new way. Stay tuned!