(Photo credit: Hy Chalmé/Instagram)
Now that Hurricane Sandy has passed and the flood waters have begun to recede, we wanted to recap what we saw over the last 24 hours across the CloudFlare network.
Our network is designed to survive hurricanes and other natural disasters, so we were confident even if some of our data centers that were in the hurricane's path failed, traffic would immediately be transferred to the next closest facility. That said, our preference is always that all our data centers remain online and able to continue to serve traffic.
Yesterday morning our ops team met to plan for the potential loss of our facilities in Newark, NJ, which we refer to by the airport code EWR, and potentially Ashburn, VA, which we refer to by the airport code IAD. Our equipment is located in an Equinix facility in both locations and we confirmed that they had taken steps to ensure their systems were tested and as hurricane-ready as they could be.
Data centers are setup so that, if power from the grid is disrupted, they switch to stored backup power until generators can kick in. In EWR, power is stored in what are, effectively, a series of car batteries. Enough power is stored in the batteries that the data center can continue to run without a new source of power for several minutes. The diesel generators are setup to kick in within that time period, usually less than a minute after a power failure is detected. The generators are intended to be able to power the facilities indefinitely so long as there is sufficient fuel. Most of the data centers from which we operate worldwide are considered "critical infrastructure" and, during an emergency, they are second in line, behind only hospitals, for delivery of diesel fuel.
The generators at our EWR facility would get a test as the storm passed overhead. At 01:31 EDT, several hours after the storm had made landfall, we received notice EWR had lost grid power. As designed, power was immediately transferred to the batteries and then to the generators. The incident description read: "Equinix IBX reported a utility power disturbance and transferred customer loads to generator power. No customers have been impacted and Site Staff reports that sufficient fuel supplies are available. Next update will be when a significant change to the situation occurs." Our systems continued to run and we did not detect any power surge or interruption.
At 08:32 EDT, we received notice that one of the EWR generators had failed: "Equinix IBX reports customer loads are on generator power, however they have a loss of redundancy do to the failure of generator 4. Engineers are investigating the issue. Next update will be when a significant change to the situation occurs." Data centers are designed for redundancy, so losing a single generator would not cause a power loss. Our systems continued to function as normal and the functional generators continued to power our equipment throughout the day. At 19:09 EDT, 11 and a half hours after the generator originally kicked in, we received notice that grid power had been restored: "Equinix IBX AMFO reports that utility power has been restored."
Elsewhere on the Internet
While we were fortunate that all of CloudFlare's facilities stayed online, other data centers and networks experienced issues. Around 02:10 EDT, our network ops team noticed a change in routing from traffic that usually transited via Level(3)'s Yellow/Atlantic Crossing-2 (AC-2) undersea cable. The cable runs from Bude, United Kingdom to Bellport, New York. While routing changed, it did not impact our customers and our network routed around the problem. We later confirmed with other network operators that AC-2 had experienced a failure.
Several regional data centers experienced outages which caused interruption to their customers' sites. In some cases, our customers had their origin data centers knocked offline. When this happens, CloudFlare's Always Online functionality kicks in and continues to serve a static version of the site until the origin is restored. The graph below illustrates the deviation from normal of websites that have triggered Always Online. At the height of the storm, beginning around 22:30 EDT and lasting until 00:30 EDT, we were 2.5 standard deviations above normal in terms of the sites on our network whose origin servers were offline but we were serving static copies of their sites.
One thing that was somewhat surprising is that traffic to our EWR and IAD data centers dropped less than 1% versus normal operations on a regular Monday night. We had speculated that with power outaged affecting a large number of homes and businesses throughout the Northeastern United States, traffic to the data centers would have been more impacted. Our speculation is that while fewer people may have been online, those that still had connectivity were glued to their computers and surfing more than usual.
Everyone at CloudFlare's thoughts are with the people of the Northeastern United States as they begin the process of cleaning up from this extremely destructive storm. Thanks to the police, fire fighters, rescue workers, and the teams on the ground in the region that kept the lights on and allowed us to continue to operate from the region uninterrupted.