Were You Affected by Google Cloud’s outage? An Analysis, and How Ongoing Monitoring Could Have Helped
This is a republish of my original blog post from 2019–06–03 at https://www.netrounds.com/newsroom/blog/articles/2019/06/03/google-cloud-incident-19009-analysis/ .
Yesterday, on the 2nd of June 2019, Google Cloud experienced a major issue with its services, tracked by Google themselves as “Incident #19009” (https://status.cloud.google.com/incident/cloud-networking/19009). A detailed root cause analysis has not yet been provided by Google as far as we have observed.
Just as many other modern-day companies, we use cloud services for development, research and production services. Considering we specialize in testing and monitoring of services we have been able to capture data on how yesterdays’ issue has affected their cloud customers.
For our own research we operate a number of Netrounds Test Agents running in the major public clouds such as AWS, GCP, Azure in selected locations. These measure the network availability and performance of up to 1000 times per second, giving us millisecond resolution on real network and service performance and availability events.
For the purposes of this blog post we are initially going to focus on analyzing the results from three Test Agents located in Google Cloud region us-central1, us-west1 and us-east1 (as reflected in the below graph). All dates and times are based on the US/Pacific (PST) timezone. As a baseline we continuously run a full-mesh measurement between the three regions, below is how it normally looks in our GUI (data in the image is captured from the 1st of June 2019):

The green bars indicate that packet loss, delay and delay variation/jitter are all normal for this 24 hour period, indicating that all measurements are below KPI thresholds and no issues or abnormalities are observed. The one-way latency values are stable at ~18ms us-east to us-central and ~33ms us-east to us-west.
However looking at the same view for the 2nd of June 2019 when the outage occurred, we instead see this:

The red/orange markings indicate issues. The Netrounds Test Agents have detected a KPI degradation and an SLA violation for traffic going from us-east1 to us-west1 and us-central1. The Netrounds Test Agents observed the first dropped packets at 11:49 and the last dropped packets were observed at 14:43.
By zooming in to traffic between 11:00 and 16:00 during the outage, we are able to see more clearly where and when the issues are occurring.

Immediately we see that traffic from us-east to us-west/central are mainly affected. Traffic going in the opposite direction are not having issues with dropped packets. So the issue is asymmetric - which is typical for issues related to congestion. We also see that the issue has been slightly more severe in us-east to us-west direction compared to us-east to us-central.
Drilling down into details on the us-east->us-west stream we see this graph:

Here we do see that the average latency started to increase at 11:48 and at the same time also packet loss was seen. At 12:07 the latency was back to normal, but packet loss was still present, this may have been due to mitigation actions by Google to restore the service.
Doing the same drill down into the us-east->us-central stream we see this graph:

Here we see no slow build-up of latency, instead we only seen intermediate packet loss - but not as frequent as in the us-east->us-west direction. In this direction we see a jump in latency from 18ms to 21ms at 12:42, this probably comes from re-routing in the Google network to mitigate the congestion issue, this slightly longer route was in use until 15:43 and after this the latency went back to the usual 18ms.
Did coast to coast traffic leave the US during the incident?
So far this looks like a normal congestion issue that affected traffic only in one direction, but we did have a very interesting observation in the stream going from us-west to us-east; the latency/packet loss graph for this direction is seen here:

What is remarkable here is the latency jump taking place at 13:22 to 13:53. The latency jumped from its usual 34ms to a consistent 191ms, without any packet loss occurring.
191ms in light seconds multiplied with ⅔ (speed of light in optical fiber) gives an approximate distance of ~38,000 km. The earth’s circumference is ~40,000 km, this suggests that traffic from US West coast to the East coast may have traversed the globe in the opposite direction during these 31 minutes.
My understanding is that traffic between GCP regions normally are encrypted and protected in several layers (https://cloud.google.com/security/encryption-in-transit/) meaning that security was likely not compromised, but it's still a remarkable and interesting event.
Unfortunately GCP does not provide the option to do traceroutes internally in a VPC, or else the Netrounds Path Trace feature (https://www.netrounds.com/download/network-path-tracing-with-netrounds/) would have been able to show the exact route taken during this event.
Simultaneous loss of us-east4 and us-west2
Additionally we operate Netrounds Test Agent instances in us-east4 and us-west2, these were completely unreachable during the incident and reported 100% packet loss on all streams between 11:47 and 15:23.
us-east1 <-> us-east4 loss graph:

us-west1 <-> us-west2 loss graph:

Its surprising that we did observe complete loss of traffic between us-west1 and us-west2 since the Google incident ticket describes the issue to have been related to congestion on the east coast. The loss of traffic was not preceded by buildup of latency or loss - it came suddenly and without previous warnings of congestion.
Conclusion
Being able to immediately discover, analyze and take relevant mitigation actions during outages is critical for running a professional 24/7 service. By doing proactive measurements of the network, the Netrounds Test Agents did discover, and alert, about this cloud outage just a few seconds after it happened. This gives an unprecedented clear view of from where an issue originates and how services are impacted and it quickly ends any blame-games and focus can be spent on troubleshooting and mitigating the incident to get services back online again.