We are investigation an issue with Google BigQuery.

96 views
Skip to first unread message

Google Cloud Platform Status

unread,
Mar 8, 2019, 4:24:08 AM3/8/19
to bigquery-dow...@googlegroups.com
We are investigation an issue with Google BigQuery.

Google Cloud Platform Status

unread,
Mar 8, 2019, 4:24:11 AM3/8/19
to bigquery-dow...@googlegroups.com
We are investigating an issue with Google BigQuery. We will provide more
information by Friday, 2019-03-08 02:30 US/Pacific.

Google Cloud Platform Status

unread,
Mar 8, 2019, 5:30:31 AM3/8/19
to bigquery-dow...@googlegroups.com
Mitigation work is currently underway by our Engineering Team. We will
provide another status update by Friday, 2019-03-08 03:30 US/Pacific with
current details.

Google Cloud Platform Status

unread,
Mar 8, 2019, 5:51:31 AM3/8/19
to bigquery-dow...@googlegroups.com
The issue with Google BigQuery API returning 503 errors has been resolved
for all affected projects as of 1:30 US/Pacific. We will conduct an
internal investigation of this issue and make appropriate improvements to
our systems to help prevent or minimize future recurrence.

Google Cloud Platform Status

unread,
Mar 18, 2019, 2:25:15 PM3/18/19
to bigquery-dow...@googlegroups.com
ISSUE SUMMARY

On Friday 8 March 2019, Google BigQuery’s jobs.insert API in the US regions
experienced an average elevated error rate of 51.21% for a duration of 45
minutes. BigQuery’s Streaming API was unaffected during this period. We
understand how important BigQuery’s availability is to our customers’
business analytics and we sincerely apologize for the impact caused by this
incident. We are taking immediate steps detailed below to prevent this
situation from happening again.

DETAILED DESCRIPTION OF IMPACT

On Friday 8 March 2019 from 00:45 - 01:30 US/Pacific, BigQuery’s
jobs.insert [1] API (responsible for import/export, query, and copy jobs)
in the US region experienced an average error rate of 51.21%. Affected
customers received error responses such as “Error encountered during
Execution, retrying may solve the problem” and “Read timed out” when
sending requests to BigQuery. BigQuery’s Streaming API was not impacted by
this incident.

The following is a breakdown of the errors experienced during the incident:

- 64.01% of jobs.insert API requests to BigQuery (US) received HTTP 503
errors

- The jobs.insert API experienced an average error rate of 51.21% and a
peak error rate of 75.96% percent at 01:21 US/Pacific

- 17.93% of BigQuery projects in the region were impacted


ROOT CAUSE

A recent change to BigQuery’s shuffle scheduling service [2] introduced the
potential for the service to enter a state where it was unable to process
shuffle jobs. A new canary release was deployed to fix the potential issue.
However, this release contained an unrelated issue which placed an overly
restrictive rate limit on the shuffle service preventing it from operating
nominally. This strict rate limit created a large job backlog for the
BigQuery Job Server, which resulted in BigQuery returning errors such as
“Error encountered during Execution, retrying may solve the problem” and
“Read timed out” to users.


REMEDIATION AND PREVENTION

Google Engineers were automatically alerted at 00:47 and immediately began
their investigation. The root cause was discovered at 01:23, and our
engineers worked quickly to mitigate the issue by redirecting traffic away
from the impacted datacenter at 01:27. The incident was fully resolved by
01:30.

We are taking immediate action to prevent recurrence. First, we have
implemented a fix to prevent the shuffle service from potentially entering
a state where it is unable to process jobs. Second, we are allocating
additional capacity to BigQuery’s US region to reduce the impact of traffic
redirections on adjacent datacenters running the service. Additionally, we
are increasing the precision of our monitoring to enable more swift and
accurate diagnosing of BigQuery issues going forward.


[1] https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs/insert
[2]
https://cloud.google.com/blog/products/gcp/in-memory-query-execution-in-google-bigquery
Reply all
Reply to author
Forward
0 new messages