Google Cloud Platform Status
unread,Jun 27, 2018, 12:22:09 PM6/27/18Sign in to reply to author
Sign in to forward
You do not have permission to delete messages in this group
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to bigquery-dow...@googlegroups.com
ISSUE SUMMARY
On Friday 22 June 2018, Google BigQuery experienced increased query
failures for a duration of 1 hour 6 minutes. We apologize for the impact of
this issue on our customers and are making changes to mitigate and prevent
a recurrence.
DETAILED DESCRIPTION OF IMPACT
On Friday 22 June 2018 from 12:06 to 13:12 PDT, up to 50% of total requests
to the BigQuery API failed with error code 503. Error rates varied during
the incident, with some customers experiencing 100% failure rate for their
BigQuery table jobs. bigquery.tabledata.insertAll jobs were unaffected.
ROOT CAUSE
A new release of the BigQuery API introduced a software defect that caused
the API component to return larger-than-normal responses to the BigQuery
router server. The router server is responsible for examining each request,
routing it to a backend server, and returning the response to the client.
To process these large responses, the router server allocated more memory
which led to an increase in garbage collection. This resulted in an
increase in CPU utilization, which caused our automated load balancing
system to shrink the server capacity as a safeguard against abuse. With the
reduced capacity and now comparatively large throughput of requests, the
denial of service protection system used by BigQuery responded by rejecting
user requests, causing a high rate of 503 errors.
REMEDIATION AND PREVENTION
Google Engineers initially mitigated the issue by increasing the capacity
of the BigQuery router server which prevented overload and allowed API
traffic to resume normally. The issue was fully resolved by identifying and
reverting the change that caused large response sizes.
To prevent future occurrences, BigQuery engineers will also be adjusting
capacity alerts to improve monitoring of server overutilization.
We apologize once again for the impact of this incident on your business.