Google Cloud Platform Status
unread,Jul 11, 2017, 2:02:10 AM7/11/17Sign in to reply to author
Sign in to forward
You do not have permission to delete messages in this group
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to bigquery-dow...@googlegroups.com
ISSUE SUMMARY
On Wednesday 28 June 2017, streaming data into Google BigQuery experienced
elevated error rates for a period of 57 minutes. We apologize to all users
whose data ingestion pipelines were affected by this issue. We understand
the importance of reliability for a process as crucial as data ingestion
and are taking committed actions to prevent a similar recurrence in the
future.
DETAILED DESCRIPTION OF IMPACT
On Wednesday 28 June 2017 from 18:00 to 18:20 and from 18:40 to 19:17
US/Pacific time, BigQuery's streaming insert service returned an increased
error rate to clients for all projects. The proportion varied from time to
time, but failures peaked at 43% of streaming requests returning HTTP
response code 500 or 503. Data streamed into BigQuery from clients that
experienced errors without retry logic were not saved into target tables
during this period of time.
ROOT CAUSE
Streaming requests are routed to different datacenters for processing based
on the table ID of the destination table. A sudden increase in traffic to
the BigQuery streaming service combined with diminished capacity in a
datacenter resulted in that datacenter returning a significant amount of
errors for tables whose IDs landed in that datacenter. Other datacenters
processing streaming data into BigQuery were unaffected.
REMEDIATION AND PREVENTION
Google engineers were notified of the event at 18:20, and immediately
started to investigate the issue. The first set of errors had subsided, but
starting at 18:40 error rates increased again. At 19:17 Google engineers
redirected traffic away from the affected datacenter. The table IDs in the
affected datacenter were redistributed to remaining, healthy streaming
servers and error rates began to subside.
To prevent the issue from recurring, Google engineers are improving the
load balancing configuration, so that spikes in streaming traffic can be
more equitably distributed amongst the available streaming servers.
Additionally, engineers are adding further monitoring as well as tuning
existing monitoring to decrease the time it takes to alert engineers of
issues with the streaming service. Finally, Google engineers are
evaluating rate-limiting strategies for the backend to prevent them from
becoming overloaded.