Intermittent Connection reset by peer in us-east4

467 views
Skip to first unread message

Daniel DeSousa

unread,
Jun 16, 2021, 10:50:58 AM6/16/21
to gce-discussion

BACKGROUND

I have a long running discord bot (3+ years) written in discord.py which has always run on GCP, zone us-east4-a. The bot runs in k8s using discord.py 1.7.2 and python 3.9.

PROBLEM

In the past month or two, I have started to see an increasing number of connection interruptions, [Error 104] Connection reset by peer. The resets are not tied directly with the amount of activity on the bot. They happen intermittent throughout the day in production (every few minutes on average).

These resets cause random failures to the discord HTTP API and result in a high level of disconnects on the WebSocket. Many of these shard disconnects are able to RESUME but many (~200 per day) end up resulting in an IDENTIFY call like a new connection and sometimes trigger extended backoff waits and partial outages.

EXAMPLE

Here is an example of a disconnect:

Traceback (most recent call last): File "/opt/venv/lib/python3.9/site-packages/discord/shard.py", line 187, in reconnect self.ws = await asyncio.wait_for(coro, timeout=60.0) File "/usr/local/lib/python3.9/asyncio/tasks.py", line 481, in wait_for return fut.result() File "/opt/venv/lib/python3.9/site-packages/discord/gateway.py", line 305, in from_client gateway = gateway or await client.http.get_gateway() File "/opt/venv/lib/python3.9/site-packages/discord/http.py", line 967, in get_gateway data = await self.request(Route('GET', '/gateway')) File "/opt/venv/lib/python3.9/site-packages/discord/http.py", line 192, in request async with self.__session.request(method, url, **kwargs) as r: File "/opt/venv/lib/python3.9/site-packages/aiohttp/client.py", line 1117, in __aenter__ self._resp = await self._coro File "/opt/venv/lib/python3.9/site-packages/aiohttp/client.py", line 544, in _request await resp.start(conn) File "/opt/venv/lib/python3.9/site-packages/aiohttp/client_reqrep.py", line 890, in start message, payload = await self._protocol.read() # type: ignore File "/opt/venv/lib/python3.9/site-packages/aiohttp/streams.py", line 604, in read await self._waiter aiohttp.client_exceptions.ClientOSError: [Errno 104] Connection reset by peer

EXPERIMENT TO ISOLATE THE PROBLEM

I performed an experiment to isolate what is causing the issue. I deployed a container with my bot to a VM (not k8s) and isolated it such that it only communicates with discord (no outside database) and automatically send it commands to simulate user behavior and load (I send about 60 commands per minute in the same server -- well under my production load). I run this for 20 minutes or until I observe if connection resets happen, and I see the following:

  • In us-east4-a, I am able to reproduce intermittent connection resets.
  • In us-east4-b, I am able to reproduce intermittent connection resets.
  • In us-east4-c, I am able to reproduce intermittent connection resets.
  • In us-central1-a, I am not able to reproduce any connection resets (even after 3 hours -- no shard disconnects at all).
  • In us-east1-b, I am not able to reproduce any connection resets.
  • On my laptop (residential internet on the east coast), I am not able to reproduce any connection resets.

All experiments use the same container, same machine-type and same test procedure.

I repeated the experiment in us-east4-a with multiple machine types up to 8 vCPU and with both the premium and standard network tiers and I still see resets. I also tried another VM in a different project, but the connection issues always persist in us-east4.

I have a support case open with GCP as it appears to be a region specific issue.

Are there any additional experiments I could provide to attempt to narrow down the cause of this? Are there any common GCP configuration problems that could result in this problem?

Short of moving to another region, I feel as though I am out of options.

Daniel DeSousa

unread,
Jun 16, 2021, 10:53:53 AM6/16/21
to gce-discussion
I should add, its s not isolated to discord connectivity, I'm also experiencing resets for my gitlab runners that are running in my k8s cluster (when hitting the gitlab api or building with pip and hitting pypi.

Digil (Google Cloud Platform Support)

unread,
Jun 17, 2021, 12:28:40 PM6/17/21
to gce-discussion
Hello Daniel,

I am not sure whether there are any other external tools available for connectivity testing. Google Cloud Platform has a help center article about 'Network Intelligence Center' which helps to prevent networking outages and performance issues. As explained there, you could use connectivity tests to help quickly troubleshoot connectivity issues and predict the impact of configuration changes, reducing the risk of network failures. Feel free to refer to the help center article about 'connectivity tests' as it helps you to learn how to run Connectivity Tests with the help of some easy steps. It may or may not be the right tool for you. However, I would recommend you refer to the mentioned documents and see if you could narrow down the cause of your regional connectivity issue.

On an additional note, a public issue report is also raised here for the same concern. The Google Cloud Compute Engine Team is already investigating this regional issue happening on 'us-east4'. You could also expect another update regarding the RCA(if any) in the public issue tracker report. Feel free to comment over there as well. 
Reply all
Reply to author
Forward
0 new messages