upgraded to 1.4.2 and requests got stuck (load 1.0)

49 views
Skip to first unread message

Peter Lada

unread,
Jan 9, 2020, 11:22:12 AM1/9/20
to pylons-discuss
Github dependabot has opened a PR for me to upgrade to 1.4.2 (thanks for the release), and I merged it yesterday around 1730. 

At around 1930 one of the 6 dynos (heroku, 1X instance, single CPU, 0.5GB RAM) has come to get pegged at 1.0 load and timed out every subsequent request (heroku router cuts connection after 30s).

At 1945 it happened to another dyno. 

At around 2000 I restarted the dynos and the problem got rectified, probably temporarily.

I've reverted to 1.4.1 and the issue has not surfaced since (12 plus hours).

Has anyone else used 1.4.2 in production? Any issues?

Sadly I cannot provide more info, beyond the 1,5,15-minute load avg graphs as the logs just show timed out requests and no other info.

--peter
Formsort.com
Screen Shot 2020-01-09 at 10.42.23.png

Michael Merickel

unread,
Jan 9, 2020, 11:31:25 AM1/9/20
to Pylons
I've been using 1.4.2 in production on heroku with 5 Standard-1X dynos and have not observed this issue across a decent amount of traffic. That being said we'll probably need more info to go on than your observations so far to reproduce or turn it into any sort of bugfix. I'm not really sure what questions to ask to diagnose what's going on other than finding a raw request payload that triggers an error in the parser. The diff on 1.4.2 is quite tiny.

- Michael

--
You received this message because you are subscribed to the Google Groups "pylons-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pylons-discus...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pylons-discuss/54aa81bf-b935-4afc-b71c-f52d1fb15516%40googlegroups.com.


--

Michael

Jonathan Vanasco

unread,
Jan 9, 2020, 5:16:12 PM1/9/20
to pylons-discuss
Is there a reason why you're both using `1.4.2` and not `1.4.9` ?

Jonathan Vanasco

unread,
Jan 9, 2020, 5:17:40 PM1/9/20
to pylons-discuss


On Thursday, January 9, 2020 at 5:16:12 PM UTC-5, Jonathan Vanasco wrote:
Is there a reason why you're both using `1.4.2` and not `1.4.9` ?

Nevermind, just realized you're talking about waitress which just pushed a 1.4.2 and not Pyramid.

Bert JW Regeer

unread,
Jan 9, 2020, 10:08:08 PM1/9/20
to Pylons Project
Does Heroku do any sort of request logging? Is there something you can turn on temporarily while investigating? 

The change between version 1.4.1 and 1.4.2 is very small:


Without seeing what request is causing it to go haywire, I am left guessing as to what it may be.

Bert

--
You received this message because you are subscribed to the Google Groups "pylons-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pylons-discus...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pylons-discuss/54aa81bf-b935-4afc-b71c-f52d1fb15516%40googlegroups.com.
<Screen Shot 2020-01-09 at 10.42.23.png>

Bert JW Regeer

unread,
Jan 9, 2020, 10:14:41 PM1/9/20
to Pylons Project
Another question:

How long were you using 1.4.1 before you upgrade to 1.4.2? Did you use 1.4.0 for any period of time before that? I am trying to narrow down what the potential changeset could be.

Thanks,
Bert JW Regeer

On Jan 9, 2020, at 07:51, 'Peter Lada' via pylons-discuss <pylons-...@googlegroups.com> wrote:

Peter Lada

unread,
Jan 10, 2020, 9:21:36 PM1/10/20
to pylons-discuss
These are great questions!

Dec 26 we released 1.4.1 (again by merging a PR opened by dependabot).
Dec 23 we moved from 1.3.1 to 1.4.0

So seems like all these bumps were just a week or less each. However in about two weeks we didn't have any incidents.



In other news, today (on 1.4.1) we had a similar incident where a process was stuck at 100% cpu, but after about 2-3 minutes it recovered. We had some requests lost due to timeout on heroku router (30 seconds).

I'm thinking about doing a few changes now:

1) move to 1.4.2 (since 1.4.1 exhibits the same behavior)
2) use --expose-tracebacks
3) use --channel-timeout=29 (to make sure waitress times out before the socket is closed by heroku router)

If anyone has a good insight on how to enable further request logging (beyond the path that heroku already gives me) that would be great.

--peter


On Thursday, January 9, 2020 at 10:14:41 PM UTC-5, Bert JW Regeer wrote:
Another question:


How long were you using 1.4.1 before you upgrade to 1.4.2? Did you use 1.4.0 for any period of time before that? I am trying to narrow down what the potential changeset could be.


Thanks,
Bert JW Regeer



On Jan 9, 2020, at 07:51, 'Peter Lada' via pylons-discuss <pylons-...@googlegroups.com> wrote:

Github dependabot has opened a PR for me to upgrade to 1.4.2 (thanks for the release), and I merged it yesterday around 1730.


At around 1930 one of the 6 dynos (heroku, 1X instance, single CPU, 0.5GB RAM) has come to get pegged at 1.0 load and timed out every subsequent request (heroku router cuts connection after 30s).


At 1945 it happened to another dyno.


At around 2000 I restarted the dynos and the problem got rectified, probably temporarily.


I've reverted to 1.4.1 and the issue has not surfaced since (12 plus hours).


Has anyone else used 1.4.2 in production? Any issues?

Sadly I cannot provide more info, beyond the 1,5,15-minute load avg graphs as the logs just show timed out requests and no other info.


--peter
Formsort.com

--
You received this message because you are subscribed to the Google Groups "pylons-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pylons-...@googlegroups.com.

Jonathan Vanasco

unread,
Jan 13, 2020, 2:52:38 PM1/13/20
to pylons-discuss

Not Heroku, but in general...



On Friday, January 10, 2020 at 9:21:36 PM UTC-5, Peter Lada wrote:

If anyone has a good insight on how to enable further request logging (beyond the path that heroku already gives me) that would be great.

You could write a simple tween, modeled after pyramid_exclog (https://github.com/Pylons/pyramid_exclog) that times the request (and possibly checks the CPU load); then logs pertinent information about the request to a database if thresholds are met.  I've used this approach to grab GET/POST vars and Paths to identify and recreate edge cases. 

You could also do this with Sentry; their API allow custom events, beyond standard exception logging.

I'd also look at your HTTP and Google Analytics logs to for any patterns when these issues spike.

I don't know if this is possible on Heroku, but I still use Statsd timing to log certain types of sever activity (like db connection management actions) - and look for any changes in behavior after a release.

But...

Based on what you've disclosed:

1) nothing makes this look particularly waitress related, and not something like traffic spikes, or spikes or problematic routes. I know nothing about your apps, but most sites I've worked on have the lowest traffic of the year between Dec 23-Jan 7.

2) if this were waitress related, i'd be suspect of the entire 1.4 release family and not just the ones you identified.   there were a handful of changes in in 1.4.0 (https://docs.pylonsproject.org/projects/waitress/en/stable/#id4

Bert JW Regeer

unread,
Feb 3, 2020, 12:56:17 AM2/3/20
to Pylons Project
Just quickly following up on this.

Thanks to Fil Zembowicz an issue was found in the regular expression used to parse incoming headers, which may lead to a denial of service. This has now been fixed in Waitress 1.4.3, please upgrade as soon as possible.

On Jan 9, 2020, at 07:51, 'Peter Lada' via pylons-discuss <pylons-...@googlegroups.com> wrote:

--
You received this message because you are subscribed to the Google Groups "pylons-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pylons-discus...@googlegroups.com.

Peter Lada

unread,
Feb 5, 2020, 2:05:13 PM2/5/20
to pylons-discuss
Yeah, he sits right next to me :)


On Monday, February 3, 2020 at 12:56:17 AM UTC-5, Bert JW Regeer wrote:
Just quickly following up on this.

Thanks to Fil Zembowicz an issue was found in the regular expression used to parse incoming headers, which may lead to a denial of service. This has now been fixed in Waitress 1.4.3, please upgrade as soon as possible.

On Jan 9, 2020, at 07:51, 'Peter Lada' via pylons-discuss <pylons-...@googlegroups.com> wrote:

Github dependabot has opened a PR for me to upgrade to 1.4.2 (thanks for the release), and I merged it yesterday around 1730. 

At around 1930 one of the 6 dynos (heroku, 1X instance, single CPU, 0.5GB RAM) has come to get pegged at 1.0 load and timed out every subsequent request (heroku router cuts connection after 30s).

At 1945 it happened to another dyno. 

At around 2000 I restarted the dynos and the problem got rectified, probably temporarily.

I've reverted to 1.4.1 and the issue has not surfaced since (12 plus hours).

Has anyone else used 1.4.2 in production? Any issues?

Sadly I cannot provide more info, beyond the 1,5,15-minute load avg graphs as the logs just show timed out requests and no other info.

--peter
Formsort.com

--
You received this message because you are subscribed to the Google Groups "pylons-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pylons-...@googlegroups.com.

Cooper Baird

unread,
Nov 7, 2020, 1:42:45 PM11/7/20
to pylons-discuss
Did you find this was resolved with 1.4.3? I recently switched to using Waitress (1.4.4) from Gunicorn for my API and have seen this behavior only twice in the past 3 weeks where every request gets the Heroku H12 30 second timeout. Both times I restarted the dyno and the issue was fixed. This is the command I use to spin up the API: waitress-serve --port=$PORT --connection-limit=2000 --threads=8 --cleanup-interval=15 --channel-timeout=60 --asyncore-use-poll run:app.

Bert JW Regeer

unread,
Nov 8, 2020, 3:01:30 AM11/8/20
to Pylons Project
I am going to need more information on what is going on, but this issue was resolved as the regex should no longer be backtracking.

To unsubscribe from this group and stop receiving emails from it, send an email to pylons-discus...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pylons-discuss/7658401c-04a4-4ad9-a8f7-4d5d55285fe1n%40googlegroups.com.

Cooper Baird

unread,
Nov 8, 2020, 12:11:54 PM11/8/20
to pylons-discuss
I didn't see any error output on my server either time this happened. I need to enable Waitress's web traffic logging with TransLogger to see if maybe that helps give me some better info for next time.
Reply all
Reply to author
Forward
0 new messages