Having issues with EB webhooks and Google Cloud Functions

Andrew Stillman

unread,

Jul 23, 2021, 11:03:00 AM7/23/21

to Eventbrite Developers

Hi all,

Over the past 6 months, we've had a half dozen or so instances of webhooks "Disappearing" from Eventbrite, which causes a lot of challenges for maintaining the integrity of our business processes. I recently implemented a monitoring system to detect these disappearances, so I have somewhat better forensic info around what might be causing them, but I would like to know if others have experience with this issue and whether my hypotheses are viable.

1) The webhooks that disappear only seem to do so for receivers built in Google Cloud Functions (GCF), an "on demand," serverless environment similar to Amazon Lambda. Is it possible there's a low-probability return pattern from GCFs infrastructure that is causing Eventbrite to automatically unsubscribe? (E.g. a "410 Gone")

2) The Google Cloud Functions occasionally have maximum response times approaching ~60s (especially when they are undergoing a "cold start") but tellingly -- never greater -- and these requests appear to correlate with 408 errors in the EB logs. Can I assume that EB requires a maximum 60s response time? Is there some logic being applied to delete webhooks in relation to the frequency of 408 errors?

In either instance, I believe my use case likely requires a reconsideration of how Eventbrite handles webhook listeners. I also work extensively with Stripe webhooks and GCFs, for example, and I don't encounter the same issues where all other variables are the same.

Is there someone on the EB API team who can communicate with me directly around this? I've tried your support chat around this several times and my tickets have gone nowhere.

Best,

Andrew

nata...@eventbrite.com

unread,

Aug 9, 2021, 6:32:31 PM8/9/21

to Eventbrite Developers

Hi Andrew!

Your remote server must process the webhook request in less than 3.5 seconds, or Eventbrite will consider it a failed request and return a 408 error. After a high volume of errors, we will temporarily disable the webhook subscription for 24 hours, then if a high volume of errors persists, we will permanently delete the webhook.

It is best practice to return a 200 status quickly and do additional processing async. For more information and best practices, feel free to review our Webhooks Documentation here.

Apologies for the difficulties you've been facing with your webhooks - I hope this information is helpful!

Thanks,

Natalie

Andrew Stillman

unread,

Aug 12, 2021, 3:59:51 PM8/12/21

to Eventbrite Developers

Hi Natalie,

Thanks for the helpful update. As of several weeks ago, I in fact changed our architecture to return in ~2 sec and use an async pattern to handle longer running processes (per your recommendation), and this has improved things some...

Unfortunately, I'm still seeing the unexplainable deletion of webhooks, and unexplained timeouts (408 errors) on the Eventbrite side even when 99% of our requests are now completing in under 1s (see report below, for the last 7 days) according to Google Cloud Platform, and none appear to be completing in more than 2s.

Can anything be done to track down the cause of this? The webhook that was most recently deleted was 8795822.

Screen Shot 2021-08-12 at 3.51.33 PM.png

Best,

Andrew

Andrew Stillman

unread,

Aug 13, 2021, 4:32:02 PM8/13/21

to Eventbrite Developers

Hi Natalie,

Here's another case where the webhook was deleted (8795812) and a graph of the execution time over the last 7 days according to Google Cloud Platform.

As you can see, it's very rare that requests are completing in >3.5s, but that it does occasionally happen. This is a common behavior common to "cold start" behavior of ephemeral cloud functions. I'm curious if the algorithm used by Eventbrite to assess the need to pause or "kill" a webhook is oversensitive to this implementation, and in need of revisiting. Given the downstream costs to the developer in terms of recovering the data integrity of an integration, I'd definitely suggest you adopt thresholds that are less sensitive. Is this something the team is open to discussing?

Screen Shot 2021-08-13 at 4.22.54 PM.png