New timeout errors (H12) - diagnosis help

341 views
Skip to first unread message

Katy Tabero

unread,
Jul 8, 2021, 9:34:46 AM7/8/21
to oTree help & discussion
Hello,

I posted before about experiencing timeout issues when running sessions of a small (9 subjects) but relatively calculation intense programme. I was getting 6-8 timeout errors every test at the same point (when calculations are done after 35 minutes of real effort tasks). First I set about making the programme more efficient, and when I was still risking a couple of timeouts per session, I upgraded to a Performance M dyno. Until yesterday I had run 5-6 sessions in a row without any timeouts and was satisfied that it must have been causing the problem before, however since then I have run 2 sessions both with 3 timeout errors. 

Does anyone know why this might have changed when nothing else has? I have included images of the metric data provided by Heroku below - the dyno load is not high, so I am not convinced increasing it again is the answer. I know in the support documents it mentions this could be a database bottleneck, however everything I have looked at implies the database I am using (Postgres Standard 0) should be sufficient.

MetricsCapture1.PNGMetricsCapture2.PNG
MetricsCapture3.PNG

Any advice would be really appreciated. I am unable to make too many changes to the programme itself (at least no changes the participants would see) as we are close to half way through our experiment programme, but background changes might be acceptable. 

Best wishes, 

Katy

Chris @ oTree

unread,
Jul 8, 2021, 10:52:56 AM7/8/21
to oTree help & discussion
oTree Hub's Pro plan has a performance analysis feature that might help to identify the bottleneck here. otherwise, you could use print statement debugging with time.time() to understand which functions are taking the longest to execute.

Katy Tabero

unread,
Jul 8, 2021, 1:12:14 PM7/8/21
to oTree help & discussion
Hi Chris, 

Thank you for your quick reply.

I registered for the oTree Hub Pro plan and ran a couple of sessions to see what might going on, however I am not really sure what to do with the output. Here is what I see in the 'Performance Culprit' section following all 9 subjects timing out: 

oTree Hub Analysis 2 - Phase 2.PNG

The average execution time for 'GET' in the waiting_room and 'POST' in the real_effort pages are very high - which is what I expected as this is when the timeout occurs (specifically moving from either a task or the waiting room to the results page, where all the calculations occur), but beyond that I am not sure what to do with this information. 

Best wishes, 

Katy

Chris @ oTree

unread,
Jul 8, 2021, 1:22:38 PM7/8/21
to oTree help & discussion
That is quite unusual for a single page load to take that long. Could you share your project code with me? (e.g. send me the otreezip)

Chris @ oTree

unread,
Jul 8, 2021, 1:23:14 PM7/8/21
to oTree help & discussion
And add ch...@otree.org as a collaborator on the Heroku app.

Katy Tabero

unread,
Jul 8, 2021, 2:37:27 PM7/8/21
to oTree help & discussion
Hi Chris, 

Of course, I have added you a a collaborator on Heroku and here is the file. I am not a proficient coder - very much self taught within the last year, so I know that it is not perfect - but anything you can suggest/find would be really helpful. 

If you have any questions about the programme or it's running history, just let me know. 

Best wishes, 

Katy

exoH.otreezip

Chris @ oTree

unread,
Jul 8, 2021, 3:47:51 PM7/8/21
to oTree help & discussion
Thanks, I'm taking a look now. I don't see anything here that would cause performance issues. It's pretty standard stuff. Did you ever experience this issue locally, or only on Heroku?
Were you running multiple dynos? (should use just 1 dyno)

Katy Tabero

unread,
Jul 8, 2021, 6:47:16 PM7/8/21
to oTree help & discussion
No, I didn't experience any errors and it ran pretty quickly locally, it was only when taking it to Heroku that I started having problems. I have always used one worker dyno and 1 web dyno (and just changed the standard of dyno) as I noticed on other threads multiple web dynos were causing other people problems. Would one of each cause the same problems? 

Chris @ oTree

unread,
Jul 8, 2021, 7:52:05 PM7/8/21
to oTree help & discussion
the setup seems fine. i need a bit of time to look into this issue, will be in touch about it.

Chris @ oTree

unread,
Jul 8, 2021, 9:31:49 PM7/8/21
to oTree help & discussion
Hi, I will follow up with you individually about how to investigate further.

Chris @ oTree

unread,
Jul 10, 2021, 12:55:46 PM7/10/21
to oTree help & discussion
(Replying to the group since I think it could be useful for others)

I have investigated this app, and my best guess is that the problem occurs when there is a timeout when there are many rounds remaining. For example in the real_effort task, oTree has to skip several hundred pages for each player (70 rounds x 6 pages for each round). On top of that, the timeout occurs for all players simultaneously (since it's a group timeout). So multiply that by the number of players. It creates a sudden pileup on the server. Once it takes more than 30 seconds, Heroku cuts off the request and gives a timeout error.

The most reliable solution would be to change the game to use live pages. Then all the interaction can be done within a single page in 1 round. Other advantages are that the data export file would be much smaller, and the user experience would be smoother since you don't need to reload the page. If you are interested in going this route, I could sketch out a bit what that would look like.

If you do not want to switch to live pages, you could reduce the number of rounds, or increase the timeout (or get rid of the timout entirely). You could switch to faster postgres. That will reduce the risk of exceeding the 30 second timeout happening but I can't guarantee it will eliminate it.

If a timeout happens, does it disrupt the experiment completely? if the user reloads their page, can they continue? The performance metrics you showed indicated that this is just a temporary spike, rather than a systematic performance issue. Maybe you can just live with the occasional timeout and tell users that if something goes wrong, to wait 10 seconds and refresh the page.

But anyway live pages would be the way to really solve this issue.


Chris @ oTree

unread,
Jul 10, 2021, 2:10:38 PM7/10/21
to oTree help & discussion
Here is another idea:

I noticed that in your app, 2 out of the 6 pages in the page_sequence are only displayed in the first round, and 2 are only displayed in the last round. If you split into 3 apps (intro, main, results), then only the middle app needs to have many rounds. That reduces the number of pages to skip by 2/3. Additionally, when a timeout happens you could use app_after_this_page to skip to the results app. Replace the is_displayed with this:

@staticmethod
def app_after_this_page(player: Player, upcoming_apps):
  if get_timeout_seconds(player) < 0:
    return upcoming_apps[0]

performance-wise, that is faster than using is_displayed on each page.

Katy Tabero

unread,
Jul 13, 2021, 7:11:36 AM7/13/21
to oTree help & discussion
Hi Chris, 

While I am working on playing around with these ideas an testing (sadly taking a very long time as sometimes I get timeouts, and others not) - I was wondering whether it would be possible to use local storage to prevent the data from the html being lost if a timeout does occur? For example, if I had a html timer in the task page, could I save this data to local storage and then recall it in a later page to save as a separate variable? 

Best wishes, 

Katy

Chris @ oTree

unread,
Jul 13, 2021, 9:04:44 AM7/13/21
to Katy Tabero, oTree help & discussion
You can use local storage but I don’t see why that would help in this case, since as far as I understand the timeout only happens after you hit the next button. So in either case you are reloading the page completely, but if a timeout happens you just need to reload again right?

Sent from my phone

On Jul 13, 2021, at 5:11 AM, Katy Tabero <k.ta...@gmail.com> wrote:

Hi Chris, 
--
You received this message because you are subscribed to the Google Groups "oTree help & discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to otree+un...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/otree/3a88cff3-c097-42a6-8244-0e0dd1b1ccc5n%40googlegroups.com.

Chris @ oTree

unread,
Jul 13, 2021, 9:11:36 AM7/13/21
to Katy Tabero, oTree help & discussion
(I mean as far as I understand the error only happens when the timer runs out, which is going to force a to go to a different page even when it works correctly.)

Sent from my phone

On Jul 13, 2021, at 7:04 AM, Chris @ oTree <ch...@otree.org> wrote:

You can use local storage but I don’t see why that would help in this case, since as far as I understand the timeout only happens after you hit the next button. So in either case you are reloading the page completely, but if a timeout happens you just need to reload again right?

Katy Tabero

unread,
Jul 13, 2021, 10:31:49 AM7/13/21
to oTree help & discussion
My thought was that if data can be stored an accessed across pages with the same domain, then if the H12 error occurs and the submitted data is lost, it should still be saved somewhere - then I could recall it using Javascript in the next page, even if the subjects have to refresh to re-join the experiment session. I am trying to think of ways to back up this data in case I cannot find a solution - as you noted above, H12 errors would not be an issue, other than not looking great, if it weren't for the data lost. 

I have run several tests over the last few days and am still unsure what is the best way forward. Sometimes I can get through a run of the original file without any H12 errors and think that I may have found the right balance of resources, but then running it again I will get them. I tried a few run throughs removing the 2 final pages in the real_effort app (the wait page and the result page) but they still timeout often unless I upgrade the Dyno's to at least Standard 1X. The problem is that the H12 errors are inconsistent, I managed to run 5 sessions in a row without any and then they turn up again - so even in testing I cannot be sure whether I have solved a problem, or it is just one of the error free runs.

Chris @ oTree

unread,
Jul 13, 2021, 12:18:35 PM7/13/21
to oTree help & discussion
Did you also remove the first 2 (intro) pages from the app sequence? They can be moved to an 'intro' app.

Katy Tabero

unread,
Jul 22, 2021, 6:18:48 AM7/22/21
to oTree help & discussion
Hi all, 

I am just updating this thread as, after some help from Chris, I have reached a solution - so I thought I should share it in case anyone finds themselves in a similar situation.

The main take away I have is that, if possible, use a live page! Unfortunately I did not know about these when I started building the programme and because I had already run several sessions with my programme before the problem popped up, I was unable to switch to live pages (no changes could be made that would be visible to the subjects). 

Given this, I found a work around with my current programme. The issue was being caused by a bottleneck - once the timer for the real effort task hit 0, the programme had to iterate through up to 420 pages (70 rounds with 6 pages each) for 9 subjects simultaneously, and then calculate their results. Even with relatively high resources I would get H12 errors at this stage and lose data, as if the page errored the data recorded there was not submitted. I was unable to reduce the amount of rounds as we needed to ensure that there were enough tasks that no subject would run out, and even after reducing the pages in each round was still timeout errors (albeit fewer). 

My work around was to identify the subject that had completed the most rounds e.g. 20 out of 70 potential tasks, and calculate everyone's results in that round + 1. To do this, I created dummy variables for having played in a given round, and then once the timer hit 0, the programme located the first round in which no subject had played, and then set this as a condition to display the pre-results wait page and result page. In the result page a second dummy variable was set to 1 and carried forward for the remainder of the 70 tasks, so that the next page to display was at the beginning of the next app (you could also use the app_after_this_page function). While this may mean iterating through many empty pages after the results page, even if there is a H12 error then refreshing the page will allow subjects to continue without any data loss. Having said that, in testing this new programme, I have not yet had a H12 error, even using the free testing resources. 

Thank you again Chris for all of your help with this. 

Best wishes, 

Katy

Chris @ oTree

unread,
Jul 22, 2021, 11:08:13 AM7/22/21
to oTree help & discussion
Thank you for the update!
Reply all
Reply to author
Forward
0 new messages