dealing with reports that study stopped/froze

145 views
Skip to first unread message

Becky Gilbert

unread,
Feb 28, 2018, 1:29:58 PM2/28/18
to JATOS
Hi JATOS users,

I sometimes get reports of studies stopping/freezing (i.e. the person just sees a blank page, nothing happens), which is super difficult to debug because I usually can't reproduce the problem. Today I had this issue, where someone reported that they just got a blank page after one of the components finished. I can see that they must have started accessing the next component (the one that didn't load) because the data was sent back to the server for the previous component, and I have a start time and results ('none') for the component that didn't load.

I looked in the Nginx error log and didn't see any problems. In the access log, I can see that their last interaction with the server is this:

<IP address> - - [28/Feb/2018:16:07:29 +0000] "GET /publix/31/167/initData?srid=816 HTTP/1.1" 200 1249 "<our domain>/publix/31/167/start?srid=816" "<user agent string>"

So it looks to me like they were successfully redirected to the study component with ID 167, and I didn't see any 4xx/5xx errors. Does anyone have thoughts on what might have happened here, and/or what else I could check?

Also, if this happens again, is there a way to allow this person to start where they left off by sending them a link to one particular component (rather than the normal study link), based on the study ID, component ID and their study result ID? For instance, in this case, could I send the person this link: <our domain>/publix/31/167/start?srid=816 to have them start from the beginning of component 167?

Thanks!
Becky

Kristian Lange

unread,
Mar 2, 2018, 11:40:56 AM3/2/18
to Becky Gilbert, JATOS
Hi Becky,

sorry for answering so late but better late than never :)

Can you somehow quantify how often this 'studies stopping/freezing' happens? If it is seldom I tend to say that this is just the usual Internet stutter and it happens. But If it is more often it might have a cause in the JATOS server or the the study's scripts.

In your Nginx log example I agree with you, the component was started and than the init data were requested:
1) '/publix/31/167/start?srid=816' - the component 167 of study 31 with the study result 816 was started
2) '/publix/31/167/initData?srid=816' - jatos.js requested the init data for same study result 816

From there on I can only guess what went wrong: There is no way I can be sure that the JATOS server actually delivered the init data back to the client. Although you said that JATOS has no error in its log which is a good sign that nothing went wrong. What does the JATOS log say around [28/Feb/2018:16:07:29 +0000], the time of your Nginx log? Is there anything unusual? Did JATOS maybe restarted unexpectantly? Is there any 'init' request around this time (each request from clients get logged)? It should look something like:
2018-03-02 16:17:13,019 [INFO] - publix_access - GET /publix/19/48/initData?srid=8

Then you asked about whether there is a way to send a link to this worker from which they could continue the study run. Probably.
As you suggested you can send your worker the link <our domain>/publix/31/167/start?srid=816. This would lead JATOS to just continue the study from where it was left, but only under the conditions: 1) the component is set to 'reloadable', 2) the worker uses the same browser on the same computer and didn't delete JATOS' cookies, and 3) the worker didn't start more than 10 JATOS studies at the same time in parallel after starting study 31. Condition 3 is very unlikely a problem and for 1 you can just set it in the components settings. Condition 2 is more difficult, demands the worker to return to the computer and browser they run the study before.

So far my best,
Kristian


--
You received this message because you are subscribed to the Google Groups "JATOS" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jatos+un...@googlegroups.com.
To post to this group, send email to ja...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/jatos/90e996d0-4233-4bbc-b12e-b47c16c0f5e5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Becky Gilbert

unread,
Mar 5, 2018, 8:54:47 AM3/5/18
to JATOS
Hi Kristian,

Thanks very much for the helpful reply, as always!

This issue happens intermittently. Whenever we recruit a batch of participants, most people can do the whole experiment just fine, but a few people report that the study stopped/froze. This typically happens at the start of a new component. There's no systematic pattern in the browsers that people are using when the study freezes. This makes me think that it's an issue with the server or the internet connection, rather than a problem with the study's scripts (but I could be wrong). 

I would agree that it could just be the usual internet stutter, except there are a few things that concern me. One is that this tends to happen in 'batches', i.e. multiple people will report the same problem around the same time. Another issue is that often people will try to reload the study component multiple times before giving up. The fact that they can reload the component means that they still have internet access, yet the component still doesn't load. 

We do load a large set of audio files at the start of some components, so maybe this is causing problems. But even so, I still don't know how to figure out exactly what the problem is, e.g. the server can't handle the requests, the worker's browser storage limits are exceeded, etc. Any suggestions for how to figure this out?

What does the JATOS log say around [28/Feb/2018:16:07:29 +0000], the time of your Nginx log? Is there anything unusual? Did JATOS maybe restarted unexpectantly? Is there any 'init' request around this time (each request from clients get logged)? It should look something like:
2018-03-02 16:17:13,019 [INFO] - publix_access - GET /publix/19/48/initData?srid=8

No I don't see anything unusual in the JATOS log around that same time. JATOS didn't restart and there were no warnings. Here are the relevant lines for this worker when they were unable to load the component:

2018-02-28 16:07:29,318 [INFO] - publix_access - PUT /publix/31/166/resultData?srid=816
2018-02-28 16:07:29,318 [INFO] - c.p.Publix - .submitResultData: studyId 31, componentId 166, studyResultId 816
2018-02-28 16:07:29,397 [INFO] - publix_access - POST /publix/31/studySessionData?srid=816
2018-02-28 16:07:29,398 [INFO] - c.p.Publix - .setStudySessionData: studyId 31, studyResultId 816
2018-02-28 16:07:29,432 [INFO] - publix_access - GET /publix/31/nextComponent/start?srid=816
2018-02-28 16:07:29,432 [INFO] - c.p.Publix - .startNextComponent: studyId 31, studyResultId 816
2018-02-28 16:07:29,469 [INFO] - publix_access - GET /publix/31/167/start?srid=816
2018-02-28 16:07:29,469 [INFO] - c.p.Publix - .startComponent: studyId 31, componentId 167, studyResultId 816
2018-02-28 16:07:29,897 [INFO] - publix_access - GET /publix/31/167/initData?srid=816
2018-02-28 16:07:29,898 [INFO] - c.p.Publix - .getInitData: studyId 31, componentId 167, studyResultId 816
2018-02-28 16:07:29,910 [INFO] - publix_access - POST /publix/31/heartbeat?srid=816
2018-02-28 16:07:30,023 [INFO] - c.p.GeneralSingleBatchChannel - .open: studyId 31, studyResultId 816

It looks to me like everything was fine and this worker should have started component 167 (I don't know what the very last line is reporting - GeneralSingleBatchChannel - but I assume this isn't an error). So I'm not sure how to debug any further, because I can't find any errors...

By the way, we have had other reports that a study stopped/froze during a component. However in these cases I don't think the problem could be with the server or internet connection, because the study should simply run locally once all of the resources have been loaded at the start. So it must be a problem with the study code, right? Or is it the case that any jatos functions called during a component (e.g. jatos.appendResultData, heartbeat) could cause the study to stop unexpectedly if they fail? 

Thanks very much for letting me know how to send a link to allow a worker to continue a study run - that all makes sense and is very useful!

Best wishes,

Becky

Kristian Lange

unread,
Mar 5, 2018, 3:32:11 PM3/5/18
to Becky Gilbert, JATOS
Hi Becky,

nice to know I was of some help. I should put this study run continuation possibility into the docs somewhere. It might be useful for others too.

Let me answer in reverse order:

>By the way, we have had other reports that a study stopped/froze during a component. However in these cases I don't think the problem could be with the server or internet connection, because the study should simply run locally once all of the resources have been loaded at the start. So it must be a problem with the study code, right? Or is it the case that any jatos functions called during a component (e.g. jatos.appendResultData, heartbeat) could cause the study to stop unexpectedly if they fail? 
I would say it's very unlikely that any jatos.js functions causes the study to stop or fail. Even if one of the Ajax calling functions like jatos.appendResultData or heartbeat fail, they are called asynchronously and the study is just continues as normal. It's difficult to say what causes those freezes. Is there any pattern, like always Firefox, or always on a Linux OS, or always at a certain time?


>It looks to me like everything was fine and this worker should have started component 167 (I don't know what the very last line is reporting - GeneralSingleBatchChannel - but I assume this isn't an error). So I'm not sure how to debug any further, because I can't find any errors...
Yes, in the log everything looks fine: first the component starts, then it gets the init data, then we have the first heartbeat, and then the batch channel opens. It really looks like the component started and jatos.js was initialized normally.
2018-02-28 16:07:29,469 [INFO] - c.p.Publix - .startComponent: studyId 31, componentId 167, studyResultId 816
2018-02-28 16:07:29,897 [INFO] - publix_access - GET /publix/31/167/initData?srid=816
2018-02-28 16:07:29,898 [INFO] - c.p.Publix - .getInitData: studyId 31, componentId 167, studyResultId 816
2018-02-28 16:07:29,910 [INFO] - publix_access - POST /publix/31/heartbeat?srid=816
2018-02-28 16:07:30,023 [INFO] - c.p.GeneralSingleBatchChannel - .open: studyId 31, studyResultId 816
>We do load a large set of audio files at the start of some components, so maybe this is causing problems. But even so, I still don't know how to figure out exactly what the problem is, e.g. the server can't handle the requests, the worker's browser storage limits are exceeded, etc. Any suggestions for how to figure this out?
You could do a little load test with your JATOS server. There are useful tools like 'siege' or 'ab'. They just request a http resource over and over again. This http resource could be a link to a personal multiple run. But this should be done very carefully: with those load test tools one can easily annoy the admins who are responsible for the server. Best is to ask them beforehand if it is okay. If you need help with the exact commands I can help there.

Another idea I had is to store your audio file somewhere else. It's not necessary to let the JATOS server deliver all static files. This can be done by any other file delivery service, e.g. dropbox (http://stefan.antoni.io/using-dropbox-for-static-file-delivery/). In case your audio files are big and many this would take a lot of load from the JATOS server.

So far my best,
Kristian

Becky Gilbert

unread,
Mar 13, 2018, 10:46:43 AM3/13/18
to JATOS
Hi Kristian,

Thanks very much for your response, and sorry for the slow reply! We're still having problems with this so I really appreciate the help.

> I would say it's very unlikely that any jatos.js functions causes the study to stop or fail. Even if one of the Ajax calling functions like jatos.appendResultData or heartbeat fail, they are called asynchronously and the study is just continues as normal. It's difficult to say what causes those freezes. Is there any pattern, like always Firefox, or always on a Linux OS, or always at a certain time?

Ok great, this is good to know. Unfortunately there's no clear pattern that I can pick out in terms of OS or browser. It doesn't seem to happen at a particular time. When we collect a batch of data at once, usually a minority of participants (e.g. 1/4 - 1/3) will report a problem with a study component not loading, and the study will run fine for everyone else. 

> You could do a little load test with your JATOS server. There are useful tools like 'siege' or 'ab'. They just request a http resource over and over again. This http resource could be a link to a personal multiple run. But this should be done very carefully: with those load test tools one can easily annoy the admins who are responsible for the server. Best is to ask them beforehand if it is okay. If you need help with the exact commands I can help there.


Thanks, this is a good idea. I'll speak to my IT team about it. Have you done this with JATOS before? I'm wondering what URL to use since there's no 'home' page - I guess just one of the study links? (I may email you about this in the future, if that's ok)


> Another idea I had is to store your audio file somewhere else. It's not necessary to let the JATOS server deliver all static files. This can be done by any other file delivery service, e.g. dropbox (http://stefan.antoni.io/using-dropbox-for-static-file-delivery/). In case your audio files are big and many this would take a lot of load from the JATOS server.

That's another good idea - thanks! 

One thing I might try is to figure out whether the problems are due to exceeding browser cache limits for some participants, since we're pre-loading a large number of audio files so that they're immediately accessible during the task. It looks like it's possible to query the amount of available storage, e.g. https://stackoverflow.com/a/35696506 so maybe that would give us some useful information.

But one thing I'm still confused about is the lack of errors in the logs. If our JATOS server is failing to deliver the study resources because it is overloaded, shouldn't I be seeing an error in the Nginx log, e.g. 504? Likewise, if a client's browser cache is full, then some of the resources would not be delivered successfully, so shouldn't that be reflected in the HTTP status codes? I don't know enough about how this works and need to do some more reading. 

Let me know if you have any other ideas. Otherwise I'll update if/when I figure out what's going on.

Thanks again,

Becky

Kristian Lange

unread,
Mar 13, 2018, 5:29:39 PM3/13/18
to Becky Gilbert, JATOS
Hi Becky,


>But one thing I'm still confused about is the lack of errors in the logs. If our JATOS server is failing to deliver the study resources because it is overloaded, shouldn't I be seeing an error in the Nginx log, e.g. 504?
If JATOS is failing you should indeed see some 5xx status codes in Nginx' log. In JATOS log I'm not sure right now, maybe some timeouts, or a new start, or some Java Exceptions. But since you don't see any 5xx status responses in Nginx log it's unlikely JATOS' problem.

>Likewise, if a client's browser cache is full, then some of the resources would not be delivered successfully, so shouldn't that be reflected in the HTTP status codes? I don't know enough about how this works and need to do some more reading.
No, the HTTP status codes are determined by the responding server, in this case JATOS or if it fails to answer then Nginx. If any problem happen in the browser you usually get messages in the browser's console.


>One thing I might try is to figure out whether the problems are due to exceeding browser cache limits for some participants, since we're pre-loading a large number of audio files so that they're immediately accessible during the task. It looks like it's possible to query the amount of available storage, e.g. https://stackoverflow.com/a/35696506 so maybe that would give us some useful information.
I can't imagine the browser cache limits the sound file loading of your study. Modern browsers allow quite big file loads. As long as your files aren't extremely huge it should be no issue. The Stackoverflow link you are referring to actually talks about a different kind of storage in a browser - not the one where media files are stored.

Keep me updated. I can help you with the load tests. Some time ago I did some load test with JATOS with ab, siege and another tool called Gatling which allows more elaborate load tests.

Best,
Kristian


Becky Gilbert

unread,
Mar 15, 2018, 2:30:41 PM3/15/18
to Kristian Lange, JATOS
Hi Kristian,

Thanks for your response - this all makes sense.

I think we finally figured this out. The problem was that some people had very slow internet connections (less than 10 Mbps), and I hadn't tested the study under these conditions. The study actually times out if the files don't load withing a certain duration, and there was some proportion of participants who were timing out for this reason, specifically on the components that require a lot of audio files. We've increased the time out duration and added a warning message that the expeirment might take a few minutes to load, and so far we haven't had this problem again.

Even though it ended up that this issue wasn't related to JATOS, this discussion has helped me learn a lot about how to differentiate among all of the possible server/client issues. So thanks again for your input.

Best,
Becky


To unsubscribe from this group and stop receiving emails from it, send an email to jatos+unsubscribe@googlegroups.com.

To post to this group, send email to ja...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/jatos/d5602bc8-7964-464f-8e82-e2e27d27e9e1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "JATOS" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/jatos/OsY2pzIWynY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to jatos+unsubscribe@googlegroups.com.

To post to this group, send email to ja...@googlegroups.com.

Kristian Lange

unread,
Mar 16, 2018, 4:57:56 AM3/16/18
to Becky Gilbert, JATOS
Hi Becky,

That's nice to hear. Your study seems to have some huge sound files if 10 Mbps isn't enough ;)

I'm just curious: Where did the timeout error occur, in the browser? And how did you increase the timeout?

Best,
Kristian

--
You received this message because you are subscribed to a topic in the Google Groups "JATOS" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/jatos/OsY2pzIWynY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to jatos+un...@googlegroups.com.

To post to this group, send email to ja...@googlegroups.com.

Becky Gilbert

unread,
Mar 16, 2018, 2:08:24 PM3/16/18
to JATOS
Hi Kristian,

Yeah, we need to preload 300+ wav files totalling about 65 MB. The problem is that there was a loading time limit (60 seconds), so if the connection speed was less than ~7 Mbps then the person would reach this time out before loading all of the files, and then see an error message. It seems surprising to me that so many people would have connections that are this slow. But I suppose it could happen, especially in rural areas, and some people will have connections that are generally faster but unreliable.

Yes the time out occurred in the browser. We're using jsPsych, which has an 'audio_preload' option that will load all the audio at the start. What I didn't realise is that there is a default time limit of 60 seconds, which you can change with a 'max_load_time' parameter. So I didn't intentionally set this time limit myself, but people were reaching it and then seeing the default error message: "The experiment failed to load". Then people were just telling us 'your study didn't load' but we didn't know they were seeing an error message. Anyway, it was an easy fix: we just changed the 'max_load_time' parameter and warned people that the task may take a few minutes to load.

Hope that helps! Let me know if you have any other questions.


To unsubscribe from this group and stop receiving emails from it, send an email to jatos+unsubscribe@googlegroups.com.

To post to this group, send email to ja...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/jatos/d5602bc8-7964-464f-8e82-e2e27d27e9e1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "JATOS" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/jatos/OsY2pzIWynY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to jatos+unsubscribe@googlegroups.com.

Kristian Lange

unread,
Mar 19, 2018, 11:54:47 AM3/19/18
to Becky Gilbert, JATOS
Hi Becky,

300+ wav files preloading - whoo! That's a lot!

Interesting, I didn't know about this 'max_load_time' parameter in jsPsych. 

Like I suggested before, if your JATOS server gets into trouble delivering all the static audio files, you can put them on some external file delivery service like Dropbox. They allow fast file downloads from everywhere in the world - that's their business. But of course this works best with static files (files that do not change), like audio, video, pictures. I wouldn't do this with any JavaScripts or similar. 

Best
Kristian

--
You received this message because you are subscribed to a topic in the Google Groups "JATOS" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/jatos/OsY2pzIWynY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to jatos+un...@googlegroups.com.

To post to this group, send email to ja...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "JATOS" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jatos+un...@googlegroups.com.
To post to this group, send email to ja...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages