sporadic problems accessing REST interface from Edison

39 views
Skip to first unread message

Anubhav Jain

unread,
Jun 28, 2016, 1:47:33 PM6/28/16
to Materials Project Development Group
I have some compute jobs that try to determine the stability of a material in the middle of a workflow. The stability determination requires calling the MPRester.get_stability() function. The jobs are running on Edison.

Most of the time, this operation completes fine - i.e., the code is OK. But sporadically (maybe 10-15% ?) of the time I get the following error:

_stacktrace": "Traceback (most recent call last):\n File \"/global/project/projectdirs/m2439/aj_thermoelectrics/codes/fireworks/fireworks/core/rocket.py\", line 213, in run\n m_action = t.run_task(my_spec)\n File \"/global/project/projectdirs/m2439/aj_thermoelectrics/codes/MatMethods/matmethods/vasp/firetasks/glue_tasks.py\", line 166, in run_task\n stored_data = mpr.get_stability([my_entry])[0]\n File \"/global/project/projectdirs/m2439/aj_thermoelectrics/codes/pymatgen/pymatgen/matproj/rest.py\", line 848, in get_stability\n raise MPRestError(str(ex))\nMPRestError: HTTPSConnectionPool(host='www.materialsproject.org', port=443): Max retries exceeded with url: /rest/v2/phase_diagram/calculate_stability (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x2aabdad2dbd0>: Failed to establish a new connection: [Errno -2] Name or service not known',))\n"

As you can see from the bottom of the stack trace, it has to do with getting some kind of connection.

I've tried to reproduce by writing a script to call get_stability() 100 consecutive times on Edison (login node), but I could not reproduce the error that way. Any idea what could be going wrong? Could this be a problem with certain compute nodes?

Anubhav Jain

unread,
Jun 28, 2016, 3:26:26 PM6/28/16
to Materials Project Development Group
Note: I also just tried the test script on a compute node at Edison (as opposed to login node) and I can issue 100 REST requests without any errors.

So either it is some temporary/sporadic thing, or an issue with some of the compute nodes.

Alireza Faghaninia

unread,
Jun 28, 2016, 3:32:24 PM6/28/16
to Anubhav Jain, Materials Project Development Group
I do not think this is an Edison-specific problem. I remember I was getting a similar error ("Max retries exceeded with url…”) when calling MPRester but from a local cluster. Also, I was calling a different module and NOT calculate_stability. Unfortunately, I could not regenerate this error even though I tried a couple of times just now. I will post here when I get a similar error again.

Best,
Alireza

--
You received this message because you are subscribed to the Google Groups "Materials Project Development Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to matproj-devel...@googlegroups.com.
To post to this group, send email to matproj...@googlegroups.com.
Visit this group at https://groups.google.com/group/matproj-develop.

Donny Winston

unread,
Jun 29, 2016, 2:02:21 PM6/29/16
to Materials Project Development Group
Got the following response from Shreyas:
 
An educated guess is that this has to do with the Edison RSIP interface (which is how the Cray compute nodes talk to the external world). RSIP is fairly limited in terms of its network bandwidth which would explain the occasional failures.
I believe NERSC is working to improve this on Cori, and the expectation is that this should be significantly better once Cori has been configured to support external connectivity (this is currently in progress).
For now, I would say that connectivity from the compute nodes will be limited unless you proxy it through a login node.
Let me know if that helps

Anubhav Jain

unread,
Jun 30, 2016, 1:06:50 PM6/30/16
to Materials Project Development Group
Thanks! For now I guess I will just resubmit the calculations and see if trouble reappears often enough to re-open this.
Reply all
Reply to author
Forward
0 new messages