I have some compute jobs that try to determine the stability of a material in the middle of a workflow. The stability determination requires calling the MPRester.get_stability() function. The jobs are running on Edison.
Most of the time, this operation completes fine - i.e., the code is OK. But sporadically (maybe 10-15% ?) of the time I get the following error:
_stacktrace": "Traceback (most recent call last):\n File \"/global/project/projectdirs/m2439/aj_thermoelectrics/codes/fireworks/fireworks/core/rocket.py\", line 213, in run\n m_action = t.run_task(my_spec)\n File \"/global/project/projectdirs/m2439/aj_thermoelectrics/codes/MatMethods/matmethods/vasp/firetasks/glue_tasks.py\", line 166, in run_task\n stored_data = mpr.get_stability([my_entry])[0]\n File \"/global/project/projectdirs/m2439/aj_thermoelectrics/codes/pymatgen/pymatgen/matproj/rest.py\", line 848, in get_stability\n raise MPRestError(str(ex))\nMPRestError: HTTPSConnectionPool(host='www.materialsproject.org', port=443): Max retries exceeded with url: /rest/v2/phase_diagram/calculate_stability (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x2aabdad2dbd0>: Failed to establish a new connection: [Errno -2] Name or service not known',))\n"
As you can see from the bottom of the stack trace, it has to do with getting some kind of connection.
I've tried to reproduce by writing a script to call get_stability() 100 consecutive times on Edison (login node), but I could not reproduce the error that way. Any idea what could be going wrong? Could this be a problem with certain compute nodes?