Cluster execution - Execute pypsa-eur with snakemake on HPCC

241 views
Skip to first unread message

Willem Dafoe

unread,
Mar 26, 2021, 7:29:54 AM3/26/21
to pypsa
Dear community,

I have a problem regarding the cluster execution of pypsa-eur with snakemake. Because I want to calculate larger networks with pypsa-eur, I wanted to switch to my university's computer cluster (ETH Euler Cluster) and build the networks there. It is a linux cluster with bsub commands.

I shifted my whole repo over there and installed the environment accordingly. When I run the workflow, I usually run prepare_all_networks in the virtual front end on the login node due to the limited memory requirements. However, in the last step, I want to submit the rule solve_all_networks to the cluster. And here I encounter a roadblock: Whatever I try to submit the rule to the cluster, it does not work, most of the time due to the error that I do not manage to allocate memory accordingly. Unfortunately I am neither a Linux nor a snakemake pro, so the info I find on the snakemake readthedocs is not sufficient. I apologize if my questions come across as naive.

I tried the following:
snakemake -j 4 --cluster bsub solve_all_networks

Which is recommended by snakemake as the standard to submit jobs. The job is submitted, but terminated shortly after because only the default resources are allocated (1 core for 4 hours and 1 GB of memory, which is obv. not enough for a 200 node network)

Then I tried to work with the recommended additional cluster configuration file cluster.yaml (from snakemake readthedocs: https://snakemake.readthedocs.io/en/stable/snakefiles/configuration.html#cluster-configuration-deprecated)
snakemake --cluster-config cluster.yaml --cluster bsub -j 4  solve_all_networks      

my cluster-config file contains just the following: 
{
    "__default__" :
    {
        "nCPUs"     : "4",
        "memory"    : 200000,
        "resources" : rusage[mem=200000]
    }
}

However, I get the same errors as before, because of the parameters from the cluster.yaml are somehow not passed to the cluster. Then I tried the modification: 

snakemake --cluster-config cluster.yaml --cluster "bsub -R {cluster.resources}" -j 1 solve_all_networks

This looks already better, but I still get the error:

(base) [wlaumen@eu-login-12 pypsa-eur]$ snakemake --cluster-config cluster.yaml --cluster "bsub -R {cluster.resources
}" -j 1 solve_all_networks
Using license file /cluster/apps/nss/gurobi/9.1.1/x86_64/gurobi.lic
Set parameter TokenServer to value lic-gurobi.ethz.ch
No parameters matching '_test' found
Building DAG of jobs...
Using shell: /cluster/apps/sfos/bin/bash
Provided cluster nodes: 1
Job counts:
        count   jobs
        1       solve_all_networks
        8       solve_network
        9

[Fri Mar 26 12:26:22 2021]
rule solve_network:
    input: networks/elec_s300_200_ec_lcopt_1H.nc
    output: results/networks/elec_s300_200_ec_lcopt_1H.nc
    log: logs/solve_network/elec_s300_200_ec_lcopt_1H_solver.log, logs/solve_network/elec_s300_200_ec_lcopt_1H_python.log, logs/solve_network/elec_s300_200_ec_lcopt_1H_memory.log
    jobid: 8
    benchmark: benchmarks/solve_network/elec_s300_200_ec_lcopt_1H
    wildcards: simpl=300, clusters=200, ll=copt, opts=1H
    threads: 4
    resources: mem=147000

Requested memory, 200000 MB, is greater than 128000 MB.
Request aborted by esub. Job not submitted.
Error submitting jobscript (exit code 255):

For the people here that already managed to run snakemake on a computer cluster: Could you help me to resolve these errors and allocate memory to the pypsa-eur snakemake execution in the correct way? I am getting a bit into time troubles in my master's thesis, so I would be very grateful for any help.

Best,
Willem

Johannes Hampp

unread,
Mar 26, 2021, 7:56:18 AM3/26/21
to Willem Dafoe, pypsa
Dear Willem,

From a first glance it appears to me that the cluster cancels your job
("Request aborted by esub") due to excessive memory requests.

Could it be that job executions are limited to allocating 128000 MB?
You're trying to allocate 200000 MB, causing the cluster to cancel your job.

Best,
Johannes



Best regards,
Johannes Hampp (he/him)

Justus Liebig University Giessen (JLU)
Center for international Development and Environmental Research (ZEU)

mailto: johanne...@zeu.uni-giessen.de

Office 110
Senckenbergstr. 3
DE-35392 Giessen
https://uni-giessen.de/zeu

Am 26/03/2021 um 12:29 schrieb Willem Dafoe:
> Dear community,
>
> I have a problem regarding the cluster execution of pypsa-eur with
> snakemake. Because I want to calculate larger networks with pypsa-eur, I
> wanted to switch to my university's computer cluster (ETH Euler Cluster)
> and build the networks there. It is a linux cluster with bsub commands.
>
> I shifted my whole repo over there and installed the environment
> accordingly. When I run the workflow, I usually run prepare_all_networks
> in the virtual front end on the login node due to the limited memory
> requirements. However, in the last step, I want to submit the rule
> solve_all_networks to the cluster. And here I encounter a roadblock:
> Whatever I try to submit the rule to the cluster, it does not work, most
> of the time due to the error that I do not manage to allocate memory
> accordingly. Unfortunately I am neither a Linux nor a snakemake pro, so
> the info I find on the snakemake readthedocs is not sufficient. I
> apologize if my questions come across as naive.
>
> I tried the following:
> *snakemake -j 4 --cluster bsub solve_all_networks*
>
> Which is recommended by snakemake as the standard to submit jobs. The
> job is submitted, but terminated shortly after because only the default
> resources are allocated (1 core for 4 hours and 1 GB of memory, which is
> obv. not enough for a 200 node network)
>
> Then I tried to work with the recommended additional cluster
> configuration file cluster.yaml (from snakemake
> readthedocs: https://snakemake.readthedocs.io/en/stable/snakefiles/configuration.html#cluster-configuration-deprecated
> <https://snakemake.readthedocs.io/en/stable/snakefiles/configuration.html#cluster-configuration-deprecated>)
> *snakemake --cluster-config cluster.yaml --cluster bsub -j 4 
> solve_all_networks *     
>
> my cluster-config file contains just the following: 
> {
>     "__default__" :
>     {
>         "nCPUs"     : "4",
>         "memory"    : 200000,
>         "resources" : rusage[mem=200000]
>     }
> }
>
> However, I get the same errors as before, because of the parameters from
> the cluster.yaml are somehow not passed to the cluster. Then I tried the
> modification: 
>
> *snakemake --cluster-config cluster.yaml --cluster "bsub -R
> {cluster.resources}" -j 1 solve_all_networks*
> --
> You received this message because you are subscribed to the Google
> Groups "pypsa" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to pypsa+un...@googlegroups.com
> <mailto:pypsa+un...@googlegroups.com>.
> To view this discussion on the web, visit
> https://groups.google.com/d/msgid/pypsa/76b70981-5e85-47c9-b662-e5b9a6177e0en%40googlegroups.com
> <https://groups.google.com/d/msgid/pypsa/76b70981-5e85-47c9-b662-e5b9a6177e0en%40googlegroups.com?utm_medium=email&utm_source=footer>.

Willem Dafoe

unread,
Mar 26, 2021, 8:30:55 AM3/26/21
to pypsa
Thanks for your answer Johannes,

I will check with the cluster support about that. Do you see any other errors with my cluster.yaml that could lead to the errors?

Or, to phrase the question differently: How can I allocate the memory requested from the server dynamically so that it just copies the memory requested by the snakemake rule? 

For example: 
rule solve_network:
    input: networks/elec_s300_130_ec_lcopt_1H.nc
    output: results/networks/elec_s300_130_ec_lcopt_1H.nc
    log: logs/solve_network/elec_s300_130_ec_lcopt_1H_solver.log, logs/solve_network/elec_s300_130_ec_lcopt_1H_python.log, logs/solve_network/elec_s300_130_ec_lcopt_1H_memory.log
    jobid: 0
    benchmark: benchmarks/solve_network/elec_s300_130_ec_lcopt_1H
    wildcards: simpl=300, clusters=130, ll=copt, opts=1H
    threads: 4
    resources: mem=106050

Best,
Willem

Willem Dafoe

unread,
Mar 26, 2021, 11:39:50 AM3/26/21
to pypsa
Dear Community,

@Johannes thanks again, i tried it with the command 

snakemake --cluster-config cluster.yaml --cluster "bsub -n {cluster.nCPUs} -R {cluster.resources}" -j 1 results/networks/elec_s300_130_ec_lcopt_1H.nc

and it ran through now. However, in the final step, when snakemake tries to export the network, I get the following error:

INFO:pypsa.linopt:No model basis stored
INFO:pypsa.linopf:Optimization successful. Objective value: 5.37e+09
INFO:pypsa.io:Exported network elec_s300_130_ec_lcopt_1H.nc has storage_units, loads, links, buses, lines, carriers, generators
Traceback (most recent call last):
  File "/cluster/home/wlaumen/.local/lib/python3.8/site-packages/xarray/backends/file_manager.py", line 199, in _acquire_with_cache_info
    file = self._cache[self._key]
  File "/cluster/home/wlaumen/.local/lib/python3.8/site-packages/xarray/backends/lru_cache.py", line 53, in __getitem__
    value = self._cache[key]
KeyError: [<class 'netCDF4._netCDF4.Dataset'>, ('/cluster/scratch/wlaumen/pypsa-eur/.snakemake/shadow/tmpcxakew7x/results/networks/elec_s300_130_ec_lcopt_1H.nc',), 'a', (('clobber', True), ('diskless', False), ('format', 'NETCDF4'), ('persist', False))]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/cluster/scratch/wlaumen/pypsa-eur/.snakemake/shadow/tmpcxakew7x/.snakemake/scripts/tmp4i1i_qic.solve_network.py", line 291, in <module>
[...]
  File "src/netCDF4/_netCDF4.pyx", line 2330, in netCDF4._netCDF4.Dataset.__init__
  File "src/netCDF4/_netCDF4.pyx", line 1948, in netCDF4._netCDF4._ensure_nc_success
PermissionError: [Errno 13] Permission denied: b'/cluster/scratch/wlaumen/pypsa-eur/.snakemake/shadow/tmpcxakew7x/results/networks/elec_s300_130_ec_lcopt_1H.nc'
[Fri Mar 26 15:47:04 2021]
Error in rule solve_network:
    jobid: 0
    output: results/networks/elec_s300_130_ec_lcopt_1H.nc
    log: logs/solve_network/elec_s300_130_ec_lcopt_1H_solver.log, logs/solve_network/elec_s300_130_ec_lcopt_1H_python.log, logs/solve_network/elec_s300_130_ec_lcopt_1H_memory.log (check log file(s) for error message)

I don't really understand the error nor do I ever get it on my computer when running smaller networks locally. Do you know what that error means and how to resolve it? 

For context: The scratch directory is a working directory for writing files and output and has a lot more storage than the usual home directory on the cluster. However I do not know where the .snakemake/shadow/tmpcxakew7x part comes from.

Best,
Willem

Johannes Hampp

unread,
Mar 26, 2021, 1:03:57 PM3/26/21
to Willem Dafoe, pypsa
Dear Willem,

Happy to hear you got the execution working!

The directory .snakemake stores information during and after each
snakemake execution and is automatically created.
Within this directory, the ./snakemake/shadow directory if you execute a
workflow with shadow rule, i.e. rules containing the "shadow" keyword.
The rule "solve_network" is such a rule. The snakemake documentation
offers a bit more information [1].

[1]
https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html?highlight=#shadow-rules

As for "why is there a shadowy?" maybe someone else can shed some light
on it (pun intended).

As for the reasons why this error comes up I can't help you solely on
the information at hand. This error could be related to problems with
file locking in the underlying file system. I have also seen it e.g. on
virtualised filesystems build upon NTFS file systems, but that does not
seem to be the case here.
That's presumably also the reason you never encountered it before on
your local machine - the environment is different.

Have you ever been able to successfully solve any network on the
cluster? Or is this your first attempt?
If it is your first attempt try a small network which solves on your
computer and submit it to the cluster with exactly 1 CPU core and once
with >=2 CPU cores for solving. (for debugging)

Things you can try (no debugging):.

* Delete the .snakemake directory

* Remove the "shadow" keyword from the rule
(I don't know why it is there and do not see an obvious reason for it to
be there. Maybe it is just a remnant of the past)

* Specify a different shadow directory, e.g. /tmp (if this is available
and usuable on the cluster nodes). I.e. add
"--shadow-prefix /tmp/snakemake " to your command


Maybe that helps.

Best,
Johannes



Best regards,
Johannes Hampp (he/him)

Justus Liebig University Giessen (JLU)
Center for international Development and Environmental Research (ZEU)

mailto: johanne...@zeu.uni-giessen.de

Office 110
Senckenbergstr. 3
DE-35392 Giessen
https://uni-giessen.de/zeu

Am 26/03/2021 um 16:39 schrieb Willem Dafoe:
> Dear Community,
>
> @Johannes thanks again, i tried it with the command 
>
> *snakemake --cluster-config cluster.yaml --cluster "bsub -n
> {cluster.nCPUs} -R {cluster.resources}" -j 1
> results/networks/elec_s300_130_ec_lcopt_1H.nc*
> https://uni-giessen.de/zeu <https://uni-giessen.de/zeu>
> <http://lic-gurobi.ethz.ch>
> <https://groups.google.com/d/msgid/pypsa/76b70981-5e85-47c9-b662-e5b9a6177e0en%40googlegroups.com?utm_medium=email&utm_source=footer
> <https://groups.google.com/d/msgid/pypsa/76b70981-5e85-47c9-b662-e5b9a6177e0en%40googlegroups.com?utm_medium=email&utm_source=footer>>.
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "pypsa" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to pypsa+un...@googlegroups.com
> <mailto:pypsa+un...@googlegroups.com>.
> To view this discussion on the web, visit
> https://groups.google.com/d/msgid/pypsa/8a1c1a69-2812-4274-9f7c-b1896a54023an%40googlegroups.com
> <https://groups.google.com/d/msgid/pypsa/8a1c1a69-2812-4274-9f7c-b1896a54023an%40googlegroups.com?utm_medium=email&utm_source=footer>.

Jonas Hörsch

unread,
Mar 29, 2021, 5:24:41 PM3/29/21
to Johannes Hampp, Willem Dafoe, pypsa
Hi,

On 26. Mar 2021, at 18:03, Johannes Hampp <johanne...@zeu.uni-giessen.de> wrote:

As for "why is there a shadowy?" maybe someone else can shed some light
on it (pun intended).


solve_network is set as a shadow rule, since the interaction between pyomo and gurobi in versions 7 and 8 always created a gurobi.log file in the working directory before it moved it to the logs folder (when pyomo finally communicated the solver
options). When you then ran multiple *local* jobs, there were repeatedly situations in which gurobi processes aborted because *their* logfile had been moved away by another gurobi.

This will not affect you on the cluster (because on each cluster node you are only running a single process), so you can safely un-shadow it.

Best,
Jonas

Willem Dafoe

unread,
Apr 1, 2021, 7:03:47 AM4/1/21
to pypsa
Dear Jonas & Johannes,


thanks for your responses. I managed to solve the problem in the meantime, but would like to post how I did solve it in order for if anyone else has the same problem and looks for solution (and also nice pun Johannes, lol)!

The first error was indeed caused by a memory request limit per job that we have on the ETH cluster, so I could resolve it by simply request less memory.

The "permission error" was indeed caused by snakemake trying to write too much information to the temporary working directory, resulting in the permission error.
I did not try to remove the "shadow" yet, but could resolve the issue as well by simply setting a tmpdir variable in the config file:

solving:
  tmpdir: /cluster/scratch/wlaumen/pypsa-eur/tmp  # (this was the actual working directory in my case)
  options:  
    formulation: kirchhoff
    load_shedding: true
    noisy_costs: true
    min_iterations: 4
    max_iterations: 6
    clip_p_max_pu: 0.01

This did the trick in my case, hope it helps for anyone having the same issue. Also thanks to Chiara Anselmetti for flagging this option.

Best,
Willem
Reply all
Reply to author
Forward
0 new messages