ParGenes

109 views
Skip to first unread message

Norbert

unread,
Jun 14, 2020, 10:55:29 AM6/14/20
to raxml
Hi,

Having problems running ParGenes in the multinodal mode, I would greatly appreciate your advise.

I have been using a fresh installation of the current version of ParGenes in my home of a HPC system.
After installation, I ran checker.sh successfully, whereas the tests not fully so, but I guess only for the reason that the installation skipped building the tests because of a missing element: I quote from the installation protocoll:
> "Could NOT find GTest (missing: GTEST_LIBRARY GTEST_INCLUDE_DIR GTEST_MAIN_LIBRARY)
> -- GTest not found
> CMake Warning at test/src/CMakeLists.txt:5 (message):
>   Skipping building tests."
Here is the result from the run_test.py:
"cd tests && python run_tests.py
> python /home/ki/opt/ParGenes/pargenes/pargenes.py -h
> [help]: Success! (0s)
> python /home/ki/opt/ParGenes/pargenes/pargenes.py -a /home/ki/opt/ParGenes/tests/smalldata/fasta_files -o /home/ki/opt/ParGenes/tests/tests_outputs/test_ml_search/pargenes -r
> /home/ki/opt/ParGenes/tests/smalldata/raxml_global_options.txt -c 4 -s 3 -p 3
> [ml_search_pargenes]: Success! (1s)
> python /home/ki/opt/ParGenes/pargenes/pargenes.py -a /home/ki/opt/ParGenes/tests/smalldata/fasta_files -o /home/ki/opt/ParGenes/tests/tests_outputs/test_modeltest/pargenes -c 4  -m
> --modeltest-global-parameters /home/ki/opt/ParGenes/tests/smalldata/only_1_models.txt
> [modeltest_pargenes]: Success! (1s)
> python /home/ki/opt/ParGenes/pargenes/pargenes.py -a /home/ki/opt/ParGenes/tests/smalldata/fasta_files -o /home/ki/opt/ParGenes/tests/tests_outputs/test_bootstraps/pargenes -r
> /home/ki/opt/ParGenes/tests/smalldata/raxml_global_options.txt -c 4  -b 3
> [bootstrapspargenes]: Success! (1s)
> python /home/ki/opt/ParGenes/pargenes/pargenes.py -a /home/ki/opt/ParGenes/tests/smalldata/fasta_files -o /home/ki/opt/ParGenes/tests/tests_outputs/test_astral/pargenes -r
> /home/ki/opt/ParGenes/tests/smalldata/raxml_global_options.txt -c 4 -s 3 -p 3 --use-astral
> [astral_pargenes]: Success! (2s)
> python /home/ki/opt/ParGenes/pargenes/pargenes.py -a /home/ki/opt/ParGenes/tests/smalldata/fasta_files -o /home/ki/opt/ParGenes/tests/tests_outputs/test_all/pargenes -r
> /home/ki/opt/ParGenes/tests/smalldata/raxml_global_options.txt -c 4 -m -b 3 -s 3 -p 3 --use-astral  --modeltest-global-parameters /home/ki/opt/ParGenes/tests/smalldata/only_1_models.txt
> [all_pargenes]: Success! (3s)
> Traceback (most recent call last):
>   File "run_tests.py", line 207, in <module>
>     test_all(pargenes_script)
>   File "run_tests.py", line 193, in test_all
>     check_all(output, True, True, True, True)
>   File "run_tests.py", line 94, in check_all
>     check_astral(run_dir)
>   File "run_tests.py", line 81, in check_astral
>     assert(len(input_gene_trees_lines) == 4)
> AssertionError"
Hence I believe that this does not affect the function of the program.

Using pargenes-hpc.py with a small testset of only a handfull MSAs and requesting the cores of two nodes the runs regularly fail already during the parsing step with the following error message in the report file: "mpi-scheduler: /home/ki/opt/ParGenes/raxml-ng/src/main.cpp:170: void init_part_info(RaxmlInstance&): Assertion `parted_msa.part_count() > 0' failed." (see report.txt, attached)

Using pargenes.py and requesting the cores of a single node only, the run does not fail but job efficiency sharply decreases when, I think, bootstrapping with raxml-ng starts.

Using pargenes-hpc-debug.py and requesting the cores of two nodes, the run does not fail but again job efficiency, at the beginning approaching 100 %, thus using all cores, then sharply decreases. (The same happens when only the cores of a single node are requested). The logs.txt in the mlsearch directory of the output states that the scheduler is called with the argument "--onecore-scheduler 64" and it states:
"A process has executed an operation involving a call to the
"fork()" system call to create a child process.  Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your job may hang, crash, or produce silent
data corruption.  The use of fork() (or system() or other calls that
create child processes) is strongly discouraged."
The raxml logs consequently state: "parallelization: NONE/sequential"
See the attached files pargenes.logs.txt, logs.txt and accD_fa_bs0.raxml.log.

Any idea what is going wrong here??

Best regards,
Norbert Kilian





accD_fa_bs0.raxml.log
logs.txt
pargenes_logs.txt
report.txt

Benoît Morel

unread,
Jun 15, 2020, 7:51:11 AM6/15/20
to raxml
Dear Norbert,

Regarding parallel efficiency: I think that you get a bad parallel efficiency because of the autoMRE mode. In autoMRE mode, for a given family ParGenes cannot parallelize over the bootstrap trees, and computes them all with one single job per family (here, one single core per family)
In your case:
- you have very few families (8)
- the bootstrap convergence test does not seem to converge, and raxml stops after the maximum limit (1000 bs trees)
At the beginning, ParGenes uses all cores because it has enough tasks to assign (50*8 ML jobs, and 8 bootstrap jobs). But the 8 bootstrap jobs are much more computational expensive. Some of them might finish earlier, and you end up with very few busy cores.
For this specific test run (few families, and bootstrap convergence never reached), I suggest to disable the autoMRE mode and to fix the number of bootstrap trees to 1000. I expect the same result (1000 bs trees computed) but with a much better parallel efficiency.


Regarding the issue with pargenes-hpc.py. The run_test.py seems to fail on your installation (AssertionError at the end). It's hard to tell whether it's the same issue, but I would first start to fix the tests.
I just updated the test script this morning, to have a better overview of the failing tests. Could you call "git pull" in the repository, run the tests again, and send me the content of the directory tests/tests_ouptut? You don't need to reinstall, I only changed the test python script.

Don't worry about "The raxml logs consequently state: "parallelization: NONE/sequential"". In your test run, each individual raxml job is run sequentially because the sequences are quite short. It does not mean that ParGenes does not parallelize over the different raxml jobs.

Best,
Benoit

Regarding the issue with pargenes-hpc.py: it looks like

Norbert Kilian

unread,
Jun 15, 2020, 10:08:12 AM6/15/20
to ra...@googlegroups.com

Dear Benoit,

Thank you very much for your speedy response.

It is a particularly valuable information that ParGenes cannot parallelize bootstrapping of the individual MSAs in the autoMRE mode, I must have overlooked this point. Well, this explains a lot! Because bootstrap convergence can also be checked afterwards, this is actually no disadvantage.

The updated tests ran fine, run_tests.py terminates with "All tests ran successfully".  Please find the output attached.

Best regards,

Norbert


Am 15.06.20 um 13:51 schrieb Benoît Morel:
--
You received this message because you are subscribed to the Google Groups "raxml" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raxml+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/raxml/e065b425-7bec-4129-a584-b5b3017101e6o%40googlegroups.com.
tests_outputs.tar.gz

Benoît Morel

unread,
Jun 16, 2020, 2:20:43 PM6/16/20
to raxml
Dear Norbert,

It's good that you brought up this point: the information was missing in the documentation and I just added it.

The error you get (Assertion `parted_msa.part_count() > 0') should never happen. Do you think you could provide the input files you used to get this error, such that I can  reproduce on my machine?

Best,
Benoit
To unsubscribe from this group and stop receiving emails from it, send an email to ra...@googlegroups.com.

Kilian, Norbert

unread,
Jun 17, 2020, 2:09:57 AM6/17/20
to ra...@googlegroups.com

Dear Benoit,

 

Thank you. I have attached the 7 (out >70 single plastid gene) msas that I used for the test runs.

Does this mean that my installation works fine as it should with ParGenes's own tests?

 

Best regards,

Norbert

To unsubscribe from this group and stop receiving emails from it, send an email to raxml+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/raxml/a92d26ad-2a47-4927-8e50-3b187b1b3336o%40googlegroups.com.

test_msas.zip

Benoît Morel

unread,
Jun 17, 2020, 3:29:22 AM6/17/20
to raxml
Dear Norbert,

Your installation works fine with ParGenes' own tests.

I found the issue. I think you created a directory "out" in your folder containing the input MSAs.
ParGenes considers all files in this directory to be input MSAs.

We will add a safety check, either in ParGenes or raxml-ng, to notify the user when this happens.

Best regards,
Benoit

Kilian, Norbert

unread,
Jun 17, 2020, 4:21:22 AM6/17/20
to ra...@googlegroups.com

Dear Benoit,

 

Oh, yes, you are right; a recent, apparently badly advised, alteration! Tons of thanks for identifying our problem!!

 

We find ParGenes very useful and thus greatly appreciate that now we will be able to continue our work with it.

To unsubscribe from this group and stop receiving emails from it, send an email to raxml+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/raxml/fc4af66e-1c48-4637-8e0e-4b38f78a40aeo%40googlegroups.com.

Benoît Morel

unread,
Jun 17, 2020, 4:31:48 AM6/17/20
to ra...@googlegroups.com
Dear Norbert,

thanks a lot for your positive feedback!

Best regards,
Benoit

You received this message because you are subscribed to a topic in the Google Groups "raxml" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/raxml/VMKV3NLAZwI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to raxml+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/raxml/7117352b41af4e91bb0883b2a97f69d3%40bgbm.org.

n.ki...@bgbm.org

unread,
Jun 18, 2020, 11:17:57 AM6/18/20
to ra...@googlegroups.com

Dear Benoit,

 

I made a new trial and thought to have considered everything, but it failed again;  I really don’t know what to do else.

Could you please have a look at the report file attached?

 

Best regards,

Norbert




Avast logo

Diese E-Mail wurde von Avast Antivirus-Software auf Viren geprüft.
www.avast.com


report.txt

Benoît Morel

unread,
Jun 19, 2020, 3:32:09 AM6/19/20
to ra...@googlegroups.com
Dear Norbert,

Did you change anything special from the last runs? Is this only happening with the pargenes-hpc.py script? Does it happen with less cores?

I made a change in the code to get more information from the error you get. Can you call the following commands from your repository?
./gitpull.sh
./install_scheduler_only.sh

This will update and recompile the component that parallelizes the raxml jobs, and hopefully we will know more about the issue.
Can you then send me the new report file again?

Best regards,
Benoit

Norbert Kilian

unread,
Jun 19, 2020, 12:57:02 PM6/19/20
to ra...@googlegroups.com

Dear Benoit,

Thank you! The only thing I changed was the placement of the output folder.

I have implemented your changes to the code. The problems only arise with the pargenes-hpc.py script (this time I only requested 2 nodes and 64 cores); please find the bash file and the report file attached. The pargenes.py script with only 1 node and max. 32 cores requested works fine (again) now.

Best regards,

Norbert



Am 19.06.20 um 09:31 schrieb Benoît Morel:
report.txt
ParGenes_new.sh

Benoît Morel

unread,
Jun 19, 2020, 2:54:52 PM6/19/20
to ra...@googlegroups.com
Dear Norbert,

ParGenes cannot find the file:
/dev/shm/ki_ParGenes_output_cp3/parse_run/parse_command.txt

This file is created and written by the ParGenes python script (which is sequential) and then read from a parallel component of the program, from all cores/nodes. 
You need to check that this file exists, and that it's accessible from all nodes. I am not an expert in file systems and clusters, but here is an idea: if you only reproduce the issue with several nodes, maybe there is something wrong with accessing this file from a node different from the node from which the file was created...

Best regards,
Benoit

Kilian, Norbert

unread,
Jun 24, 2020, 3:21:41 AM6/24/20
to ra...@googlegroups.com

Dear Benoit,

 

Thank you for this valuable hint. After changing the location on the cluster for ParGenes's work directory I perfectly managed to run it with multiple nodes. In fact, the simple reason for my former fruitless attempts was that the location of the cluster I previously had used is node local, as I have learned now.

 

I also tested your new export.py, which I find excellent to get and safe the relevant results in a compact and lucid way.

 

Many thanks for all your support and your patience!

 

Best wishes,

Reply all
Reply to author
Forward
0 new messages