CPU core oversubscription detected!

410 views
Skip to first unread message

Kaján Győző

unread,
Oct 19, 2020, 6:08:37 AM10/19/20
to ra...@googlegroups.com

Dear Developers,

 

I am very grateful for the continuous development of RAxML-NG, v1 with its parallel workers was long-anticipated and much welcomed! And also the startup heuristics are excellent and helping a lot to not so experienced users.

I would have a question related to the CPU core oversubscription check. I have analysed a long alignment recently, the recommended No of threads were 8 (attached). But I have run a few short test runs, and at the end decided for only 1 thread (core)/tree search with 20 parallel tree searches on the 20 cores of a single node (--threads 20 --workers 20). And I received a CPU core oversubscription error (attached). I did not understand the reason as I did not want to use 20 threads (in other words cores) on a single tree search, so I have forced the job, and it did finish with the average time 67 sec/tree. This is almost a 12x speedup compared to single-core usage, thus I did find this effective.

So, the question arises: wouldn't it be more logical to check for the ratio of cores and threads?

Or I should have played around with thread pinning?


Best regards,

Győző

PRRSV1_over13K_alignment_edit_alignparse.raxml.log
slurm-1406507.out

Alexey Kozlov

unread,
Oct 19, 2020, 7:59:37 AM10/19/20
to ra...@googlegroups.com
Dear Győző,

thank you for your feedback, I'm glad to hear you liked the new RAxML-NG v1.0!

Regarding the oversubscription check: it does not compare number of cores vs. number of threads, but
instead tries to detect whether a synchronization (thread barrier) takes "too long". Admittedly,
this is a dirty heuristic that could give false positives, but I have tested it on a variety of
systems, and it was quite reliable in practice. Now, in case of coarse-grained parallelization -
especially in your "extreme" case with 1 thread/tree - oversubscription is less of a problem since
there is little/no synchronization between threads. So you might not observe a noticeable slowdown
even if you oversubscribe.

Could you please show me your SLURM submission script and raxml-ng output with "--log debug" and
without "--force"? Also, you can try thread pinning to check whether you job gets all 20 physical cores.

Best,
Alexey
> So, the question arises: wouldn't it be more logical to check for the /ratio /of cores and threads?
>
> Or I should have played around with thread pinning?
>
>
> Best regards,
>
> Győző
>
> --
> You received this message because you are subscribed to the Google Groups "raxml" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> raxml+un...@googlegroups.com <mailto:raxml+un...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/raxml/CAEKTi_nSHH5MHKgyj2EV5ZJwNH95h-Gas1qTcZmYAkQ2BNET-Q%40mail.gmail.com
> <https://groups.google.com/d/msgid/raxml/CAEKTi_nSHH5MHKgyj2EV5ZJwNH95h-Gas1qTcZmYAkQ2BNET-Q%40mail.gmail.com?utm_medium=email&utm_source=footer>.

Kaján Győző

unread,
Oct 20, 2020, 7:30:35 AM10/20/20
to ra...@googlegroups.com
Dear Alexey,

Thank you for your prompt reply!
This seems to be the case of the inexperienced user again, sorry about it. Pls find some test submission scripts and RAxML logs attached. To be short, everything went fine both without setting anything about thread pinning and with thread pinning (--extra thread-pin). The only case I got the alarm was when I explicitly turned thread pinning off (--extra thread-nopin).
Previously, I had been using RAxML-NG v1.0.1 without MPI support installed. This was the version giving the alarm. Then I found that if a newer OpenMPI version (openmpi/3.1.2-gcc-8.2.0) is loaded I can install RAxML-NG with MPI support. Our system default OpenMPI is openmpi/1.8.5-intel. Now I wonder, maybe it's not even the version number, but has sg to do with gcc. The system default gcc is 8.2.0, so maybe from OpenMPI I need the openmpi/3.1.2-gcc-8.2.0 version. I don't know, either way, a properly installed RAxML-NG with OpenMPI support did not give a false alarm.

Best regards,
Győző

You received this message because you are subscribed to a topic in the Google Groups "raxml" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/raxml/QEfqb1PVbiE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to raxml+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/raxml/2b158ac1-d1cb-250b-9b73-4203235ba1d0%40gmail.com.
RAxML_over13k_Alexey_no_pinning.bash
PRRSV1_over13k_Alexey_no_pinning.raxml.log
PRRSV1_over13k_Alexey_w_pinning.raxml.log
PRRSV1_over13k_Alexey_wo_pinning.raxml.log
RAxML_over13k_Alexey_w_pinning.bash
RAxML_over13k_Alexey_wo_pinning.bash

Alexey Kozlov

unread,
Oct 20, 2020, 8:09:53 AM10/20/20
to ra...@googlegroups.com
Hi Győző,

thanks for investigating! Now the case is clear: this wasn't a false alarm. When you explicitly
disabled thread pinning, multiple threads were running on a single core. For whatever reason, this
often happens with MPI, that's why RAxML-NG enables thread pinning by default under such circumstances.

So ideally, you should get even better performance with thread pinning! :)

Best,
Alexey

On 20.10.20 13:29, Kaján Győző wrote:
> Dear Alexey,
>
> Thank you for your prompt reply!
> This seems to be the case of the inexperienced user again, sorry about it. Pls find some test
> submission scripts and RAxML logs attached. To be short, everything went fine both without setting
> anything about thread pinning and with thread pinning (--extra thread-pin). The only case I got the
> alarm was when I explicitly turned thread pinning off (--extra thread-nopin).
> Previously, I had been using RAxML-NG v1.0.1 /without/ MPI support installed. This was the version
> giving the alarm. Then I found that if a newer OpenMPI version (openmpi/3.1.2-gcc-8.2.0) is loaded I
> /can/ install RAxML-NG /with/ MPI support. Our system default OpenMPI is openmpi/1.8.5-intel. Now I
> wonder, maybe it's not even the version number, but has sg to do with gcc. The system default gcc is
> 8.2.0, so maybe from OpenMPI I need the openmpi/3.1.2-gcc-8.2.0 version. I don't know, either way, a
> properly installed RAxML-NG /with /OpenMPI support did /not/ give a false alarm.
> > raxml+un...@googlegroups.com <mailto:raxml%2Bunsu...@googlegroups.com>
> <mailto:raxml+un...@googlegroups.com <mailto:raxml%2Bunsu...@googlegroups.com>>.
> > To view this discussion on the web visit
> >
> https://groups.google.com/d/msgid/raxml/CAEKTi_nSHH5MHKgyj2EV5ZJwNH95h-Gas1qTcZmYAkQ2BNET-Q%40mail.gmail.com
>
> >
> <https://groups.google.com/d/msgid/raxml/CAEKTi_nSHH5MHKgyj2EV5ZJwNH95h-Gas1qTcZmYAkQ2BNET-Q%40mail.gmail.com?utm_medium=email&utm_source=footer>.
>
> --
> You received this message because you are subscribed to a topic in the Google Groups "raxml" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/raxml/QEfqb1PVbiE/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> raxml+un...@googlegroups.com <mailto:raxml%2Bunsu...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/raxml/2b158ac1-d1cb-250b-9b73-4203235ba1d0%40gmail.com.
>
> --
> You received this message because you are subscribed to the Google Groups "raxml" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> raxml+un...@googlegroups.com <mailto:raxml+un...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/raxml/CAEKTi_n0yxHwZtOJjX3QxhLO-t%2BcWp1gKOGq1W60Q1LCi-AOcg%40mail.gmail.com
> <https://groups.google.com/d/msgid/raxml/CAEKTi_n0yxHwZtOJjX3QxhLO-t%2BcWp1gKOGq1W60Q1LCi-AOcg%40mail.gmail.com?utm_medium=email&utm_source=footer>.
Reply all
Reply to author
Forward
0 new messages