2025 GPU for BEAST X and BEAST 2

209 views
Skip to first unread message

Gautier Richard

unread,
Jul 14, 2025, 2:47:15 PMJul 14
to beast-users
Hello everyone,

We are setting up a small datacenter to perform some machine learning, deep learning and BEAST X computations using GPUs.

We were wandering what are the most important specs for GPUs when it comes to BEAST X? We are mainly interested in analyzing influenza phylodynamics and phylogeographics as well as reassortments dynamics with BEAST X and BEAST 2.

After a few hours of research it seems that FP64 performance is the most important aspect for double precision? My conclusions were thus that A100 GPUs are probably the best (not sure that we have the budget for an H100 or H200). And this would thus rule out the L40S and L4 GPUs? What about the RTX5090? Currently we have an RTX4070 on a desktop computer for the demanding models, and it's indeed faster than CPUs. We are however limited in the number of GPUs at the moment.

All the best and many thanks,

Gautier Richard, PhD
Project Manager in Molecular Epidemiology,
Swine Immunology Virology Unit, Ploufragan, France

Guy Baele

unread,
Jul 21, 2025, 3:46:35 PMJul 21
to beast-users
The main requirement is indeed the FP64 performance; in my experience, only the (expensive) nVidia and AMD cards are a real option.
The problem that I had last year was that I couldn't actually get my hands on A100 GPUs, as I was told that I would have to wait for at least 1 year without any guarantee that I would actually get any.
So I ended purchasing half the number of GPUs by going for the H100 over the A100, which is a very painful choice given that the performance benefit (for phylogenetics) doesn't warrant the price increase of an H100 over an A100 GPU.

Best I can tell, the RTX5090 has an FP64 performance of 1.637 TFLOPS, whereas an A100 GPU has 9.7 TFLOPS and an H100 has 34 TFLOPS.
But note that these numbers don't easily translate to BEAST performance increases, so it's best to try to get as many modern GPUs as possible for your budget.
I would suggest to go for the A100 (not sure you can still get older GPUs) indeed and focus on having a decent memory size, e.g. the 80Gb version.
But nowadays, perhaps AMD Instinct cards may be the better option, go for an MI210 or higher (MI300 and higher have very nice benchmark numbers).

Best regards,
Guy

Op maandag 14 juli 2025 om 20:47:15 UTC+2 schreef Gautier Richard:

Pfeiffer, Wayne

unread,
Jul 23, 2025, 4:21:37 PMJul 23
to gautier....@gmail.com, beast...@googlegroups.com
Hi Richard,

I am the project leader for the CIPRES science gateway. We run tens of BEAST, BEASTX, and BEAST2 jobs every day on our Expanse cluster that has AMD cores and NVIDIA V100 and A100 GPUs. Currently 26 jobs are running: 10 are using BEAST or BEASTX, and 16 are using BEAST2. Two jobs using BEAST 1.10.4 are running on GPUs, while all of the other jobs are running on cores.

To decide how to run these jobs cost-effectively, I have done extensive benchmarking of BEAST, BEASTX, and BEAST2 on both cores and GPUs. From these benchmarks I developed rules for choosing the number of cores or GPUs for cost-effective execution. The most important parameters for the rules are

- the number of patterns,
- the number of partitions, and
- whether or not the data set has amino acid sequences.

Before each job we run a script that parses the input xml file to get the needed parameters. GPUs are generally much faster than cores on Expanse, but the usage charge for GPUs is correspondingly higher: 1 GPU hour = 20 AMD core hours. Thus we only use GPUs when their expected speedup over cores is at least 5. For really large data sets, we see speedups on GPUs over cores of ≥30 for BEASTX and ≥100 for BEAST2. We just have a few A100s, so we use them only when they are 1.3x faster than V100s.

With that as background, the rules that we use for BEASTX and BEAST2 are appended here. Feel free to contact me if you want more information.

Best regards,
Wayne

---

* Rules for running BEASTX 10.5.0 on Expanse via the CIPRES gateway

The runs use varying numbers of cores and GPUs within a single node of Expanse
depending upon the data set.

- Ask the user for the following:

  whether the data set has amino acids (AAs),
  the number of partitions in the data set,
  the total number of patterns in the data set, and
  whether the analysis needs extra memory.

  If the data set does not contain AAs, assume that it is DNA.

- Specify the Slurm partition, threads, beagle_instances, cores, GPU type, and
  Slurm memory according to the following table. Also, use the additional BEAGLE
  parameters listed in the examples, including -beagle_scaling dynamic unless
  the user specifies otherwise.

   Data         Data     Memory      Slurm             beagle_          GPU   Slurm
partitions    patterns   needed   partition  threads instances  cores  type  memory  

DNA data

    <8          <3,000  regular    shared       3        3        3             6G
    <8    3,000-79,999  regular  gpu-shared     1                10    V100    90G
    <8        >=80,000  regular  gpu-shared     1                10    A100    90G
   >=8         <10,000  regular    shared       3        3        3             6G
   >=8        >=10,000  regular    shared       4        4        4             8G
   any           any     extra     shared       6        6        6            12G
                                                                    
AA data

   any           any    regular  gpu-shared     1                10    V100    90G

---

* Rules for running BEAST2 2.x.x on Expanse via the CIPRES gateway

The runs use varying numbers of cores and GPUs within a single node of Expanse
depending upon the type of analysis and the data set.

- Ask the user for the following:

  whether the analysis uses Path Sampling, SNAPP, or both,
  whether the data set has amino acids (AAs),
  the number of partitions in the data set,
  the total number of patterns in the data set,
  whether the analysis needs extra memory. 

- Specify the Slurm partition, threads, instances, cores, GPUs, and memory according
  to the following table. Also, use the additional BEAGLE parameters listed in the
  examples, including-beagle_scaling dynamic unless the user specifies otherwise.

   Data          Data     Extra    Slurm                                         Slurm 
partitions     patterns  memory  partition  -threads -instances  cores   GPUs   memory

Any data with Path Sampling but without SNAPP

    any           any       no    shared        6         1         6             11G

Any data with SNAPP but without Path Sampling

    any           any       no    shared       24         1        24             46G     

Any data with SNAPP and Path Sampling

    any           any       no    shared       25         1        25             50G 
    any           any      yes    compute      25         1        25            243G

DNA data without Path Sampling or SNAPP

  1 to 3         <5,000     no    shared        3         3         3              6G
  1 to 3   5,000-37,999     no  gpu-shared      1         1        10    1 V100   90G
  1 to 3  38,000-99,999     no  gpu-shared      1         1        10    1 A100   90G
  1 to 3      >=100,000     no  gpu-shared      2         2        20    2 A100  180G

  4 to 17        <1,200     no    shared        1         1         1              2G
  4 to 17   1,200-4,999     no    shared        3         1         3              6G
  4 to 17  5,000-19,999     no    shared        6         2         6             12G
  4 to 17      >=20,000     no  gpu-shared      1         1        10    1 V100   90G

   >=18          <8,000     no    shared        2         1         2              4G
   >=18    8,000-13,999     no    shared        3         1         3              6G
   >=18   14,000-39,999     no    shared        6         2         6             12G
   >=18        >=40,000     no  gpu-shared      1         1        10    1 V100   90G

    any           any      yes    shared       12         1        12             24G

AA data without Path Sampling or SNAPP

     1          <12,000     no  gpu-shared      1         1        10    1 V100   90G
     1         >=12,000     no      gpu         4         4        40    4 V100  360G

  2 to 39         any       no  gpu-shared      1         1        10    1 V100   90G

   >=40           any       no    shared       24         1        24             46G

-- 
You received this message because you are subscribed to the Google Groups "beast-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beast-users...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/beast-users/51b0ca3a-d396-4ab9-ae8e-eb969a81f328n%40googlegroups.com.

Martin Gunnill

unread,
Sep 3, 2025, 3:32:03 PMSep 3
to beast-users
Dear Wayne

Am I correct in assuming that 'pattern' here refers to codons?

I also notice that BEAST 2 outputs the number of patterns to screen before MCMC run e.g:
```
Alignment(2023-03-01 56 WGS D8)
  56 taxa
  15678 sites
  266 patterns
```
Do you have any suggestions for estimating number patterns before running BEAST 2.
I am working on a pipelining tool for BEAST 2 and was wondering about the feasibility of inputting some of the decisions in your look up table.

Yours Martin

Pfeiffer, Wayne

unread,
Sep 3, 2025, 6:02:49 PMSep 3
to beast...@googlegroups.com, Pfeiffer, Wayne
On Sep 3, 2025, at 7:04 AM, Martin Gunnill <Martin....@phac-aspc.gc.ca> wrote:

Dear Wayne

Am I correct in assuming that 'pattern' here refers to codons?

No. The number of patterns is the number of unique columns in the multiple sequence alignment. It can be much smaller than the number of sites, as shown by your example alignment.

I also notice that BEAST 2 outputs the number of patterns to screen before MCMC run e.g:
```
Alignment(2023-03-01 56 WGS D8)
  56 taxa
  15678 sites
  266 patterns
```
Do you have any suggestions for estimating number patterns before running BEAST 2.
I am working on a pipelining tool for BEAST 2 and was wondering about the feasibility of inputting some of the decisions in your look up table.

BEASTX includes npatterns in the xml file :) However, BEAST2 does not :(

To schedule a BEASTX job, we first run a bash parser script that extracts npatterns from the xml file.

To get the number of patterns for a BEAST2 analysis, we run a BEAST2 pilot job just far enough to output the needed number. The pilot job uses a temporary xml file that includes

  chainlength=“0"
  preBurnin=“0”

The temporary xml file can be generated with the following commands:

  cp *xml temp.xml
  sed -i -e 's/chainLength="[0-9]*"/chainLength="0"/' temp.xml
  sed -i -e 's/preBurnin="[0-9]*"/preBurnin="0"/‘ temp.xml

The number of patterns is then extracted from the pilot job output file using a bash parser script.

Usually the pilot job finishes in less than 60 seconds, but if it doesn’t, we set the number of patterns to a default value of 1000.

Martin Gunnill

unread,
Sep 18, 2025, 2:08:33 PMSep 18
to beast-users
Just found out that the beast 2 command 'beast -validate path-to-your.xml` lists the patterns in the sequence alignment in the xml without running the entire xml.

Jie Zha

unread,
Sep 20, 2025, 6:42:10 AM (13 days ago) Sep 20
to beast-users
Dear Wayne,

If my dataset has 6 partitions with unlinked substition model and clock model, but linked tree model. And if I use BEAST 2 and  assign these partitions to 6 GPUS parallely, will the computing speed be increased by 6 times?

Best,

Jie

Pfeiffer, Wayne

unread,
Sep 20, 2025, 11:27:41 AM (13 days ago) Sep 20
to beast...@googlegroups.com
Hi Jie,

No. With 6 partitions, you will probably need 100,000 or more patterns to see much speedup using 2 GPUs over 1 GPU.

Best regards,
Wayne

Jie Zha

unread,
Sep 20, 2025, 9:00:18 PM (13 days ago) Sep 20
to beast-users
Dear Wayne,

Assume I have a very large sequences alignment,  if I split this alignment into many small partitions and assigning them to different GPUs parallely, will the computing speed be increased significantly?

Best,
Jie

Pfeiffer, Wayne

unread,
Sep 21, 2025, 9:04:43 AM (12 days ago) Sep 21
to beast...@googlegroups.com
Hi Jie,

Thanks for asking me about this again, since what I said before was not quite right.

I made benchmark runs on V100 GPUs with BEAST2 2.7.8 for 8 data sets from CIPRES users and got the results in the following table. I had expected the speedup to increase as the number of patterns/partition increased, but that is not the case. Instead, there is no consistent dependence of the speedup on the number of patterns/partition.

- The average speedup is 1.54 for 2 GPUs and 2.04 for 4 GPUs.
- On 2 GPUs, the best speedup is 1.77, and the worst speedup is 1.28.
- On 4 GPUs, the best speedup is 2.78, and the worst speedup is 1.52.

6 GPUs will probably not give much more speedup than 4 GPUs.

                    Parti-  Patterns/
Data set  Patterns   tions  partition  GPUs  Speedup

46766        1,787      22         81     1     1.00
                                          2     1.52
                                          4     2.16

03AFF        3,697      22        168     1     1.00
                                          2     1.50
                                          4     1.91

8F0C5          695       5        246     1     1.00
                                          2     1.28
                                          4     1.52

0A9C8       11,294      25        452     1     1.00
                                          2     1.77     
                                          4     2.78

B4469        4,680       8        585     1     1.00
                                          2     1.70
                                          4     2.05

11663        6,707       4      1,677     1     1.00
                                          2     1.62    
                                          4     2.13

3DFAD       31,017       8      3,877     1     1.00
                                          2     1.34
                                          4     1.81
 
jessicag   112,661      16      7,041     1     1.00
                                          2     1.58
                                          4     1.97

The command line that I used to run on 4 GPUs was

  -beagle_GPU -beagle_order 1,2,3,4 -beagle_scaling dynamic -threads 4 -instances 1

Best regards,
Wayne

Jie Zha

unread,
Sep 21, 2025, 11:04:55 AM (12 days ago) Sep 21
to beast-users
Dear Wayne,

Would you mind to explain how you assign your partitions, using the command line. For "-instances 1", is it mean you have only 1 partition?

Best,

Jie

Pfeiffer, Wayne

unread,
Sep 21, 2025, 1:40:30 PM (12 days ago) Sep 21
to beast...@googlegroups.com
Hi Jie,

Each data set has the number of partitions indicated in my table.

Parallelization is controlled by the threads and instances input parameters.

-threads is the number of parallel threads to use, which should be set to the number of GPUs.
-instances is the number of logical partitions to which a thread is assigned within each partition of the multiple sequence alignment.

For these data sets with few patterns/partition, using more instances than 1 slows down the calculation.

The speedup you get will depend upon the absolute speed of your GPU. The results in my table were for a relatively old V100. For the data sets in my table, the newer A100 is a little faster than the V100 only for the last and largest data set. For some of the smaller data sets, the A100 is slower than the V100!

Incidentally, the speedups for the largest data set were incorrect in my previous table. The correct values are listed in the following table excerpt.

                    Parti-  Patterns/
Data set  Patterns   tions  partition  GPUs  Speedup
 
jessicag   112,661      16      7,041     1     1.00
                                          2     1.41
                                          4     1.77

Best regards,
Wayne

Pfeiffer, Wayne

unread,
Sep 21, 2025, 4:27:28 PM (12 days ago) Sep 21
to beast...@googlegroups.com
Hi Jie,

I think that I may have misunderstood your original question regarding partitions.

Adding partitions by genes to the multiple sequence alignment will only give the modest speedups noted in my original table.

On the other hand, adding logical partitions via the instance parameter can lead to more substantial speedups for an otherwise unpartitioned MSA. This is shown by the additional table below with results from BEAST2 2.7.3 for the three largest data sets without MSA partitions that I have analyzed. 

- For all three data sets the speedup on 4 V100s is greater than 2.5 and reaches 2.99 for the last data set.
- For the first data set the speedup on A100s is similar to that on V100s, whereas for the other two data sets the A100 speedup is much less than the V100 speedup on 4 GPUs.
- The last column with the relative run times shows that the A100 is much faster than the V100 for these large data sets by 1.63x to 1.71x on 1 or 2 GPUs. However, the speed advantage of the A100 drops to 1.27x and 1.12x on 4 GPUs for the two largest data sets. Nonetheless, having a faster GPU is still helpful for large data sets.

                             V100     A100  V100 time/                 
Data set  Patterns  GPUs  speedup  speedup  A100 time

47335      117,899     1     1.00     1.00       1.63
                       2     1.63     1.69       1.70  
                       4     2.52     2.52       1.64
 
065EF      259,428     1     1.00     1.00       1.69
                       2     1.58     1.53       1.64
                       4     2.70     2.03       1.27

4763C      311,817     1     1.00     1.00       1.73
                       2     1.70     1.68       1.71
                       4     2.99     1.94       1.12

The command line that I used to run on 4 GPUs was

  -beagle_GPU -beagle_order 1,2,3,4 -beagle_scaling dynamic -threads 4 -instances 4

Best regards,
Wayne

Jie Zha

unread,
Sep 22, 2025, 6:55:59 AM (11 days ago) Sep 22
to beast-users
Hi Wayne,

Thanks your replies very much! I will epuip a supercomputer in my lab to deal with the bayesian analyses, and if I have further questions on how to speedup computing, may I trouble you for expert advices later?

Best,

Jie

Jie Zha

unread,
Sep 25, 2025, 9:39:21 AM (8 days ago) Sep 25
to beast-users
Hi Wayne,

I am trying to use the Epi tree Prior in BEAST v 2.7.8 for epidemic analysis, which uses the  Particle Filter  method  for finding a good starting point relatively efficiently. However, my analysis run very slowly using the regular beast run way.  Would you mind I send my xml file to you for speedup suggestions?

Best,

Jie

Pfeiffer, Wayne

unread,
Sep 25, 2025, 10:32:22 AM (8 days ago) Sep 25
to beast...@googlegroups.com, Pfeiffer, Wayne
Hi Jie,

Yes, feel free to send your xml file to my email address.

Best regards,
Wayne

Jie Zha

unread,
Sep 25, 2025, 7:17:25 PM (8 days ago) Sep 25
to beast-users
Hi Wayne,

Would you mind post your email address here, so I can send my xml file to you.

Best,

Jie

Pfeiffer, Wayne

unread,
Sep 25, 2025, 7:25:25 PM (8 days ago) Sep 25
to beast...@googlegroups.com, Pfeiffer, Wayne
Hi Jie,

My email address is pfei...@sdsc.edu .

Best regards,
Wayne

Reply all
Reply to author
Forward
0 new messages