2025 GPU for BEAST X and BEAST 2

Gautier Richard

unread,

Jul 14, 2025, 2:47:15 PMJul 14

to beast-users

Hello everyone,

We are setting up a small datacenter to perform some machine learning, deep learning and BEAST X computations using GPUs.

We were wandering what are the most important specs for GPUs when it comes to BEAST X? We are mainly interested in analyzing influenza phylodynamics and phylogeographics as well as reassortments dynamics with BEAST X and BEAST 2.

After a few hours of research it seems that FP64 performance is the most important aspect for double precision? My conclusions were thus that A100 GPUs are probably the best (not sure that we have the budget for an H100 or H200). And this would thus rule out the L40S and L4 GPUs? What about the RTX5090? Currently we have an RTX4070 on a desktop computer for the demanding models, and it's indeed faster than CPUs. We are however limited in the number of GPUs at the moment.

All the best and many thanks,

Gautier Richard, PhD
Project Manager in Molecular Epidemiology,
Swine Immunology Virology Unit, Ploufragan, France

Guy Baele

unread,

Jul 21, 2025, 3:46:35 PMJul 21

to beast-users

The main requirement is indeed the FP64 performance; in my experience, only the (expensive) nVidia and AMD cards are a real option.
The problem that I had last year was that I couldn't actually get my hands on A100 GPUs, as I was told that I would have to wait for at least 1 year without any guarantee that I would actually get any.
So I ended purchasing half the number of GPUs by going for the H100 over the A100, which is a very painful choice given that the performance benefit (for phylogenetics) doesn't warrant the price increase of an H100 over an A100 GPU.

Best I can tell, the RTX5090 has an FP64 performance of 1.637 TFLOPS, whereas an A100 GPU has 9.7 TFLOPS and an H100 has 34 TFLOPS.
But note that these numbers don't easily translate to BEAST performance increases, so it's best to try to get as many modern GPUs as possible for your budget.
I would suggest to go for the A100 (not sure you can still get older GPUs) indeed and focus on having a decent memory size, e.g. the 80Gb version.
But nowadays, perhaps AMD Instinct cards may be the better option, go for an MI210 or higher (MI300 and higher have very nice benchmark numbers).

Best regards,
Guy

Op maandag 14 juli 2025 om 20:47:15 UTC+2 schreef Gautier Richard:

Pfeiffer, Wayne

unread,

Jul 23, 2025, 4:21:37 PMJul 23

to gautier....@gmail.com, beast...@googlegroups.com

Hi Richard,

I am the project leader for the CIPRES science gateway. We run tens of BEAST, BEASTX, and BEAST2 jobs every day on our Expanse cluster that has AMD cores and NVIDIA V100 and A100 GPUs. Currently 26 jobs are running: 10 are using BEAST or BEASTX, and 16 are using BEAST2. Two jobs using BEAST 1.10.4 are running on GPUs, while all of the other jobs are running on cores.

To decide how to run these jobs cost-effectively, I have done extensive benchmarking of BEAST, BEASTX, and BEAST2 on both cores and GPUs. From these benchmarks I developed rules for choosing the number of cores or GPUs for cost-effective execution. The most important parameters for the rules are

- the number of patterns,

- the number of partitions, and

- whether or not the data set has amino acid sequences.

Before each job we run a script that parses the input xml file to get the needed parameters. GPUs are generally much faster than cores on Expanse, but the usage charge for GPUs is correspondingly higher: 1 GPU hour = 20 AMD core hours. Thus we only use GPUs when their expected speedup over cores is at least 5. For really large data sets, we see speedups on GPUs over cores of ≥30 for BEASTX and ≥100 for BEAST2. We just have a few A100s, so we use them only when they are 1.3x faster than V100s.

With that as background, the rules that we use for BEASTX and BEAST2 are appended here. Feel free to contact me if you want more information.

Best regards,

Wayne

---

* Rules for running BEASTX 10.5.0 on Expanse via the CIPRES gateway

The runs use varying numbers of cores and GPUs within a single node of Expanse

depending upon the data set.

- Ask the user for the following:

whether the data set has amino acids (AAs),

the number of partitions in the data set,

the total number of patterns in the data set, and

whether the analysis needs extra memory.

If the data set does not contain AAs, assume that it is DNA.

- Specify the Slurm partition, threads, beagle_instances, cores, GPU type, and

Slurm memory according to the following table. Also, use the additional BEAGLE

parameters listed in the examples, including -beagle_scaling dynamic unless

the user specifies otherwise.

Data Data Memory Slurm beagle_ GPU Slurm

partitions patterns needed partition threads instances cores type memory

DNA data

<8 <3,000 regular shared 3 3 3 6G

<8 3,000-79,999 regular gpu-shared 1 10 V100 90G

<8 >=80,000 regular gpu-shared 1 10 A100 90G

>=8 <10,000 regular shared 3 3 3 6G

>=8 >=10,000 regular shared 4 4 4 8G

any any extra shared 6 6 6 12G

AA data

any any regular gpu-shared 1 10 V100 90G

---

* Rules for running BEAST2 2.x.x on Expanse via the CIPRES gateway

The runs use varying numbers of cores and GPUs within a single node of Expanse

depending upon the type of analysis and the data set.

- Ask the user for the following:

whether the analysis uses Path Sampling, SNAPP, or both,

whether the data set has amino acids (AAs),

the number of partitions in the data set,

the total number of patterns in the data set,

whether the analysis needs extra memory.

- Specify the Slurm partition, threads, instances, cores, GPUs, and memory according

to the following table. Also, use the additional BEAGLE parameters listed in the

examples, including-beagle_scaling dynamic unless the user specifies otherwise.

Data Data Extra Slurm Slurm

partitions patterns memory partition -threads -instances cores GPUs memory

Any data with Path Sampling but without SNAPP

any any no shared 6 1 6 11G

Any data with SNAPP but without Path Sampling

any any no shared 24 1 24 46G

Any data with SNAPP and Path Sampling

any any no shared 25 1 25 50G

any any yes compute 25 1 25 243G

DNA data without Path Sampling or SNAPP

1 to 3 <5,000 no shared 3 3 3 6G

1 to 3 5,000-37,999 no gpu-shared 1 1 10 1 V100 90G

1 to 3 38,000-99,999 no gpu-shared 1 1 10 1 A100 90G

1 to 3 >=100,000 no gpu-shared 2 2 20 2 A100 180G

4 to 17 <1,200 no shared 1 1 1 2G

4 to 17 1,200-4,999 no shared 3 1 3 6G

4 to 17 5,000-19,999 no shared 6 2 6 12G

4 to 17 >=20,000 no gpu-shared 1 1 10 1 V100 90G

>=18 <8,000 no shared 2 1 2 4G

>=18 8,000-13,999 no shared 3 1 3 6G

>=18 14,000-39,999 no shared 6 2 6 12G

>=18 >=40,000 no gpu-shared 1 1 10 1 V100 90G

any any yes shared 12 1 12 24G

AA data without Path Sampling or SNAPP

1 <12,000 no gpu-shared 1 1 10 1 V100 90G

1 >=12,000 no gpu 4 4 40 4 V100 360G

2 to 39 any no gpu-shared 1 1 10 1 V100 90G

>=40 any no shared 24 1 24 46G

--
You received this message because you are subscribed to the Google Groups "beast-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beast-users...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/beast-users/51b0ca3a-d396-4ab9-ae8e-eb969a81f328n%40googlegroups.com.

Martin Gunnill

unread,

Sep 3, 2025, 3:32:03 PMSep 3

to beast-users

Dear Wayne

Am I correct in assuming that 'pattern' here refers to codons?

I also notice that BEAST 2 outputs the number of patterns to screen before MCMC run e.g:

```

Alignment(2023-03-01 56 WGS D8)
56 taxa
15678 sites
266 patterns

```

Do you have any suggestions for estimating number patterns before running BEAST 2.

I am working on a pipelining tool for BEAST 2 and was wondering about the feasibility of inputting some of the decisions in your look up table.

Yours Martin

Pfeiffer, Wayne

unread,

Sep 3, 2025, 6:02:49 PMSep 3

to beast...@googlegroups.com, Pfeiffer, Wayne

On Sep 3, 2025, at 7:04 AM, Martin Gunnill <Martin....@phac-aspc.gc.ca> wrote:

Dear Wayne

Am I correct in assuming that 'pattern' here refers to codons?

No. The number of patterns is the number of unique columns in the multiple sequence alignment. It can be much smaller than the number of sites, as shown by your example alignment.

I also notice that BEAST 2 outputs the number of patterns to screen before MCMC run e.g:

```

Alignment(2023-03-01 56 WGS D8)
56 taxa
15678 sites
266 patterns

```

Do you have any suggestions for estimating number patterns before running BEAST 2.

I am working on a pipelining tool for BEAST 2 and was wondering about the feasibility of inputting some of the decisions in your look up table.

BEASTX includes npatterns in the xml file :) However, BEAST2 does not :(

To schedule a BEASTX job, we first run a bash parser script that extracts npatterns from the xml file.

To get the number of patterns for a BEAST2 analysis, we run a BEAST2 pilot job just far enough to output the needed number. The pilot job uses a temporary xml file that includes

chainlength=“0"

preBurnin=“0”

The temporary xml file can be generated with the following commands:

cp *xml temp.xml

sed -i -e 's/chainLength="[0-9]*"/chainLength="0"/' temp.xml

sed -i -e 's/preBurnin="[0-9]*"/preBurnin="0"/‘ temp.xml

The number of patterns is then extracted from the pilot job output file using a bash parser script.

Usually the pilot job finishes in less than 60 seconds, but if it doesn’t, we set the number of patterns to a default value of 1000.

To view this discussion visit https://groups.google.com/d/msgid/beast-users/e6c30e17-2c5a-4867-86ab-bcec49e7b5dan%40googlegroups.com.

Martin Gunnill

unread,

Sep 18, 2025, 2:08:33 PMSep 18

to beast-users

Just found out that the beast 2 command 'beast -validate path-to-your.xml` lists the patterns in the sequence alignment in the xml without running the entire xml.

Jie Zha

unread,

Sep 20, 2025, 6:42:10 AMSep 20

to beast-users

Dear Wayne,

If my dataset has 6 partitions with unlinked substition model and clock model, but linked tree model. And if I use BEAST 2 and assign these partitions to 6 GPUS parallely, will the computing speed be increased by 6 times?

Best,

Jie

Pfeiffer, Wayne

unread,

Sep 20, 2025, 11:27:41 AMSep 20

to beast...@googlegroups.com

Hi Jie,

No. With 6 partitions, you will probably need 100,000 or more patterns to see much speedup using 2 GPUs over 1 GPU.

Best regards,

Wayne

To view this discussion visit https://groups.google.com/d/msgid/beast-users/335a7330-7c18-46fb-ac3a-6c26ce98df05n%40googlegroups.com.

Jie Zha

unread,

Sep 20, 2025, 9:00:18 PMSep 20

to beast-users

Dear Wayne,

Assume I have a very large sequences alignment, if I split this alignment into many small partitions and assigning them to different GPUs parallely, will the computing speed be increased significantly?

Best,

Jie

Pfeiffer, Wayne

unread,

Sep 21, 2025, 9:04:43 AMSep 21

to beast...@googlegroups.com

Hi Jie,

Thanks for asking me about this again, since what I said before was not quite right.

I made benchmark runs on V100 GPUs with BEAST2 2.7.8 for 8 data sets from CIPRES users and got the results in the following table. I had expected the speedup to increase as the number of patterns/partition increased, but that is not the case. Instead, there is no consistent dependence of the speedup on the number of patterns/partition.

- The average speedup is 1.54 for 2 GPUs and 2.04 for 4 GPUs.

- On 2 GPUs, the best speedup is 1.77, and the worst speedup is 1.28.

- On 4 GPUs, the best speedup is 2.78, and the worst speedup is 1.52.

6 GPUs will probably not give much more speedup than 4 GPUs.

Parti- Patterns/

Data set Patterns tions partition GPUs Speedup

46766 1,787 22 81 1 1.00

2 1.52

4 2.16

03AFF 3,697 22 168 1 1.00

2 1.50

4 1.91

8F0C5 695 5 246 1 1.00

2 1.28

4 1.52

0A9C8 11,294 25 452 1 1.00

2 1.77

4 2.78

B4469 4,680 8 585 1 1.00

2 1.70

4 2.05

11663 6,707 4 1,677 1 1.00

2 1.62

4 2.13

3DFAD 31,017 8 3,877 1 1.00

2 1.34

4 1.81

jessicag 112,661 16 7,041 1 1.00

2 1.58

4 1.97

The command line that I used to run on 4 GPUs was

-beagle_GPU -beagle_order 1,2,3,4 -beagle_scaling dynamic -threads 4 -instances 1

Best regards,

Wayne

To view this discussion visit https://groups.google.com/d/msgid/beast-users/0e4ace11-f4e4-481d-b22a-d2309c4f4523n%40googlegroups.com.

Jie Zha

unread,

Sep 21, 2025, 11:04:55 AMSep 21

to beast-users

Dear Wayne,

Would you mind to explain how you assign your partitions, using the command line. For "-instances 1", is it mean you have only 1 partition?

Best,

Jie

Pfeiffer, Wayne

unread,

Sep 21, 2025, 1:40:30 PMSep 21

to beast...@googlegroups.com

Hi Jie,

Each data set has the number of partitions indicated in my table.

Parallelization is controlled by the threads and instances input parameters.

-threads is the number of parallel threads to use, which should be set to the number of GPUs.

-instances is the number of logical partitions to which a thread is assigned within each partition of the multiple sequence alignment.

For these data sets with few patterns/partition, using more instances than 1 slows down the calculation.

The speedup you get will depend upon the absolute speed of your GPU. The results in my table were for a relatively old V100. For the data sets in my table, the newer A100 is a little faster than the V100 only for the last and largest data set. For some of the smaller data sets, the A100 is slower than the V100!

Incidentally, the speedups for the largest data set were incorrect in my previous table. The correct values are listed in the following table excerpt.

Parti- Patterns/

Data set Patterns tions partition GPUs Speedup

jessicag 112,661 16 7,041 1 1.00

2 1.41

4 1.77

Best regards,

Wayne

To view this discussion visit https://groups.google.com/d/msgid/beast-users/59d15d48-6033-43ea-b716-7169ac6b2f9dn%40googlegroups.com.

Pfeiffer, Wayne

unread,

Sep 21, 2025, 4:27:28 PMSep 21

to beast...@googlegroups.com

Hi Jie,

I think that I may have misunderstood your original question regarding partitions.

Adding partitions by genes to the multiple sequence alignment will only give the modest speedups noted in my original table.

On the other hand, adding logical partitions via the instance parameter can lead to more substantial speedups for an otherwise unpartitioned MSA. This is shown by the additional table below with results from BEAST2 2.7.3 for the three largest data sets without MSA partitions that I have analyzed.

- For all three data sets the speedup on 4 V100s is greater than 2.5 and reaches 2.99 for the last data set.

- For the first data set the speedup on A100s is similar to that on V100s, whereas for the other two data sets the A100 speedup is much less than the V100 speedup on 4 GPUs.

- The last column with the relative run times shows that the A100 is much faster than the V100 for these large data sets by 1.63x to 1.71x on 1 or 2 GPUs. However, the speed advantage of the A100 drops to 1.27x and 1.12x on 4 GPUs for the two largest data sets. Nonetheless, having a faster GPU is still helpful for large data sets.

V100 A100 V100 time/

Data set Patterns GPUs speedup speedup A100 time

47335 117,899 1 1.00 1.00 1.63

2 1.63 1.69 1.70

4 2.52 2.52 1.64

065EF 259,428 1 1.00 1.00 1.69

2 1.58 1.53 1.64

4 2.70 2.03 1.27

4763C 311,817 1 1.00 1.00 1.73

2 1.70 1.68 1.71

4 2.99 1.94 1.12

The command line that I used to run on 4 GPUs was

-beagle_GPU -beagle_order 1,2,3,4 -beagle_scaling dynamic -threads 4 -instances 4

Best regards,

Wayne

To view this discussion visit https://groups.google.com/d/msgid/beast-users/14748FE0-2598-4C3F-A154-84572F8CC8D9%40sdsc.edu.

Jie Zha

unread,

Sep 22, 2025, 6:55:59 AMSep 22

to beast-users

Hi Wayne,

Thanks your replies very much! I will epuip a supercomputer in my lab to deal with the bayesian analyses, and if I have further questions on how to speedup computing, may I trouble you for expert advices later?

Best,

Jie

Jie Zha

unread,

Sep 25, 2025, 9:39:21 AMSep 25

to beast-users

Hi Wayne,

I am trying to use the Epi tree Prior in BEAST v 2.7.8 for epidemic analysis, which uses the Particle Filter method for finding a good starting point relatively efficiently. However, my analysis run very slowly using the regular beast run way. Would you mind I send my xml file to you for speedup suggestions?

Best,

Jie

Pfeiffer, Wayne

unread,

Sep 25, 2025, 10:32:22 AMSep 25

to beast...@googlegroups.com, Pfeiffer, Wayne

Hi Jie,

Yes, feel free to send your xml file to my email address.

Best regards,

Wayne

To view this discussion visit https://groups.google.com/d/msgid/beast-users/f43ed9f3-18b3-40f5-8b70-cf3dd7a83fd0n%40googlegroups.com.

Jie Zha

unread,

Sep 25, 2025, 7:17:25 PMSep 25

to beast-users

Hi Wayne,

Would you mind post your email address here, so I can send my xml file to you.

Best,

Jie

Pfeiffer, Wayne

unread,

Sep 25, 2025, 7:25:25 PMSep 25

to beast...@googlegroups.com, Pfeiffer, Wayne

Hi Jie,

My email address is pfei...@sdsc.edu .

Best regards,

Wayne

To view this discussion visit https://groups.google.com/d/msgid/beast-users/6150838a-87cb-42cd-b2c6-7c8da5fa72b8n%40googlegroups.com.

Reply all

Reply to author

Forward