Ways to Subset Data for Large, Complex Datasets

Alex Krohn

unread,

Sep 13, 2021, 2:28:05 PM9/13/21

to IQ-TREE

Hi there,

I'm loving IQ-Tree and would love to keep using it. I often have large RADseq datasets with many individuals, and lots of missing data. Unfortunately, these large datasets often take a very long time to run through IQ-Tree on my machine (e.g. one week to evaluate one model!).

I know I can speed up analysis by just evaluating GTR models. I've asked previously about removing constant sites (which was not advised).

Are there other ways I can speed up the analysis? Removing some individuals? Removing individuals with high amounts of missing data?

Alternatively, are there other programs that might run faster with such large and unruly datasets? RAxML is another popular program, but takes similarly long to run (without the helpful log files and information of IQ-Tree!).

Any suggestions you have would be much appreciated.

Thanks,

Alex

As an example, here's a dataset that I'm working on now:

iqtree2 -s 88clust.phy -mset GTR -mrate I+G,I+R

IQ-TREE multicore version 2.1.2 COVID-edition for Linux 64-bit built Mar 30 2021

Developed by Bui Quang Minh, James Barbetti, Nguyen Lam Tung,

Olga Chernomor, Heiko Schmidt, Dominik Schrempf, Michael Woodhams.

Host: tbc-comp1 (SSE4.2, 125 GB RAM)

Command: iqtree2 -s 88clust.phy -mset GTR -mrate I+G,I+R

Seed: 676885 (Using SPRNG - Scalable Parallel Random Number Generator)

Time: Mon Sep 13 14:11:39 2021

Kernel: SSE2 - 1 threads (80 CPU cores detected)

HINT: Use -nt option to specify number of threads because your CPU has 80 cores!

HINT: -nt AUTO will automatically determine the best number of threads to use.

Reading alignment file 88clust.phy ... Phylip format detected

Alignment most likely contains DNA/RNA sequences

WARNING: 271837 sites contain only gaps or ambiguous characters.

Alignment has 207 sequences with 31735669 columns, 2344971 distinct patterns

143311 parsimony-informative, 92902 singleton sites, 31499456 constant sites

Minh Bui

unread,

Sep 14, 2021, 6:41:04 PM9/14/21

to iqt...@googlegroups.com, Alex Krohn

Hi Alex,

You haven’t used the multi-threading feature! So the quick answer now is to use many threads — this should be very efficient for your dataset that has many sites/patterns (e.g., I’m expecting almost 10 times speedup with 10 threads). See: http://www.iqtree.org/doc/Tutorial#utilizing-multi-core-cpus

Minh

--
You received this message because you are subscribed to the Google Groups "IQ-TREE" group.
To unsubscribe from this group and stop receiving emails from it, send an email to iqtree+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/iqtree/9b4d8026-926d-41ce-a490-5a7aa54d91c3n%40googlegroups.com.

Alex Krohn

unread,

Sep 15, 2021, 9:18:08 AM9/15/21

to Minh Bui, IQ-TREE

Apologies, I usually do -T 20 instead of -T AUTO, but I forgot in that example round. Even with -T 20 it usually takes datasets of this size ~1 week to evaluate one substitution model. Any other suggestions?