Iqtree for targeted sequencing of large population

Skip to first unread message

Apr 11, 2023, 10:39:44 AMApr 11
Hi all,

We are using IQTREE to calculate the phylogenetic tree of a large plant population (>10,000 total accessions) encompassing heavenly uneven numbers of accessions from about 10 species from the same genus (the domesticated species + some wild relatives). In this tdudy we are mainly focused on the domestication process. We are using genome wide SNP data from targeted amplicon sequencing. More in detail we are using 5,000 markers which were chosen in order to discriminate among those accessions belonging to the most represented species. Some colleagues of ours raised the concern about possible problems in the results due to markers ascertainment bias.

When calculating the tree with IQTREE we used the following command:

iqtree2 -s ${PHYLIP} \
--mset GTR -mrate ASC+R --cmin 10 --cmax 20 --prefix ${NAME} -alrt 1000 -B 1000 \
--mem ${MEM}% -T ${THREADS} --nmax 1000 --bnni --cptime 201 --safe -o $OUTGROUP

Do you believe using "-mrate ASC+R" is enough to save us from errors coming from ascertainment bias? Is preliminary LD pruning also needed?

There were concerns suggesting to proceed first with duplicates and hybrid accessions removal (actually not so many). Do you think this is also really needed?



Apr 17, 2023, 8:13:55 PMApr 17
Hi Giuseppe,

It's hard to answer your questions without really knowing the study system in great detail. 

The ASC model helps to alleviate a very specific kind of ascertainment bias - that which derives from including only variable sites in your analysis. 

To me, the more important issue with the analysis you've described is that you're assuming that a single tree exists for all the 10K accessions you're studying. From what you've written, it sounds like you have ~5K SNPs for each of ~10K terminals from ~10 species total. I see no good reason to assume that the 5K SNPs should follow a single tree within each species, nor (assuming that some introgression and/or ILS is likely between the species in your sample) between species.

It seems to me that it would be more appropriate in this case to use methods like SNAPP, SVD-Quartets or SNP-based network approaches like SNAQ. 

Of course, the best method really depends on the questions you're asking and a deep understanding of the system and the data. You're best placed to make those calls!


Reply all
Reply to author
0 new messages