SNP-HELP

79 views
Skip to first unread message

林子熙

unread,
May 5, 2023, 12:05:46 AM5/5/23
to raxml
Dear developers and community,

I am using SNPs data obtained by GBS-SNP-crop software. Can I convert the data into fas format for raxml-ng run? The command I use is raxml-ng --all --msa gekko.fas --outgroup Gekko_japonicus --model GTR+G --tree pars{10} --bs-trees 1000 --force, so using only SNPs data to build evolutionary tree books is not credible? 

Best regards,
 
Xu Jun

Alexandros Stamatakis

unread,
May 5, 2023, 3:43:44 AM5/5/23
to ra...@googlegroups.com
Dear Xu,

> I am using SNPs data obtained by GBS-SNP-crop software. Can I convert
> the data into fas format for raxml-ng run?

Yes, converting ot to fasta and then running it with RAxML should work.

> The command I use is raxml-ng
> --all --msa gekko.fas --outgroup Gekko_japonicus --model GTR+G --tree
> pars{10} --bs-trees 1000 --force, so using only SNPs data to build
> evolutionary tree books is not credible?

I am not sure why you are using --force any specific reason? Also what
do you mean by credible?

One thing you should definitely do when building trees on SNP data is to
use an evolutionary model with ascertainment bias correction:

https://academic.oup.com/sysbio/article/64/6/1032/1669226?login=true

Alexis


>
> Best regards,
> Xu Jun
>
> --
> You received this message because you are subscribed to the Google
> Groups "raxml" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to raxml+un...@googlegroups.com
> <mailto:raxml+un...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/raxml/ed59d66f-ac0b-431e-b424-d5fd3121ee65n%40googlegroups.com <https://groups.google.com/d/msgid/raxml/ed59d66f-ac0b-431e-b424-d5fd3121ee65n%40googlegroups.com?utm_medium=email&utm_source=footer>.

--
Alexandros (Alexis) Stamatakis

ERA Chair, Institute of Computer Science, Foundation for Research and
Technology - Hellas
Research Group Leader, Heidelberg Institute for Theoretical Studies
Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology

www.biocomp.gr (Crete lab)
www.exelixis-lab.org (Heidelberg lab)

林子熙

unread,
May 5, 2023, 9:17:46 PM5/5/23
to raxml
ok

林子熙

unread,
May 5, 2023, 9:32:20 PM5/5/23
to raxml
Dear Alexis,

I think I'll have to correct my wording, we're using whole fragments to build evolutionary trees instead of just SNPs, so the results of this approach can be trusted?

I use --force because I get an error on the run: (ERROR: Too few patterns per thread! RAxML-NG will terminate now to avoid wasting resources.NOTE:  Please reduce the number of threads (see guidelines above).NOTE: This check can be disabled with the '--force' option.), so I used the --force parameter. And the command I used was the same command I used to run mitochondrial genes to build an evolutionary tree, only removing the --model partitions parameter. 

Finally, before that, the results of using the bpp method using nuclear genes were not very consistent with our results with raxml-ng, so we wanted to confirm that the results of constructing the evolutionary tree with the entire SNP fragment were correct, of course, on the basis of the correct model.

Best regards,

Xu Jun
在2023年5月5日星期五 UTC+8 15:43:44<Alexandros Stamatakis> 写道:

Grimm

unread,
May 9, 2023, 5:53:24 AM5/9/23
to raxml
Hi Jun;

the only principal difference with using complete sequences vs only SNPs is that you add invariable sites, thus, would not need to correct for the so-called ascertainment bias.
"Too few patterns" indicates that your overall divergence in the data is already low, pending at which hierarchical level (taxonomy-wise), you're working and the gene set you extracted from the GBS run, the divergence between the tips in your data sets, the OTUs, may be just too low to infer a probabilistic tree.

Try to establish the Phytia score, is your data fit for tree-ing?

The main reason for poor Phytia scores in phylogenomic data is the general low divergence between the genes that can be identified as homologues, but more importantly that, especially at the genus-level, we have a lot of nearly identical or only randomly differing tips. That is, we often feed a lot of topology-indifferent data into our tree-inference programmes.

If your data ends up in the unfortunate to impossible score range:
Calculate the pairwise distances, simple Hamming will do, and make a heat-map and a neighbour-net. Use the circular arrangement in the neighbour-net to sort the heat-map to see which groups of tips are trivial (high ingroup coherence, distinct to any other set of tips), which ones are pointless to tree (generating very flat terminal subtrees, spider-cocoon-like graph portions in the neighbour-net), and where there is something a ML tree inference can work with. Then reduce the tip set to a set of placeholders that are suffitiently divergent to tree (i.e. a set producing a better Phytia score)

Cheers, Guido

Here an example for a very quick genomic similarity assessment. The sections of maples are signal-wise trivial but within the sections (coloured, note the very lush green areas in the heat-map); it's often random-noise that ends up building the tree (the tips, each bubble represent an individual of a distinct species, with long terminal edges). The centre part is spider-web like but fully resolved in a ML tree, this is where probabilistic methods go in their inference beyond the trivial. Note that this is not a worst-case data scenario, I haven't calculated the Phytia score but I suppose it's well in the possible-to-tree range. Pics are from the related Res.I.P. posts [Big Data = No Brain? #1][ Big Data = No Brain?#2]

AB2021NrDNANNet.png

HeatMapClassification.jpg

林子熙

unread,
May 17, 2023, 1:18:40 AM5/17/23
to raxml
Dear all,

Thank you very much for such a meticulous answer!

Xu Jun

Reply all
Reply to author
Forward
0 new messages