Data format for reading Dart data into SplitsTree

430 views
Skip to first unread message

Helen Kennedy

unread,
May 18, 2022, 3:48:13 AM5/18/22
to dartR
Hi all, 

I was wondering if anyone had tried using their DArT SNP data in SplitsTree and what format they used to do so?
The SplitsTree manual says it accepts Nexus, ClustalW, PhylipParsimony, FastA or Newick .

I've tried converting my genlight to Phylip (gl2phylip), Nexus (gl2nexus) and fasta (gl2fasta). Both gl2phylip and gl2nexus create files, but when I try to open them in Splitstree I get errors regarding formatting. For gl2fasta I just get this error;
 " Fatal Error: Parameter method out of range."

I would appreciate your thoughts!

Helen 

Jose Luis Mijangos

unread,
May 18, 2022, 11:29:57 PM5/18/22
to dartR
Hi Helen,

SplitsTree is intended to be used with sequence data. In this type of data, large chunks of DNA are sequenced, so when we find polymorphic sites along the sequence, we can be sure that all of them are in the same chromosome. This type of data from which we know the chromosome from which each allele comes from, is commonly known as phased data. Data obtained from reduced representation genome-sequencing approaches, like DArTseq, is usually unphased, i.e. we don’t know from which chromosome comes each allele. So, if we use unphased data to construct phylogenetic networks, we will obtain biased results.

I can think of 4 different ways to deal with this:

1. One way is by “haplodized” each genotype by randomly choosing one allele from heterozygous genotypes as in Ellegren, Hans, et al. "The genomic landscape of species divergence in Ficedula flycatchers." Nature 491.7426 (2012): 756-760. We can do this process with the dartR function “gl2fasta” and using method 4. For this approach to work, your dataset needs to be free of missing data and not be filtered by minor allele frequencies. Be aware that there is not a lot of information in the literature about the validity of this approach.

2. Another way to deal with unphased data and probably more reliable is to phase the genotypes with more complex approaches like those implemented in the programs Beagle and dnaSP, see for example Al Bkhetan, Ziad, et al. "Exploring effective approaches for haplotype block phasing." BMC bioinformatics 20.1 (2019): 1-14.

3. Another option would be to use a distance matrix as input for SplitsTree. See the code example below. If you need that distances matrixes have the label of individuals or populations you would need to install the beta version of dartR. We also have a nice tutorial about the genetic distances that can be calculated in dartR: http://georges.biomatix.org/dartR

4. Last option would be to use dartR's function to plot trees using a Euclidean distance matrix.

See some code below exemplifying the approaches discussed above.

Cheers,
Luis 

# installing beta version of dartR
library(devtools)
install_github("green-striped-gecko/dartR@beta", build_vignettes=TRUE)

#filtering data
test <- gl.filter.callrate(platypus.gl,threshold = 1)
test <- gl.filter.monomorphs(test)
test <- gl.filter.allna(test)

# converitng to fasta concatening SNPs
gl2fasta(test,outpath=getwd(),method = 4)

# using distance between individuals
ind_dis <- as.matrix(gl.dist.ind(test))
write.table(nrow(ind_dis),file = "ind_dis.txt",sep = " ",row.names = FALSE,col.names = FALSE,quote = FALSE)
write.table(ind_dis,file = "ind_dis.txt",sep = " ",append = T,col.names = FALSE,quote = FALSE)
# using distances between populations
pop_dis <- as.matrix(gl.dist.pop(test))
write.table(nrow(pop_dis),file = "pop_dis.txt",sep = " ",row.names = FALSE,col.names = FALSE,quote = FALSE)
write.table(pop_dis,file = "pop_dis.txt",sep = " ",append = T,col.names = FALSE,quote = FALSE)

# plotting trees in dartR
gl.tree.nj(test,type = "unrooted")


k K

unread,
Nov 21, 2022, 1:16:08 PM11/21/22
to dartR
Hi,
it is not true that Splitstree is designed to work with sequence data only since it is very popular with AFLPs or microsatellites.
I came across this thread and I wanted to use the SNPs in the SplitsTree program to look at the reticulate network. However, I have a problem when loading the exported data into SplitsTree. I also tried to export SNPs as fasta (gl2fasta) using all four methods and also as a nexus file (SNAPP, gl2snapp) but the files are not recognized by SplitsTree. I understand this is probably a problem related to the program rather than dartR but since dartR has already so many useful functions maybe there is a simple solution or a possibility to include one more export option? Splitstree is a fairly popular program for reconstructions so maybe this could be useful :)
Or maybe somebody has an easy solution to this problem other than mentioned one? Distances work but it would be also useful to look at the reticulations as well.
Best regards,
Kamil

Jose Luis Mijangos

unread,
Nov 21, 2022, 6:09:01 PM11/21/22
to dartR
Hi Kamil,

I did not have any problem opening in SplitsTree4 the file generated by the dartR function gl2fasta() using method= 4.

I used Menu File > Open > Files of Type > All files and then selected the output of gl2fasta (output.fasta).

Your assertion that SplitsTree is very popular with AFLPs or microsatellites is not a valid statement to prove that this program is not designed to work with sequence data only, see for example the bandwagon fallacy https://yourlogicalfallacyis.com/bandwagon.

In Chapter 5 of the User Manual for SplitsTree4 explains that the program assumes that the input is sequence data. The link to the user manual is:

There are many methods in which SNP's can be used for phylogenetics analyses and each of them has different assumptions, advantages and limitations, see for example: Leaché, Adam D., and Jamie R. Oaks. "The utility of single nucleotide polymorphism (SNP) data in phylogenetics." Annual Review of Ecology, Evolution, and Systematics 48.1 (2017): 69-84.

Check particularly in section 5 "METHODS FOR ESTIMATING SNP PHYLOGENIES" the concatenation method.

Cheers,
Luis

Arthur Georges

unread,
Nov 21, 2022, 6:53:17 PM11/21/22
to da...@googlegroups.com
Hi Kamil,

I would recommend SNAPPER for phylogenetics using SNPs -- https://github.com/rbouckaert/snapper. This package overcomes the limitations on data size using SNAP.

Arthur

--
You received this message because you are subscribed to the Google Groups "dartR" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dartr+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dartr/b2419c9d-16a3-4fd3-a5db-2ffcb0e7d270n%40googlegroups.com.

k K

unread,
Nov 23, 2022, 1:57:32 PM11/23/22
to dartR
Hi,
seems that my problem was related to SplitsTree. When reading the fasta files produced by dartR I always got a message that either I have incorrect characters or I have duplicate names ("Can't determine datatype ...", then "import failed..."). Neither of these was true but when I tried to reduce the number of missing data (N) in the datasets (after applying more stringent filtering) it helped. So after gl.filter.callrate with threshold=1 and generating fasta files using method 2 or 4 I was able to read the files to SplitsTree. SplitsTree should handle ambiguity codes so I don't know why options 1 and 2 don't work. This situation is a bit strange but I'm temporarily content with this solution because the PCoA of the filtered dataset still has the same structure as before filtering. Also, this problem is occurring with SplitsTree 4 but not in SplitTree 5 which reads files produced by all four methods.    
Regarding the other things, I don't think that it is a bandwagon fallacy. The main concept of SplitsTree is to reconstruct networks. It is a very popular program, especially because there are many methods to reconstruct phylogenetic trees but there are not that many that can produce networks. This is an actual problem, and this may be the fallacy that you were referring to because many people are reconstructing relationships as bifurcating trees instead of inferring a network even if they deal with mixed ploidy datasets, polyploids, or hybrids. With SplitsTree you can reconstruct a network from many input data types. The basic principles are in the beginning, and for example, if you look at page 8 you can see how 01 data can be reconstructed as a network. This is section 5 that you mentioned but "sequence" here doesn't mean a "DNA sequence" but a sequence of characters in general. You can find another hint in the section "18.2 Choose Datatype" where you can see that you can import more formats (01 data, RNA, protein, etc.). Additionally, it is possible also to read in distance matrix or trees directly, for example, a set of trees from the posterior distribution. So the program is very useful and so far I'm not aware of any more efficient program that reconstructs networks.  
Thank you for the link to SNAPPER - I will certainly have a look at it with my next dataset. The one I have now consists of different populations from the same species (though different subspecies) sampled along elevational gradient so I'm looking more for methods that are dealing with relationships among individuals, not phylogeny. Also because I have hybrids I was looking for a method that can show their relationships with other individuals in the dataset.  
Thank you for your very fast answers - I really admire what you are doing. DARTseq is quite efficient in my studies so dartR has become one of my favorite R packages now :)
Best regards,
Kamil

Arthur Georges

unread,
Nov 23, 2022, 3:02:53 PM11/23/22
to da...@googlegroups.com
Thanks Kamil,

I have learnt a lot from this exchange. I agree that phylogenetic applications are largely founded on a bifurcating process and that, strictly, the terminals subject to analysis should be well defined. Of course as our attention comes to fall more closely on very recent and contemporary processes, this breaks down and other approaches that handle geneflow between the terminals, reticulation and hybridization are required. Sounds like SplitTree is a good option for people to explore.

If you get the time, perhaps you could work with us on a gl2splittree script that captures the idiosyncracies so people do not have to reinvent the wheel.

Best, A
Reply all
Reply to author
Forward
0 new messages