Hi Helen,
SplitsTree is intended to be used with sequence data. In this type of data, large chunks of DNA are sequenced, so when we find polymorphic sites along the sequence, we can be sure that all of them are in the same chromosome. This type of data from which we know the chromosome from which each allele comes from, is commonly known as phased data. Data obtained from reduced representation genome-sequencing approaches, like DArTseq, is usually unphased, i.e. we don’t know from which chromosome comes each allele. So, if we use unphased data to construct phylogenetic networks, we will obtain biased results.
I can think of 4 different ways to deal with this:
1. One way is by “haplodized” each genotype by randomly choosing one allele from heterozygous genotypes as in Ellegren, Hans, et al. "The genomic landscape of species divergence in Ficedula flycatchers." Nature 491.7426 (2012): 756-760. We can do this process with the dartR function “gl2fasta” and using method 4. For this approach to work, your dataset needs to be free of missing data and not be filtered by minor allele frequencies. Be aware that there is not a lot of information in the literature about the validity of this approach.
2. Another way to deal with unphased data and probably more reliable is to phase the genotypes with more complex approaches like those implemented in the programs Beagle and dnaSP, see for example Al Bkhetan, Ziad, et al. "Exploring effective approaches for haplotype block phasing." BMC bioinformatics 20.1 (2019): 1-14.
3. Another option would be to use a distance matrix as input for SplitsTree. See the code example below. If you need that distances matrixes have the label of individuals or populations you would need to install the beta version of dartR. We also have a nice tutorial about the genetic distances that can be calculated in dartR: http://georges.biomatix.org/dartR
4. Last option would be to use dartR's function to plot trees using a Euclidean distance matrix.
See some code below exemplifying the approaches discussed above.
Cheers,
Luis
# installing beta version of dartR
library(devtools)
install_github("green-striped-gecko/dartR@beta", build_vignettes=TRUE)
#filtering data
test <- gl.filter.callrate(
platypus.gl,threshold = 1)
test <- gl.filter.monomorphs(test)
test <- gl.filter.allna(test)
# converitng to fasta concatening SNPs
gl2fasta(test,outpath=getwd(),method = 4)
# using distance between individuals
ind_dis <- as.matrix(gl.dist.ind(test))
write.table(nrow(ind_dis),file = "ind_dis.txt",sep = " ",row.names = FALSE,col.names = FALSE,quote = FALSE)
write.table(ind_dis,file = "ind_dis.txt",sep = " ",append = T,col.names = FALSE,quote = FALSE)
# using distances between populations
pop_dis <- as.matrix(gl.dist.pop(test))
write.table(nrow(pop_dis),file = "pop_dis.txt",sep = " ",row.names = FALSE,col.names = FALSE,quote = FALSE)
write.table(pop_dis,file = "pop_dis.txt",sep = " ",append = T,col.names = FALSE,quote = FALSE)
# plotting trees in dartR
gl.tree.nj(test,type = "unrooted")