silicodart more resolution than dartSNP?

Kwan TN

unread,

Mar 12, 2024, 2:17:53 AMMar 12

to dartR

Hi Luis and co,

I am working with a fish population in a confined lake and early studies seem to indicate poor population structuring.

I am finding it interesting that all my phylogenetic tree, PCA and DAPC of silicodart data hint of clustering but never at all with the respective dartSNP dataset. How does that say anything about the dominant versus co-dominant markers? Most papers I read seem to indicate comparable results. In both dataset, I used 95% filter for reproducibility, ind and loc call rate, removed sex linked loci, but did not (not yet) filter MAF, Hamming and Linkage disequilibrium.

My next question- i have trouble using gl.run.structure with my silicodart dataset. The error i got is 'You need to provide a SNP genlight object (ploidy=2)!' I think it was expecting SNP data. Evanno et al., 2005 used AFLP data, and the structure manual (Section 4.1, pg 14) seems to address using dominant markers. Can I format the data?

Thank you so much for considering, and I have benefited from this group a lot.

Cheers

kwan

Jose Luis Mijangos

unread,

May 1, 2024, 3:58:24 AMMay 1

to dartR

Hi Kwan,

One of the possible reasons for the contrasting results you see between your SNP and SilicoDArT data might be your filtering approach. It might be that there is a "real" genetic structure occurring on your study species which might be driven by specific mutations at one or both of the restriction enzyme recognition sites that cut the DNA. As a result, SNPs of individuals that have these mutations are called as missing data. So, when you filter by call rate (missing data) you are removing the SNPs containing this information. These mutations in the recognition sites is what SilicoDArT data reports.

To test this hypothesis, you could run your analyses again without filtering your SNP data. To speed up the analyses you might want to subsample your data using the function gl.subsample.loci. Also, you could run the function gl.report.pa to calculate private alleles and fixed difference in both datasets and check whether you get the same proportion of them in both datasets.

The calculation of private alleles in SilicoDArT data is available in the developing version of dartR.base which you can install as below.

install.packages("dartrVerse")
library(dartRverse)
dartRverse_install(package = "dartR.base", rep = "github", branch = "dev")

For the moment, the function to analyse SIlicoDArT data is not enabled in gl.run.structure, but you could try to convert your data into structure format using gl2structure.

Cheers,
Luis

KWAN

unread,

May 7, 2024, 3:00:35 AMMay 7

to dartR

Hi Luis,

I appreciate the reply and I have gone away working on your suggestions. I tried analyzing using the unfiltered SNP data, and the PCA output is still the same i.e. it showed no clustering as opposed to those derived from the silicodart. I also tried quickly doing Fst of the two dataset types and the score (comparing two assumed populations) is 0.034 for SNP and a whooping 0.560 for silicodart.

Appreciate your insights on this.

Cheers

Kwan

Jose Luis Mijangos

unread,

May 27, 2024, 9:45:40 PMMay 27

to dartR

Hi Kwan,

Based on the dataset you shared with me, the contrasting patterns between your SilicoDArT and SNP datasets can be attributed to the presence of individuals from two different species/subspecies and low DNA quality/quantity in some samples.

Missing data in a SNP dataset typically arises from two main sources:

1. Low DNA quality/quantity.
2. Mutations at restriction enzyme recognition sites. Restriction enzymes and DArT's calling pipelines are designed to target one species. Individuals from a different species/subspecies will have mutations at these recognition sites, causing restriction enzymes to not bind to the target sites, resulting in missing data in the SNP dataset.

In contrast, in SilicoDArT data, these mutations at the recognition sites are shown as absences rather than missing data.

Given the patterns in your dataset, which result from a combination of missing data due to low DNA quality and the presence of individuals from different species/subspecies, simply using a threshold of missing data to remove individuals is not ideal.

Instead, we can use the presence/absence data from the SilicoDArT to set a threshold to identify individuals from different species/subspecies, as demonstrated in the code below and the attached smearplot in which the individuals fro m a different species/subspecies are shown at the bottom of the plot in blue.

After removing individuals from different species/subspecies, you can see that the PCAs from SilicoDArT and SNP data are now quite similar (attached).

Another option to improve your data quality would be to ask DArT to re-analyse your data separately for each of the two species/subspecies.

Cheers,

Luis

library(dartRverse)
# reading SilicoDArT
t2 <- gl.read.silicodart(your_silicoDArT)
# calculating mean presence/absence by individual
ind_abs_pres <- round(rowMeans(as.matrix(t2),na.rm = T),2)
t2a <- t2
# concatenating presence/absence with individual names to visualise in smearplot
indNames(t2a) <- paste0(ind_abs_pres,"_",indNames(t2a))
# subsampling loci
t2a <- gl.subsample.loci(t2a,n=5000)
# using ind.labels = T to plot individuals in alphabetical order in a smearplot
gl.smearplot(t2a,ind.labels = TRUE)
# histogram and table to decide on a threshold to drop individuals
hist(ind_abs_pres,breaks = 50)
table(ind_abs_pres)
# getting individual names to drop with leass than 0.9 presence/absence
ind_drop <- indNames(t2)[which(ind_abs_pres < 0.9)]
# dropping individuals
t2wo <- gl.drop.ind(t2,ind.list =ind_drop )
# filtering on call rate
t2wo <- gl.filter.callrate(t2wo,threshold = 1)
# pca
pca_t2wo <- gl.pcoa(t2wo)
# pca plot
gl.pcoa.plot(pca_t2wo,t2wo)
# reading SNP data
t1 <- gl.read.dart(your_DArT_report)
# dropping individuals
t1wo <- gl.drop.ind(t1,ind.list = ind_drop)
# filtering on call rate
t1wo <- gl.filter.callrate(t1wo,threshold = 1)
# pca
pcoa_t1wo <- gl.pcoa(t1wo)
# pca plot
gl.pcoa.plot(pcoa_t1wo,t1wo)

pca_snp.png

smearplot.png

pca_silico.png

KWAN

unread,

May 27, 2024, 10:52:01 PMMay 27

to dartR

Hi Luis, this is incredibly helpful - this is my thought following from yours. Because of the scattering nature of those 'weird' individuals that were clearly separated from the rests, I am inclined to think they are not of a sub-species. If they were indeed a sub-species, i would expect them to form a cluster themselves, away from the rests. I have also thought about mis-id of animals during sampling and my sampling team had indicated very low possibility due to obvious morphology. Poor DNA is thus the likelier explanation as they were quite old samples. I am following your work steps to pull out the poor DNA individuals.

Thanks heaps DA team..

Kwan

Jose Luis Mijangos

unread,

May 28, 2024, 2:33:23 AMMay 28

to dartR

Hi Kwan,

Yes, that sounds about right and could be one of reasons why I did not see any difference between loci that were mapped to the reference genome and unmapped loci.

However, it is important to note that missing data is among the main factors causing distortions in PCA, see for example page 31 of the below ms:

Distances and their visualization in studies of spatial-temporal genetic variation using single nucleotide polymorphisms (SNPs) (biorxiv.org)