Combining datasets

224 views
Skip to first unread message

Peter K

unread,
Apr 21, 2022, 8:57:37 PM4/21/22
to dartR
Hi folks,
I'm wanting to combine two separate DArT SNP datasets using dartR if possible.

The datasets are for the same species of organism. The initial dataset was larger, the second was for a batch of supplementary individuals where samples were received after the initial batch was sent off for sequencing. Consequently there were two separate sequence runs and the resulting loci for the two datasets are somewhat different. However there is a large (probably large enough) overlap of loci that are common to both datasets.

I could fairly easily just remove the loci that are only included in either the first or second dataset and retain only the loci that are common to both but that still leaves the issue that the locus metrics calculated by DArT will be different for most loci between datasets.

Is there a function (or functions) in dartR that can be used to combine datasets after they've been imported and then recalculate the locus metrics? Or is there some other suggestion?

Thanks,
Peter

Berry, Olly (NCMI, IOMRC Crawley)

unread,
Apr 21, 2022, 11:53:55 PM4/21/22
to da...@googlegroups.com
Hi Peter,
I don’t advise combining datasets as you describe. A better approach would be to ask your SNP provider (Diversity Arrays in your case) to re-analyse the two runs together and re-call the SNPs. That will give you the best confidence in your SNP calls.
Cheers,
Olly

CSIRO Environomics Future Science Platform

From: da...@googlegroups.com <da...@googlegroups.com> on behalf of Peter K <pkay...@gmail.com>
Sent: Friday, April 22, 2022 8:57:37 AM
To: dartR <da...@googlegroups.com>
Subject: [dartR] Combining datasets
 
--
You received this message because you are subscribed to the Google Groups "dartR" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dartr+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dartr/639c611d-c070-425f-9216-62c3af6385ban%40googlegroups.com.

Jose Luis Mijangos

unread,
Apr 29, 2022, 12:55:15 AM4/29/22
to dartR
Hi Peter,

As Olly said, it is not advisable to merge two datasets with a different number of loci and different individuals, so the reanalysis of the two datasets from DArT is the safest option. This issue has been discussed previously in the forum, see:


Having said that, please find below some code of how to do the merging. This would work assuming that DArT assigned the same loci names to all the jobs of the same species. I asked DArT bioinformaticians about loci names in different jobs from the same species, they said:

"The SNP names which contain ID numbers less than 100,000,000 would be the same between reports in general, however, in it also depends on whether or not the 2 species have been assayed under the same organism designation.
I would definitely recommend to request a co-analysis rather than try to combine the results."

Another potential source of bias would be that the reference allele is used in the names of the loci. The reference allele is based on the allele with higher frequency. This means that the reference allele could be different in different populations.

If you have a reference genome, one possible solution would be to map the trimmed sequences in both datasets using the function gl.blast.

Cheers,
Luis 

library(dartR)
# test dataset
df_test <- platypus.gl
df_test <- gl.filter.callrate(df_test,threshold = 1)
df_test <- gl.filter.monomorphs(df_test)
#test dataset 1
test <- gl.keep.ind(df_test,ind.list = indNames(df_test)[1:5])
test <- gl.keep.loc(test, loc.list = locNames(test)[1:6])
#test dataset 2
test_2 <- gl.keep.ind(df_test,ind.list = indNames(df_test)[6:10])
test_2 <- gl.keep.loc(test_2, loc.list = locNames(test_2)[6:11])
# finding loci in common
common_loc_test <- locNames(test)[locNames(test) %in% locNames(test_2)]
common_loc_test_2 <- locNames(test_2)[locNames(test_2) %in% locNames(test)]
# keeping only loci in common
test <- gl.keep.loc(test,loc.list = common_loc_test)
test_2 <- gl.keep.loc(test_2,loc.list = common_loc_test_2)
# ordering loci 
test <- test[,order(locNames(test))]
test_2 <- test_2[,order(locNames(test_2))]
# testing that loci names are equal and in the same order in both datasets. The result of the below command should be 0 
sum(locNames(test)!=locNames(test_2))
#merging datasets
merge_gl <- rbind.genlight(test,test_2)
# running compliance function
merge_gl <- gl.compliance.check(merge_gl)
# recalculating metrics
merge_gl <- gl.recalc.metrics(merge_gl)
Reply all
Reply to author
Forward
0 new messages