Callrate threshold / using raw data with dartR

184 views
Skip to first unread message

Luana Sousa

unread,
Jul 13, 2022, 1:40:12 PM7/13/22
to dartR
Hey everyone! 
I have a couple of questions, so let's go!

I am having some difficulties understanding a better approach with my dart data.  I want to work independently with three different groups of individuals in my sequence.  I am aware of how to divide them using either gl.drop.po or dl.drop.ind.

I recalculated the metrics with one of these groups, subtracted the monomorphics, and filtered the reproducibility by a 0.99 threshold. To prevent losing all of my SNPs when I filtered by callrate, I had to put the threshold by 0.65. I drop from 53k to 25k SNPs. To 17k SNPs, and with 0.75 thresholds. (1.) What is the best way to choose the threshold of callrate? 

2. Do I lose biological information if I subset the data? If the different groups were separated, would it be preferable to retrieve the SNPs straight from the raw data?

I have fastq.gz files containing the sequencing's raw data. I'm attempting to figure out how to use these data more effectively. I reach Andrzej Kilian for help, but he stopped replying to my emails when I tried to explain this to him. When I inquired about using Stacks to retrieve the SNPs, the following response was given:

" While some of our users go back to SNP calling from raw sequences most would plug our report into application like dartR and process marker data we produce using a range of algorithms available there. (...) but for now I would suggest you consider using your analytical skills “downstream” from marker data extraction."

I looked through the tutorials and here, but I couldn't find a way to accomplish it. Can you tell me how to do this and what the best course of action would be? (3. How to use the raw sequencing in dartR)

Although it's a lot, I do hope you can assist me!

Thanks!
Luana

Luana Sousa

unread,
Jul 13, 2022, 1:43:39 PM7/13/22
to dartR
I forgot to mention, that without the filter by callrate I have 40.62% of missing data. With a callrate of 0.75 it drops to 16.76% of missing data. 

Jose Luis Mijangos

unread,
Jul 13, 2022, 8:09:41 PM7/13/22
to dartR
Hi Luana,

1- You need to decide your filtering thresholds based on the characteristics of your data and the questions you want to answer. For this, you need to read about how missing data, and other factors, affect the analyses you want to do. I recommend reading the following papers:
- Torkamaneh, Davoud, and Francois Belzile. "Scanning and filling: ultra-dense SNP genotyping combining genotyping-by-sequencing, SNP array and whole-genome resequencing data." PloS one 10.7 (2015): e0131533.
- Schmidt, Thomas L., et al. "Unbiased population heterozygosity estimates from genome‐wide sequence data." Methods in Ecology and Evolution 12.10 (2021): 1888-1898.
- O'Leary, Shannon J., et al. "These aren't the loci you are looking for: Principles of effective SNP filtering for molecular ecologists." (2018): 3193-3206.

You only need a couple of thousands of reliable SNPs to respond to some answers using genetic data. Read for example:
- Allendorf, Fred W., Paul A. Hohenlohe, and Gordon Luikart. "Genomics and the future of conservation genetics." Nature reviews genetics 11.10 (2010): 697-709.

2. Resources to analyse raw:
- Access to a supercomputer or computing clusters to analyse GB of data that would take months on a personal computer.
- Experience in Unix/Linux systems and high-performance computing environment scripting languages.
- Experience in a dozen programs and how the dozens of parameters of each one affect the data.

Acquiring this bioinformatics experience and the required resources would take months.

SNPs provided by DArT have already been processed using their proprietary analytical pipelines, which are fine-tuned for their particular sequencing technology. See, for example:

Kilian, Andrzej, et al. "Diversity arrays technology: a generic genome profiling technology on open platforms." Data production and analysis in population genomics. Humana Press, Totowa, NJ, 2012. 67-89.

Providing SNPs ready to be used is one of the main advantages of DArT over other technologies. 

If you are interested, the bioinformatics GitHub page from the University of Sydney has a nice pipeline to analyse raw data:

Cheers,
Luis

Luana Sousa

unread,
Jul 14, 2022, 11:00:40 AM7/14/22
to dartR
I appreciate the information.
I have access to all the tools necessary to work with the raw data. My concern was whether it could be used with dartR, but ok, it turns out that it cannot. I'm not sure if it calls SNPs separately from the subset of raw data or if it simply divides the SNP data with dartR, it would make a difference, given that there are three separate species. What do you think? 

In the future, I want to be able to join data from different sequencing, but I had read here that it is not a good idea to merge SNPs filtered independently. So working with the raw data it would be my best choice I think. But I am the first person in my lab to use Dart, so I'm not entirely sure how to proceed.

Jose Luis Mijangos

unread,
Jul 15, 2022, 1:20:21 AM7/15/22
to dartR
Hi Luana,

1. Could you please extend on and maybe explain with an example your comment: "I'm not sure if it calls SNPs separately from the subset of raw data or if it simply divides the SNP data with dartR."

2. You can read a previous post about joining different sequencing jobs,: https://groups.google.com/g/dartr/c/C5djrsEhlw0/m/MuSJF3pQAAAJ 

Cheers,
Luis

Luana Sousa

unread,
Jul 15, 2022, 12:54:37 PM7/15/22
to dartR
In my sequencing, I have 3 data sets (1. Petunia altiplana, 2. P. scheideana + P. guarapuavensis, and 3. 8 Calibrachoa sp.) all aligned to P. axilaris and P. inflata. Due to their biological significance and the nature of our goal, all three must be worked separately. My question is whether there is any difference between separating these data sets before or after the SNPs calling, if in filtering together it lost some information about a particular species. All the SNPs were called together in the pipeline of DArT (which we don't have access to the details, only what is in the articles).
Andrzej was replying to my questions, but then he stopped.

and thank you for the link!
Luana

Jose Luis Mijangos

unread,
Jul 18, 2022, 12:42:52 AM7/18/22
to dartR
Hi Luana,

I recommend contacting DArt and asking them to process the data separately.

In my opinion, DArT is a sequencing provider, and its service does not include technical or academic advice. It is the responsibility of the user to investigate how to analyse the sequencing data.

Cheers,
Luis
Reply all
Reply to author
Forward
0 new messages