Hi Luana,
1- You need to decide your filtering thresholds based on the characteristics of your data and the questions you want to answer. For this, you need to read about how missing data, and other factors, affect the analyses you want to do. I recommend reading the following papers:
- Torkamaneh, Davoud, and Francois Belzile. "Scanning and filling:
ultra-dense SNP genotyping combining genotyping-by-sequencing, SNP array
and whole-genome resequencing data." PloS one 10.7 (2015): e0131533.
- Schmidt, Thomas L., et al. "Unbiased population heterozygosity estimates from genome‐wide sequence data." Methods in Ecology and Evolution 12.10 (2021): 1888-1898.
- O'Leary, Shannon J., et al. "These aren't the loci you are looking for:
Principles of effective SNP filtering for molecular ecologists." (2018):
3193-3206.
You only need a couple of thousands of reliable SNPs to respond to some answers using genetic data. Read for example:
- Allendorf, Fred W., Paul A. Hohenlohe, and Gordon Luikart. "Genomics and the future of conservation genetics." Nature reviews genetics 11.10 (2010): 697-709.
2. Resources to analyse raw:
- Access to a supercomputer or
computing clusters to analyse GB of data that would take months on a personal computer.
- Experience in Unix/Linux systems
and high-performance computing environment scripting languages.
- Experience in a dozen programs and how the dozens of parameters of each one affect the data.
Acquiring this bioinformatics experience and the required resources would take months.
SNPs provided by DArT have already
been processed using their proprietary analytical pipelines, which are fine-tuned for their particular sequencing technology. See, for example:
Kilian, Andrzej, et al. "Diversity arrays technology: a generic genome profiling technology on open platforms." Data production and analysis in population genomics. Humana Press, Totowa, NJ, 2012. 67-89.
Providing SNPs ready to be used is one of the main advantages of DArT over other technologies.
If you are interested, the bioinformatics GitHub page from the University of Sydney has a nice pipeline to analyse raw data:
Cheers,
Luis