Dear Kamvar,
I'm pretty new to genetic population analysis, I ran in this group and I really appreciate any suggestion that may improve my analysis. I've found different R packages for population analysis such as poppr, adegenet and strataG but I'm not sure which may be a suitable option.
For my thesis project I have to analyze genotype data of sole, my dataset is a .xlsl file with individuals ids (rows), populations (meant as sampling places) and SNP loci (columns). The genotypes are coded as 0101, 0103, etc... and I suppose that each allele is coded as 01, 02, 03, 04 whereas 00 means no allele (e.g. 0000 is a NA value). At first glance, it seems that the data are all biallelic and I have read in a previous post (
https://groups.google.com/d/msg/poppr/KJhszeCKIDA/8wIPtL9tGgAJ) that you've recommended genlight object. However, I'm not sure about this information and I don't know which are the major and minor allele in my data. How can I retrieve this info? Do you think that find and replace the values in excel may be a good option?
Otherwise, I thought about using df2genind function to import my data as genind object and count the allele occurrence from there, especially because my dataset is quite small (436 samples, 426 SNPs) and I read that genlight object usage is suggested when you have thousands of SNPs and VCF file is available.
Morevoer, genind object has different slots such as population and strata: from this example (
https://grunwaldlab.github.io/Population_Genetics_in_R/Population_Strata.html) it seems to me that strata can be used to define such information as the year of sampling but I'm not sure and from the documentation I didn't get the real difference and different usage of genind and genpop objects.
Finally, I have a question about NA values: my data show several missing values for both individuals and loci, from previous related works I read about delete individuals/loci with more than 10% NAs. However, an entire population is deleted using this approach. I read about function as missingno and its options but I cannot find references to understand when is preferable use each of them (remove loci/individuals, replace with average allele count or treat 00 as an additional allele), can you recommend any?
Thank you very much in advance
Regards,
Elisabetta