pop genetic analysis: genlight, genind and other questions

920 views
Skip to first unread message

EPiazza

unread,
Feb 11, 2019, 9:52:26 AM2/11/19
to poppr


Dear Kamvar,
I'm pretty new to genetic population analysis, I ran in this group and I really appreciate any suggestion that may improve my analysis. I've found different R packages for population analysis such as poppr, adegenet and strataG but I'm not sure which may be a suitable option.

For my thesis project I have to analyze genotype data of sole, my dataset is a .xlsl file with individuals ids (rows), populations (meant as sampling places) and SNP loci (columns). The genotypes are coded as 0101, 0103, etc... and I suppose that each allele is coded as 01, 02, 03, 04 whereas 00 means no allele (e.g. 0000 is a NA value). At first glance, it seems that the data are all biallelic and I have read in a previous post (https://groups.google.com/d/msg/poppr/KJhszeCKIDA/8wIPtL9tGgAJ) that you've recommended genlight object. However, I'm not sure about this information and I don't know which are the major and minor allele in my data. How can I retrieve this info? Do you think that find and replace the values in excel may be a good option? 
Otherwise, I thought about using df2genind function to import my data as genind object and count the allele occurrence from there, especially because my dataset is quite small (436 samples, 426 SNPs) and I read that genlight object usage is suggested when you have thousands of SNPs and VCF file is available.

Morevoer, genind object has different slots such as population and strata: from this example (https://grunwaldlab.github.io/Population_Genetics_in_R/Population_Strata.html) it seems to me that strata can be used to define such information as the year of sampling but I'm not sure and from the documentation I didn't get the real difference and different usage of genind and genpop objects.

Finally, I have a question about NA values: my data show several missing values for both individuals and loci, from previous related works I read about delete individuals/loci with more than 10% NAs. However, an entire population is deleted using this approach. I read about function as missingno and its options but I cannot find references to understand when is preferable use each of them (remove loci/individuals, replace with average allele count or treat 00 as an additional allele), can you recommend any?

Thank you very much in advance

Regards,
Elisabetta
 

Zhian Kamvar

unread,
Feb 11, 2019, 10:12:57 PM2/11/19
to EPiazza, poppr
Hi Elizabetta, 

See my answers below.

Best,
Zhian

On Feb 11, 2019, at 23:52 , EPiazza <elizabet...@gmail.com> wrote:



Dear Kamvar,
I'm pretty new to genetic population analysis, I ran in this group and I really appreciate any suggestion that may improve my analysis. I've found different R packages for population analysis such as poppr, adegenet and strataG but I'm not sure which may be a suitable option.

You can use all of these packages together. I believe strataG has a conversion from genind to gtypes.


For my thesis project I have to analyze genotype data of sole, my dataset is a .xlsl file with individuals ids (rows), populations (meant as sampling places) and SNP loci (columns). The genotypes are coded as 0101, 0103, etc... and I suppose that each allele is coded as 01, 02, 03, 04 whereas 00 means no allele (e.g. 0000 is a NA value). At first glance, it seems that the data are all biallelic and I have read in a previous post (https://groups.google.com/d/msg/poppr/KJhszeCKIDA/8wIPtL9tGgAJ) that you've recommended genlight object. However, I'm not sure about this information and I don't know which are the major and minor allele in my data. How can I retrieve this info? Do you think that find and replace the values in excel may be a good option? 

I would suggest to never use excel for data cleaning. Always save your original data as write-only and create an R script specifically to generate clean data from the original data. I have an example of what I mean here: https://github.com/everhartlab/sclerotinia-366/blob/master/results/data-comparison.md 

Otherwise, I thought about using df2genind function to import my data as genind object and count the allele occurrence from there, especially because my dataset is quite small (436 samples, 426 SNPs) and I read that genlight object usage is suggested when you have thousands of SNPs and VCF file is available.

Given the small number of SNPs, you are correct that the genind object is more appropriate.


Morevoer, genind object has different slots such as population and strata: from this example (https://grunwaldlab.github.io/Population_Genetics_in_R/Population_Strata.html) it seems to me that strata can be used to define such information as the year of sampling but I'm not sure and from the documentation I didn't get the real difference and different usage of genind and genpop objects.

The population strata can be a data frame that defines anything that can group your data, including year. For example, the "monpop" data set in poppr contains three levels: Tree, Year, and Symptom (https://grunwaldlab.github.io/poppr/reference/monpop.html).

You may not use the genpop objects much. They are a simplification of genind objects that counts the total number of alleles in populations and can be used to create bootstrapped population dendrograms (See "Bootstrapping": https://www.frontiersin.org/articles/10.3389/fgene.2015.00208/full).

Finally, I have a question about NA values: my data show several missing values for both individuals and loci, from previous related works I read about delete individuals/loci with more than 10% NAs. However, an entire population is deleted using this approach. I read about function as missingno and its options but I cannot find references to understand when is preferable use each of them (remove loci/individuals, replace with average allele count or treat 00 as an additional allele), can you recommend any?

Many analyses will either handle or ignore missing data. 10% is more of a "this feels right" measure, but there's no hard-and-fast rule for this. 


Thank you very much in advance

Regards,
Elisabetta
 

--
You received this message because you are subscribed to the Google Groups "poppr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to poppr+un...@googlegroups.com.
To post to this group, send email to po...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/poppr/d1f6448d-eccb-466f-914a-0b4ce76210fd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages