Defining strata and populations from vcf file

535 views
Skip to first unread message

Victoria Glynn

unread,
Mar 15, 2022, 11:40:16 PM3/15/22
to poppr
Hello everyone, 

I am a new user to poppr but would like to use it to define multilocus genotypes for 170 whole genome sequence samples. These samples followed the GenPipes pipeline for SNP calling, quality control, and mapping to my reference genome. I also filtered the resulting VCF file (e.g. MAF, missingness, depth, etc.), using plink, as per this tutorial.  

My input file is a VCF for poppr, which I first converted into a genid file, and then a genclone object in R. However, I do not have strata and population defined, and I am struggling to do this in R with poppr.

I have tried using the addStrata command and defining my strata as a vector in R, but the resulting genclone object does not have all my defined strata -- I defined 6, but under "Population information" only 2 strata are listed. Likewise, I tried using grep to define strata, as my sample names are defined as per the monpop tutorial (i.e. site_gulf_year), but I cannot access each sample's unique identifier. I have also been unable to find in the documentation how to set populations within an already created genclone object.

Any guidance would be deeply appreciated. 

Thank you in advance for your time and help. 

Best regards,

Victoria

Brian Knaus

unread,
Mar 16, 2022, 4:43:17 PM3/16/22
to Victoria Glynn, poppr
Hi Victoria,

I think we've documented some options below.


Good luck!
Brian

--
You received this message because you are subscribed to the Google Groups "poppr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to poppr+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/poppr/9b9622be-14af-4ce6-b2cb-e9d6b571b51an%40googlegroups.com.


--
Brian J. Knaus, Ph.D.
he/him/his
Corvallis, Oregon, USA
brianknaus.com

Victoria Glynn

unread,
Mar 23, 2022, 10:48:54 PM3/23/22
to poppr
Hello Brian,

Thank you for your message. I am already following the tutorial you shared -- it is very useful! 

I may be doing something incorrectly, but when I reach the mlg.filter() step, it appears I need to have a genclone and not a genid object. Unfortunately, upon converting my genid to a genclone object, the population data I loaded via genlight is removed. Below is the output I am receiving:

gid_filt <- mlg.filter(gid, distance = bruvo.dist) <- 1 + .Machine$double.eps^0.5

Warning: mlg.filter<- only has an effect on genclone objects.
If you want to utilize this functionality, please convert to a genclone object.
No modifications are taking place. 

My R commands to go from (1) my VCF file to a genlight object, (2) a genlight to genid object, and then (3) a genid to genclone object, are as follows: 

gl.rubi <- vcfR2genlight(vcfR)

ploidy(gl.rubi) <- 2

pop(gl.rubi) <- pop.data$reef

gid <- gl2gi(gl.rubi, probar = FALSE, verbose = NULL)

gid

locNames(gid) <- paste("SNP",1:nLoc(gid),sep=".")

loci <- locNames(gid)

gc <- as.genclone(gid)

However, in reaching this step, I cannot create a genclone object as I receive the following error:

Error in .local(.Object, ...) :
more than one '.' in column names; please name column as [LOCUS].[ALLELE] 

My locName(gid) output is as follows (first 10 lines):

[1] "SNP.1"    "SNP.2"    "SNP.3"    "SNP.4"    "SNP.5"    "SNP.6"    "SNP.7"    "SNP.8"  
[9] "SNP.9"    "SNP.10"

All my loci names are unique [anyDuplicated(loci) = 0]. Likewise, I saw from a previous post that it is recommended to remove the first row from the CSV file if this error appears. But in my case, I have a VCF file as my input. 

Any further guidance would be appreciated. 

Thank you again for your time and consideration.

Best regards,

Victoria

Zhian Kamvar

unread,
Mar 24, 2022, 10:43:51 AM3/24/22
to Victoria Glynn, poppr
Hi Victoria,

The steps you want to take are:

1. convert from vcf to genlight
3. confirm that the ploidy, populations, strata, and loci are correct
4. filter

it might look something like this:


My R commands to go from (1) my VCF file to a genlight object, (2) a genlight to genid object, and then (3) a genid to genclone object, are as follows: 

gl <- vcfR2genlight(vcfR) # convert to genlight
gl <- as.snpclone(gl) # add mlg information by converting to snpclone
ploidy(gl) <- 2 # enforce the ploidy
strata(gl) <- pop.data # set the strata
setPop(gl) <- ~reef # set the population (from the strata)
mlg.filter(gl) <- 0.05 + .Machine$double.eps^0.5 # filter to collapse clones with a maximum distance of 0.05% changes. Change this to whatever suits your needs.

The error message you received from mlg.filter() was correct, but confusing because in poppr because you can not convert a genlight to a genclone. There are four kinds of objects to consider:

genind and genlight objects handle multi-allelic and biallelic data, respectively, but do not contain any information about MLG.
genclone and snpclone objects both handle MLG data, and are complements of genind and genlight, converted with as.genclone() and as.snpclone(), respectively, but because of the differing types of underlying data, you can not convert a genlight to a genclone or a genind to a snpclone.

I should also note that you attempted to use Bruvo's distance in your filter command, which is only suitable for microsatellite data, not SNPs.

I hope that helps.

All the best,
Zhian

Victoria Glynn

unread,
Apr 13, 2022, 12:04:36 AM4/13/22
to Zhian Kamvar, poppr
Hello Zhian,

Thank you for your email. The steps you shared worked perfectly -- thank you again!

I was able to call MLGs from a SNPclone object following your steps above, and then filtering to collapse clones with a maximum distance of 0.05%. I now wish to calculate population genetic parameters, like Fst and observed/expected heterozygosity. Many of the tutorials I am following use as an input a genid object, to then convert this to a “hierfstat” object for the analyses, such as in this tutorial

I have been unable to find a command/package to convert a SNPclone to a genid object. If you have any insights on how to do this, or if you have another suggested route to calculate these metrics, I would deeply appreciate it. 

Thank you again for all your help and support thus far. 

Best regards,

Victoria
--
Victoria Marie Glynn
B.S. Environmental Sciences | Honors Program
University of California, Berkeley | Class of 2019 
Phi Beta Kappa 
Ronald E. McNair Post-Baccalaureate Achievement Program
Reply all
Reply to author
Forward
0 new messages